International Journal of Computer Science and Security by ijcsiseditor

VIEWS: 7,940 PAGES: 330

More Info
									     IJCSIS Vol. 7 No. 3, March 2010
           ISSN 1947-5500

International Journal of
    Computer Science
      & Information Security

                           IJCSIS Editorial
                     Message from Managing Editor

In this March 2010 issue, we present selected publications from the active
research area of core and applied computer science, networking, information
retrieval, information systems, emerging communication technologies and
information security. The paper acceptance rate for this issue is 32%. The
journal covers the state-of-art issues in the computer science and their
applications in business, industry and other subjects. Printed copies of the
journal are distributed to accredited universities and libraries. All the papers in
IJCSIS are also available freely with online full-text content and permanent
worldwide web link. The abstracts are indexed and available at major academic

We are very grateful to all our authors who have produced some excellent work
and the reviewers for providing prompt constructive review comments on the
manuscripts. Special thanks to our technical sponsors for their valuable service.

Available at
IJCSIS Vol. 7, No. 3,
March 2010 Edition
ISSN 1947-5500
© IJCSIS 2010, USA.
Indexed by (among others):
Dr. Gregorio Martinez Perez
Associate Professor - Professor Titular de Universidad, University of Murcia
(UMU), Spain

Dr. M. Emre Celebi,
Assistant Professor, Department of Computer Science, Louisiana State University
in Shreveport, USA

Dr. Yong Li
School of Electronic and Information Engineering, Beijing Jiaotong University,
P. R. China

Prof. Hamid Reza Naji
Department of Computer Enigneering, Shahid Beheshti University, Tehran, Iran

Dr. Sanjay Jasola
Professor and Dean, School of Information and Communication Technology,
Gautam Buddha University

Dr Riktesh Srivastava
Assistant Professor, Information Systems, Skyline University College, University
City of Sharjah, Sharjah, PO 1797, UAE

Dr. Siddhivinayak Kulkarni
University of Ballarat, Ballarat, Victoria, Australia

Professor (Dr) Mokhtar Beldjehem
Sainte-Anne University, Halifax, NS, Canada

Dr. Alex Pappachen James, (Research Fellow)
Queensland Micro-nanotechnology center, Griffith University, Australia
                                  TABLE OF CONTENTS

1. Paper 28021082: Design and Implementation of an Intelligent Educational Model Based on
Personality and Learner’s Emotion (pp. 1-13)
Somayeh Fatahi, Department of Computer Engineering, Kermanshah University of Technology,
Kermanshah, Iran
Nasser Ghasem-Aghaee, Department of Computer Engineering, Isfahan University, Isfahan, Iran

2. Paper 22021039: Signature Recognition using Multi Scale Fourier Descriptor And Wavelet
Transform (pp. 14-19)
Ismail A. Ismail, Professor, Dean, College of Computers and Informatics ,Misr International University, ,
Mohammed A. Ramadan, Professor, Department of Mathematics, Faculty of Science, Menofia University ,
Talaat S. El danaf , lecturer, Department of Mathematics, Faculty of Science, Menofia University , Egypt
Ahmed H. Samak, Ass. Lecturer , Department of Mathematics, Faculty of Science, Menofia University,

3. Paper 31011077: Feature-Based Adaptive Tolerance Tree (FATT): An Efficient Indexing
Technique for Content-Based Image Retrieval Using Wavelet Transform (pp. 20-29)
Dr.P.AnandhaKumar, Department of Information Technology, Madras Institute of Technology, Anna
University Chennai, Chennai India
V. Balamurugan, Research Scholar, Department of Information Technology, Madras Institute of
Technology Anna University Chennai, Chennai, India

4. Paper 28021084: Ontology-supported processing of clinical text using medical knowledge
integration for multi-label classification of diagnosis coding (pp. 30-35)
Phanu Waraporn 1,4,*, Phayung Meesad 2, Gareth Clayton 3
1 Department of Information Technology, Faculty of Information Technology
2 Department of Teacher Training in Electrical Engineering, Faculty of Technical Education
3 Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology
North Bangkok
4 Division of Business Computing, Faculty of Management Science, Suan Sunandha Rajabhat University
Bangkok, Thailand

5. Paper 23021042: Botnet Detection by Monitoring Similar Communication Patterns (pp. 36-45)
Hossein Rouhani Zeidanloo, Faculty of Computer Science and Information System University of
Technology Malaysia 54100 Kuala Lumpur, Malaysia
Azizah Bt Abdul Manaf, College of Science and Technology University of Technology Malaysia 54100
Kuala Lumpur, Malaysia

6. Paper 04021004: JawaTEX: A System for Typesetting Javanese (pp. 46-52)
Ema Utami 1, Jazi Eko Istiyanto 2, Sri Hartati 3, Marsono 4, and Ahmad Ashari 5
1 Information System Major of STMIK AMIKOM Yogyakarta, Ring Road Utara ST, Condong Catur, Depok
Sleman Yogyakarta, Telp. (0274) 884201-884206, Faks. (0274) 884208, Candidate Doctor of Computer
Science of Postgraduate School Gadjah Mada University
2,3,5 Doctoral Program in Computer Science, Graha Student Internet Center (SIC) 3rd floor, Faculty of
Mathematic and Natural Sciences Gadjah Mada University, Sekip Utara Bulaksumur Yogyakarta. 55281
Telp/Fax: (0274) 522443
4 Sastra Nusantara Major of Culture Sciences Gadjah Mada University, Humaniora ST No.1, Bulaksumur,
Yogyakarta, Fax. (0274) 550451

7. Paper 04110902: Effective Query Retrieval System In Mobile Business Environment (pp. 53-57)
R.Sivaraman, Dy.Director, Center for Convergence of Technologies (CCT), Anna University
Tiruchirappalli, Tiruchirappalli, Tamil Nadu, India
RM. Chandrasekaran, Registrar, Anna University Tiruchirappalli, Tiruchirappalli, Tamil Nadu, India

8. Paper 10021011: Predictive Gain Estimation – A mathematical analysis (pp. 58-61)
P. Chakrabarti, Sir Padampat Singhania University, Udaipur, Rajasthan, India

9. Paper 12021012: Lightweight Distance bound Protocol for Low Cost RFID Tags (pp. 62-67)
Eslam Gamal Ahmed, Eman Shaaban, Mohamed Hashem
Faculty of Computer and Information Science, Ain Shams University, Abbasiaa, Cairo, Egypt

10. Paper 12021016: Analysis of Empirical Software Effort Estimation Models (pp. 68-77)
Saleem Basha, Department of Computer Science, Pondicherry University, Puducherry, India
Dhavachelvan Ponnurangam, Department of Computer Science, Pondicherry University, Puducherry,

11. Paper 12021018: A Survey on Preprocessing Methods for Web Usage Data (pp. 78-83)
V.Chitraa, Lecturer, CMS College of Science and Commerce, Coimbatore, Tamilnadu, India
Dr. Antony Selvdoss Davamani, Reader in Computer Science, NGM College (AUTONOMOUS ), Pollachi,
Coimbatore,Tamilnadu, India

12. Paper 16021023: Seamless Data Services for Real Time Communication in a Heterogeneous
Networks using Network Tracking and Management (pp. 84-91)
Adiline Macriga. T, Research Scholar, Department of Information & Communication, MIT Campus, Anna
University Chennai, Chennai – 600025.
Dr. P. Anandha Kumar, Asst. Professor, Department of Information Technology, MIT Campus, Anna
University Chennai, Chennai – 600025.

13. Paper 17021026: Effect of Weighting Scheme to QoS Properties in Web Service Discovery (pp.
Agushaka J. O., Lawal M. M., Bagiwa, A. M. and Abdullahi B. F.
Mathematics Department, Ahmadu Bello University Zaria-Nigeria

14. Paper 17021030: Fuzzy Logic of Speed and Steering Control System for Three Dimensional Line
Following of an Autonomous Vehicle (pp. 101-108)
Full Text: PDF
Dr. shailja shukla, Department of electrical engineering J.E.C. Jabalpur
Mr. Mukesh tiwari, Department of electrical engineering J.E.C. Jabalpur

15. Paper 19011013: A reversible high embedding capacity data hiding technique for hiding secret
data in images (pp.109-115)
Mr. P. Mohan Kumar, Asst. Professor, CSE Department, Jeppiaar Engineering College, Chennai., India.
Dr. K. L. Shunmuganathan, Professor and Head, CSE Department, R.M.K. Engineering College, Chennai.

16. Paper 19011014: Mining The Data From Distributed Database Using An Improved Mining
Algorithm (pp. 116-121)
J. Arokia Renjit, Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai,
TamilNadu,India – 600119.
Dr. K. L. Shunmuganathan, Professor & Head, Department of CSE, RMK Engineering College,
TamilNadu , India – 601 206.

17. Paper 20021033: Node Sensing & Dynamic Discovering Routes for Wireless Sensor Networks (pp.
Prof. Arabinda Nanda, Department of CSE, KEC, Bhubaneswar, India
Prof (Dr) Amiya Kumar Rath, Department of CSE & IT, CEB, Bhubaneswar, India
Prof. Saroj Kumar Rout, Department of CSE, KEC, Bhubaneswar, India

18. Paper 20021034: A Robust Fuzzy Clustering Technique with Spatial Neighborhood Information
for Effective Medical Image Segmentation (pp. 132-138)
S. Zulaikha Beevi, Assistant Professor, Department of IT, National College of Engineering, Tamilnadu,
M. Mohammed Sathik, Associate Professor, Department of Computer Science, Sathakathullah Appa
College, Tamilndu, India.
K. Senthamaraikannan, Professor & Head, Department of Statistics, Manonmaniam Sundaranar University,
Tamilnadu, India.

19. Paper 20021035: Design And Implementation Of Multilevel Access Control In Medical Image
Transmission Using Symmetric Polynomial Based Audio Steganography (pp. 139-146)
J.Nafeesa Begum, Research Scholar &Sr. Lecturer in CSE, Government College of Engg, Bargur- 635104,
Krishnagiri District , Tamil Nadu , India
K. Kumar, Research Scholar &Lecturer in CSE ,Government College of Engg, Bargur- 635104, Tamil
Nadu , India
Dr. V. Sumathy, Asst. Professor in ECE , Government College of Technology,Coimbatore, Tamil Nadu,

20. Paper 23021040: Enhanced Authentication and Locality Aided - Destination Mobility in Dynamic
Routing Protocol for MANET (pp. 147-152)
Sudhakar Sengan, Lecturer, Department of CSE, Nandha College of Technology, Erode -TamilNadu –
Dr.S.Chenthur Pandian, Principal, Selvam College of Technology, Namakkal -TamilNadu – India

21. Paper 23021041: Processor Based Active Queue Management for providing QoS in Multimedia
Application (pp. 153-158)
N. Saravana Selvam, Department of Computer Science and Engineering, Sree Sowdambika College of
Engineering, Aruppukottai, India
Dr. S. Radhakrishnan, Department of Computer Science and Engineering, Arulmigu Kalasalingam College
of Engineering, Krishnankoil, India

22. Paper 23021045: New Clustering Algorithm for Vector Quantization using Rotation of Error
Vector (pp. 159-165)
Dr. H. B. Kekre, Computer Engineering, Mukesh Patel School of Technology Management and
Engineering, NMIMS University, Vileparle(w) Mumbai 400–056, India
Tanuja K. Sarode, Ph.D. Scholar, MPSTME, NMIMS University, Assistant Professor, Computer
Engineering, Thadomal Shahani Engineering College, Bandra(W), Mumbai 400-050, India

23. Paper 25021046: Enhanced Ad-Hoc on Demand Multipath Distance Vector Routing protocol (pp.
Mrs. Sujata V. Mallapur, Department of Information Science and Engineering      Appa Institute of
Engineering and Technology Gulbarga, India
Prof. Sujata .Terdal, Department of Computer Science and Engineering P.D.A College of Engineering
Gulbarga, India

24. Paper 25021047: A Survey on Space-Time Turbo Codes (pp. 171-177)
Dr. C. V. Seshaiah, Prof and Head, Sri Ramakrishna Engg. College
S. Nagarani, Research Scholar, Anna University, Coimbatore

25. Paper 25021048: Mathematical Principles in Software Quality Engineering (pp. 178-184)
Dr. Manoranjan Kumar Singh, PG Department of Mathematics, Magadha University, Bodhagaya, Gaya,
Bihar, India-823001
Rakesh. L, Department of Computer-Science, SCT Institute of Technology, Bangalore, India-560075
26. Paper 27021050: An Analytical Study on Behavior of Clusters Using K Means, EM and K* Means
Algorithm (pp. 185-190)
G. Nathiya, Department of Computer Science, P.S.G.R Krishnammal College for Women, Coimbatore-
641004, Tamilnadu India.
S. C. Punitha, Department of Computer Science, P.S.G.R Krishnammal College for Women, Coimbatore-
641004, Tamilnadu India.
Dr. M. Punithavalli, Director of the Computer Science Department, Sri Ramakrishna college of Arts and
Science for Women, Coimbatore, Tamilnadu, India.

27. Paper 27021055: Node inspection and analysis thereof in the light of area estimation and curve
fitting (pp. 191-197)
A. Kumar, Dept. of Comp. Sc. & Engg., Sir Padampat Singhania, University, Udaipur, India.
P. Chakrabarti, Dept. of Comp. Sc. & Engg., Sir Padampat Singhania, University, Udaipur, India.
P. Saini, Dept. of Comp. Sc. & Engg., Sir Padampat Singhania, University, Udaipur, India.

28. Paper 27021059: An Improved Fixed Switching Frequency Direct Torque Control of Induction
Motor Drives Fed by Direct Matrix Converter (pp. 198-205)
Nabil Taïb and Toufik Rekioua
Electrical Engineering Department, University of A. Mira, Targua Ouzemour, Bejaia, 06000, Algeria.
Bruno François, L2EP Laboratory, Central School of Lille, Lille 59651, France

29. Paper 28021065: Internet ware cloud computing :Challenges (pp. 206-210)
Dr. S Qamar, Department of Computer Science, CAS, King Saud University,Riyadh, Saudi Arabia
Niranjan Lal, Department of Information Technology, SRM University-NCR Campus, Ghaziabad ,India
Mrityunjay Singh, Department of Information Technology, SRM University-NCR Campus, Ghaziabad ,

30. Paper 28021070: Mobile Database System: Role of Mobility on the Query Processing (pp. 211-
Samidha Dwivedi Sharma and Dr. R. S. Kasana,
Department of Computer Science & Applications, Dr. H. S. Gour, University, Sagar, MP, India

31. Paper 28021076: Secure Iris Authentication Using Visual Cryptography (pp. 217-221)
P.S. Revenkar, Faculty of Department of Computer Science and Engineering, Government College of
Engineering, Aurangabad, Maharashra, India
Anisa Anjum, Department of Computer Science and Engineering, Government College of Engineering,
Aurangabad, Maharashtra, India
W. Z. Gandhare, Principal of Government College of Engineering, Aurangabad, Maharashtra , India

32. Paper 28021077: A New Approach to Lung Image Segmentation using Fuzzy Possibilistic C-
Means Algorithm (pp. 222-228)
M. Gomathi, Department of MCA, Velalar College Of Engineering and Technology, Thindal (PO), Erode,
Dr. P.Thangaraj, Dean, School of Computer Technology and Applications, Kongu Engineering College,
Perundurai, Erode, India

33. Paper 28021078: Protection of Web Applications from Cross-Site Scripting Attacks in Browser
Side (pp. 229-236)
K. Selvamani, Department of Computer Science and Engineering, Anna University, Chennai, India
A. Duraisamy, Department of Computer Science and Engineering, Anna University, Chennai, India
A.Kannan, Department of Computer Science and Engineering, Anna University, Chennai, India

34. Paper 28021079: Review of Robust Video Watermarking Algorithms (pp. 237-246)
Mrs Neeta Deshpande, Research Scholar, SRTM University, Nanded India
Dr.Archana Rajurkar, Professor and Head, MGM College of Engineering, Nanded India,
Dr. R Manthalkar, Professor and Head, SGGS Institute of Engineering and Technology Nanded India

35. Paper 28021080: Terrorism Event Classification Using Fuzzy Inference Systems (pp. 247-256)
Uraiwan Inyaem, Faculty of Information Technology, King Mongkut’s University of Technology North
Bangkok, Bangkok, Thailand
Choochart Haruechaiyasak, Human Language Technology Laboratory, National Electrics and Computer
Technology Center, Pathumthani, Thailand
Phayung Meesad, Faculty of Technical Education, King Mongkut’s University of Technology North
Bangkok, Bangkok, Thailand
Dat Tran, Faculty of Information Science and Engineering, University of Canberra, ACT, Australia

36. Paper 28021086: A Model of Cloud Based Application Environment for Software Testing (pp.
T. Vengattaraman, Department of Computer Science, Pondicherry University, India
P. Dhavachelvan, Department of Computer Science, Pondicherry University, India.
R. Baskaran, Department of Computer Science and Engineering, Anna University, Chennai, India.

37. Paper 28021089: Joint Design of Congestion Control Routing With Distributed Multi Channel
Assignment in Wireless Mesh Networks (pp. 261-266)
K.Valarmathi, Research Scolar, Sathyabama University, Chennai, India
N. Malmurugan, Principal, Oxford Engineering College, Trichy, India

38. Paper 28021095: Mobile Broadband Possibilities considering the Arrival of IEEE 802.16m &
LTE with an Emphasis on South Asia (pp. 267-275)
                         1                   2                     3                 4          5
Nafiz Imtiaz Bin Hamid , Md. Zakir Hossain , Md. R. H. Khandokar , Taskin Jamal , Md.A. Shoeb
Department of Electrical and Electronic Engineering (EEE)
    Islamic University of Technology (IUT), Board Bazar, Gazipur-1704, Bangladesh.
    The University of Asia Pacific (UAP), Dhanmondi R/A, Dhaka-1209, Bangladesh.
    Stamford University, Siddeswari, Dhaka-1217, Bangladesh.
    School of Engineering and Computer Science. Independent University, Bangladesh.
    Radio Access Network (RAN) Department, Qubee. Augure Wireless Broadband Bangladesh Limited.

39. Paper 28021096: SAR Image Segmentation using Vector Quantization Technique on Entropy
Images (pp. 276-282)
Dr. H. B. Kekre, Computer Engineering, MPSTME, NMIMS University, Vileparle(w) Mumbai 400–056,
Saylee Gharge, Ph.D. Scholar, MPSTME, NMIMS University, Assistant Professor, V.E.S.I.T, Mumbai-
400071, India
Tanuja K. Sarode, Ph.D. Scholar, MPSTME, NMIMS University, Associate Professor, TSEC, Mumbai 400-
050, India

40. Paper 28021098: Reversible Image data Hiding using Lifting wavelet Transform and Histogram
Shifting (pp. 283-289)
S. Kurshid Jinna, Professor, Dept of Computer Science & Engineering, PET Engineering College, Vallioor,
Tirunelveli, India
Dr. L. Ganesan, Professor, Dept of Computer Science & Engineering, A.C College of Engineering &
Technology, Karaikudi, India

41. Paper 31011061: GIS: (Geographic Information System) An application for socio-economical
data collection for rural area (pp. 290-293)
Mr.Nayak S.K., Head, Dept. of Computer Science, Bahirji Smarak Mahavidyalaya, Basmathnagar, Dist.
Hingoli. (MS), India
Dr.S.B.Thorat, Director, Institute of Technology and Mgmt, Nanded, Dist.Nanded. (MS), India
Dr.Kalyankar N.V., Principal, Yeshwant Mahavidyalaya, Nanded, Nanded (MS) India
42. Paper 28021073: Probabilistic Semantic Web Mining Using Artificial Neural Analysis (pp. 294-
Mr.T.Krishna Kishore, Assistant Professor, St.Ann's College of Engineering and Technology. Chirala-
Mr.T.Sasi Vardhan, Assistant Professor, St.Ann's Engineering College , Chirala-523187
Mr.N.Lakshmi Narayana, Assistant Professor, St.Ann's College of Engineering and Technology. Chirala-

43. Paper 28021075: Document Clustering using Sequential Information Bottleneck Method (pp. 305-
MS. P.J.Gayathri, 1M.Phil scholar, P.S.G.R. Krishnammal College, for Women, Coimbatore, India
MRS. S.C. Punitha, 2 HOD, Department of Computer science, P.S.G.R. Krishnammal College for Women,
Coimbatore, India.
Dr.M. Punithavalli, 3 Director , Department of Computer science, Sri Ramakrishna college of Arts and
Science for Women, Coimbatore, India.
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 7, No. 3, March 2010

            Design and Implementation of an Intelligent
            Educational Model Based on Personality and
                        Learner’s Emotion

                      Somayeh Fatahi                                                        Nasser Ghasem-Aghaee
          Department of Computer Engineering                                          Department of Computer Engineering
          Kermanshah University of Technology                                                 Isfahan University
                    Kermanshah, Iran                                                             Isfahan, Iran
           Email:                                                Email:

Abstract—The Personality and emotions are effective parameters             hand, negative emotions block the thinking process and prevent
in learning process. Thus, virtual learning environments should            sound reasoning [7]. For example, if the students are tired and
pay attention to these parameters. In this paper, a new e-learning         stressful, they cannot concentrate on lessons and think properly
model is designed and implemented according to these                       [24] [29]. So, the emotional states of learners in virtual learning
parameters. The Virtual learning environment that is presented             environments must be taken into account [7]. Besides, people
here uses two agents: Virtual Tutor Agent (VTA), and Virtual               have different personalities, and these differences absolutely
Classmate Agent (VCA). During the learning process and                     affect their duties and daily activities. People with different
depending on events happening in the environment, learner’s                personalities show different emotions in facing events. Also,
emotions are changed. In this situation, learning style should be
                                                                           different personalities play an important role among learners in
revised according to the personality traits as well as the learner’s
current emotions. VTA selects suitable learning style for the
                                                                           the learning process. Personality of learners can affect their
learners based on their personality traits. To improve the                 learning styles [13]. According to their personalities, each
learning process, the system uses VCA in some of the learning              person has especial learning style, and therefore the teaching
steps. VCA is an intelligent agent and has its own personality. It         style that must be used for every student varies from student to
is designed so that it can present an attractive and real learning         student.
environment in interaction with the learner. To recognize the                  In virtual learning systems that have been made until now,
learner’s personality, this system uses MBTI test and to obtain
                                                                           scientists have focused on the learner’s emotions and have used
emotion values uses OCC model. Finally, the results of system
                                                                           emotional agents. Personality is an independent parameter in
tested in real environments show that considering the human
features in interaction with the learner increases learning quality
                                                                           few of them. There is an important problem that has not been
and satisfies the learner.                                                 solved until now, there is some models in this field, but no one
                                                                           of them have perfect view relative to learner’s personality and
   Keywords-Emotion; Learning Style; MBTI Indicator;                       emotions together.
Personality; Virtual Classmate Agent (VCA); Virtual Tutor Agent                This paper is organized in the following ways: section 2 is a
(VTA); Virtual learning.
                                                                           review of the previous works and literature. Section 3 explains
                       I.    INTRODUCTION                                  psychological principles. Section 4 is about the proposed
                                                                           model, and section 5 discuses implementation of the model.
    One of the most important applications of computers is                 Finally, section 6 and 7 explain evaluation of the system,
virtual learning. In recent years, many organizations have                 results and future works.
started to use distance learning tools. Although this type of
education has some advantages, they don't deal with sufficient                                  II.   PREVIOUS WORKS
dynamism and often the education systems do not have any                      Many researchers had tried to design general models for
capability of a real class [29]. Nowadays, an effort is being              emotions in artificial intelligence area [28] [36] [41] [16] [34].
made to make the virtual learning environments as real-like as             Due to the fact that emotions’ effect on the learner plays an
possible using intelligent agents with emotions and a                      inevitable part in the real world, if we neglect them, it’s
personality as well as simulating human behavior.                          counted as a big fault in virtual learning systems [2].
    It’s clear that learners’ emotional states change during                  There are many efforts in modeling of emotions in the field
learning processes; the changes in the learning process depend             of virtual learning. Some of these efforts are explained here:
on individual differences and events that happen in the                    Barry Kort, Rob Reilly, Rasalind Picard have presented a
environment [9]. Positive emotions play an important role in               model whose target is using the effect of emotions on learning;
creativity and flexibility for solving problems. On the other              then, they implemented their models so that their system could

                                                                                                      ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 7, No. 3, March 2010

recognize emotional state of learners and respond to them               agent based on personal and impersonal features is done
according to the state [34]. Shaikh Mostafa Al Masum and                through a machine learning technique called ID3. Also they
Mitsuru Ishizuka have presented an emotional cubic model.               used OCC emotional model [9]. Haron have suggested a
The cubic model has three dimensions. According to these                learning system with a learning module, which can adapt to
three dimensions, eight types of emotions are extracted that are        each learner. This environment uses fuzzy logic and MBTI
effective in the learning process. In this model, they have used        personality test [1]. Passenger software is designed by Marin
fuzzy science for emotional states [2]. Qiao Xiangjie and his           and his colleagues to be used for laboratory lessons in distance
colleague have presented a self-assessment model. In this               education. OCC model is used for implementation of the
model, the learner finds his/her emotional state by data that are       software. The system uses virtual tutor agent [35]. Ju and his
gathered from the self-assessment based on an emotion map.              colleagues have implemented the learning environment, and
They have used a polar model for extracting emotions [46].              have studied the effects of efficiency of virtual classmate agent
                                                                        on the learner’s efficiency. In their environment, the virtual
    Some people just have presented personality models, such
                                                                        classmate agent has either a competitor or a cooperator [25].
as Ushida and his colleague who have modeled personality
                                                                        Maldonado and his colleagues have designed a system that
types based on differences in individual emotional states. Rosis
                                                                        provides an environment to provide learning through
and his colleague have modeled and implemented the
                                                                        interaction with software agents. The software agent, who acts
personalities according to the change of the agents' priority of
                                                                        as a classmate in the system, tries to answer the learner
goals .Serra and Chittaro have suggested a goal bound method
                                                                        emotionally. This system is used a cooperated agent to help
for modeling agents. Ball and Breese have modeled intensity of
                                                                        learner [35].
emotions in the agent with two personality features in a BBN
(Bayesian Belief Network) network. Thalmann and Kishirsagar                            III. PSYCHOLOGICAL PRINCIPELS
have used BBN for modeling features. They have used FFM
personality model and OCC emotional model, and have added                   Emotion, personality and individual differences are
surprise and hate emotions to it [44]. Andre and his colleagues         effective parameters in human activities especially in learning.
have presented an integrated model of emotions (based on                Every person has special learning style according to his/her
OCC model) and personality (based on FFM model). At the                 personality features [3].
beginning, they have simulated basic emotions like sadness,             A. Emotion
joy, fear and anger, and two personality dimensions:
                                                                            Emotions are our reactions to the surrounding world.
extroversion and pleasantness [44] [30]. Jin Due and his
                                                                        Aristotle, defined emotion as "that which leads one’s condition
colleagues tried to obtain modeling from the learner. In their
                                                                        to become so transformed that his judgment is affected and,
model, the learner’s personality is extracted based on Cattell
                                                                        which is accompanied by pleasure and pain"[21]. Damasio
questionnaire, and then the relation between personality and
                                                                        have proven that the emotions affect reasoning, memorizing,
behavior is obtained using data mining techniques [13].
                                                                        learning and decision making [11]. Studying has showed that
    After reviewing emotion and personality models in the               intelligence is effective in learning process as much as emotion,
virtual learning, we will explain some systems that use this            interest rate and individuals do [29]. Other people such as
model: Chaffer, Cepeda and Frasson have tried to predict the            Bower and Cohen believe that emotions affect remembering
learner’s emotion in E-learning systems. They have used Naïve           and decision making [8] [29].
Bayes Classifier method for predicting and modeling the
learner’s emotional reaction according to his/her personal                1) OCC Model
features such as sex, personality, etc. [7] [8]. Ju and his                 A lot of models have been designed for emotions. They
colleagues have designed a software agent that can cooperate            help us implement emotions and peoples’ reactions in different
with the learner. They put this agent in a non-synchronic               conditions such as encountering an event. One of the most
learning environment. The designed system included two                  famous is OCC model that is used in most researches. This
subsystems: the teacher subsystem and the learner subsystem.            calculating model is established by Ortoney, Clore and Collins
In the learner subsystem, learners are grouped according to             in 1998. The model          determines 22 types of emotions.
their level of knowledge and willingness to cooperate, and are          Emotions are divided into positive and negative ones, based on
treated accordingly. Keleş and his colleagues have presented a          positive or negative reactions to events. The OCC model is
learning system that is called ZOSMAT. The system is                    calculated intensity of emotions based on a set of variables.
implemented in a real class environment. The system is a                The variables are divided into two groups: global and local.
learner based system [44]. Chaffer and Frasson have suggested           Global variables affect all the emotions, however; local
ESTEL architecture to determine optimal emotional state of              variables affect just some emotions. Global variables include
learning and induct it to the leaner. The architecture uses Naïve       senses of reality, proximity, unexpectedness and aroused local
Bayes Classifier method to predict the optimal emotional state          variables include desirability, praise worthiness and attraction.
of the learner based on his/her personality. The optimal                The other local variables include desirability for others,
emotional state has been defined so that it increases the               deservingness, liking, likelihood, effort, realization, strength of
learner’s efficiency, then the optimal emotional state is               cognitive unit, expectation deviation and familiarity [28].
inducted to the learner using some combined techniques such                 The OCC model has three branches. The first branch is the
as pictures, music, etc. [24]. Chalfoun, Chaffar and Frasson            emotions which show the result of happening events. These
have designed an agent that can predict the emotional state of          results are obtained according to desirability or undesirability
the learner in the e-learning environment. The prediction of the        level of events compared to the agents’ goals. This branch

                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 7, No. 3, March 2010

includes four classes and twelve emotions (Happy-for,                     this questionnaire is used as an instrument for education and
Resentment, Gloating, Pity, Hope, Fear, Joy, Distress,                    business to determine learning or teaching styles,
Satisfaction, Disappointment, Relief, and Fear-confirmed) [4].            communicating styles and job selection [42] [12]. According to
                                                                          MBTI grouping, every person has instinctive priorities that are
    The second branch is emotions that are pointed out the                decisive in their behaviors in different conditions [12] [14]. The
result of agent function based on approving or disapproving               questionnaire helps to specify the personality features and
relative to a set of standards. The second branch has just one            learning priorities of each person, and to extract the teaching
class. It includes four emotions (Pride, Shame, Reproach, and             styles are related to the features [40]. MBTI uses four two-
                                                                          dimensional functions according to the Jung theory. Jung
    The third branch consists of emotions that are the                    theory specifies three functions of Extraversion/Introversion
consequence of the agent’s liking or disliking his/her goals              (E/I), Sensing/Intuition (S/N) and Thinking/Feeling (T/F), but
compared to the agent’s position and attitude. This branch has            we have a fourth dimension, that is, Judging/Perceiving (J/P) in
just one class that includes two emotions (Love and Hate).                MBTI [1] [40].
    There is still another class beside these three branches, and              c) The Features of “MBTI” Dimensions
includes four compound emotions (Anger, Gratitude, Remorse,                  Extroversion/Introversion: extrovert people tend more to
Gratification) [38].                                                      the outside world. In fact, they have many tendencies for
B. Personality                                                            teamwork. They have a lot of friends. They are active and
                                                                          practical. Their emotions are easily expressed [12] [14] [22]
    There are various psychological definitions for personality.          [40]. Conversely, introvert people prefer their introverted
In Schultz’s view, the unique, relatively constant internal and           opinions and internal world and ideas. They are very
external aspects of a person’s character that influence his               independent. They spend a lot of time to think on their tasks;
behavior in different situations are called personality. In fact,         they have a few friends. They try to hold their emotions and
personality includes thoughts, feelings, wishes, inclinations,            express their emotions at certain times to particular people [14]
and behavioral tendencies that are ingrained in various aspects           [22] [40] [45].
of each person’s existence [21].
                                                                              Sensing/Intuition (S/N): sensing people are emotional
1)     Learning Style                                                     people who get information from environment through their
                                                                          five senses. They are realists. They usually pay attention to the
    The psychological studies show that each person displays              details, focus on practical subjects [14] [22] [24] [45]. On the
several individual features in problem-solving and decisions-             other hand, intuitive people, who get information through
making. These features are often considered as learning styles            perception between relationships and results, usually use their
or learning methods [33]. It’s clear that each person has a               conception to get information. They try to make a mental
specific learning style according to his/her personality features         picture of the subject for themselves and then move towards
[47]. According to the Keefe’s definition of learning styles              details. Their concentration is more on ideas and their integrity.
[32], “the learning style is composed of cognitive and                    Their concentration is on the future rather than present [14]
emotional characteristics and factors of every individual and is          [22] [24] [45].
applicable as a set of permanent indices for recognizing how
the learner comprehends the concepts, and interacts with the                 Thinking/Feeling: thinking people make their decisions
learning environment and responses to the environment”.                   based on exact data, and they like accurate subjects. Their
Learning styles are the criteria in perception of information and         decisions are logical and impersonal [33] [22] [24] [45].
evaluation of understanding them [1] [14] [32] [47]. There is an          Feeling people, on the other hand, have emphasis on harmony
important point in the definition, that is, the learning styles           and balance. They enjoy teamwork. Their judgments and
reflect preferences and individual priorities in selection of             decisions are based on personal value.
learning conditions [12] [14].                                                Judging/Perceiving: judging people prefer completely
     a) Evaluation of Learning Styles                                     organized life and regulated thoughts and ideas. They pay
                                                                          attention to activities which are important to them. Deadlines
    Different tools are used to determine learners’ learning              are important for them [14] [22] [45]. Perceiving people,
styles [3]. There are many questionnaires that categorize each            however, have a flexible life style. They are curious, agreeable
person according to their learning styles: Kolb questionnaire,            and tolerant. They start several projects simultaneously. They
Honey and Mumford questionnaire [33], GRSLSS                              don't pay attention to deadlines.
questionnaire [33] [31], etc.
                                                                               d) Personality Types of "MBTI"
     b) MBTI (Myers-Briggs Type Indicator)
    In comparison with the questionnaires in other education              Sixteen personality types result from mixing four two-
sciences, MBTI is known as a strong instrument to determine               dimensional functions and individuals are categorized in these
learning styles of individuals [14]. It is an evaluation                  types after completing some questionnaires [12] [24]. The
instrument related to Jung personality theory, the first time, it         sixteen groups are shown in Table I. For example, people in
used by Kathrin Briggs and Isabel Myers Briggs in 1920 [42]               ENTP group are all extrovert, intuitive, thinking and
[14]. It was first used as a job applicants’ evaluation test. After
that it was used in education sciences in 1957[43]. Generally

                                                                                                    ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                             Vol. 7, No. 3, March 2010

           TABLE I.         PERSONALITY TYPES OF "MBTI”                         Module of personality identification: In the first, learner
          ESTJ          ISTP          ENTJ            INTP                   encounters with system, MBTI questionnaire is in front of
          ISTJ          ESTP          INTJ            ENTP
                                                                             him/her, and it finds out learner’s personality (for example
                                                                             ISEJ, ESTP, INTJ …).
          ISFJ          ESFP          INFJ            ENFP
          ESFJ          ISFP          ENFJ            INFP                       Module of choosing a teaching style matching with
                                                                             learner’s personality: Generally, there are three kinds of
                      IV.   PROPOSED MODEL                                   learning environment: Independent, cooperative, and
    In this paper, a new model is presented according to the                 competitive learning environment [15]. The system sets learner
learning model based on emotions and personality [19] and                    in one of three kinds of independent group, cooperative group
virtual classmate model [17] [18] in our previous research. This             with VCA, or competitive group with VCA. Individuals are
module is displayed in Figure1.                                              classified according to personality types that are showed in
                                                                             Table II. This classification is obtained based on fulfilled
                                                                             studies on personality’s characteristics [23].
          Personality                                                            Module of choosing virtual classmate agent: If the learner
         Identification                                                      be placed in the independent group, the education and learning
                                                                             process will be started otherwise system selects a suitable VCA
                                                                             according to learner’s personality type, then; the process of
                                                                             education and learning will be started. Selecting personality of
                                                                             VCA should be in such a way that is proportionate with the
                                                                             leaner’s personality type and it helps to progress of learner
                                   Choosing Virtual                          during the learning process. According to the fulfilled studies,
       Choosing                    Classmate Agent                           presence of VCA with an opposed personality is suitable [6]
     Teaching Style                                                          [10] [39]. Studies in [6] [10] [39] and other psychological
                                                                             sources are showed that people are grouped with opposed
                                                                             personality are much better than the people are grouped with
                                                                             similar personality. Former has high efficiency rather than
     Education and Learning
                                                                                 TABLE II.        LEARNER’S CLASSIFICATION BASED ON MBTI TYPE

                                                                                                        Cooperative with         Competitive
                                                                                                             VCA                  with VCA

                                          Choosing Behavior                            INTJ                   INFJ                   ENTP
     Emotion Evaluation                   and Updating the
                                                                                       INFP                   ENFP                   ENTJ
                                           Learning Style
                                                                                        ISTJ                  ISFJ                   ESFJ
                                                                                       INTP                   ESFP
                      Figure 1. Proposed model
                                                                                       ISFP                   ESTP
   The model has goals as following:                                                   ISTP                   ENFJ
     •    Presentation method to determine learner’s group                                                    ESTJ
          based on personality questionnaire
                                                                                 Module of education and learning: In this module, lesson’s
     •    Presentation method to recognize learner’s emotions                points are presented to learner as exercises and learning process
          based on his/her personality                                       will be started.
     •    Presentation method to change teaching style based                     Module of emotion evaluation: During doing exercise and
          on learner’s emotions and personality                              evaluating the rate of learning, based on the learner’s learning
                                                                             level and the events which happen in the environment, some
     •    Presentation a method to determine VCA based on
                                                                             emotions are expressed in the learner (e.g. liking VCA,
          learner’s personality
                                                                             disappointment from doing exercises, etc). According to the
     •    Presentation suitable tactics in learning based on                 fulfilled studies, we have found that only specific emotions are
          learner’s emotions and personality using VTA and                   effective in the learning process. The result of the studies in
          VCA.                                                               [29] [36] are showed that the first branch of emotions in the
                                                                             OCC model is effective in learning. Here, we use first and third
A. Model Architecture                                                        branch of OCC model. The first branch includes effective
   The model includes six modules, each of them described                    emotions in the learning process, and the third branch includes
below:                                                                       those emotions that the person in the relation with the others
                                                                             (e.g. the virtual classmate agent) shows.

                                                                                                          ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 7, No. 3, March 2010

    Module of choosing behavior and updating style of                     In Eq. (1), gi is valued with goal importance.
teaching: According to the events which happened in the
environment and was changed learner’s emotions; this module                b) Event’s Impact on Learner’s Goals
based on personality’s characteristic and current learner’s               In this stage, the value of each event’s impact on the goal is
emotions, changes the teaching style. In addition, VTA and             expressed, by a matrix Eq. (2):
VCA perform suitable tactics to interacting with learner based
on event has happened. The tactics are in the knowledge base                      a11   …       a1n
of system and include a set of rules like “if-then”.                                        .
                                                                       Impact=              .            ∀i ∈ [1, n], ∀j ∈ [1, m] : a ∈ [−1,1]   (2)
B. Calculation of Emotions’ Values                                                                                                   ij
    In the proposed model, two emotion’s branches of the OCC                      am    …       amn
model are used. The first branch includes effective emotions in
the learning process and third branch includes person’s emotion
against the others. In this part, we examine the way of                   Each element of the matrix shows value of each event’s
calculating them.                                                      impact on a special goal. “m” includes number of events and
  1) First Branch of Emotions in OCC Model                             “n” includes number of goals. Positive values of aijs show the
    We use calculating model to achieve emotions’ values [26].         positive effect of event on the goal and negative value of aijs
In this model to calculate the first branch’s emotions of OCC          shows the negative effect of event on the goal. The values of
model, values of general variables (Unexpectedness), and local         each aijs are calculated for ith event and jth goal. To calculate
variables (Likelihood and Desirability) should be calculated.          each of aijs, the “Value Function’ is used [26].
Unexpectedness and Likelihood variables directly are obtained              To calculate the “Value Function” environmental variables
from outside, and calculate based on environmental variables.          are used. These variables are used to measure the achievement
In implementation, which is based on this model [27] as a              to goal for an event. Two vectors are defined: vector “V” which
VTA, the value of Unexpectedness and Likelihood variables              shows the effect of environmental variables’ values on a goal,
have been calculated according to the learner’s answer for the         and vector “W” which shows the effect of the weights’ values
exercises (e.g. True or False) [26]. Here, in the same approach,       of each environmental variable on a goal. If “n” variables affect
calculating the values of these two variables is done, but             a goal, they showed in vector “V”, as following Eq. (3):
calculation value of Desirability is different.
    a) Learner Agent’s Goals According to Personality
   According to the model [26], when an event happens in the                     .
environment, this event is offered to event’s evaluating module.
According to the learner’s goals, evaluation is done and the              V=     .                       [ ]
                                                                                                ∀i ∈ 1, n : vi ∈ 0,1       [ ]                         (3)
value of Desirability is obtained.
   In this paper, according to the “MBTI” questionnaire, four
goals for each learner are considered:
   • The goal of learner’s agent is to do the exercise alone               The weight of each variable is also placed in vector “W”, as
     and without any help from VCA.                                    following Eq. (4):
   • The goal of learner’s agent is effort in doing the
     exercise, even if desirable result isn’t obtained.                                                       [
                                                                                                      ∀ j ∈ 1, m : w   ]        j
                                                                                                                                    ∈ 0, n       ]
                                                                          W= [w1 w2 … wm]                                                              (4)
   • The goal of learner’s agent is having high speed in
     answering for the exercises.                                         Value of the “Value Function” is obtained by multiplying
                                                                       two vectors of “V” and “W”. Because of the value of the
   • The goal of learner’s agent is cooperating with VCA for           “Value Function” should be between 0 and 1, we divide
     doing the exercises.                                              obtained product to the sums of the elements of weights’ vector
         The values of each goal are obtained according to             Eq. (5):
several questions in MBTI questionnaire. After normalization,
these values are placed in the numerical domain of 0-1 Eq. (1).                        W ×V
These values in the matrix G which is the goals matrix are                Value (gj) = m                                                               (5)
placed as following [26]:                                                               ∑W
                                                                                       k =1 k

          g1                                                                c) Environmental Variables
          g2     ∀i ∈ [1,4]: g ∈ [0,1]                                     In the proposed model, according to the defined goals for a
  G=      g3
                              i                                        learner, five environmental variables are considered, these
                                                           (1)         variables influence the goals:

                                                                                                      ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 7, No. 3, March 2010

        •      Independence                                                     In equation Eq. (8), “RT” is leaner’s response time, and
                                                                             “DT” is the default time of the system for responding to the
        •      Potential of Cooperation                                      exercise.
        •      Response Speed                                                   Grade of Exercises: Value of the variable is between 0-1,
        •      Grade of Exercises                                            and it shows the learner’s grade from solving the exercises.

        •      Effort                                                           Effort: Value of the variable is obtained with ask of learner.
                                                                             Value of this variable is between 0-1.
    According to the fulfilled studies on MBTI questionnaire,
for each of the influencing variables on learner’s goals, it is                   d) Calculating Desirability
considered a numerical weight in the domain of 1-3 [42] [12]                     After calculation of the goals’ values and the impact value
[14] [22] [24] [40] [45]. Variables and weights are shown in                 of the event on the learner’s goals, Desirability value is
Table III.                                                                   obtained by multiplying two matrixes: “Impact” and “G”
                                                                             according to Eq. (9) [26]. In this equation ei represents the
               TABLE III.      ENVIRONMENTAL VARIABLES                       entrance event.
                                                                                Since the numerical results should be between -1 and 1, the
                                           Vector W
         Vector V                                                            obtained result is divided to the sums of the vector “G”
                            Goal 1   Goal 2     Goal 3    Goal 4             elements.
       Independence           3        1              1     1
       Potential of                                                                                    ∑ a ij g j
                              1        1              1     3                                         j =1
                                                                                  Desirability (ei) =      n                                               (9)
    Response Speed            1        1              3     1                                             ∑ gi
    Grade of Exercises                                                                                   i =1
                              2        1              1     2
            Effort            1        3              1     1                    Finally, we obtain value of the first emotions branch of
                                                                             OCC model from Desirability, Unexpectedness and Likelihood
                                                                             variables. It should be mentioned that the value of Likelihood
   Independence: Value of the variable shows the learner’s                   variable is obtained with attention to the student’s performance
independence during the learning process. At the first learner               in the past, and the value of Unexpectedness variable is
encounter the system, value of this variable for the cooperative             obtained according to the student’s effectiveness.
and competitive group is 0 and for the independence group is 1.
During learning process, the value of this variable calculated               2)         Third Branch of Emotions in OCC Model
according to Eq. (6):
                                                                                 For determining Love and Hate emotions, we attend the
   Independence=1- AH                                              (6)       event’s type and current emotions. When an event happens in
                                                                             the environment, either positive or negative emotions appear in
   In this equation, “Independence” variable shows learner’s                 the leaner based on this event. List of the events as well as their
independence, and "AH" is indicator of VTA’s help or VCA’s                   classification to positive and negative groups are shown in
help. The AH is between 0-1, and it’s obtained by the division               Table IV. In addition, the emotions’ classification based on
of aid’s request value from VTA or VCA on the number of                      positive or negative are shown in Table V.
                                                                                            TABLE IV.           CLASSIFICATION OF THE EVENTS
    Potential of Cooperation: Value of the variable shows
                                                                                           Positive                                 Negative
learner interesting in cooperative group, the value of this
variable calculated according to Eq. (7):                                    Accurate response to the exercise       Inaccurate response to the exercise

                                                                             Student’s effort for solving the        Finishing time of responding and
   Potential of Collaboration = 1- Independency                    (7)
                                                                             exercise                                Unsolved exercises
   In this equation, the independence variable is calculated                 Student’s thinking                      Leaving the class
from Eq. (6).
                                                                                                                     Rejecting help or refusing to cooperate
    Response Speed: Value of the variable shows learner’s                    Requesting help from VCA
                                                                                                                     with VCA
response speed that is obtained with Eq. (8).
                                                                                                                     Request for the next exercise and
                                                                                                                     leaving the previous exercise unsolved
        Response Speed= 1 –                                        (8)

                                                                                                                ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 7, No. 3, March 2010

      TABLE V.     CLASSIFICATION OF THE EMOTIONS IN OCC MODEL            B. Educational Environment with Emotional VTA
                   Positive                Negative                           In this environment, VTA tries to get the learner’s emotions
                     Joy                   Distress                       by simulation of the learner and select the suitable tactics and
                    Hope                     Fear                         perform them, using the learner’s emotions and the events
                 Satisfaction           Disappointment                    happening in the environment. VTA also can show emotional
                    Relief               Fear-confirmed                   states and interact with the learner. Considering the events
                   Gloating                   Pity
                                                                          happening in the environment, the VTA infers the learner’s
                                                                          emotions and, based on them, uses suitable tactics for teaching
                  Happy-for               Resentment
                                                                          the learner more effectively. The tactics are saved as a set of
                    Pride                   Shame
                                                                          rules in the VTA’s knowledge base. One of them for example
                 Admiration                Reproach                       is as follows:
                   Love                      Hate
                                                                           Rule     IF              Disappointment       IS   High
                  Gratitude                 Anger
                 Gratification             Remorse                                          OR      Disappointment       IS   Medium
                                                                                            AND     Event                IS   Wrong Answer
                                                                                    THEN            Teacher-Tactic 1     IS   Increase-Student-
   In this section, based on the conducted studies in [37],                                                                   Self-Ability
                                                                                            AND     Teacher-Tactic 2     IS   Increase-Student-
positive and negative emotions, and positive and negative                                                                     Effort
events, we consider four states as follows:
                                                                          C. Educational Environment with Emotional VTA and VCA
  •     If the learner’s current emotion is negative and a                    with Emotions and Personality
        negative event has happened, Hate emotion toward VCA
        will be expressed.                                                    This environment is implemented based on the proposed
                                                                          model. The environment uses VTA and VCA together. Two
  •     If the learner’s current emotion is negative and a                agents based on recognizing learner’s emotions, select suitable
        positive event has happened, Hate emotion toward VCA              tactics to interaction with him/her (Figure 2).
        will be expressed.
  •     If the learner’s current emotion is positive and a
        negative event has happened, Love emotion toward
        VCA will be expressed.
  •     If the learner’s current emotion is positive and a positive
        event has happened, Love emotion toward VCA will be
     According to these states, Love or Hate emotions toward a
VCA is obtained and since the numerical value of these two
emotions is not important for the rules of our expert system, it
is just enough for the it to differentiate them.
                       V.        IMPLEMENTATION
   We have implemented our model in educational
environments. The educational domain in this environment is
Learning English Language. For better evaluating of the
                                                                                    Figure 2. Educational Environment with VTA and VCA
proposed model, this environment is compared with two other
environments. Three environments to examine are as follows:
                                                                              In this environment, selection of tactics for dealing with the
   •     Educational environment without emotions                         learner is related to the group’s type. According to the learner’s
                                                                          learning group, certain tactics are used. At the beginning, based
   •     Educational environment with emotional VTA                       on the learner’s personality, one of the independent,
   •     Educational environment with emotional VTA and                   cooperative, or competitive learning groups is considered for
         VCA who have emotions and personality                            him/her.

A. Educational Environment without Emotions                                   The rules of determining the learner’s group are saved in
                                                                          the system’s knowledge base. For example, one of the rules is
    This environment is a simple virtual education                        as follows:
environment. The learner enters the environment and just tries
to solve many exercises. After responding to the questions of a           Rule 1:
                                                                                    IF        Student's Personality           IS     ISFJ
level, the learner is promoted to an upper level. During the                        THEN      His/her group                   IS     Cooperative
learning process, the environment does not provide any guides
for the learners, and just the learner’s grade is shown on the

                                                                                                       ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                     Vol. 7, No. 3, March 2010

  1) Independency’s Learning Group                                                   TABLE VIII.        VERBAL BEHAVIORS OF VTA IN INDEPENDENCY LEARNING
    When a learner is in this group, he is supposed to solve the
exercises by himself and without the presence of VCA. In this
environment, VTA doesn’t help the learner either and only                                  Verbal
does his administrative tasks. This learning environment is very                                                             Merlin’s Speaking
similar to the educational environment with the emotional
VTA. There is, however, a small difference, that is, since the                                             Uauuuuu! You are very well! Congratulations for the
personality’s type of the learner is considered as independent                                             efforts that you made!
and these people like to solve the exercises by themselves, help                                           Congratulations! You obtain an excellent result!
with tasks is deleted from the VTA’s tasks. The other VTA’s                            Congratulation
                                                                                                           Continue it.
tasks are similar to those of the VTA in the educational system
as mentioned above. The tactics in the system has been defined                         Congratulation      Congratulations! Your performance was stupendous!
for him. The list of VTA’s tactics is shown in Table VI.                               Congratulation      Congratulations for your efforts!
                                                                                       Congratulation      Congratulations! You reached a good result!

                                 VTA’s Tactics                                         2) Cooperative Learning Group
                                                                                        When a learner is in this group, according to his
    1        Increase-Student-Self-                Allow-to-Leave-Virtual-
                                         7                                           personality’s type, he is interested in participating in the group
                     ability                                Class
                                                                                     and doing cooperative activities. Considering this issue, the
    2        Increase-Student-Effort     8             Teacher-IS-Idle               environment has been designed in such a way that it uses a
    3                                            Propose-Cooperate-With-             VCA to interaction with the learner.
              Congratulate-Student       9
                                                                                         In this environment due to the presence of VCA, VTA’s
    4                                            Change-Student-Group-to-            tasks are changed, and we have defined a set of tasks for VTA
                 Encourage-Student      10
                                                       Cooperative                   and VCA. Considering this fact, the environment is defined so
    5                                            Change-Student-Group-to-            that a VCA is used to communicate with the learner. The
            Recognize-Student-Effort    11
                                                           Competitive               tactics of VTA and VCA in Cooperative Learning Environment
    6       Show-Student-New-Skills     12          Show-Next-Exercise               are shown in Table IX.

                                                                                      TABLE IX.          TACTICS OF VTA AND VCA IN COOPERATIVE LEARNING
   VTA performs some tactics to interact with the learner.                                                          ENVIRONMENT
The tactics are saved as rules in VTA’s knowledge base. For                                       VCA’s Tactics                          VTA’s Tactics
example, one of the rules is as follows:                                              Increase-Student-Self-ability            Congratulate-Student
                                                                                      Increase-Student-Effort                  Recognize-Student-Effort
Rule 1:     IF                  Student Group         IS     Independent
                       AND      Disappointment        IS     High                     Persuade-Student-To-Think-More-
                       OR       Disappointment        IS     Medium                                                            Show-Student-New-Skills
                       AND      Event                 IS     Wrong Answer
            THEN                Teacher-Tactic 1      IS     Increase-Student-        Cooperate-With-Student                   Show-Next-Exercise
                                                             Self-Ability             Notify-Student-For-Deadline              Allow-To-Leave-Virtual-Class
                       AND      Teacher-Tactic 2      IS     Increase-Student-
                                                             Effort                   Encourage-Student                        Teacher-IS-Idle
                       AND      Teacher-Tactic 3      IS     Change-Student-                                                   Change-Student-Group-To-
                                                             Group-to-                Give-Help
                                                             Cooperative                                                       Independent
    a) Physical and Verbal Behaviors                                                  Persuade-Student-To-Be-                  Change-Student-Group-To-
   Every tactic includes a set of physical and verbal behaviors.                      Independent                              Competitive
The physical behaviors are those emotional behaviors that the                         Offer-Cooperation
VTA expresses for showing his emotional states. These
behaviors are executable by the prepared functions of Merlin
agent which Microsoft Company has designed.                                              According to the above description, a set of rules are saved
    For each verbal behavior, there is a sentence that VTA                           in the knowledge base of the environment so that the VTA and
presents as text. Instances of the physical and verbal behaviors                     VCA have interaction with the learner in accordance with
are shown in Table VII and Table VIII, respectively.                                 them. For example, one of the rules is as follows:

          Physical Behavior                       Merlin Function
           Congratulation                    Congratulate, pleased, Speak

                                                                                                                      ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                   Vol. 7, No. 3, March 2010

Rule 1:   IF                  Student Group          IS     Cooperative                TABLE XII.       TACTICS OF VTA AND VCA IN COMPETITIVE LEARNING
                     AND      Like                   IS     High                                                   ENVIRONMENT
                     OR       Like                   IS     Medium
                                                                                                   VCA’s Tactics                     VTA’s Tactics
                     AND      Distress               IS     High
                     AND      Event                  IS     Wrong-Answer                 Increase-Student-Effort            Congratulate-Student
                     AND      Student’s              IS     Higher-Than-
                              Response speed                Threshold                                                       Congratulate-Classmate
                     AND      Virtual                IS     IN                           More-For- Problem
                                                                                         Notify-Student-For-Deadline        Increase-Student-Self-Ability
                     OR       Virtual                IS     IS                                                              Encourage-Student
          THEN                Teacher_Tactic1        IS     Recognize-                                                      Show-Student-New-Skills
                     AND      Classmate_Tactic1      IS     Persuade-Student-
                                                            to-Think-More-                                                  Allow-To-Leave-Virtual-Class
    In this part, VTA’s behaviors are executable by Merlin                                                                  Independent
agent, and VCA’s behaviors are executable by Peedy agent.
Instances of the physical and verbal behaviors, which are                                                                   Change-Student-Group-To-
related to the VCA, are given in Table X and Table XI.                                                                      Cooperative
                                                                                        According to the above description, a set of rules are saved
                              ENVIRONMENT                                           in the knowledge base of the environment so that the VTA and
                                                                                    VCA have interaction with the learner in accordance with
  Physical Behavior        Peedy Function                                           them. For example, one of the rules is as follows:
   Persuade_Student                          Think, Speak
                                                                                    Rule 1:   IF                   Student Group          IS   Competitive
                                                                                                          AND      Like                   IS   High
                                                                                                          OR       Like                   IS   Medium
                                                                                                          AND      Distress               IS   High
                                                                                                          AND      Event                  IS   Wrong Answer
                                                                                                          AND      Student’s              IS   Higher–Than-
Verbal Behavior            Peedy’s Speaking                                                                        Response speed              hreshold
                                                                                                          AND      Virtual Classmate's    IS   IN
Persuade-Student           You have to think more for doing your                                                   Personality
                           assignment.                                                                    OR       Virtual Classmate's    IS   IS
Persuade-Student           I know, you are very intelligent, but you’d                        THEN                 Classmate_Tactic1      IS   Persuade-Student-
                           better spend more time on your assignments.                                                                         to-Think-
 Persuade-Student          Let think more on this idea, please. There are                                 AND      Teacher_Tactic1        IS   Recognize-Student-
                           enough times. Don’t festinate to answer.                                                                            Effort
                                                                                                          AND      Teacher_Tactic2        IS   Change-Student-
  3) Competitive Learning Group                                                                                                                Cooperative
   When a learner is in this group he/she may be interested in                          In this part, like the previous part, for each tactic of the
competing with his classmate, based on his personality type.                        VTA and VCA, a set of physical and verbal behaviors are
Concerning this issue, the environment has been designed in                         defined. For a better comparison between the two cooperative
such a way that it uses a VCA for competing with the learner                        and competitive states of the virtual classmate agent, it is
and simulates a competitive environment for the learner.
                                                                                    given the tactics of persuade-student-to-think in Tables XIII
    In this environment, due to the presence of VCA, VTA’s                          and XIV, respectively.
tasks are changed, and we have defined a set of tasks for VTA
and VCA. Considering this fact, the environment is defined so                        TABLE XIII.      PHYSICAL BEHAVIORS OF VCA IN COMPETITIVE LEARNING
that a VCA is used to communicate with the learner. The                                                          ENVIRONMENT
tactics of VTA and VCA in Cooperative Learning Environment                              Physical Behavior              Peedy Function
are shown in Table XII.                                                                 Persuade_Student_to_Think      Speak

                                                                                                                    ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                               Vol. 7, No. 3, March 2010

                        LEARNING ENVIRONMENT

  Verbal Behavior               Peedy’s Speaking

  Persuade-Student-to-Think     I think you should take your time doing
                                your assignments.
  Persuade-Student-to-Think     No need to rush. Think them over!

                    VI.   EXPERIMENTAL RESULTS
    For evaluating educational environment based on the
proposed model, three educational environments were shown
                                                                                 Figure 6. Evaluating Interaction and User’s Satisfaction Of Educational
to thirty users and the users evaluated them. After that, they
were asked to answer a questionnaire (Appendix A). According
to users’ responses, the following results are given (Figure 3-
Figure 14):

                                                                                 Figure 7. Evaluating Interaction and User’s Satisfaction of Educational
                                                                                                            Environment 2

                 Figure 3. Evaluating of Learning Rate

                                                                                 Figure 8. Evaluating Interaction and User’s Satisfaction of Educational
                                                                                                            Environment 3
                 Figure 4. Evaluating Attractiveness of
                       Educational Environments

                                                                                    Figure 9. Comparing User’s Interaction and Satisfaction of Three
                                                                                                     Educational Environments.

       Figure 5. Evaluating Interaction and User’s Satisfaction of
                      Educational Environments

                                                                                                             ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                               Vol. 7, No. 3, March 2010

   Figure 10. Evaluating VTA’s Function in the Environment 2
                                                                                 Figure 14. Evaluating the Effect of VCA’s Presence in Learning Progress

                                                                                            VII. CONCLUSION AND FUTURE WORKS
                                                                                  As the result of evaluation shows, users believe that               the
                                                                               educational environment 3 is more attractive than                      the
                                                                               educational environments 1 and 2. Also, they believe                   the
                                                                               educational environment 2 is more attractive than                      the
                                                                               educational environment 1.
                                                                                   The result of the evaluation shows that the presence of the
                                                                               intelligent agents with features like human can increase the
                                                                               learning rate, and it has an important role in attracting them for
                                                                               the virtual educational environments. Comparing the
                                                                               educational environment 2 with the educational environment 3
                                                                               showed that the presence of a VCA increases users’ satisfaction
        Figure 11. Evaluating VTA’s Function in the Environment 3              and users’ interaction with the environment.
                                                                                  Finally, the results show that VTA and VCA have
                                                                               considerable effect on improvement of learning process.
                                                                                   In the future, we will try to complete the system. We can do
                                                                               the following enhancements:
                                                                                   Applying all dimensions of “MBTI”: In the proposed model
                                                                               just two dimensions of MBTI were used for VCA simulation.
                                                                               For presenting this agent more realistically, two other
                                                                               dimensions can be added.
                                                                                   Applying all emotions of the OCC model: In this model, just
                                                                               the emotions of the first and third branches of the OCC model
                                                                               were used. We can improve the model by using all of them.
                                                                                  Applying the culture model: Nowadays, lots of researchers
Figure 12. Comparing VTA’s Function in the Environment 2 and 3
                                                                               on the impacts of the culture on learning styles have been
                                                                               published. In the future, by adding this parameter to personality
                                                                               and emotion models, we can increase the environment
                                                                               credibility for the learners.
                                                                                   Adding the learning feature to agents: In the implemented
                                                                               system, agents perceive the states according to the written rules
                                                                               of the system and do a suitable action for it. By adding the
                                                                               learning features to VTA and VCA and based on the learner’s
                                                                               reactions in the environment, they can learn the rules.
                                                                                   Completing the system: We can add some more functions to
                                                                               the system, such as lessons’ classification according to their
                                                                               degree of hardness. In addition, it can increase the number of
                                                                               the VCA’s in the learning environment.
            Figure 13. Evaluating VCA’s Function

                                                                                                             ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                   Vol. 7, No. 3, March 2010

                                REFERENCES                                              [22] Higgs, M. (2001). "Is there a relationship between the Myers-Briggs
                                                                                             type indicator and emotional intelligence?", Journal of Managerial
[1]    Abrahamian, E., Weinberg, J., Grady, M., and Michael Stanton, C.                      Psychology, Vol. 16, no. 7, pp. 509-533.
       (2004). "The Effect of Personality-Aware Computer-Human Interfaces
       on Learning" , Journal of Universal Computer Science, Vol. 10, pp. 27-
       37.                                                                              [24] Jessee, S.A., ƠNeill, P.N. and Dosch, R.O. (2006). "Matching Student
                                                                                             Personality Types and Learning Preferences to Teaching
[2]    Al Masum, S.M. and Ishizuka, M. (2005). "An Affective Role Model of
                                                                                             Methodologies", Journal of Dental Education, Vol. 70, pp. 644-651.
       Software Agent for Effective Agent-Based e-learning by Interplaying
       between Emotions and Learning", WEBIST, USA, pp. 449-456.                        [25] Ju, W., Nickell, S., Eng, K. and Nass, C. (2005). "Influence of colearner
                                                                                             agent behavior on learner performance and attitudes", CHI '05 extended
[3]    Anderson, T. and Elloumi, F. (2003). "Theory and practice of online
       learning".                                                                            abstracts on Human factors in computing systems, ACM Press, Portland,
                                                                                             OR, USA.
[4]    Berry, M. (2005). "A virtual learning environment in primary
       education".                                                                      [26] Kazemifard, M., Ghasem-Aghaee, N., and Oren, T. I. (2006). "An
                                                                                             Event-based Implementation of Emotional Agents". In Proceeding of the
[5]    Botsios, S., Mitropoulou, V., Georgiou, D. and Panapakidis, I. (2006).                Summer Simulation Conference (SCSC'06), Calgary, Canada,pp. 63-67.
       "Design of Virtual co-Learner for Asynchronous Collaborative e-
       Learning". ICALT, pp. 406-407.                                                   [27] Kazemifard, M., and Ghasem-Aghaee, N. (2007). "Virtual Tutor". In
                                                                                             Proceeding of the 15th Iranian Conference on Electrical Engineering
[6]    Capretz L., F. (2002). "Implications of MBTI in Software Engineering                  (ICEE'07), Tehran, Iran,pp. 358-363.
       Education", ACM SIGCSE Bulletin - inroads, ACM Press, New York,
       Vol. 34, pp. 134-137.                                                            [28] Kestern, A.J.(2001)."A supervised machine-learning approach to
                                                                                             artificial emotions". Master’s thesis, Department of Computer Sience,
[7]    Chaffar, S. and Frasson, C. (2004). "Using an Emotional Intelligent                   University of Twent.
       Agent to Improve the Learner’s Performance", In Social and emotional
       intelligence in learning Environments workshop, 7th International                [29] Kort, B. and Reilly, R. (2001). "Analytical Models of Emotions,
       Conference on Intelligent Tutoring System, Brazil, pp. 37-43.                         Learning and Relationships: Towards an Affect-sensitive Cognitive
                                                                                             Machine". MIT Media Lab Tech Report, no. 548.
[8]    Chaffar, S., Cepeda, G. and Frasson, C. (2007). "Predicting the Learner’s
                                                                                        [30] Kshirsagar, S. and Magnenat-Thalmann, N. (2002). "A multilayer
       Emotional Reaction towards the Tutor’s Intervention", 7th IEEE
       International Conference, Japan, pp. 639-641.                                         personality model", Proceedings of the 2nd international symposium on
                                                                                             Smart graphics, Hawthorne, New York pp.107-115.
[9]    Chalfoun, P., Chaffar, S. and Frasson, C. (2006). "Predicting the
                                                                                        [31] Lang, H.G., Stinson, M.S., Kavanagh, F., Liu, Y., and Basile, M. (1999).
       Emotional Reaction of the Learner with a Machine Learning
                                                                                             "Learning styles of deaf college students and instructors’ teaching
       Technique", Workshop on Motivaional and Affective Issues in ITS,
       International Conference on Intelligent Tutoring Systems, Taiwan                      emphases". Journal of Deaf Studies and Deaf Education, Vol. 4, pp. 16-
[10]   Choi, K., Deek, F.P., and Im, I. (2008). "Exploring the Underlying
       Aspects of Pair Programming: The Impact of Personality", Journal of              [32] Li, Y.S., Chen, P.S., and Tsai, S.J. (2007). "A comparison of the
       Information and Software Technology.                                                  learning styles among different nursing programs in Taiwan:
                                                                                             Implications for nursing education", Journal of Nurse Education Today.
[11]   Damasio, A.R. (1994). "Descartes’ Error: Emotion, Reason, and the                     Vol. 28, no. 1, pp. 70-76.
       Human Brain ", Gosset/Putnam Press, New York.
                                                                                        [33] Logan, K., and Thomas, P. (2002). "Learning Styles in Distance
[12]   Dewar,T. and Whittington,D. (2000). "Online Learners and their                        Education Students Learning to Program", Proceedings of 14th
       Learning Strategies" , Journal of Educational Computing Research, Vol.                Workshop of the Psychology of Programming Interest Group, Brunel
       23, no. 4, pp. 415-433.                                                               University, pp. 29-44.
[13]   Du, J., Zheng, Q., Li, H. and Yuan, W. (2005). "The Research of Mining           [34] Maria, K.A. and Zitar, R.A. (2007). "Emotional Agents: A Model and an
       Association Rules Between Personality and Behavior of Learner Under                   Application", Journal of Information and Software Technology, Elsevier
       Web-Based Learning Environment", ICWL, pp. 406 – 417.                                 Publishing, Vol. 49, no. 6, pp. 695–716.
[14]   Durling, D., Cross, N., and Johnson, J. (1996). "Personality and learning        [35] Marin , B.F., Hunger, A. and Werner, S. (2006). "Corroborating
       preferences of students in design and design-related disciplines".                    Emotion Theory with Role Theory and Agent Technology: a Framework
       Proceedings of IDATER 96 (International Conference on Design and                      for Designing Emotional Agents as Tutoring Entities", Journal of
       Technology Educational Research), Loughborough University, pp. 88-                    Networks, Vol. 1, pp. 29-40.
                                                                                        [36] Miranda, J.M and Aldea, A. (2005). "Emotions in Human and Artificial
[15]   Ellis, S., and Whalen,       S. (1996). “Cooperative Learning: Getting                Intelligence", Journal of Computers in Human Behaviour, Vol. 21, no.
       Started”, Scholastic.                                                                 2, pp. 323-341.
[16]   El-Nasr, M., Yen, J. and Ioerger, T. (2000), "FLAME – fuzzy logic                [37] Morishima, Y., Nakajima, H., Brave, S., Yamada, R., Maldonado, H.,
       adaptive model of emotions", Autonomous Agents and Multi-Agent                        Nass, C. and Kawaji, S. (2004), "The Role of Affect and Sociality in the
       Systems, Vol. 3, no.3, pp. 219-57.                                                    Agent-Based Collaborative Learning System", In Tutorial and Research
[17]   Fatahi, S., Ghasem-Aghaee, N., and Kazemifard, M. (2008). "Design an                  Workshop, New York, pp. 265-275.
       Expert System for Virtual Classmate Agent (VCA)". Proceedings of                 [38] Ortony, A., Clore, G. L. and Collins, A .1988. “The Cognitive Structure
       World Congress Engineering 2008, U.K, London.                                         of Emotions”, Cambridge University Press, Cambridge, UK.
[18]   Fatahi, S., M. Kazemifard, and N. Ghasem-Aghaee. 2009. "Design and               [39] Peslak, A.R. (2006). "The impact of personality on information
       Implementation of an E-Learning Model by Considering Learner's                        technology team projects", Proceedings of the 2006 ACM SIGMIS CPR
       Personality and Emotions". In Advances in Electrical Engineering and                  conference on computer personnel research: Forty four years of
       Computational Science, Netherlands: Springer, Vol. 39, pp. 423-434.                   computer personnel research: achievements, challenges & the future,
[19]   Ghasem-Aghaee, N., Fatahi, S., and T.I. Ören. (2008). "Agents with                    Claremont, California, USA.
       Personality and Emotional Filters for an E-learning Environment",                [40] Rushton, S., Morgan, J. and Richard, M. (2007). "Teacher's Myers-
       Proceedings of Spring Agent Directed Simulation Conference, Ottawa,                   Briggspersonality profiles: Identifying effective teacher personality
       Canada.                                                                               traits", Journal of Teaching and Teacher Education, Vol. 23, pp. 432-
[20]   Harati Zadeh, S., bagheri Shouraki, S. and Halavati, R. (2006).                       441.
       "Emotional Behavior: A Resource Management Approach", Journal of                 [41] Sarmento, L.M. (2004). "An Emotion-Based Agent Architecture",
       SAGE, Vol. 14, pp. 357-380.                                                           Master Thesis in Artificial Intelligence and Computing, Universidade do
[21]   Hartmann, P. (2006). "The Five-Factor Model: Psychometric, biological                 Porto.
       and practical perspectives".                                                     [42] Schultz, D. P., and Schultz, S.E., (2008) “Theories of Personality”,
                                                                                             edition 4.

                                                                                                                       ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                  Vol. 7, No. 3, March 2010

[43] Shermis, M.D. and Lombard, D. (1998). "Effects of computer-based test             (vii) Your evaluation of the effect of the cooperation with the
     administration on test anxiety and performance". Journal of Computers
     in Human Behavior, Vol. 14, pp. 111-123.                                                virtual classmate agent in learning progress:
[44] Vinayagamoorthy, V., Gillies , M., Steed, A., Tanguy, E., Pan, X.,                      a) Very Low       b) Low       c) Medium      d) High
     Loscos, C., and Slater,M. (2006). Building Expression into Virtual                      e) Very High
     Characters" , In Eurographics 2006. Vienna: Proceeding of
[45] Vincent, A. and Ross, D. (2001). "Personalize training: determine             (viii) Your evaluation of the effect of the competition with the
     learning styles, personality types and multiple intelligences online", The           virtual classmate agent in learning progress:
     Learning Organization, Vol. 8, no. 1, pp. 36±43 ISSN 0969-6474.
                                                                                          a) Very Low       b) Low       c) Medium      d) High
[46] Xiangjie, Q., Zhiliang, W., Jun, Y. and Xiuyan, M. (2006). "An
     Affective Intelligent Tutoring System Based on Artificial Psychology".               e) Very High
     Beijing University of Science and Technology, China, pp. 402-405.
[47] Yeung A., Read J. and Schmid S. (2005). "Students’ learning styles and            Somayeh Fatahi received her B.Sc. and M.Sc. degrees in
     academic performance in first year chemistry". Proceedings of the                 Computer Engineering from Razi University, and Isfahan
     Blended Learning in Science Teaching and Learning Symposium.
     Sydney,NSW: UniServe Science, pp. 137–42.
                                                                                       University, Iran, in 2006 and 2008 respectively. She is
                                                                                       Lecturer in the Department of Computer Engineering at
Appendix A.                                                                            Kermanshah University of Technology. She also teaches in
                                                                                       Razi University, University of Applied Science and
A few sample questions from the questionnaire for the user’s                           Technology, Payam Noor University of Kermanshah, Islamic
satisfaction of the system.                                                            Azad University of Kermanshah, Institute of Higher Education
                                                                                       of Kermanshah Jahad-Daneshgahi.
 (i) Which of the educational environments is more attractive?                         Her research activities include (1) simulation of agents with
     (a) The educational environment 1(simple educational                              dynamic personality and emotions (2) Computational
         environment)                                                                  cognitive modeling (3) Simulation and formalization of
     (b) The educational environment 2(educational                                     cognitive processes (4) Multi Agent Systems (5) Modeling of
         environment with emotional virtual tutor)                                     Human Behavior (6) Fuzzy Expert Systems.
     (c) The educational environment 3(educational
         environment with virtual tutor and classmate agent                            Dr. Nasser Ghasem-Aghaee is a co-founder of Sheikhbahaee
         who has emotions and personality)                                             University of Higher Education in Isfahan, Iran, as well as
                                                                                       Professor in the Department of Computer Engineering at both
                                                                                       the Isfahan University and Sheikhbahaee University. In 1993-
(ii) Your satisfaction level of the educational environment
                                                                                       1994 and 2002- 2003, he has been visiting Professor at the
     3(educational environment with a virtual tutor and a
                                                                                       Ottawa Center of the McLeod Institute for Simulation
     classmate agent with emotions and personality)
                                                                                       Sciences at the School of Information Technology and
     a) Very Low      b) Low      c) Medium        d) High                             Engineering at the University of Ottawa. He has been active in
     e) Very High                                                                      simulation since 1984. His research interests are modeling and
                                                                                       simulation, cognitive simulation (including simulation of
(iii) Your satisfaction level of the virtual tutor’s helps in the                      human behaviour by fuzzy agents, agents with dynamic
      educational environment 2:                                                       personality and emotions, artificial intelligence, expert
      a) Very Low       b) Low      c) Medium        d) High                           systems, fuzzy logic, object-oriented analysis and design,
      e) Very High                                                                     multi-agent systems and their applications. He published more
                                                                                       than 100 documents in Journals and Conferences.
(iv) Your satisfaction level of the virtual classmate’s
     encouragements in the educational environment 2:
     a) Very Low     b) Low       c) Medium      d) High
     e) Very High

 (v) Your satisfaction level of the virtual classmate’s
     encouragements in the educational environment 3:
      a) Very Low     b) Low      c) Medium        d) High
      e) Very High
(vi) How do you evaluate the presence of the virtual classmate
     a) Poor          b) Fair     c) Acceptable d) Good
     e) Excellent

                                                                                                                 ISSN 1947-5500
                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                             Vol. 7, No. 3, 2010

   Signature Recognition using Multi Scale
       Fourier Descriptor And Wavelet
             Ismail A. Ismail 1 , Mohammed A. Ramadan 2 , Talaat S. El danaf 3 and
                                       Ahmed H. Samak4
        Professor, Dean, College of Computers and Informatics ,Misr International University, , Egypt
              Professor, Department of Mathematics, Faculty of Science, Menofia University , Egypt
          Ass. Professor , Department of Mathematics, Faculty of Science, Menofia University , Egypt
            Ass. Lecturer , Department of Mathematics, Faculty of Science, Menofia University, Egypt
                                Email :

Abstract This paper present a novel off-line
signature recognition method based on multi
scale Fourier Descriptor and wavelet transform .
The main steps of constructing a signature
recognition      system are discussed and
experiments on real data sets show that the
average error rate can reach 1%. Finally we
compare 8 distance measures between feature
vectors with respect to the recognition

     a) Key words signature recognition ,
Fourier Descriptor , Wavelet transform , personal                   Figure 1: Verification task
                                                                Handwritten signature recognition can be
                                                            divided into on-line (or dynamic) and off-line
    1- INTRODUCTION                                         (or static) recognition. On-line recognition
                                                            refers to a process that the signer uses a special
    In the past decades, biometrics research has            pen called a stylus to create his or her signature,
always been the focus of interests for scientists           producing the pen locations, speeds and
and engineers. It is an art of science to use               pressures, while off-line recognition just deals
physical and behavioral characteristics to verify           with signature images acquired by a scanner or
or identify a person. Particularly, handwriting is          a digital camera. In general, off-line signature
believed to be singular, exclusive, personal for            recognition is a challenging problem. Off- line
individuals. Handwriting signature is the most              handwriting recognition systems are more
popular identification method socially and                  difficult than online systems as dynamic
legally which has been used widely in the bank              information like duration, time ordering,
check and credit card transactions, document                number of strokes, and direction of writing are
certification, etc.                                         lost. But, offline systems have a significant
                                                            advantage in that they don‟t require access to
                                                            special processing devices when the signature
    The objective of signature recognition is to            is produced. This chapter deals with an off-line
recognize the signer for the purpose of                     signature recognition and verification system.
recognition or verification. Recognition is
finding the identification of the signature
owner. Verification is the decision about                       In the last few decades, many approaches
whether the signature is genuine or forgery as              have been developed in the pattern recognition
in figure 1 .                                               area, which approached the offline signature
                                                            verification problem. Justino, Bortolozzi and
                                                            Sabourin proposed an off-line signature
                                                            verification system using Hidden Markov
                                                            Model [1]. Zhang, Fu and Yan (1998) proposed

                                                                                     ISSN 1947-5500
                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                             Vol. 7, No. 3, 2010

handwritten signature verification system based             pixels and smeared images. This noise may
on Neural „Gas‟ based Vector Quantization [2].              cause severe distortions in the digital image
Vélez, Sánchez and Moreno proposed robust                   and hence ambiguous features and a
off-line signature verification system using                correspondingly     poor     recognition   and
compression networks and positional cuttings                verification rate. Therefore, a preprocessor is
[3].                                                        used to remove noise. Preprocessing
    Arif and Vincent (2003) concerned data                  techniques eliminate much of the variability of
fusion and its methods for an off-line signature            signature data. Preprocessor also achieve the
verification problem which are Dempster-                    scaling and rotation invariant using slant
Shafer evidence theory, Possibility theory and              normalization.
Borda count method [4]. Chalechale and
Mertins used line segment distribution of                   3-1 Noise Reduction
sketches for Persian signature recognition [5].             Standard noise reduction and isolated peak
Sansone and Vento (2000) increased                          noise removal techniques, such as median-
performance of signature verification system by             filtering     [1], are used to clean the initial
a serial three stage multi-expert system [6].               image. The median filter is a sliding-window
                                                            spatial filter, it replaces the center value in the
                                                            window with the median of all the pixel values
    Inan Güler and Majid Meghdadi ( 2008)                   in the window. The kernel is usually square but
proposed       a method for the automatic                   can be any shape. Figure 3 present example
handwritten signature verification (AHSV)is                 of noise reduction
described. This method relies on global features
that summarize different aspects of signature
                                                            3-2 Slant Normalization
shape and dynamics of signature production.
For designing the algorithm, they have tried to             Normalization is necessary to achieve the
detect the signature without paying any
                                                            scaling and rotation invariant of the target
attention to the thickness and size of it [7].
                                                            images before the recognition phase. The
                                                            scale normalization can be made by scaling the
                                                            image along the x coordinate and y coordinate
    Jing Wen,BinFang, Y.Y.Tang and TaiPing
Zhang (2009) presents two models utilizing                  respectively to the prefixed size. For the slant
rotation invariant structure features to tackle the         normalization a moment based algorithm is
problem. In principle, the elaborately extracted            described in [10]. The basic idea is to compute
ring-peripheral features are able to describe               the major orientation or slant angle of the
internal and external structure changes of                  handwriting strokes according to second
signatures periodically. In order to evaluate               moments of foreground pixels and rotate the
match score quantitatively, discrete fast Fourier           foreground pixels by the computed angle along
transform is employed to eliminate phase shift              the opposite direction such that the major
and verification is conducted based on a                    orientation is horizontal. Figure 4 present
distance model. In addition, the ring-hidden                examples of slant normalization
Markov model (HMM) is constructed to
directly evaluate similar between test signature
and training samples [8].

     2- DATABASE
The signature database consists of 840
signature images, scanned at a resolution of
300 dpi,8-bit gray-scale. They are organized
into 18 sets, and each set corresponds to one
signature enrollment. There are 24 genuine and
24 forgery signatures in a set. Each volunteer
was asked to sign his or her own signatures on
a white paper 24 times. After this process had
been done, we invited some people who are
good at imitating other‟s handwritings. An
examples of the database image are shown in
figure 2.
Any image-processing application suffers from
noise like touching line segments, isolated

                                                                                     ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 7, No. 3, 2010


                          Figure 2 Database examples : ( a ) Genuine and ( b ) forgery signatures.

                                         (a)                               (b)
                               Figure 3 : Noise reduction (a) noised image (b)filtered image

                                               Figure 4 Slant Normalization

    4- FEATURES EXTRACTION                                        4-1 Multi scale Fourier descriptor using wavelet transform
The choice of a powerful set of features is crucial in            The multi scale representation of the signature image can be
signature identification systems. In our system, we use           achieved using wavelet transform. In discrete wavelet
implement a new Multi scale Fourier descriptor using              transform (DWT),the wavelet coefficient of the image f(x,
wavelet transform as discus in the next section                   y) defined as

                                                                                             ISSN 1947-5500
                                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                                 Vol. 7, No. 3, 2010
                         1                                                                                                                                                                                 bn and that of
                            
                                     M 1        N 1
W ( j0 , m, n)                     x 0        y 0
                                                          f ( x, y) j0 ,m,n ( x, y)             normalized coefficient of the derived image
                        MN                                                                                                                             0
                                                                                                                                                      bn have only difference of
                                                                                                 the        original        image
                        
                             M 1   N 1
Wi ( j, m, n)                             f ( x, y) ij ,,m,n ( x, y)   h  {H ,V , D}          exp[ j (n  1)t ] . If we ignore the phase information and
                             x 0   y 0
Where J0 is arbitrary starting scale                                                                                                                bn
                                                                                                 only use magnitude of the coefficients, then                                                                                 and
The problem with the coefficients obtained from the wavelet                                           0
transform is the fact that they not rotation invariant. Also the                                  b   n                                                                                          b
                                                                                                       are the same. In other words, n is invariant to
dimensionality of the feature vector depends on the                                              translation, rotation, scaling and change of start point. The
signature image size . Therefore, the coefficient vectors of                                     set of magnitudes of the normalized Fourier coefficients of
different signatures cannot be directly matched in the image
retrieval. The proposed solution for this problem is to apply                                                                               b                       0  n  N } can now be used
                                                                                                 the signature image { n ,
the Fourier transform to the coefficients obtained from the                                      as signature       image                                                    descriptors,                         denoted      as
wavelet transform. In this way, the multi scale signature
                                                                                                 {FDn 0  n  N}
representation can be transformed to the frequency domain,                                                           . the next section discuses some
in which normalization and matching are straightforward                                          distance measured can be used .
operations. Hence the benefits of multi scale representation
and Fourier representation can be combined. The Multi scale                                                5- DISTANCE MEASURES
Fourier descriptor is formed by applying the discrete Fourier
transform of Eq                                                                                  Let X, Y be feature vectors of length n. Then we can
             N 1                                                                                calculate the following distances between these feature
an        u(t ) exp( j 2nt / N ) where u(t )  w
        N t 0
                                                                                                 Minkowski distance ( LP matrices)
                                                                                                          d ( X , Y )  L p ( X , y)                                                                   
This results in a set of Fourier coefficients { n }, which is a                                                                                                                   xi  y i                        P       ;
                                                                                                                                                                     i 1
representation of the signature region . Since image
                                                                                                 Manhattan distance ( L1 matrices , city block distance )
generated through rotation, translation and scaling (called                                                                                                                           n
similarity transform of a image ) of a same image are                                                             d ( X , Y )  L p 1 ( X , y )   xi  yi                                                          ;
similar images , a image representation should be invariant                                                                                                                       i 1
to these operations. The selection of different start point on                                   Euclidean distance (L2 matrices)
the image boundary to derive u (t ) should not affect the                                                                                                                                  2

representation. From Fourier theory, the general form for the                                                 d ( X , Y )  L p 2 ( X , y)                                          (X i 1
                                                                                                                                                                                                     i    yi ) 2 ;
Fourier coefficients of a contour generated by translation,
rotation, scaling and change of start point is given by :                                        Angle – based distance
                                                                                                      d ( X , Y )   cos( X , Y )
an  exp( jnt )  exp( j )  c  a                     ( 0)
                                                        n                 n0                                                           x

                                                                                                                                                                i   yi                      ;
                                                                                                      cos( X , Y )                         i 1
                                                                                                                                        n                           n

      a0    a                                                                                                                         x y              2
where n and n are the Fourier coefficients of the                                                                                      i 1                       i 1

original image          and the similarity transformed image ,                                   Correlation coefficient- based distance
                                                                                                      d ( X , Y )  r ( X , Y )
respectively; exp( jnt ) , exp( j) and s are the terms
due to change of starting point, rotation and scaling. Except
                                                                                                                                n             n
                                                                                                                             n xi y i   xi  y i
                                                                                                    r( X ,Y )                 i 1          i 1          i 1
                             a                                                                                        n        n                     n                       n
the DC component ( 0 ), all the other coefficients are not                                                         (n xi2  ( xi ) 2 ) (n yi2  ( yi ) 2 )
                                                                                                                     i 1     i 1                  i 1                 i 1
affected by translation. Now considering the following
expression                                                                                       Modified Manhattan distance

    a    exp( jnt )  exp( j )  c  a n0)
                                                                                                                                                          xi  yi
bn  n                                                                                               d ( X ,Y )                       i

          exp( jt )  exp( j )  c  a00)
                                       (                                                                                               n                                n
    a0                                                                                                                                
                                                                                                                                      i 1
                                                                                                                                                    xi              
                                                                                                                                                                    i 1
      ( 0)
     n
       exp[ j (n  1)t ]  bn0) exp[ j (n  1)t ]
                            (                                                                    Modified SSE-based distance
    a  0                                                                                                                                 (x                         i        yi ) 2
               0                                                                                             d ( X ,Y )                 i 1

      bn      bn                                                                                                                            n

                                                                                                                                             x                     2
                                                                                                                                                                                            y i2
where     and   are normalized Fourier coefficients of the                                                                                    i 1
                                                                                                                                                                                 i 1
derived image and the original image , respectively. The

                                                                                                                                              ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 7, No. 3, 2010
    6- EXPERIMENTAL RESULTS                                            performance is evaluated using different distance measure
                                                                       and different wavelet families as present in Table1,table 2
This section reports some experimental results obtained                and table 3 . in the first table we use only FD as a features to
using our method . In the following experiments, a total of            present signature images ,also Table 2 use only Wavelet
840 signature images. The experimental platform is the                 Transform finally Table 3 we use the new Multi scale
Intel core 2 duo 1.83 GHZ processor, 1G RAM, Windows                   Fourier descriptor using wavelet transform.
vista , and the software is Matlab The recognition

                                          Distance measures                 Correct
                                                                            recognition rate
                                          Minkowski distance                92.6%
                                          Manhattan distance                96.2%
                                          Euclidean distance                95.4%
                                          Angle – based distance            93.4%
                                          Correlation     coefficient-
                                          based distance
                                          Modified         Manhattan
                                          Modified         SSE-based
                              Table 1 : Recognition Performance using Fourier descriptor coefficients.

         Wav family
                                                     haar             DB2            DB8            DB15            Sym8
         Distance measures

         Minkowski distance                          89.8%            91.4%          92.8%          93.6%           93%

         Manhattan distance                          93%              93.4%          94.4%          95.2%           94.6 %

         Euclidean distance                          92.6%            93.4%          94.8%          94.4%           94.2%

         Angle – based distance                      93.2%            94.2%          94.2%          94.8%           94.2%

         Correlation coefficient- based distance     92.2%            92.8%          93.4%          94.2%           93.8%

         Modified Manhattan distance                 92%              93.8%          94.2%          94.6%           94%

         Modified SSE-based distance
                                                     93.2%            94.2%          94,6%          95.6%           94.4 %

                                   Table 2 : Recognition Performance Using wavelet transform .

                                                                                                  ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 7, No. 3, 2010
          Wav family
                                                   haar          DB2           DB8              DB15               Sym8
          Distance measures

          Minkowski distance                       97%           97.6%         97.2%            97%                96.6%

          Manhattan distance                       98.8%         98.4%         98%              98.6%              99%

          Euclidean distance                       98.2%         98.4%         98.2%            98.2%              98.2%

          Angle – based distance                   96.6%         97.8%         97%              97.6%              97.8%

          Correlation     coefficient-    based
                                                   95.8%         96.6%         96.6%            97.2%              97.2%

          Modified Manhattan distance              96.2%         98%           97.4%            97.8%              97.6%

          Modified SSE-based distance
                                                   92.8%         96.2%         96.8%            98%                95.8%

                  Table 3 : Recognition Performance Using Multi scale Fourier descriptor and wavelet transform

   In this publication we devolve a method for signature               [4] M. Arif and N. Vincent, “Comparison of Three Data
recognition based on multi scale Fourier Descriptor using              Fusion Methods For An Off-Line Signature Verification
wavelet transform. Recognition experiments were                        Problem”, Laboratoire d‟Informatique, Université de
performed using the database containing 840 signature                  François Rabelais, 2003
image . Our method tested using different wavelet family
and various distance Measures. The best recognition results            [5] A. Chalechale and A. Mertins, “Line Segment
were achieved using sym8 wavelet family and Manhattan                  Distribution of Sketches for Persian Signature Recognition”,
distance .                                                             IEEE Proc. TENCON, vol. 1, pp. 11–15, Oct. 2003
                                                                       [6] Sansone and Vento, “Signature Verification: Increasing
REFERENCES                                                             Performance by a Multi-Stage System”, Pattern Analysis &
                                                                       Applications, vol. 3, pp. 169–181, 2000.
[1] E. J. R. Justino, F. Bortolozzi and R. Sabourin, “Off-line
Signature Verification Using HMM for Random, Simple                    [7] Inan Güler and Majid Meghdadi " A different approach
and Skilled Forgeries”, ICDAR 2001, International                      to off-line handwritten signature verification using the
Conference on Document Analysis and Recognition, vol. 1,               optimal dynamic time warping algorithm", Digital Signal
pp. 105--110. 2001                                                     Processing 18 (2008) 940–950

[2] B. Zhang, M. Fu and H. Yan, “Handwritten Signature                 [8] Jing Wen, BinFang, Y.Y.Tang and TaiPing Zhang "
Verification based on Neural „Gas‟ Based Vector                        Model-based signature verificationwith rotation invariant
Quantization”, IEEE International Joint Conference on                  features", Pattern Recognition 42 (2009) 1458 – 1466
Neural Net-works, pp. 1862-1864, May 1998
                                                                       [9]   Gonzalez, C., Wintz, P., 1987. Digital Image
[3] J. F. Vélez, Á. Sánchez , and A. B. Moreno, “Robust                Processing, second ed. Addison-Wesley, MA.
Off-Line Signature Verification Using Compression
Networks And Positional Cuttings”, Proc. 2003 IEEE                     [10] B. Horn, Robot Vision, The MIT Press, 1986.
Workshop on Neural Networks for Signal Processing, vol. 1,
pp. 627-636, 2003.

                                                                                                 ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol.7, No. 3, 2010

  Feature-Based Adaptive Tolerance Tree (FATT): An
   Efficient Indexing Technique for Content-Based
      Image Retrieval Using Wavelet Transform
                  Dr.P.AnandhaKumar                                                                  V.Balamurugan
    Department of Information Technology                                    Research Scholar, Department of Information Technology
Madras Institute of Technology Anna University Chennai,                    Madras Institute of Technology Anna University, Chennai
                         Chennai,India                                                                Chennai India
                    Email ID : annauniv@edu                                           Email ID:

   Abstract—This paper introduces a novel indexing and access             traditional keywords based method to retrieve a particular
method, called Feature- Based Adaptive Tolerance Tree (FATT),             image becomes inefficient[21],[26],[33],[35].
using wavelet transform is proposed to organize large image data
sets efficiently and to support popular image access mechanisms
like Content Based Image Retrieval (CBIR).Conventional                       Content- based image retrieval (CBIR) approach has
database systems are designed for managing textual and                    emerged as a promising alternative. CBIR has been a
numerical data and retrieving such data is often based on simple          particular challenge as the image content covers a vast range
comparisons of text or numerical values. However, this method is          of subjects and requirements from end users are often very
no longer adequate for images, since the digital presentation of          loosely defined. In CBIR, images are indexed by its own
images does not convey the reality of images. Retrieval of images         visual contents, such as color, texture and shape. The
become difficult when the database is very large. This paper
addresses such problems and presents a novel indexing                     challenge in CBIR is to develop the methods that will increase
technique, Feature Based Adaptive Tolerance Tree (FATT),                  the retrieval accuracy and reduce the retrieval time. Hence, the
which is designed to bring an effective solution especially for           need of an efficient indexing structure for images supporting
indexing large databases. The proposed indexing scheme is then            popular access mechanisms like CBIR arises.[26], [33], [35].
used along with a query by image content, in order to achieve the            In order to overcome such problems several content based
ultimate goal from the user point of view that is retrieval of all        indexing and retrieval applications have been developed such
relevant images. FATT indexing technique, features of the image
is extracted using 2-dimensional discrete wavelet transform (2D-          as the MUVIS system [22],[23],[29], photobook
DWT) and index code is generated from the determinant value of            [31],VisualSEEK[37], Virage[40], and VideoQ[9] all of which
the features. Multiresolution analysis technique using 2D-DWT             are designed to bring a framework structure for handling and
can decompose the image into components at different scales, so           especially retrieval of the digital multimedia items, such as
that the coarest scale components carry the global approximation          images, audio, and/or video clips. In this way similarity
information while the finer scale components contain the detailed         between two database images can be retrieved by calculating
information. Experimental results show that the FATT
outperforms M-tree upto 200%, Slim-tree up to 120% and HCT                the similarity distance between their feature vectors. This is
upto 89%. FATT indexing technique is adopted to increase the              the general query by example (QBE) scenario, which on the
efficiently of data storage and retrieval.                                other hand is expensive and CPU intensive especially for large
                                                                          databases. The main challenge of implementing such an index
  Index Terms— CBIR, FATT, indexing, wavelet transform.                   structure is to make it capable of handling high level image
                                                                          relationships easily and efficiently during access[7], [14],[21]
                        INTRODUCTION                                         The indexing techniques can be mainly grouped in two
      ne of the challenges in the development of a content
O     based indexing and retrieval application is to achieve an
      efficient indexing scheme. Efficient image indexing and
                                                                          types (1) Spatial Access Methods (SAMs) and (2) Metric
                                                                          Access Methods (MAMs) are proposed[21] .Initial attempts of
                                                                          SAMs are such as KD-trees[1],R-trees[16],SS-tree[41],Hybrid
accessing tools are of paramount importance in order to fully             tree[8]etc. Especially for content-based indexing and retrieval
utilize the increasing amount of digital data available on the            in large scale databases. SAMs have several drawbacks and
internet and digital libraries. Recent years have been a                  limitations. SAM-based indexing technique partitions and
growing interest in developing effective methods for indexing             works over a single feature space.SAMs while providing good
large image databases. Image databases are utilized in diverse            results on low dimensional feature space and do not scale up
areas such as advertising, medicine, entertainment, crime                 well to high dimensional spaces. SAM-based indexing
detection, education, military and in the domain of museum                schemes even becomes less efficient than sequential indexing
and gallery image collections etc. This motivates the                     for high dimensions.(2)Static and Dynamic Access
development of efficient image indexing and retrieval systems             Methods(MAMs)[17] are proposed such as VP-tree[43],MVP-
and algorithms. However as the database grow larger, the                  tree[46],GNAT[6] whereas the dynamic ones, M-tree[7] and

                                                                                                     ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 7, No. 3, 2010

other M-tree variants such as M+-tree. The existing                     tree avoids overlapping of region bounding boxes in the
multidimensional index structures support CBIR by translating           directory structure by using a new organization of the
the content-similarity measurement into features level                  directory as a result, X-tree outperforms both TV-tree and R*-
equivalence, which is a very difficult job and can result in            tree significantly. It is 450 times faster than R-tree and
erroneous interpretation of user’s perception of similarity.            between 12 times faster than the TV-tree when the dimension
    The indexing structures so far addressed are all designed to        is higher than two and it also provides faster insertion
speed up any Query By Example(QBE)process by some                       times.Still, bounding rectangles can overlap in higher
multidimensional index structure.However all of them have               dimensions. In order to prevent this, White and Jain proposed
significant drawbacks and shortcomings for indexing of large            the SS-tree [41], an alternative to the R-tree structure, which
                                                                        uses minimum bounding spheres instead of rectangles. Even
scale databases[1],[6],[7] ,[21], [41].In order to overcome such
                                                                        though SS-tree outperforms R*-tree, the overlapping in the
problems and provide efficient solutions to the aforementioned
                                                                        high dimensions still occur. Thereafter, several other SAM
shortcomings of the indexing algorithm especially for image
                                                                        variants are proposed such as SR-tree[20],S2-tree[42],Hybrid-
databases,we develop a MAM-based balanced and dynamic                   tree[8],A-tree[38],IQ-tree[5],Pyramid        –tree[4],NB-tree[13]
indexing technique called Feature-Based Adaptive tolerance              etc.The aforementioned degradations and shortcomings
Tree(FATT).                                                             prevent a wide usage of SAM based indexing structures
   In this paper, we present FATT consists of root                      especially for large scale image collections. In order to
node(grandparent) is having the maximum equal parent nodes              provide a more general approach to similarity indexing for
of 256 and depth of the tree depends upon the number of                 several MAM- based indexing techniques have been
features considered and width of the tree depends upon the              proposed.Yianilos[43] presented VP-tree that is based on
individual index value.                                                 partitioning the feature vectors into two groups according to
   The rest of the paper is organized as follows: Section II            their similarity distances with respect to a reference point, a
presents the related work in the area of the indexing and               so-called vantage point.Bozkaya and Ozsoyoglu[47] proposed
retrieval. In Section III we introduce the general structure of         extension of the VP-tree, a so-called MVP-tree(multiple
FATT and algorithms. Section IV for wavelet transform and               vantage point),which basically assigns m vantage points to a
image coding. Section V presents the experimental results.              node with a fan out of m2 .They reported 20% -80% reduction
Finally, Section VI concludes this paper.                               of similarity distance computation compared to VP-trees. Brin
                                                                        [6]introduced the geometric near-neighbor access tree(GNAT)
                                                                        indexing structure,which chooses k split points at the top level
                                                                        and each of the remaining feature vectors are associated with
                                                                        the closest split points. GNAT is then built recursively and the
                       RELATED WORK
                                                                        parameter k is chosen to be a different value for each feature
   There are several multidimensional indexing techniques for           set depending on its cardinality.Koikkalainen and Oja
capturing the low-level features like feature based or distance         introduced TS-SOM [25] that is used in PicSOM [28] as a
based techniques, each of which can be further classified as            CBIR indexing structure.TS-SOM provides a tree –structure
data-partitioned[3],[7],[16],[44]or space partitioned based             vector quantatization algorithm. Other SOM-based approaches
algorithm [27],[34].Feature based indexing techniques project           are introduced by Zhang and Zhong [45],and Sethi and
an image as a feature vector in a feature space and index the           Coman[39].All SOM-based indexing method rely on training
space. The basic feature based index structures are KD-tree             of the levels has a pre-fixed node size that has to arranged
[1], R-tree [16] etc.Researchers proposed several indexing              according to the size of the database. This brings a significant
techniques that are formed mostly in a hierarchical tree                limitation ,that is they are all static indexing structures,which
structure that is used to cluster the feature space. Initial            do not allow dynamic construction or updates for particular
attempts of such as KD-trees [1] used space –partitioning               database. Retraining and costly reorganizations are required
methods that divide the feature space into predefined hyper-            each time the content of the database changes (i.e., new
planes regardless of the distribution of the feature vectors.           insertions and deletions),that is indeed nothing but rebuilding
Such regions are mutually disjoint and their union covers               the whole indexing structure from the scratch. Similarly, the
entire space. In R-tree [16] the feature space is divided               rest of the MAMs, so far addressed present several
according to the distribution of the database items and region          shortcomings. Contrary to SAMs, these metric trees are
overlapping may occur as a result. Both KD-tree and R-tree              designed only to reduce the number of similarity distance
are the first examples of SAMs. Afterwards several enhanced             computations, paying no attention to I/O costs (disk page
SAMs have been proposed .R*-tree [2] provides consistently              accesses).They are also intrinsically static methods in the
better performance by introducing a policy called “forced               sense that the tree structure is built once and new insertions
reinsert” than the R-tree and R+-tree[36].R*-tree also                  are not supported.Furthermore, all of them build the indexing
improves the node splitting policy of the R-tree by taking              structure from top to bottom and hence the resulting tree is not
overlapping area and region parameters into consideration.Lin           guaranteed to be balanced. Ciaccia et al [7] proposed the M-
et al proposed TV-tree[19],which called as telescope vectors.           tree to overcome such problems. The M-tree is a balanced and
These vectors can be dynamically shortened assuming that                dynamic tree, which is built from bottom to top, creating a
only dimensions with high variance dimensions can be                    new level only when necessary. The node size is a fixed
neglected. Berchtold et al [3] introduced X-tree, which is              number M, and therefore the tree height depends on M and the
particularly designed for indexing high dimensional data. X-

                                                                                                    ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 7, No. 3, 2010

database size. Its performance optimization concerns both                                          Where N=256, m=4
CPU computational time for similarity distances and I/O costs
                                                                                                         Root node
for disk accesses for feature vectors of the database items.
Traina et al [15] proposed Slim- tree, an enhanced variant of                                Top level
M-trees, which is designed for improving the performance by
minimizing the overlaps between nodes. They introduced two
                                                                            Level 1
parameters, “fat-factor” and “blot factor”, to measure the                  (m=1)
degree of overlap and proposed the usage of minimum
                                                                            00          01   ...   1A      1B        ...   DF      EF       FF
spanning tree (MST)[24],[32],for splitting the node. Another
slightly enhanced M-tree structure, a so-called M+-tree, can be
found.[46].Serkan kiranyaz and Moncef Gabbouj [21]
                                                                            Level 2
proposed Hierarchical cellular tree(HCT) has no limit for cell              (m=2)
size as long as cell keeps a definite compactness.Any action
                                                                            00          01   ...   3E      3F        ...   DF      EF       FF
(i.e,an item insertion)it consequent reactions cannot last
indefinitely due to the fact that each of them can occur only in
a higher level and any HCT body has naturally limited number
                                                                            Level 3
of levels. Ming Zhang and John Fulcher[48] proposed GAT-                    (m=3)
trees which is neural network based adaptive tree but it takes
                                                                             00         01   ...   2A      2B        ...   DF      EF       FF
longer time to train the network and it gives accuracy 5%
higher than general tree random noise is added, 7% accuracy
when gamma value of Gaussian noise exceeds 0.3.
                                                                           Level 4
1. FATT indexing is MAM-based and have a hierarchical                      (m=4)
structure, i.e., levels (m).
                                                                                             ...   3E                ...
2. FATT designed, dynamically and balanced in a top-to-                      00         01                 3F              DF      EF       FF

bottom fashion to overcome the problem of overlap between
nodes in metric spaces.
                                                                             Index value
3. Root node (grant parent(A)) grows from the root node to                    Index value

leaves, each parent node is represented with several left
child(lchild) and rightchild(rchild) nodes and maximum of                                               of
256(N) nodes.                                                                         Fig.1.General structure of the FATT(N=256, m=4)
4. Each child node immediately proceeding with equal left
child(lchild) and rightchild(rchild).For both lchild and rchild
of A+1 is at NA+N=N(N+A).Unless N(A+1)>m in which case
                                                                         FATT Algorithms and Operations
(A+1) has no lchild.Hence this representation can clearly be
used for all image indexing and fast searching process.                    There are 3 FATT algorithms and operations (i.e) image
                                                                               insertion, searching and retrieval (indexing).
                                                                           The insertion algorithm Insert (nextImage,levelNo) first
  FATT Structure
                                                                        performs the searching algorithm which hierarchically FATT
   FATT is balanced, adaptive and an indexing tree .It is               from root node(grant parent) to the least child node .In order
designed in a complete N-ary, where N is the number of nodes.           to locate most suitable image for nextImage.Once the least
The FATT is constructed based on the index code developed.              node is located,the image is inserted into the immediate parent
The height of the tree depends upon the number of features
                                                                        node. Let the nextImage be the image to be inserted into a
considered and width of the tree depends upon the number
                                                                        target level indicated by levelNo. Accordingly the insertion
individual index codes that are used in code generation. The
                                                                        algorithm as follows and the sample insertion of 2 nodes ,4
leaf node contains the index value. The retrieval performance
depends upon the tree construction. The size of the tree can be         levels,( N=2, m=4) 3 nodes 4 levels(N=3, m=4) is shown in
changed depending upon the number of images that has to be              Fig.2. and Fig.3
stored. The structure of the tree can be changed depending
upon images found. The general structure (4 levels, m =4, N=            1.Insertion Algorithm
256)of the FATT is shown in Fig.1.The maximum number
unique index value for a m-levels, N-nodes tree is given as                  Insert(nextImage,levelNo)
=Nm                                                                     Let the root level number; root LevelNo and the single image
                                                                        in the top level:
             No. of nodes at top level (m)=1                                      If(levelNo>rootlevelNo) then do;
      No. of nodes for each node at first level (m1) =256               Append nextImage into least lchild node
    No. of nodes for each node at second level (m2) =256                          If(levelNo=rootlevelNo) then do;
     No. of nodes for each node at third level (m3) =256                 Assign rootlevelNo=LevelNo
     No. of nodes for each node at fourth level (m4) =256                         If(levelNo<root levelNo) then do;
 No. of individual index value =(1×256×256×256×256)4=410                Append nextImage into least rchild node.

                                                                                                         ISSN 1947-5500
                                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                                      Vol. 7, No. 3, 2010

Check least child node for post- processing                                                                 If a complete FATT with N=3,m=4 the depth first search
         If leastchild node is split then do;                                                             (i.e.,DFS) is represented sequentially as above then for
 Create the least lchild node equal to immediate parent node                                                   grand parent node with index A , 1≤A≤m we have,
         Insert(nextImage,immediate root levelNo)
                 Else if insertion of least child node then do;                                                                                  node
 Let least lchild and nextImage be parent and new child node                                                     T op level

          Else if unchange the immediate parent node
Return.                                                                                                        Level 1            A          B                  C
A.Insertion of 4-levels and 2-nodes (m=4,N=2)                                                                   (m=1)                 1            2            3

                     Number of levels, m =4                                                                 Level 2
                     Number of nodes, N =2                                                                   (m=2)        A       B       C D       E F  G H                I

       Then the total number of index values (N m) is16
                                                                                                                              1       2   3  1     2 3 1   2            3

                                                                                                           Level 3 A  B               C   I. J K         L .R   S       T        U
                                           Root                                                             (m=3)    1 2              3       1  2      3           1       2   3
                                                                                                                                           ..              ..
                                                                                                                A     B       C           J K       L .R S      T             Y
                                                                                                                                                                            U .    Z AAAB .AZ
       Top level                                                                                                                      . I
                                                                                                          Level 4 1   2       3           1  2     3       1     2      3        1   2 3
                                                                                                                                      ..              ..                      ..          ..
                         A                                           B                                     (m=4)
      Level 1
      (m=1)               1                                         2                                       Index value

    Level 2   A                    B                   C                      D
    (m=2)      1                  2                        1                  2
                                                                                                  Fig.2                                   Fig.3..A sample FATT (m=4,N=3)
  Level 3 A              B    C            D       E           F        G           H
  (m=3) 1            2        1        2           1           2         1          2
        A       BC       DE       FG           I       J K           Case (1) grand parent (A) is at (A/3)if A≠1.When A=1, A is
                                                                   L M       N O        P
 Level 4                                   H
 (m=4) 1      2 1 2 1 2 1          1 2  1 2 1 2
                                           2       1 2                         the root and has no parent
                                                                     Case (2) leftchild (A) is at 3A if 3A≤m.If 3A>m,the A has
       Index                                                                    no left child(lchild)
                                                                     Case (3) middlechild(A) is at 3A+1 if 3A+1≤m.If
                     Fig.2.A sample FATT (m=4,N=2)                            3A+1>m, then A has no middle child(mchild)
                                                                      Case (4) rightchild (A) is at 3A+2 if 3A+2≤m.If 3A+2>m,
                                                                     then                then A has no right child(rchild)
   If a complete FATT with N =2, m =4 depth first search is            We prove case(1),(3) and (4) is an immediate
represented sequentially as above then for then for grand        consequence of case (2) and numbering of nodes on the same
parent node with index A, 1≤A≤m we have,                         level left to middle and right.Case(1)follows (2),(3) and
                                                                 (4).We prove case (2) by induction on A.For A=1,Clearly
Case (1) grandparent (A) is at (A/2) if A =1 when A=1, A is
                                                                 lchild is at 3 unless 3>m,then nodes immediately proceeding
            the root and has no parents.
                                                                 lchild(A+1), in the representations are the middle child of A
      Case (2) leftchild (A) is at 2A if 2A≤m.If 2A>m, then      and the lchild(A). The lchild(A) is at 3(A), hence the lchild of
                   grand parent A has no left child(lchild)      3A+2 unless 3(A+1)>m..
   Case (3) rightchild (A) is at 2A+1 if 2A+1≤m,If 2A+1>m,          In general if a complete FATT with number of maximum
                       then A has no rightchild(rchild)
                                                                 nodes is 256(N=256) (ie,depth =Nm is represented sequentially
                                                                 as above then for grandparent node with index A,1≤A≤N.we
   We prove case(2) and case(3) is immediate consequence of
case(2) and the numbering nodes on the same level from left have,
to right case (1) follows from case (2) and case (3).We prove Case (1) grandparent (A) is at (A/N) if A≠1 when A=1,A is
case (2) by induction on A..For A =1,clearly the lchild is at 2               the root and no parent.
unless 2>m in which case (1) has no lchild.Then two nodes        Case (2) leftchild (A) is at 2A if 2A ≤256.If 2A>256 then A
immediately proceeding lchild(A+1).In representations are the                   has no lchild. mchild(A) at 2A+1 if 2A+1≤256.If
rchild of A and lchild of A.The lchild of A is at 2A.Hence    2A+                2A+1≤256.If 2A+1>256 then A has no mchild.
the lchild of A+1 is at 2A+2=2(A+1).Unless 2A+1>m, in            Case Case (3) rightchild (A ) is at 2A+2 if 2A+2≤256.
which case A+1,has no lchild.                                                If 2A+2>256 then A has no rchild .we prove that
                                                                             case(2),(3)and (4) is an immediate consequence of
           B.Insertion of 4-levels and 3-nodes(m=4,N=3)                      case (2) and the numbering of nodes on the same
                                                                             level from left to right case(1) follows from case
                      Number of levels, m ==4                                (2),(3) and (4).we prove case (2) by induction on A
                       Number of node, N =3                                  at 2. Unless 2>256 in which case (1) has no lchild
         Then the total number of index values (N m) is 81                   .Then,the two nodes immediately proceeding lchild
                                                                             (A+1) in the representations are the mchild A and
                                                                             the lchild of A.The lchild of A is at 2A .Hence the

                                                                                                                                              ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 7, No. 3, 2010

          lchild of A+1 is at 2A+2=2(A+1) unless                        needed for the perfect reconstruction of the original set from
          2(A+1)>256 in which case A+1 has no lchild.                   the low resolution version.
                                                                            Since image is typically a two dimensional signal, a 2D
2. Searching Algorithm                                                  equivalent of the DWT is performed. This is achieved by first
   Depth first search algorithm is used for traversing the tree.        applying the L and H filters to the lines of samples, row by
In this algorithm search as deeply as possible by visiting a            row, then refiltering the output to the columns by the same
node, and then recursively performing depth –first search on            filters. As the result, the image shown in Fig.4 is divided into
each adjacent node.                                                     4 subbands, LL, LH, HL and HH as depicted in Fig.5. The LL
                     Search(nextImage,levelNo)                          sub band contains the low pass information of horizontal,
         If the parentnode >root node then do;                          vertical and diagonal orientation. The LL sub band provides a
                     Go to the lchild of the root node                  half sized version of input image which can be transformed
                        Until no lchild is avilable                     again to have more levels of resolution..Generally, an image is
              Check whether the indexing value is obtained              partitioned into L resolution levels by applying the 2D DWT
         Elseif parentnode< root node                                   (L-1) times.[11].
                       Return the index value
         Else not found traverse step 1 and 2
     Repeat the steps 1,2 and 3 until index value is found.

   The searching algorithm search(nextImage,levelNo) first
             perform the searching with root node.
Case (1) If parentnode >root node searching beginning with
             lchild node until no leftmost child is available.
Case (2) If parentnode< root node then return the index value
          otherwise search for leftmost child node is available           Fig.4.Orginal image                 Fig.5.One level decomposition

3. Retrieval Algorithm
                                                                          By wavelet transform, we mean the decomposition of an
FATT can index large image-scale database using index code              image     with    family    of     real   orthogonal     bases
generated and euclidean similarity measure used as distance                      obtained through translation and dilation of a kernel
                                                                        function       known as mother wavelet.[11].
         If the parentnode >root node then do;
                     Go to the lchild of the root node                  (1)
                        Until no lchild is avilable
              Check whether the indexing value is obtained                 Where      and     are integers. Due to the orthonormal
         Elseif parentnode< root node                                   property,the wavelet coefficients of a signal can be easily
                      Return the index value                            computed via
         Else not found traverse step 1 and 2
         Repeat the steps 1,2 and 3 until index value is found.                                                                                (2)
The indexing algorithm index(nextImage,levelNo) first
                                                                        and the synthesis formula
perform the indexing with root node

Case (1) If parentnode >root node indexing beginning with                                           .
       lchild node until no leftmost child is available.Check                                                      (3)
       whether index value is available.
Case (2)If parentnode< root node then return the index value            can be used to recover       from its wavelet coefficients
      otherwise index for leftmost child node is available                 To construct the mother wavelet             we may first
                                                                        determine a scaling function     which satisfies the two-scale
                                                                        difference equation
  Wavelet Transform
   A 2 dimensional-discrete wavelet transform(2D-DWT) is a
process which decomposes a signal, that is a series of digital                                                           (4)
samples, by passing it through two filters, a low pass filter L
and high pass filter H. The low pass sub band represents a                Then the wavelet kernel                  is related to the scaling
down sampled version of the original signal. The high pass              function via,
sub band represents residual information of the original signal,

                                                                                                        ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 7, No. 3, 2010

                                                                              The wavelet packet basis functions                       can be
                             where                                        generated from a given function as follows

                                                              (6)                                                                           (11)
The coefficients        in (4) have to several conditions for the
set of basis wavelet functions in (1) to be unique, orthonormal,                                                                            (12)
       and have a certain degree of regularity [11], [18].
  The coefficients       and       play a very crucial role in a             Where the function         can be identified with the scaling
   given discrete wavelet transform. To perform the wavelet               function     and      with the mother wavelet          .Then,the
   transform does not require the explicit forms of        and            wavelet packet bases can be defined to be the collection of
                                                                          orthonormal bases composed of functions of the
                                                                          form               ,where l, k        ,n       Each element is
       but only depends on        and        .Consider a J-level          determined by a subset of the indices: a scaling parameter l, a
                                                                          localization parameter k, and an oscillation parameter n.
                                                                             The 2D wavelet (or wavelet packet) basis functions can be
        wavelet decomposition which can be written as                     expressed by tensor product of two 1-D wavelet (or wavelet
                                                                          packet) basis functions along the horizontal and vertical
                                                                          directions. The corresponding 2-D filter coefficients can be
                                                                          expressed as
 Where coefficients     are given and coefficients           and
     at scale       are related to the coefficients      at scale                                                                           (14)
                                                                              Where the first and second subscripts in (13) and (14)
                                                            (8)           denotes the low pass and highpass filtering characteristics in
                                                                          the -and -directions respectively.[11]We have applied orth-
                                                                          ogonal wavelet transformation with dyadic subsampling.Wav-
                                                            (9)           elet decomposition of images is performed using db4 wavelet
                                                                          basis function. On application of the the above procedure,for
  Where             .Thus,(8) and (9) provides a recursive                an image of size 256 × 256, db4 wavelet yields suband
algorithm for wavelet decomposition through               and             matrices of 128×128 at the first level,64×64 at the second
     ,and the coefficients    for a low resolution component              level, 32×32 at the third level, 16×16 at the fourth level of
                                                                          wavelet decomposition.
        .By using a similar approach,we can derive a recursive
algorithm for function synthesis based on its wavelet
coefficients               and
                                                                            Image Coding
                                                          (10)              In image coding, a six digit code is generated for each image
                                                                          feature which is to be stored and retrieved .The code generated
                                                                          depends upon the value of the determinant generated. The
It is convenient to view the decomposition (8) as passing a               number of different codes that can be generated depends upon
signal      through a pair of filters   and with impulse                  the value of the determinant .The algorithm for finding the
responses        and        and down sampling the filtered                code is given below
signals by two (dropping every other sample), where     and
      are defined as                                                                    Code=function (determinant value)
                                                                             The code is generated depends upon the determinant value
                                                                          of the matrix. The function generates the code between 0 to F
                                                                          based upon the value of the determinant. It is a function in
The pair of filters H and G. correspond to the halfband                   which the input to the function is the determinant value
lowpass and highpass filters,respectively,and are called the              obtained. Based upon the range of the determinant obtained
quadrature mirror filters in the signal processing literature[11]         the function returns value which is the code for the image.The
The reconstruction procedure is implemented by upsampling                 database to store the images is a single table with three
the subsignals         and          and filtering with         and        fields.The three fields are serial number, image and index
     ,respectively, and adding these two filtered signals                 value.
together.Usually the signal decomposition scheme is                          Image Coding Algorithm
performed recursively to the output of lowpass filter        .

                                                                                                     ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 7, No. 3, 2010

Step1: Get the matrix of the individual feature (m×m)                   possible only if the error occurs at the last level. If the error
Step2: Find the transpose of the matrix obtained                        occurs at higher level tracing the tree becomes difficult.
         Read the m×m matrix                                            Example
         Give the input of the m×m matrix elements                      Let the image in the database have the node, 122300
         Transpose the m×m of the input matrix elements                 Let the code generated for the image which is being given as
   Step 3: Multiply the two matrices to get a square matrix.            the input be, 122301
         Read the order of m×m for the first matrix                     Searching the adjacent nodes, which would return the image
         Read the order of p×q for the second matrix                    required, could rectify this error.
         Give the input of the m×m of the first matrix
         elements                                                                          EXPERIMENTAL RESULTS
         Give the input of the p×q of the second matrix
                                                                           We have tested the FATT to index and retrieve the images
         For i=0;i<m;i++,For j=0;j<q;j++,For
         k=0;k<p;k++                                                    with three different databases through a query process such as
         C[i,j]=C[i,j]+(A[i,k] × B[k,j])                                query by example (QBE). Database 1-D1 consists of 1000
    Step4: Get the determinant value of the matrix (m×m)                natural scene images of size 256×256 of 60 selected categories
         Read the m×m matrix                                            are from D1:various beach images, sunsets, waterfalls, clouds,
         Give the input m×m matrix elements                             mountains and glaciers etc from COREL. Database 2-
         The determinant value of the m×m matrix                        D2consists of 10000 natural scene images of 100 categories
         Determinant value given by A[i,j] ×(A[i+1,j+1]                 similar contents with D1 from COREL. Database.Selected
         ×A[i+2, j+2]-A[i+2, j+1] ×A[i+1,j+2]) A[i+1,j+1]               categories are rose,sunflower,dinosaur,,bus,lion,elephant
         ×(A[i+1,j] ×A[i+2, j+2]- A[i+2, j] × A[i+1,j+2]) +             etc.We implemented in MATLAB 7.4 under Windows vista.
         A[i+1,j+2] × (A[i+1,j] × A[i+2, j+1]- A[i+1,j+1] ×             The experiments were performed on a Pentium Core 2
         A[i+2, j])                                                     Duo,1.66GHz PC with 1GB RAM.Our implementations for
 Step5: Based on the range of the determinant value the code            node access ,distance computations and retrieval process.
                           If x 0 and x≤10 then y=00                       One indexing and retrieval results from D1 shown in Fig.6 ,
       Else if x>10 and x≤20 then y=01                                  D2 is shown in Fig.7 and for D3 shown in Fig.8.The
       Else if x>20 and x≤30 then y=02                                  effectiveness of the proposed tree have been validated by a
       Else if x>30and x≤40 then y=03                                   large number of experiments.
       Else if x>40 and x≤50 then y=04
       Else if x>50 and x≤60 then y=05
       Else if x>60and x≤70 then y=06
       Else if x>70and x≤80 then y=07
       Else if x>80 and x≤90 then y=08
       Else if x>90 and x≤100 then y=09
       Else if x>100 and x≤110 then y=10
       Else if x>110 and x≤120 then y=11
       Else if x>120 and x≤130 then y=12
       Else if x>130 and x≤140 then y=13
       Else if x>140 and x≤150 then y=14
       Else if x>150 then y=15
                                                                                              Fig.6.Retrieval of images from D1
                    D. Error Handling Phase

   Errors may occur during retrieval of images from the
database. The error is caused by the code change during the
generation purpose. The code generated is made up of six
parts each obtained from each individual feature. The error
occurs when determinant value obtained returns a code that
differs from the code of the image stored in the database. This
changes the code entirely. The severity of the error lies in the
level it occurs. When the error occurs at the bottom most level
the severity is minimum. Whenever such error occurs
searching the neighboring nodes can rectify it. This only

                                                                                                    ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 7, No. 3, 2010

                                                                        Fig.11.I/Os versus number of objects in D2
                   Fig.7.Retrieval of images from D2

                    Fig.8.Retrieval of images from D3

                                                                    Fig.12.Computation costs versus no. of objects in D2

                                                                     Since FATT is balanced and dynamic tree the performances
 Fig.9.I/Os versus number of objects in D1                        like I/Os and computation costs is compared with existing
                                                                  balanced and dynamic tress such as M-tree, Slim-tree, HCT.
                                                                  The FATT consistently outperforms irrespective of the
                                                                  database size .Fig.9,Fig.11.,Fig.13 shows the I/Os versus
                                                                  number of objects in D1,D2 and D3.From the experimental
                                                                  though database size increased but the I/O cost maintains less
                                                                  and uniformly.upto 83% for the D1,upto81% for D2 and upto
                                                                  82% for D3 the performance is achieved.
                                                                     Fig.10. Fig.12, Fig.14.Shows the computation costs versus
                                                                  no.of objects for D1and D2.The computation cost the for
                                                                  D1decreased up to 82% up to 81% for D2 and upto83% for
                                                                  D3. Average distance computations performance also
                                                                  outperforms for D1 and D2.Experimental results show that
                                                                  the performance is no degradation irrespective of the database
                                                                  size is increased.FATT indexing technique can be
Fig.10.Computation costs versus no. of objects in D1
                                                                  conveniently used for retrieval and to achive the users goal to
                                                                  retrieve the most relevant images.

                                                                    Complexity of the algorithm
                                                                    Suppose we are given n images A, B, C…Z and suppose the
                                                                  images are inserted in order into a FATT node. There are n
                                                                  permutations are required for n images. Each such
                                                                  permutations will give rise to a corresponding tree. It can be
                                                                  shown that the average depth of the n tree is approximately
                                                                  log2n+1,Accordingly ,the average running time f(n) to search

                                                                                                  ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 7, No. 3, 2010

for image in FATT with n images is proportional to                                    [3]    S.Berchtold,D.A.Keim,and H.Kriegel,“The x-tree:an index structure for
                                                                                             high dimensional data,”In Proceedings of the 22nd International
log2 n,(i.e,)f(n) =O(log2 n+1).                                                              Conference on Very Large Databases,pp.28-39, India,Sept. 1996.
                                                                                      [4]    S.Berchtold,D.A.Keim and H.Kriegel, “The pyramid-technique:Towards
                        TABLE III                                                            breaking the curse of dimensionality,”In               Proc.of 1998.ACM
                  COMPLEXITY ANALYSIS                                                        SIGMMOD.International Conference on Management of Data,pp.142-
          log2n      n     nlog2n   n2                n3                                     153,Seattle,Washinton,US Jun. 01-04,1998.
            1         2       2      4                    8                           [5]    S.Berchtold,C.Bohm,H.V.Jagadish,H.P.Kriegel              and       Sander,
          1.584       3     4.754    9                   27                                  “Independent Quantization:An index compression technique for high
            2         4       8      16                  64                                  dimensional data spaces,”In Proc.of of the 16th Int.Conf.on Data
          2.321       5    11.609    25                 125                                  Engineering.US.pp.557-588,Feb.2000.
          2.584       6    15.509    36                 216                           [6]    S.Brin, “Near neighbor search in metric spaces,”In Proc.Int.Conf.on
          7.807       7    19.651    49                 343                                  Very Large Databases(VLDB),.pp.574-584. 1995
            3         8      24      64                 512                           [7]    P.Ciacca,M.Patella,and P.Zezula, “M-tree:An efficient access method
                                                                                             for similarity search in metric spaces,”In Proceedings of the 23th VLDB
          3.169       9    28.529    81                 729
                                                                                             International Conference. Athens,Greece, pp.426-435,August 1997.
          3.321      10    33.329   100               1000000
                                                                                      [8]    K.Chakabarti and S.Mehrotra, “The hybrid tree:An index for high
            8       256     20.48  65536              16777216                               dimensional data spaces,”In Proceedings of the IEEE International
                                                                                             Conference on Data Engineering , Australia pp.440-447,Mar. 1999.
   FATT nodes represent the database images, this requires                            [9]    S.F.Chang,W.Chen,J.Meng,H.Sundaram and D.Zhong, “VideoQ::An
prior calculation of the relative similarity distances and hence                             automated content-based search system using cues,”In Proc.ACM
yields aO(n2)computational cost. The nodes are constructed                            [10]         K.Chatterjee and S-C Chen, “Affinity hybrid tree:An indexing
dynamically since images can be any time inserted since such                                 technique for content-based image retrieval in multimedia databases,”in
operation requires O(n3).Therefore, an incremental FATT                                      Procceedings of the Eighth IEEE International Symposium on
algorithm is adaptive based on leaf nodes. This is a sequential                              Multimedia(ISM’06).pp.2746-2755.2006.
                                                                                      [11]   T.Chang and C-C Jay Kuo, “Texture analysis and classification with
algorithm and O(n ) computational complexity per image and                                   tree-structured     wavelet    transform,     “IEEE      Trans.on    Image
hence O(n2) overall cost as desired.                                                1993.
                                                                                      [12]   R.Fagin “Fuzzy queries in multimedia database systems.In Proceedings
                                                                                             of Sventeenth ACM SIGACT SIGMOD-SIGART symbosium on
                                                                                             principles of database systems,US. pp. 1-10, 1998:
                   CONCLUSION AND FUTURE WORK                                         [13]   M.J.Fonseca,J.A.Jorge, “Indexing High –Dimensional Data for Content
   In this paper, a novel indexing technique for content based                               –Based Retrieval in Large Databases’,In Proc.of Eighth International
                                                                                             Conference on Database Systems for Advanced Applications(DASFAA
image retrieval is designed particularly for fast insertion,                                 03).Kyoto-Japan,Mar.26-28, pp.267-274,2003.
searching and indexing, moreover to tackle the problem of                             [14]   C.Traina,A.J.M.Traina, B.Seeger,and C.Faloutsos, “Slim-trees :High
overlap between the nodes. The M-tree has fixed cell policy                                  performance metric trees minimizing overlap between nodes,”in
due to its natural deficiency not suitable in dynamic                                        Proc.Oedbt 2000,Konstanz,Germany, pp.51-65. Mar. 2000
                                                                                      [15]   C.Traina,A.J.M.Traina,C.Faloutsos, and B.Seeger, “Fast indexing and
environments and data are subject to permanent changes.                                      visualization of metric data sets using slim-trees,” IEEE Transactions on
Slim-tree is the extension of M-tree that speed up insertion                                 Knowledge           and Data Engineering,vol.14,.no2. pp. 244-
using node splitting algorithm while also improving the                                      260.Mar/Apr.2002.
storage utilization, unfortunately this algorithm does not                            [16]   A.Guttman. “R-trees:A dynamic index structure for spatial searching”.In
                                                                                             Proceedings of the 1984 ACM SIGMOID International Conference on
guarantee the minimal occupation of each node. Whereas HCT
                                                                                             Management of Data,Pages 47-57, U S,Jun. 1984.
has flexible cell policy but the levels are limited. Our proposed                     [17]   V.Gaede and O.Gunther, “Multidimensional Access Methods,”ACM
FATT greatly improves the performance comparing with M-                                      Comptuing Surveys,vol.30,no.2.pp.170-231,1998.
tree, Slim-tree and HCT indexing schemes in I/Os costs,                               [18]   W.Y.Ma and B.S.Manjunath,“A comparison of wavelet transform
computation costs and average distance computations. The                                     features for texture image annotation,”in Proc.IEEE International
                                                                                             Conf.On Image Processing,1985.
insertion speed of our FATT improves up to 84% compared                               [19]   K.Lin,H.V.Jagadish,and C.Faloutsos. “The TV-tree:an index for high
with M-tree, up to 75% compared with Slim-tree, up to 70%                                    dimensional data,”Very Large Databases (VLDB)Journal,3(4),pp.57-
compared with HCT. The searching speed of improves up to                                     543,1994.
146% compared with M-tree, up to 105% compared with                                   [20]   N.Katayama, S.Satoh, “The SR-tree: An index for high –dimensional
                                                                                             nearest neighbor queries”, In Pro.of 1997              ACM SIGMMOD
Slim-tree, up to 89% compared with HCT. The possibility of                                   International      Conference        on     Management          of    data,
using clustering, neural networks and other multimedia                                       Tueson,Arizona,US,May 11- 15, pp.369-380,1997.
databases such as audio, video will also give attractive                              [21]   S.Kiranyanz and M.Gabbouj, “Hierarchical cellular tree:An efficient
performances could be explored in future                                                     indexing scheme for content based retrieval on multimedia based
                                                                                             databases,” IEEE Transactions on :102-
                                                                                      [22]   S.Kiranyanz,K.Caglar,O.Guldogan,and           E.Karaogllu,      “MUVIS:A
                                                                                             multimedia browsing, indexing and retrieval                  framework,”in
                                                                                             Proc.Third      Int.Worshop       on    Content       Based     Multimedia
                                                                                             Indexing,CBMI,Rennes,France,Sep.22- 24,2003
[1]   J.L.Bentley, “Multidimensional binary search trees used for associative         [23]   S.Kiranyanz,K.Caglar,E.Guldogan,O.Guldogan             and      M.Gabbouj,
      searching,” In Proc.of communications of the ACM,vol.18,no.9,pp.509-                   “MUVIS:A content-based multimedia indexing                   and retrieval
      517,Sept.1975.                                                                         framework,”in Proc.Seventh Int.Symposium on Signal Proc.and its
[2]   N.Beckmann,H-P.Kriegel,R.Schneider and B.Seeger, “The R*-tree:An                       Applications,ISSPA 2003,Parris,France, Jul. 1-4, pp.1-8,2003.
      efficient and robust access method for points and rectangles,”In Proc.of        [24]   J.R.Kruskal, “On the shortest spanning subtree of a graph and the
      ACM SIGMOID Int.conf.on management of data.,US,pp.322-331.1990.                        travelling salesman problem,”in Proc.AMS 71,1956
                                                                                      [25]   P.Koikkalainen and E.Oja, “Self-organizing hierarchical feature
                                                                                             maps,”in Proc.Int.Joint Conf.Neural Netwoks,San Diego,CA,1990.

                                                                                                                          ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 7, No. 3, 2010

[26] Manesh Kokare,Student Member IEEE,P.K.Biwas,Member IEEE and                       Technology, Coimbatore. He received PhD degree from College of
     B.N.Chatterji, “Texture Image Retrieval Using New rotated Complex                 Engineering Guindy in the Department of Computer Science and
     Wavelet Filters.,”IEEE Transactions on Systems,Man and Cybernetics-               Engineering at Anna university,Chennai.Currently working as Assistant,Dec. 2005.                                        Professor in the Department of Information Technology at Madras Institute
[27] D.B.Lomet and B.Salzberg. “The hb-tree:A multiattribute indexing                  of Technology,Anna University- Chennai,India. He has published several
     method with good guaranteed performance”ACM Transactions on                       papers in international, national journals and conferences. His areas of
     Database Systems.15(4):pp.625-658,1990.                                           interests includes content-based image indexing techniques and
[28] J.T.Laaksonen, J.M.Koskela,S.P.Laakso,and E.Oja, “PicSOM-content-                 frameworks, image processing and analysis,video analysis,fuzzy
     based image retrieval with self organizing maps,”Pattern Recognition              logic,pattern recognition,knowledge management and semantic analysis .
[29] MUVIS.[Online].Avilable:                                                             V.Balamurugan received B.E degree in
[30] A.Motro Vague, “A user interface to relational databases that permits                                       Electrical and Electronics Engineering from
     vague queries.”ACM Transactions on office           Information Systems,                                    Government College Technology, Coimbatore
     6(3):pp.625-658, 1990.                                                                                      and M.E degree in Applied Electronics with
[31] A.Pentland,R.W.Picard,and S.Sclaroff, “Photobook:Tools for content                                          distinction from Karunya Institute of
     based manipulation of images,”in          Proc.SPIE(Storage and retrieval                                   Technology, Coimbatore. Currently pursuing
     for image and video databases II), vol.2185,pp.34-37. 1994,                                                 PhD degree in the Department of Information
[32] R.C.Prim, “Shortest connection matrix network and some                                                      Technology at Madras Institute ofTechnology,
     generalizations,”Bell Syst.Tec.J.,vol.36.pp.1389-1401[2]                                                    Anna University-Chennai. India. His current
[33] Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: Current                                             research interest includes in the area of
     techniques, promising directions, and open issues,”      J. Vis. Commun.                                    content-based    image      retrieval,   image
                                                                                                                 processing and analysis, digital signal
     Image Represent., vol. 10, pp. 39–62, Nov.1999.
                                                                                                                 processing, wavelet theory and application.
[34] J.Robinson. “The k-d-b-tree:A search strucure for large
     multidimensional dynamic indexes”. ”.In Proceedings of            the 1984
     ACM SIGMOID International Conference on Mangement of Data, Ann
     Arbor, Michigan,U S, pp. 10-18,Apr. 1981.
[35] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
     “Content-based image retrieval at the end of       the early years,” IEEE
     Trans. Pattern Anal. Machine Intell., vol. 22, pp.1349–1380, Dec. 2000
[36] T.K.Sellis,N.Roussopoulos,C.Faloutsos, “The R+-Tree :A Dynamic
     Index for Multi-Dimensional Objects,”In                  Proc.of the 13th
     International Conferenc on Very Large Data Bases,pp.507-518,Sept. 01-
[37] J.R.Smith and Chang, “VisualSEEK:A fully automated content-based
     image query system,”in Proc.ACM Multimedia,Bostan,MA,Nov.96
[38] Y.Sakurai,M.Yoshikawa,S.Uemura,H.Kojima, “The A- tree:An index
     structure for High –Dimensional Spaces using Relative
     Approximation,”In Proc.of the 26th International conference on Very
     Large Databases,pp.516-26,September 10-14,2000.
[39] I.K.Sethi and Coman, “Image retrieval using hierarchical self-organizing
     feature map,”Pattern Recognition      Letter,vol.20,pp.1337-1345,1999.
[40] Virage,[Online]
[41] D.A White and R.Jain. “Similarity indexing with ss-tree”. In
     Proceedings of the 12th International Conference on                Data
     Engineering, pp.516-523, New Orleans,LA,US,Feb. 1996.
[42] H.Wang,C.-S.Perg, “The S2 –tree:An Index Structure for susequence
     matching of spatial objects”,In Pro.of 5th Asia Pacific Conf.on
     Knowledge Discovery and Data mining (PAKDD),Hong Kong,2001.
[43] P.N.Yianilos.”Data structures and algorithms for nearest search in
     general metric spaces.In Proceeding of the 3rd     Annual ACM-SIAM
     Syposium          on         Discrete        algorithms,         pp311-
     321,Philadephia,PA,US,January 1993.
[44] P.Zezula,P.Ciaccia,and F.Rabitti. “M-tree :A dynamic index for
     similarity queries in multimedia databases.In         Technical Report
     7.HERMES ESPRIT LTR Projects,1996.
[45] H.Zhang and D.Zhong, “A scheme for visual features based image
     indexing,”in Proc.SPIE/IS&T Conf.Storage and        Retrieval for Image
     and Video Databases III,San Jose,CA, vol.2420,pp.36-46. Feb.9-
[46] X.Zhou,G.Wang,J.X.Yu,and G.Yu, “M+tree:A new dynamical
     multidimensional index for metric spaces, “in           Proc.Fourteenth
     Australian                   Database                    Conf.Database
     Technologies,Adelaide,Australia,pp.161-168. Feb,2003
[47] T.Bozkaya and Z.M.Ozsoyoglu, “Distance-based indexing for high
     dimensional metric spaces,” in Proc.ACM-SIGMOID,pp.357-368,1997.
[48] Ming Zhang and John Fulcher, “Face recognition using artificial neural
     network group-based adaptive tolerance trees,”IEEE Trans.on Neural

   P.AnandhaKumar received B.E degree in Electronics and Communication
   from Government College of Engineering,Salem,Tamil nadu. and M.E
   degree in Computer Science and Engineering from Government College of

                                                                                                                     ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 7, No. 3, 2010

  Ontology-supported processing of clinical text using
    medical knowledge integration for multi-label
          classification of diagnosis coding
                                    Phanu Waraporn1,4,*, Phayung Meesad2, Gareth Clayton3
                          Department of Information Technology, Faculty of Information Technology
                    Department of Teacher Training in Electrical Engineering, Faculty of Technical Education
                                  Department of Applied Statistics, Faculty of Applied Science
                                   King Mongkut’s University of Technology North Bangkok
                                Division of Business Computing, Faculty of Management Science
                                              Suan Sunandha Rajabhat University
                                                      Bangkok, Thailand
                  , ,

Abstract—This paper discusses the knowledge integration of              using the query on the ontology , specifically so in the domain
clinical information extracted from distributed medical ontology        specific such as the health information systems or medicine.
in order to ameliorate a machine learning-based multi-label             An ontology is a specification of a conceptualization that
coding assignment system. The proposed approach is
                                                                        defines and/or specifies the concepts, relationships, and other
implemented using a decision tree based cascade hierarchical
technique on the university hospital data for patients with             distinctions that are relevant for modeling a domain. Such
Coronary Heart Disease (CHD). The preliminary results                   specification takes the form of the definitions of
obtained show a satisfactory finding.                                   representational vocabulary (classes, relations, and so on),
                                                                        which provide meanings to the vocabulary and formal
Keywords-component; medical ontology, diagnosis          coding,        constraints on its coherent use [3].
knowledge integration, machine learing, decision tree.

                      I.       INTRODUCTION

   From the fact that the typical character of any medical data
is heterogeneous, traditional machine learning approach alone
cannot be directly applied to solve efficiently our study in the
area of automating the clinical coding task. Therefore, we
present a knowledge integration method based on the
utilization of distributed medical ontology support knowledge
capturing and integration and machine learning techniques to
enhance a coding assignment of multi-label medical text.
   Text Categorization or Text Assignment (TA) as part of the
Natural Language Processing (NLP) consists the assignment
of one or more preexisting categories to a text document [1].
In multi-label assignment, the problem can comprise various
   As large unstructured and structured medical databases are
being generated momentarily, difficulties accessing,
integrating, extracting, and managing knowledge out of them                        Fig.1. The framework of officially accepted disease [6].
are among many reasons researchers are trying to overcome                         In compliance with the WHO ICD 10 for Coronary
including utilizing ontology, a form of knowledge-based                 Heart Disease [4], a domain specific ontology based on a
systems which are repositories of structured knowledge such             distributed architecture is constructed for use in the work to
as UMLS, MeSH, ICD, etc.                                                support the knowledge capturing and integration processes. As
      According to Nelson et al.[2], several studies have shown         depicted in Figure 1 below an adapted framework of officially
that the use and integration of several knowledge sources               accepted disease generated in HOZO [5] representing a high
improves the quality and efficiency of information systems              level medical ontology in our work which can be shared and

                                                                                                     ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                             Vol. 7, No. 3, 2010
      The essence in managing knowledge about semantic
relations, how to acquire as well as how to integrate them and
finally how to put them altogether are discussed. Of many
machine learning algorithms, we short-listed few of them and
preliminary tests are performed. We compare the results with
our own classifier, a variant of the existing decision tree, and
conclude that our approach marginally outperform the
traditional ones.
      The rest of the paper is arranged as follows: first, a
related work section comments on some relevant works in the
field; section 3 briefly introduces the proposed system
architecture; section 4 summarizes the data collection and
experiments along with the preliminary results. We deliver the
conclusions and future work in the epilogue of section 5.

                      II.    RELATED WORKS
                                                                                              Fig.3 High Level category in HOZO [6].
A. Ontology
      Our aim of enhancing the automatic assignment of
medical coding by automatically integrating ontologies is
                                                                            B. Medical Knowledge Integration Framework and Process
guided by the use of ontologies in various natural language
processing tasks such as automatic summarization, text
                                                                                 In a complex domain such medicine, knowledge
annotation and word sense disambiguation, among other [7].
                                                                            integration requires consolidation of the heterogeneous
Advantage of using ontologies in the area of relevance-
                                                                            knowledge and reconfirm it with the induced models since it
feedback, corpus-dependent knowledge models and corpus-
                                                                            originates from sources with different levels of certainty and
independent knowledge models on the domain-specific and                     completeness. Therefore, new models are collectively learned
domain-independent ontologies all contribute to ameliorate                  and comprehensively evaluated based on existing knowledge.
information retrieval systems [8].                                          Figure 4 illustrates the overview of medical knowledge
      Domain-independent ontologies such as WordNet/                        integration whilst a more comprehensive medical knowledge
Medical WordNet though improves a word sense                                integration process is depicted in Figure 5.
disambiguation, it has so broad coverage that it can be
debatable for the ambiguous terms making a domain-specific
ontologies, particularly on the part of a terminology which is
less ambiguous. Further more, it models terms and concepts
corresponding to a specific or given domain [9].
     In our case, an Ischaemic Heart Disease ontology is
manually built by reusing the framework, high level, relations
and concepts defined in accordance with [6]. Figures 2 and
Figure 3 illustrates the property inheritance and top level,

                                                                                Fig.4 Overview of medical knowledge integration architecture, adapted
                                                                                                          from [10].

           Fig.2 Property Inheritance in HOZO Heart Disease [6].

                                                                                                         ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 7, No. 3, 2010
                                                                                categories with the following meaning can be considered: P1
                                                                                can mean death, P2 deterioration of the patient’s state, P3
                                                                                steady state, P4 improvement of the patient’s state and P5

                                                                                As opposed to binary classification, multiple labels (e.g.,
                                                                                Breathing, Blood circulation, and Pain) are considered. In
                                                                                binary classification there was one label (e.g., Breathing) and
                                                                                the task was to assign instances either to the class of topically
                                                                                related (i.e., belongs to the class of Breathing) or to the class
                                                                                of topically unrelated (i.e., does not belong to the class of
                                                                                Breathing) objects. In other words, the multi-label
                                                                                classification task can be restructured to multiple binary
                                                                                classification tasks - one for each topic-label. The only
                                                                                difference in the multi-class classification is that in multi-label
    Fig.5 Domain knowledge integrated into medical knowledge integration        classification, each instance can belong to several classes at
                     process, adapted from [10].                                the same time.

                                                                                E. Diagnosis Coding
C. Clinical data/text                                                                From Figure 6, a ICD 10 for Coronary Heart Disease
                                                                                (CHD) is presented in a hierarchical form. The top first level is
     Patient records in use are the discharge summary records                   the high/concept level class where the second and third levels
generally contain the principal diagnosis (PDx) information.                    are major and minor classes, respectively. The minor class
The particular of secondary diagnosis (SDx) may appear                          gives more specific information yet not comprehensively
provided that there is additional ailment requiring treatment in                enough when compared with other major medical concepts
addition to the PDx. If any, SDx can be the details about the                   and ontologies like UMLS, GALEN, SNOMED CT, etc. This
comorbidity, complication, status of the disorder; chronic,                     is due to the fact that information is served differently.
acute and external cause of the injury. Additionally, depending
on the necessity, the treatment may involve other medical                            Key words/terms extracted from patient discharge
procedures (PROC) and such details are to be itemized. In the                   summary will undergo the coding process and provided that
case where only PDx is reported, the coding process is fairly                   all relevant information are in accordance, this summary will
simple and fast. Generally, however, the patient records                        finally be assigned a corresponding ICD 10 code(s) as per
contain many SDxs and multiple PROC. The coder duty is to                       specified in the ICD 10.
translate those PDx(s), SDx(s) and PROC information into
corresponding ICD 10 code(s). Other patient data aside from
the discharge summary includes but not limited to prescription
information sheet, laboratory results, drug prescribed nurse
note, progress note, and other records [11].

D. Multi-label Classification Task[12][13]

     A pattern recognition, classification, categorization can be
viewed as a mapping from a set of input variables to an output
variable representing the class label. In classification problems
the task is to assign new inputs to labels of classes or
categories. As input data the patient’s anamnesis, subjective
symptoms, observed symptoms and syndromes, measured
values (e.g. blood pressure, body temperature etc.) and results
of laboratory tests are taken. This data is coded by vector x,
the components of which are binary or real numbers. The                                 Fig.6 ICD 10 for Coronary Heart Disease in hierarchical form.
patients are classified into categories D1, …, Dm that
correspond to their possible diagnoses d1, …., dm. Also
                                                                                     With the preceding background and related works, a
prediction of the patient’s state can be stated as classification
                                                                                system for automating the ICD 10 coding assignment for CHD
problem. On the basis of examined data represented with
                                                                                can now begin. In the next section, a framework for such a
vector x patients are categorized into several categories P1, …,
                                                                                system will be introduced. Furthermore, data collected for this
Pm that correspond to different future states. For example five

                                                                                                             ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                               Vol. 7, No. 3, 2010
preliminary experiment will be briefed and early result of                      the output of the second classifier is performed as preliminary
applying the proposed novel method called cascade                               experiments suggested that this would not further improve the
hierarchical Decision Tree (chiDT), a twice processing of                       performance.
Decision Tree C4.5 algorithm, is presented.
                                                                                    In the context of Thai people based on data collected
                                                                                during year 2009 for all patients with Coronary Heart Disease
     III.   THE PURPOSED FRAMEWORK AND PRELIMINARY                              (CHD) in one university hospital, out of 28 ICD 10 codes in
                     EXPERIMENTAL RESULTS                                       the CHD group, 11 classes were reported, the rest is non
   This section explains the proposed framework appeared in
Figure 7 below. Our emphasis is on the one surrounded by the                        196 patients with CHD were reported to the university
red block.                                                                      hospital during 2009. 53 patient records were then separated
                                                                                for training purpose. These instances were carefully selected
                                                                                in order to cover all the classes. In order to understand the data
                                                                                for preliminary testing purpose, we recombined 53 training
                                                                                records with the 143 remaining and untrained ones and based
                                                                                on chiDT classifier using WEKA [14] as a processing engine,
                                                                                the following early results were reported. The following
                                                                                tables detail the results generated for each different classifier.

                                                                                    Table 1 represents the results of 53 carefully selected
                                                                                records covering 11 classes widely seen in Thailand and used
                                                                                for training. While Tables 2 to 4 present the results on all 196
                                                                                records used for training the chiDT, SVM and Fuzzy
                                                                                classifiers, respectively.

                                                                                 Correctly Classified Instances       52       98.1132 %
                                                                                 Incorrectly Classified Instances      1       1.8868 %
                                                                                 Kappa statistic                 0.9756
                                                                                 Mean absolute error                0.012
                                                                                 Root mean squared error               0.0818
                                                                                 Relative absolute error            4.5652 %
                                                                                 Root relative squared error         22.5907 %
                                                                                 Total Number of Instances            53

                                                                                       Table 1 With original 53 selected for training, the chiDT alone gave
                                                                                                     approximately 98.11 % accuracy.

                                                                                 Correctly Classified Instances      185       94.3878 %
                                                                                 Incorrectly Classified Instances     11       5.6122 %
                                                                                 Kappa statistic                 0.9323
     Fig.7 High level framework for automating ICD 10 coding assignment.         Mean absolute error                0.0107
                                                                                 Root mean squared error               0.0927
    A machine-learning approach using a cascade of two                           Relative absolute error            7.0466 %
classifiers trained on the same data is used to predict the                      Root relative squared error         33.6758 %
codes. Both classifiers are trained with the same data, and                      Total Number of Instances           196
perform multi-label classification by decomposing the task
into 11 (out of 28) binary classification problems, one for each                      Table 2 With original 196 selected for training, the chiDT alone gave
                                                                                                     approximately 94.39 % accuracy.
code. In this setting, it is possible for a classifier to predict an
empty, or impossible, combination of codes. Such known
errors are used to trigger the cascade: when the first classifier
makes a known error, the output of the second classifier is
used instead as the final prediction. No further correction of

                                                                                                               ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                   Vol. 7, No. 3, 2010
 SVM                                                                               Division of Molecular Genetics, staff at the National Center
 Correctly Classified Instances      183       93.3673 %                           for Genetic Engineering and Biotechnology (BIOTEC) and
 Incorrectly Classified Instances     13       6.6327 %                            staff at the Human Language Technology Laboratory of the
 Kappa statistic                 0.9196
                                                                                   National Electronics and Computer Technology Center
                                                                                   (NECTEC) for all resources and advices discussed in this
 Mean absolute error                0.1492
                                                                                   experimental study. In addition, we acknowledge the team at
 Root mean squared error              0.2642
                                                                                   the Mizoguchi Laboratory of the Institute of Scientific and
 Relative absolute error           97.9788 %                                       Industrial Research, Osaka University for allowing us to
 Root relative squared error         95.9388 %                                     access the software, HOZO, an ontology editor.
 Total Number of Instances           196

    Table 3 With original 196 selected for training, the chiDT with Support
  Vector Machine (SVM) classifier gave approximately 93.37 % accuracy.
 FLR                                                                               [1]    Sebastiani, F. Machine Learning in automated text categorization. ACM
                                                                                          Comput. Surv., 34(1), 1-47 (2002).
 Correctly Classified Instances      184       93.8776 %
                                                                                   [2]    Nelson, S.J., et al. Relationships in medical subject headings. In C.A.
 Incorrectly Classified Instances     12       6.1224 %                                   Bean & R. Green (Eds.), Relationship in the Organization of Knowledge.
 Kappa statistic                 0.9262                                                   New York: Kluwer Academic Publishers, (pp. 171-184) (2001).
 Mean absolute error                0.0111                                         [3]    Gruber, T. Toward Principles for the Design of Ontologies used for
                                                                                          Knowledge Sharing. International Journal of Human-Computer Studies,
 Root mean squared error              0.1055                                              43, 907-928 (1995).
 Relative absolute error            7.3094 %                                       [4]    WHO.         World        Health       Organization.,
 Root relative squared error         38.3185 %                                  
 Total Number of Instances           196                                           [5]    HOZO, an ontology editor.
                                                                                   [6]    Mizoguchi, R. et al. An Advanced Clinical Ontology. Available from
      Table 3 With original 196 selected for training, the chiDT with Fuzzy               Nature Precedings <> (2009)
              classifier gave approximately 93.87 % accuracy.                      [7]    Martin-Valdivia, M.T. Expanding terms with medical ontologies to
                                                                                          improve a multi-label text categorization system. In P.Violaine and R.
                                                                                          Mathieu (Eds.), Information Rerieval in Biomedicine: Natural Language
    The early result of this preliminary experiments displays                             Processing for Knowledge Integration, (pp 38-57) (2009).
that the proposed chiDT classifier based on cascade                                [8]    Bhogal, J., et al. A review of ontology based query expansion.
                                                                                          Information Processing & Management, 43(4), July 2007, 866-886
hierarchical architecture does produce satisfying output                                  (2007)
followed marginally by Fuzzy and SVM techniques.                                   [9]    Waraporn, P., et al. Proposed framework for interpreting medical
                                                                                          diagnosis records using adopted WordNet/Medical WordNet.
                                                                                          Proceedings of Technology and Innovation for Sustainable Development
                                                                                          Conference(TISD2008), 05_004_2008I, 433-436 (2008).
             IV.     CONCLUSION AND FUTURE WORKS                                   [10]   Kwiatkowska, M., et al, Knowledge-basaed induction of clinical
In this paper, we have provided an overview of the proposed                               prediction rules. In Berka, P., Rauch, J., Zighed A., D., (Eds.), Data
                                                                                          Mining and Medical Knowledge Management: Cases and Applications,
machine learning algorithm titled cascade hierarchical                                    (pp 356-357) (2009)
Decision Tree (chiDT). Though the result was not so                                [11]   Waraporn, P., et al. Distributed Ontological Engineering and Integrated
distinctively conclusive due to limited number of the data set,                           Development of Medical Diagnosis Coding Ontology for State Hospitals
the authors expect that with increased sample size, the result                            in Thailand, Proceedings of National Conference on Computer and
                                                                                          Information Technology (NCIT 2008) (2008).
should improve the yield substantially. To conclude that this
                                                                                   [12]   Vesely, A. Classification and prediction with neural networks. In Berka,
algorithm is reliable, a cost sensitivity analysis (CSA) will be                          P., Rauch, J., Zighed D. A., editors. Data Mining and Medical
tested apart from other standard model evaluation techniques.                             Knowledge Management: Cases and Applications (pp 76-107) (2009).
These works represent an upcoming work to be carried out.                          [13]   Suominen, H., et al. Machine learning to automate the assignment of
Also it would be appropriate to mention that by integrating                               diagnosis codes to free-text radiology reports: a method description. In
                                                                                          Hauskrecht, M., Schuurmans, D., and Szepesvari, C., editors.
ontology and medical knowledge integration framework into                                 Proceedings of the ICML/UAI/COLT 2008 Workshop on Machine
this work, future medical knowledge management would                                      Learning for Health-Care Applications, July 9, 2008, Helsinki, Finland
enhance both works of computer scientists and clinical                                    (2008).
specialists, to name just a few, for the improved classifiers and                  [14]   WEKA. An open sourced machine learning software from the University
                                                                                          of Waikato. Http://
clinically generic models, respectively.

                      V.     ACKNOWLEDGEMENT
    We would like to extend our sincere thanks to staff at the
Mahidol University’s Faculty of Medicine Siriraj Hospital

                                                                                                                    ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                          Vol. 7, No. 3, 2010
     AUTHORS PROFILE                                                       Phayung Meesad, Ph.D., assistant professor. He is an
                                                                           associate dean for academic affairs and research at the
                                                                           faculty of Information Technology, King Mongkut’s
Phanu Waraporn, a Ph.D. candidate in Information
                                                                           University of Technology North Bangkok. He earned
Technology at the Faculty of Information Technology,
                                                                           his MS and Ph.D. degrees in Electrical Engineering
King Mongkut’s University of Technology North                              from Oklahoma State University, U.S.
Bangkok. Currently, he is the lecturer in Business
Computing at the Faculty of Management Science,
Suan Sunandha Rajabhat University, Bangkok,
Thailand.                                                                  Gareth Clayton, Ph.D., a senior lecturer in statistics at
                                                                           the Department of Applied Statistics, Faculty of
                                                                           Applied Science, King Mongkut’s University of
                                                                           Technology North Bangkok. He earned his Ph.D. in
                                                                           Statistics from Melbourne University, Australia.

                                                                                    ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010

                 Botnet Detection by Monitoring Similar
                        Communication Patterns

              Hossein Rouhani Zeidanloo                                                   Azizah Bt Abdul Manaf
  Faculty of Computer Science and Information System                                 College of Science and Technology
           University of Technology Malaysia                                         University of Technology Malaysia
             54100 Kuala Lumpur, Malaysia                                             54100 Kuala Lumpur, Malaysia

Abstract— Botnet is most widespread and occurs commonly in                sending large amount of SPAM or phishing mails, and other
today‘s cyber attacks, resulting in serious threats to our network        nefarious purpose [ 3,4,5 ]. Bots infect a person’s computer in
assets and organization’s properties. Botnets are collections of          many ways. Bots usually disseminate themselves across the
compromised computers (Bots) which are remotely controlled by             Internet by looking for vulnerable and unprotected computers
its originator (BotMaster) under a common Command-and-                    to infect. When they find an unprotected computer, they infect
Control (C&C) infrastructure. They are used to distribute
                                                                          it and then send a report to the BotMaster. The Bot stay hidden
commands to the Bots for malicious activities such as distributed
denial-of-service (DDoS) attacks, spam and phishing. Most of the          until they are announced by their BotMaster to perform an
existing Botnet detection approaches concentrate only on                  attack or task. Other ways in which attackers use to infect a
particular Botnet command and control (C&C) protocols (e.g.,              computer in the Internet with Bot include sending email and
IRC,HTTP) and structures (e.g., centralized), and can become              using malicious websites, but common way is searching the
ineffective as Botnets change their structure and C&C                     Internet to look for vulnerable and unprotected computers [6] .
techniques. In this paper at first we provide taxonomy of Botnets             The main difference between Botnet and other kind of
C&C channels and evaluate well-known protocols which are                  malwares is the existence of Command-and-Control (C&C)
being used in each of them. Then we proposed a new general                infrastructure. The C&C allows Bots to receive commands and
detection framework which currently focuses on P2P based and
                                                                          malicious capabilities, as devoted by BotMaster. BotMaster
IRC based Botnets. This proposed framework is based on
definition of Botnets. Botnet has been defined as a group of bots         must ensure that their C&C infrastructure is sufficiently robust
that perform similar communication and malicious activity                 to manage thousands of distributed Bots across the globe, as
patterns within the same Botnet. The point that distinguishes our         well as resisting any attempts to shutdown the Botnets.
proposed detection framework from many other similar works is             Recently, attackers are also continually improving their
that there is no need for prior knowledge of Botnets such as              approaches to protect their Botnets. The first generation of
Botnet signature.                                                         Botnets utilized the IRC (Internet Relay Chat) channels as
                                                                          their Common-and-Control (C&C) centers. The centralized
   Keywords-Botnet; Bot; centralized; decentralized; P2P; similar         C&C mechanism of such Botnet has made them vulnerable to
                                                                          being detected and disabled. Therefore, new generation of
                                                                          Botnet which can hide their C&C communication have
                       I.   INTRODUCTION                                  emerged, Peer-to-Peer (P2P) based Botnets. The P2P Botnets
   Bot is a new type of malware [1] installed into a                      do not suffer from a single point of failure, because they do
compromised computer which can be controlled remotely by                  not have centralized C&C servers [12]. Attackers have
BotMaster for executing some orders through the received                  accordingly developed a range of strategies and techniques to
commands. After the Bot code has been installed into the                  protect their C&C infrastructure. The rest of the paper is
compromised computers, the computer becomes a Bot or                      organized as follows. In Section 2, we analyze different Botnet
Zombie [2]. Contrary to existing malware such as virus and                communication topologies and consider the protocols that are
worm which their main activities focus on attacking the                   currently being used in each model. In Section 3, we review
infecting host, bots can receive commands from BotMaster                  the related work. In Section 4, we describe our proposed
and are used in distributed attack platform.                              detection framework and all its components and finally
   Botnets are networks consisting of large number of Bots.               conclude in section 5.
Botnets are created by the BotMaster(a person or a group of
person which control remote Bots) to setup a private
communication infrastructure which can be used for malicious
activities such as Distributed Denial-of-Service(DDoS),

                                                                                                     ISSN 1947-5500
                                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                  Vol. 7, No. 3, 2010
   According to the Command-and-Control(C&C) channel, we
categorized Botnet topologies into two different models, the
Centralized model and the Decentralized model.
A. Centralized model
    In this model, one central point is in charge for exchanging
commands and data between the BotMaster and Bots. In this
model, BotMaster chooses a host (usually high bandwidth
computer) to be the central point (Command-and-Control)
                                                                                                      Figure 2. IRC based Botnet
server of all the Bots. The C&C server runs certain network
services such as IRC or HTTP. The main advantage of this                             In this model, BotMasters can command their Bots as a
model is small message latency which cause BotMaster easily                      whole or command a few of the Bots selectively using one-to-
arranges Botnet and launch attacks. Since all connections                        one communication. The C&C server runs IRC service that is
happen through the C&C server, therefore, the C&C is a                           the same with other standard IRC service. BotMaster usually
critical point in this model. In other words, C&C server is the                  creates a designated channel on the C&C servers where all the
weak point in this model. If somebody manages to discover                        Bots will connect, awaiting commands in the channel which
and eliminates the C&C server, the entire Botnet will be                         will instruct each connected Bot to do the BotMaster’s
useless and ineffective. Thus, it becomes the main negative                      bidding. Figure 2 showed that there is one central IRC server
aspect of this model.                                                            that forwards commands and data between the BotHerder and
    Since IRC and HTTP are two common protocols that C&C                         his Bots.
server uses for communication, we consider Botnets in this                         2) Botnet based on HTTP: The HTTP protocol is another
model based on IRC and HTTP. Figure 1 shows the basic                            popular protocol used by Botnets. Since IRC protocol within
communication architecture for a Centralized model. There are                    Botnets became well-known, more internet security
two central points that forward commands and data between                        researchers gave attention to monitoring IRC traffic to detect
the BotMaster and his Bots.                                                      Botnet. Consequently, attackers started to use HTTP protocol
                                                                                 as a Command-and-Control communication channel to make
                                                                                 Botnets become more difficult to detect. The main advantage
                                                                                 of using the HTTP protocol is hiding Botnets traffics in
                                                                                 normal web traffics, so it can easily bypasses firewalls with
                                                                                 port-based filtering mechanisms and avoid IDS detection.
                                                                                 Usually firewalls block incoming/outgoing traffic to unwanted
                                                                                 ports, which often include the IRC port. There are some
                                                                                 known Bots using the HTTP protocol, such as Bobax [16],
                                                                                 ClickBot [13] and Rustock [17]. Gu et al in the reference [10]
                                                                                 pointed out that the HTTP protocol is in a “pull” styleand the
                                                                                 IRC is in a “push” style. However the architecture of both is
                                                                                 B. Decentralized Model
    Figure 1. Command and control architecture of a Centralized model                 Due to main disadvantage of Centralized model attackers
                                                                                 started to build alternative Botnet communication system that
                                                                                 is much harder to discover and to destroy. Hence, they decided
  1) Botnet based on IRC : The IRC is a form of real-time                        to find a model in which the communication system does not
Internet text messaging or synchronous conferencing [13]. The                    completely depending on only some selected servers and even
protocol is based on the Client-Server model, which can be                       discovering and destroying a number of Bots. As a result,
used on many computers in distributed networks. Some                             attackers exploit the idea of Peer-to-Peer (P2P)
advantages which made IRC protocol widely being used in                          communication as a Command-and-Control(C&C) pattern
remote communication for Botnets are: (1) Low latency                            which is more resilient to failure in the network. The P2P
communication; (2) Anonymous real-time communication; (3)                        based C&C model will be used significantly in Botnets in the
Ability of Group (many-to-many) and Private (one-to-one)                         near future, and definitely Botnets that use P2P based C&C
communication; (4) simple to setup and (5) simple commands.                      model impose much bigger challenge for defense of networks.
The basic commands are connect to servers, join channels and                     Since P2P based communication is more robust than
post messages in the channels; (6) Very flexibility in                           Centralized C&C communication, more Botnets will move to
communication. Therefore IRC protocol is still the most                          use P2P protocol for their communication.
popular protocol being used in Botnet communication [5].                             In the P2P model, as shown in Figure 3, there is no
                                                                                 Centralized point for communication. Each Bot keeps some

                                                                                                             ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 7, No. 3, 2010
connections to the other Bots of the Botnet. Bots act as both              for attacks. To solve this, Binkley and Singh [14] proposed an
Clients and servers. A new Bot must know some addresses of                 effective algorithm that combines TCP-based anomaly
the Botnet to connect there. If Bots in the Botnet are taken               detection with IRC tokenization and IRC message statistics to
offline, the Botnet can still continue to operate under the                create a system that can clearly detect client Botnets. This
control of BotMaster. P2P Botnets aim at removing or hiding                algorithm can also reveal bot servers [14]. However, Binkley’s
the central point of failure which is the main weakness and                approach could be easily crushed by simply using a minor
vulnerability of Centralized model.                                        cipher to encode the IRC commands.
                                                                              Lately, Gu et al. have proposed Botsniffer [13] that uses
                                                                           network-based anomaly detection to identify Botnet C&C
                                                                           channels in a local area network. Botsniffer is based on
                                                                           observation that bots within the same Botnet will likely reveal
                                                                           very strong similarities in their responses and activities.
                                                                           Therefore, it employs several correlation analysis algorithms
                                                                           to detect spatial-temporal correlation in network traffic with a
                                                                           very low false positive rate [13].
                                                                              DNS-based detection techniques are based on DNS
                                                                           information generated by a Botnet. As mentioned before, bots
                                                                           normally begin connection with C&C server to get commands.
                                                                           In order to access the C&C server bots carry out DNS queries
                                                                           to locate the particular C&C server that is typically hosted by
        Figure 3.   Example of Peer-to-peer Botnet Architecture            a DDNS (Dynamic DNS) provider. Therefore, it is feasible to
   Some P2P Botnets operate to a certain extent decentralized              detect Botnet DNS traffic by DNS monitoring and detect DNS
and some completely decentralized. Those Botnets that are                  traffic anomalies [29, 30].
completely decentralized allow a BotMaster to inject a                        In 2005, Dagon [31] proposed a method to discover Botnet
command into any Bots, and have it either be broadcasted to a              C&C servers by detecting domain names with unusually high
specified node. Since P2P Botnets usually allow commands to                or temporally intense DDNS query rates. This method is
be injected at any node in the network, the authentication of              similar to the approach proposed by Kristoff [32] in 2004.
commands become essential to prevent other nodes from                         In 2007, Choi et al. [29] suggested anomaly mechanism by
injecting incorrect commands.                                              monitoring group activities in DNS traffics. They defined
                                                                           some special features of DNS traffics to differentiate valid
                       III.   RELATED WORK                                 DNS queries from Botnet DNS queries. This method is more
   Different approaches have been proposed for detection of                efficient than the prior approaches and can detect Botnet
Botnet. There are essentially two approaches for botnt                     despite the type of bot by looking at their group activities in
detection. One approach is based on locating honeynets in the              DNS traffic [29].
network. And another approach is monitoring and analysis of                   Geobl and Holz [15] proposed Rishi in 2007. Rishi is
passive network traffic [20].                                              primarily based on passive traffic monitoring for odd or
   There are many papers discussed how to apply honeynets                  suspicious IRC nicknames, IRC servers, and uncommon
for Botnet detection [5,3,21,22,23,24,1,25,26]. Honeynets are              server ports. They use n-gram analysis and a scoring system to
functional to understand Botnet characteristics and                        detect bots that use uncommon communication channels,
technology, but cannot detect bot infection all the times.                 which are commonly not detected by classical intrusion
   We can categorize passive network traffic monitoring                    detection systems [15]. The disadvantages of this method are
approach to signature-based, anomaly-based, DNS-based and                  that it cannot detect encrypted communication as well as non-
mining-based.                                                              IRC Botnets.
   Signature-based Botnet detection technique uses the                        Strayer et al. [33] proposed a network-based approach for
signatures of current Botnets for its detection. For instance,             detecting Botnet traffic which used two step processes
Snort [27] is capable to monitor network traffic to find                   including separation of IRC flows at first, and then discover
signature of existing bots. Signature-based detection approach             Botnet C&C traffic from normal IRC flows [33]. This
is only capable to be used for detection of well-known                     technique is specific to IRC based Botnets.
Botnets. Consequently, this solution is not functional for                    Masud et al. [34] proposed effective flow-based Botnet
unknown bots.                                                              traffic detection by mining multiple log files. They proposed
   Anomaly-based detection approaches try to detect Botnets                several log correlation for C&C traffic detection. They
based on a number of network traffic anomalies such as high                categorize an entire flow to identify Botnet C&C traffic. This
network latency, high volumes of traffic, traffic on unusual               method can detect non-IRC Botnets[34].
ports, and unusual system behavior that could show existence                  Botminer [35] is the most recent approach which applies
of bots in the network [28]. Nevertheless this technique meets             data mining techniques for Botnet C&C traffic detection.
the problem of detecting unknown Botnets, but is not capable               Botminer is an improvement of Botsniffer [13]. It clusters
to realize an IRC network Botnet which has not been used yet               similar communication traffic and similar malicious traffic.

                                                                                                       ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 7, No. 3, 2010
Then, it performs cross cluster correlation to identify the hosts        Detector, Analyzer, Monitoring & Clustering and Flows
that share both similar communication patterns and similar               Analyzer.
malicious activity patterns. Botminer is an advanced Botnet                 Filtering is responsible to filter out unrelated traffic flows.
detection tool which is independent of Botnet protocol and               The main benefit of this stage is reducing the traffic workload
structure. Botminer can detect real-world Botnets including              and makes application classifier process more efficient.
IRC-based, HTTP-based, and P2P Botnets with a very low                   Application classifier is responsible for separating IRC and
false positive rate [35].                                                HTTP traffics from the rest of traffics. Malicious activity
   As we mentioned above researches have proposed some                   detector is responsible to analyze the traffics carefully and try
approaches and techniques [14,15,16,13,17,18] for detecting              to detect malicious activities that internal host may perform and
Botnets. Majority of these approaches are developed for                  separate those hosts and send to next stage. Traffic Monitoring
detecting IRC or HTTP based Botnets[14,15,18]. For instance,             is responsible to detect the group of hosts that have similar
BotSniffer[13] is designed especially for detecting IRC and              behavior and communication pattern by inspecting network
HTTP based Botnets. Rishi[15] is also desingned for detecting            traffics. Analyzer is responsible for comparing the results of
IRC based Botnets with using well-known IRC bot nickname                 previous parts (Traffic Monitoring and Malicious Activity
patterns as signature. But recently we have witnessed that               Detector) and finding hosts that are common on the results of
structure of Botnets moved from centralized to distributed               both parts. Monitoring & Clustering is responsible to monitor
(e.g., using P2P [9,19]). Consequently, the detection                    the traffic flows and cluster the similar flows to same database.
approaches designed for IRC or HTTP based Botnets may                    Flows Analyzer is responsible to detect the group of hosts that
become ineffective against the new P2P based Botnets.                    have similar behavior and communication patterns by
Therefore, we need to develop a next generation Botnet                   comparing databases that received from previous stage for
detection system, which is also effective in the face of P2P             detecting IRC based bots.
based Botnets. In addition, we have to take into consideration
                                                                         A. Filtering
that this detection system should require no prior knowledge
of particular Botnets (such as Botnet signature, or C&C server              The main objective of Filtering is to reduce the traffic
names/addresses).                                                        workload and makes the rest of the system perform more
   In order to come up with a new detection system that also             efficiently. Figure 5 shows the architecture of the filtering.
meet the requirements for detection of P2P based Botnets, we
studied the communication and activity characteristics of few
P2P based Botnet( e.g. Storm Worm) and eventually come up
with effective definition of Botnets; specially for P2P based
    “A group of bots (at least three) within the same Botnet
will perform similar communication and malicious activities”.
Actually we share similar idea for definition of Botnet as
proposed by Gu et al. in Botminer[35]. It means that if each                                Figure 5. Traffics filtering stages
bot within the same Botnet show different behavior (e.g. in
terms of receiving instructions), the bots are only isolated
infected systems that we cannot consider them as a Botnet                   In C1, we filter out those traffics which targets (destination
based on our definition. According to definition above we                IP address) are recognized servers and will unlikely host
proposed a new framework for detection of Botnets that                   Botnet C&C servers. For this purpose we used the top 500
mainly targets P2P based and IRC based Botnets, however the              websites on the web (Http://, which
framework has the capability of adding another component for             the top 3 are, and
HTTP based Botnet detection. This framework monitors both                   In C2, we filter out handshaking processes (connection
the group of hosts that show similar communication pattern               establishments) that are not completely established.
and performing malicious activities, and try to find common              Handshaking is an automated process of negotiation that
hosts on them.                                                           dynamically sets parameters of a communications channel
                                                                         established     between     two      entities  before     normal
    IV.   PROPOSED BOTNET DETECTION FRAMEWORK AND                        communication over the channel begins. It follows the
                         COMPONENTS                                      physical establishment of the channel and precedes normal
   Our proposed framework is based on passively monitoring               information transfer [36]. To establish a connection, TCP uses
network traffics. This model is based on the definition of P2P           a three-way handshake; in this case we filter out the traffics
Botnets that multiple bots within the same Botnet will perform           that TCP handshaking have not completed. Like a host sends
similar communication patterns and malicious activities. Figure          SYN packets without completing the TCP handshake. Based
4 shows the architecture of our proposed Botnet detection                on our experience these flows are mostly caused by scanning
system, which consist of 7 main components: Filtering,                   activities.
Application Classifier, Traffic Monitoring, Malicious Activity

                                                                                                       ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 7, No. 3, 2010

                                                                            IRC                Monitoring
                                                  Application                                      &               Flows              Report
                                                                                               Clustering         Analyzer

      Network             Filtering                  HTTP                                  HTTP
                                                  Other (P2P)

                                       Figure 4. Architecture overview of our proposed detection framework

                                                                                    •     A HEAD request is similar to GET request, except it
B. Application classifier
                                                                                         asks the server to return the response headers only,
    Application Classifier is responsible to separate IRC and                            and not the actual resource (i.e. no message body).
HTTP traffics from the rest of traffics and send them to                                 This is helpful to consider characteristics of resources
Monitoring & Clustering and HTTP component. For detecting                                without downloading it which can help in saving
IRC traffics we can inspect the contents of each packet and try                          bandwidth. We use HEAD when no need for a file’s
to match the data against a set of user defined strings. For this                        contents.
purpose we use payload inspection that only inspects the first                      •    A POST request is used to send data to the server to
few bytes of the payload and looking for specific strings.                               be processed in some way, like by a CGI script. A
These IRC specific strings are NICK for the client’s nickname,                           POST request is different from a GET request in the
PASS for a password, USER for the username, JOIN for                                     following ways:
joining a channel, OPER that says a regular user wants to
become a channel operator and PRIVMSG that says the                                              •    There's a block of data sent with the request,
message is a private message [37]. Using this strategy for                                            in the message body. There are usually extra
detecting IRC traffic is almost simple for most network                                               headers to describe this message body.
intrusion detection software like Snort. In some cases
botmasters are using encryption for securing their                                               •    The request URI is not a resource to
communication that make using packet content analysis                                                 retrieve; it's usually a program to handle the
strategy useless. This issue actually is not our target here.                                         data you're sending.
    In next step, we also have to separate Http traffics and send
to Centralized part. For this purpose we also can inspect the                                    •    The HTTP response is normally program
first few bytes of Http request and if it has certain pattern or                                      output, not a static file.
strings, separate it and send it to centralized part. For detecting
Http traffics we focus on concept of Http protocol. Like most                   Therefore we inspect the traffics and if the first few bytes
network protocols, HTTP uses the client-server model: An                    of an Http request contain “GET”, “POST” or “HEAP”, it’s
HTTP client opens a connection and sends a request message                  the indication of Http protocol and will separate those flows
to an HTTP server (e.g. "Get me the file 'home.html'"); the                 and send them to Centralized part. After filtering out Http and
server then returns a response message, usually containing the              IRC traffics, the remaining traffics that have the probability of
resource that was requested("Here's the file", followed by the              containing P2P traffics are send to Traffic Monitoring part and
file itself). After delivering the response, the server closes the          Malicious Activity Detector. However in parallel we can use
connection (making HTTP a stateless protocol, i.e. not                      other approaches for identifying P2P traffics. We have to take
maintaining       any     connection      information      between          into consideration that P2P traffic is one of the most
transactions).[38]                                                          challenging application types. Identifying P2P traffic is
    In the format of Http request message, we are focusing on               difficult both because of the large number of proprietary p2p
Http methods. Three common Http methods are “GET”,                          protocols, and also due to the deliberate use of random port
“HEAD”, or “POST”: [38]                                                     numbers for communication. Payload-based classification
                                                                            approaches tailored to p2p traffic have been presented in [41,
     •      A GET is the most common Http method; it says
                                                                            40], while identification of p2p traffic through transport layer
           "give me this resource"
                                                                            characteristics is proposed in [39]. Our suggestion for using

                                                                                                               ISSN 1947-5500
                                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                         Vol. 7, No. 3, 2010
specific application or tools for identifying P2P traffics other                        update their commands from botmasters or aim to attack a
than sending remaining traffics is use of BLINC [42] that can                           target; their similar behaviors are more obvious. Therefore,
identify general P2P traffics. In contrast to previous methods,                         next step is to looking for groups of Databases that are similar
BLINC is based on observing and identifying patterns of host                            to each other.
behavior at the transport layer. BLINC investigates these
patterns at three levels of increasing detail (i) the social, (ii)                                        Table 2. Database for analogous flows
the functional and (iii) the application level. This approach has                            SPort           DPort               nbps              nbpp
two important features. First, it operates in the dark, having (a)
no access to packet payload, (b) no knowledge of port
numbers and (c) no additional information other than what
current flow collectors provide.[42]
                                                                                           We proposed a simple solution for finding similarities
C. Traffic Monitoring                                                                   among group of databases. For each database we can draw a
    Traffic Monitoring is responsible to identify hosts that are                        graph in x-y axis, which x-axis is the average number of bytes
likely part of Botnet during the time that hosts (bots) initiate                        per packet (nbpp) and y-axis is average number of byte per
attacks by analyzing flows characteristics and finding                                  second (nbps). (X, Y)= (bpp, bps).
similarities among them. Therefore, we are capturing network                               For example, in database (di), for each row we have nbpp
flows and record some special information on each flow. We                              that specify x-coordinate and have nbps that determine y-
are using Audit Record Generation and Utilization System                                coordinate. Both x-coordinate and y-coordinate determine a
(ARGUS) which is an open source tool [43] for monitoring                                point (x,y) on the x-y axis graph. We do this procedure for all
flows and record information that we need in this part. Each                            rows (network flows) of each database. At the end for each
flow record has following information: Source IP(SIP)                                   database we have number of points in the graph that by
address, Destination IP(DIP) address, Source Port(SPORT),                               connecting those points to each other we have a curvy graph.
Destination Port(DPORT), Duration, Protocol(Pr), Number of                              We have an example, Figure 6, for two different databases
packets(np) and Number of bytes(nb) transferred in both                                 based on data in our lab that their graphs are almost similar to
directions.                                                                             each other.
                                                                                              3.5                              3.5
                   Table 1. Recorded information of network flows
                                                                                                3                                3
       fi    SIP      DIP    Sport    Dport    Pr    np    nb       duration
       f1                                                                                     2.5                              2.5
                                                                                                2                                2
        .                                                                                     1.5                              1.5
                                                                                                1                                1
   Then we insert this information on a data base like Table 1,                               0.5                              0.5
which {fi}i=1…n are network flows. After this stage we                                          0                                0
specify the period of time which is 6 hours and during each 6                                        0      10        20             0        10          20
hours, all n flows that have same Source IP, Destination IP,
Destination port and same protocol (TCP or UDP) are marked                                   Figure 6: Example of two similar graphs based on data in our lab
and for each network flow {fi} (row) we calculate Average
number of bytes per second and Average number of bytes per                                  Next step is comparing different x-y axis graphs, and
packet:                                                                                 during that period of time (each 6 hours) those graphs that are
   •        Average number of bytes per second(nbps) = Number of                        similar to each other are clustered in same category. The
            bytes/ Duration                                                             results will be some x-y axis graphs that are similar to each
   •        Average number of bytes per packet(nbpp) = Number of                        other. Each of these graphs is referring to their corresponding
            Bytes/ Number of Packets                                                    databases in previous step. We have to take record of SIP
                                                                                        addresses of those hosts and send the list to next step for
Then, we insert this two new values (nbps and nbpp) including                           analyzing.
SIP and DIP of the flows that have been marked into another
database, similar to Table 2 . Therefore, during the specified                          D. Malicious Activity Detector
period of time (6 hours), we might have a set of database,                                  In this part we have to analyze the outbound traffic from
{fi}i=1…m which each of these databases have same SIP,                                  the network and try to detect the possible malicious activities
DIP, DPORT and protocol (TCP/UDP). We are focusing just                                 that the internal machines are performing. Each host may
at TCP and UDP protocols in this part.                                                  perform different kind of malicious activity but Scanning and
As we mentioned earlier, the bots belonging to the same                                 Spamming are the most common and efficient malicious
Botnet have same characteristics. They have similar behavior                            activities a botmaster may command their bots to perform
and communication pattern, especially when they want to                                 [44,26,45]. The outputs of this part are the list of hosts which
                                                                                        performed malicious activities.

                                                                                                                       ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010
   1) Scanning: Scanning activities may be used for malware
propagation and DOS attacks. Most scan detection has been                       b) Outbound Scan Detection (OSD): OSD is based on a
based on detecting N events within a time interval of T                    voting scheme (AND, OR or MAJORITY). SCADE in this
seconds. This approach has the problem that once the window                part has three parallel anomaly detection models that track all
size is known, the attackers can easily evade detection by                 outbound connection per internal host:
increasing their scanning interval. Snort are also use this                • Outbound scan rate (s1): Detects local hosts that perform
approaches. Snort version 2.0.2 uses two preprocessors. The                high-rate scans for many external addresses.
first is packet-oriented, focusing on detecting malformed                  • Outbound connection failure rate (s2): Detects unusually
packets used for “stealth scanning” by tools such as nmap                  high connection fail rates, with sensitivity to HS port usage.
[46]. The second is connection oriented. It checks whether a               The anomaly score s2 is calculated based on this formula:
given source IP address touched more than X number of ports
or Y number of IP addresses within Z seconds. Snort’s                                       S2= (w1fhs + w2 fls)/C
parameters are tunable, but it suffers from the same drawbacks
as Network Security Monitor(NSM)[47] since both rely on the                 fhs: indicate numbers of failed attempts at high-severity ports
same metrics [48]. Other work that are focusing on scan                     fls: shows numbers of failed attempts at low-severity ports
detection is by Staniford et al. on Stealthy Probing and                     C: is the total number of scans from the host within a time
Intrusion Correlation Engin( SPICE)[49]. SPICE is focusing                       window.
on detecting stealthy scans, especially scans that spread across
multiple source addresses and execute at very low rates. In                • Normalized entropy of scan target distribution (s3):
SPICE there are anomaly scores for packets based on                        Calculates a Zipf (power-law) distribution of outbound address
                                                                           connection patterns. A consistently distributed scan target
conditional probabilities derived from the SIP and DIP and
                                                                           model provides an indication of a possible outbound scan. It is
ports. It uses simulated annealing to cluster packets together
                                                                           used an anomaly scoring technique based on normalized
into port scan using heuristics that have developed from real              entropy to identify such candidates:
scans[49]. An important need in our system is prompt
response, however reaching to our goals which are promptness                                  S3=H/ln (m)
and accuracy in detecting malicious scanners is a difficult task.
Another solution is also using Threshold Random                            H: is the entropy of scan target distribution
Walk(TRW)[48], an online detection algorithm. TRW is based
on sequential hypothesis testing.                                                          H= - ∑ pi ln(pi)
   After assessing different approaches for detecting scanning
activities, the best solution for using in this part is Statistical
                                                                           m: is the total number of scan targets
sCan Anomaly Detection Engine( SCADE)[16], a snort
processor plug-in system which has two modules, one for                    pi: is the percentage of the scans at target i
inbound scan detection and another one for detecting outbound
attack propagation.                                                          2) Spam-related Activities: E-mail spam, known as
      a) Inbound Scan Detection(ISD): In this part SCADE has               Unsolicited Bulk Email (UBE), junk mail, is the practice of
focused on detection of scan activities based on ports that are            sending unwanted email messages, in large quantities to an
usually used by malware. One of the good advantages of this                indiscriminate set of recipients. More than 95% of email on
procedure is that it is less vulnerable to DOS attacks, mainly             the internet is spam[50], which most of these spam are sent
because its memory trackers do not maintain per-external-                  from Botnets. A number of famous Botnets which have been
source-IP. SCADE here just tracks scans that are targeted to               used specially for sending spam are Storm Worm which is P2P
internal hosts. The bases of Inbound Scan Detection are on                 Botnet and Bobax that used Http as its C&C.
failed connection attempts. SCADE in this part has defined                 A common approach for detecting spam is the use of DNS
two types of ports: High-Severity (hs) ports which                         Black/Black       Hole      List      (DNSBL)       such       as
representing highly vulnerable and commonly exploited                      ( DNSBLs specify a list
services and low-severity (ls) ports. For make it more                     of spam senders’ IP addresses and SMTP servers are blocking
applicable in current situation SCADE focused on TCP and                   the mail according to this list. This method is not efficient for
UDP ports as high-secure and all other as low-secure ports.                bot-infected hosts, because legitimate IP addresses may be
There are different weights to a failed scan attempt for                   used for sending spam in our network. Creation or misuse of
different types of ports.                                                  SMTP mail relays for spam is one of the most well-known
The warning for ISD for a local host is produced based on an               exploitation of Botnets. As we know user-level client mail
anomaly score that is calculated as based on this formula:                 application use SMTP for sending messages to mail server for
                                                                           relaying. However for receiving messages, client application
                     S= (w1fhs + w2 fls)
                                                                           usually use Post Office Protocol (POP) or the Internet
 fhs: indicate numbers of failed attempts at high-severity ports           Message Access Protocol (IMAP) to access the mail box on a
  fls: shows numbers of failed attempts at low-severity ports              mail server. Our idea in this part is very simple and efficient.

                                                                                                       ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010
Our target here is not recognizing which email message is
spam, though for detecting group of bots that sending spam                               Table 3. Recorded information of network flows
with detecting similarities among their actions and behaviors.                fi   SIP      DIP    Sport    Dport     Dr    Pr   PAT      np      nb
Therefore the content of emails from internal network to                      f1
external network is not important in our solution. All we want                f2
to do is determining which clients have been infected by bot                   .
and are sending spam. For reaching to this target, we are
focusing on the number of emails sending by clients to
different mail servers. Based on our experience in our lab,
                                                                              Then, we insert these two new values (nbps and nbpp)
using different external mail servers for many times by same
                                                                          including SIP and DIP of the flows that have been marked into
client is an indication of possible malicious activities. It means
                                                                          another database, similar to Table 2. Therefore, during the
that it is unusual that a client in our network send many emails
                                                                          specified period of time (6 hours), we might have a set of
to the same mail server (SMTP server) in the period of time
                                                                          database,{di}i=1…m in which each of these databases has the
like one day. Therefore, we are inspecting outgoing traffic
                                                                          same SIP, DIP, DPORT, PAT and protocol (TCP/UDP). We
from our network( gateway), and recording SIP and DIP of
                                                                          are focusing just at TCP and UDP protocols in this part. These
those traffics that destination ports are 25( SMTP) or
                                                                          databases are sent to next stage, Flows analyzer, for finding
587(Submission) in the database. Based on network flows
                                                                          similar databases.
between internal hosts and external computers( SIP belong to
mail servers) and the number of times that it can happen we               G. Flows Analyzer
can conclude which internal host is behaving unusual and are                 Flows Analyzer is responsible for looking a group of
sending many emails to different or same mail servers.                    databases that are similar to each other. The comparison and
E. Analyzer                                                               finding analogous databases is similar to approach that has
                                                                          been described in Traffic Monitoring component. After
    Analyzer which is the last part of our proposed framework
                                                                          finding similar databases we have to take a record of SIP
for detection of Botnets, is responsible for finding common
                                                                          addresses of those hosts and send them as a group of bot that
hosts that appeared in the results of previous parts (Traffic
                                                                          are belong to IRC based Botnet.
Monitoring and Malicious Activity Detector).
                                                                                                     V.     CONCLUSION
F. Monitoring and Clustering
                                                                             The first seminar on Botnets was hold in 2007 and since
    Since the architecture of communication between IRC                   then many Botnet detection techniques have been proposed
server and bots is one-to-many (multicast) model, thus; the               and also some real bot detection systems have been
network flows to all bots should show similar characteristic              implemented (e.g. BotHunter by Gu et al. [16]). Botnet
and pattern. Our objective in this part is detection of IRC               detection is a challenging problem. In this paper at first we
based Botnet by monitoring network traffics. Our approach in              have defined taxonomy for better understanding of Botnets.
this part is based on identifying hosts that are likely part of a         Then we proposed a new general detection framework which
Botnet before initiating an attack, particularly during the time          currently focuses on P2P based Botnets and IRC based
that IRC server commanding or updating their bots.                        Botnets. This proposed framework is based on the definition
    Monitoring & Clustering is responsible to inspect network             of Botnets. Botnets have been defined as a group of bots that
traffics and clustering the similar characteristics of network            will perform similar communication and malicious activities
flows. Consequently, we are capturing network flows and                   pattern within the same Botnet. The point that distinguishes
record some special information in each flow. We are using                our proposed detection framework from many other similar
ARGUS for monitoring flows and record information that we                 works is that there is no need for prior knowledge of Botnets
need in this part. Each flow record has following information:            such as Botnet signature. In addition, we plan to further
Source IP(SIP) address, Destination IP(DIP) address, Source               improve the efficiency of our proposed detection framework
Port(SPORT), Destination Port(DPORT), Duration(Dr),                       with adding unique detection method in HTTP part and make
Protocol(Pr), Packet Arrival Time(PAT), Number of packets                 it as one general system for detection of Botnet and try to
(np) and Number of bytes (nb) transferred in both directions.             implement it in near future.
Then, we insert this information in a data base as shown in
Table 3, in which {fi}i=1…n are network flows.                                                     ACKNOWLEDGMENT
    After this stage, we specify the period of time which is 6               The authors would like to express their appreciation to
hours and during each 6 hours, all n flows that have same                 Universiti Teknologi Malaysia (UTM) for their invaluable
Source IP, Destination IP, Source port, Destination port,                 supports (technically & financially) in encouraging the authors
Packet Arrival Time (PAT) and same protocol (TCP or UDP),                 to publish this paper.
are marked and then for each network flow (row) we calculate
nbps and nbpp based on formula that mentioned earlier.

                                                                                                           ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 7, No. 3, 2010
                                REFERENCES                                             [23] M. Vrable, J. Ma, J. Chen, D. Moore, E.Vandekieft, A. C. Snoeren,
                                                                                            G.M. Voelker, and S.Savage,” Scalability, Fidelity and Containment in
[1]    P. Barford and V.Yagneswaran, “An Inside Look at Botnets”. In: Special               the Potemkin Virtual Honeyfarm,” in Proc. ACM SIGOPS
       Workshop on Malware Detection, Advances in Information Security,                     OperatingSystem Review, vol. 39(5), pp. 148–162, 2005.
       Springer, Heidelberg (2006).
                                                                                       [24] F. Freiling, T. Holz, and G. Wicherski, “Botnet tracking: Exploring a
[2]    N. Ianelli, A. Hackworth, Botnets as a Vehicle for Online Crime, CERT,               root-cause methodology to prevent distributed denial-of-service attacks,”
       December 2005.                                                                       in Proc. 10th European Symposium on Research in Computer Security
[3]    E. Cooke, F. Jahanian, and D. McPherson, “The zombie roundup:                        (ESORICS), vol. Lecture Notes in Computer Science 3676, September
       Understandinng, detecting, and disrupting Botnets,” Proc. of Workshop                2005, pp. 319–335
       on Steps to Reducing Unwanted Traffic on the Internet (SRUTI'05),               [25] D. Dagon, C. Zou, and W. Lee, “Modeling Botnet propagation using
       June 2005.                                                                           time zones,” in Proc. 13th Network and Distributed System Security
[4]    Honeynet Project, Know your Enemy: Tracking Botnets, March                           Symposium (NDSS’06), 2006
       2005.                                        [26] J. Oberheide, M. Karir, and Z.M. Mao, “Characterizing Dark DNS
[5]    M.A Rajab, J. Zarfoss, F. Monrose, and A. Terzis, “A multifaceted                    Behavior,” in Proc. 4th nternational Conference on Detection of
       approach to understanding the Botnet phenomenon,” 6th ACM                            Intrusions and Malware, and Vulnerability Assessment, 2007.
       SIGCOMM on Internet Measurement Conference, IMC 2006, 2006, pp.                 [27] Snort IDS web page., March 2006.
                                                                                       [28] B. Saha and A, Gairola, “Botnet: An overview,” CERT-In White
[6]    Zeidanloo, H.R.; Manaf, A.A. “Botnet Command and Control                             PaperCIWP-2005-05, 2005.
       Mechanisms”,. Second International Conference on Computer and
       Electrical Engineering, 2009. ICCEE '09. Page(s): 564 – 568 .2009               [29] H. Choi, H. Lee, H. Lee, and H. Kim, “Botnet Detection by Monitoring
                                                                                            Group Activities in DNS Traffic,” in Proc. 7th IEEE International
[7]    Duc T. Ha, Guanhua Yan, Stephan Eidenbenz, Hung Q. Ngo. On the                       Conference on Computer and Information Technology (CIT 2007),
       Effectiveness of Structural Detection and Defense Against P2P-based                  2007,pp.715-720.
       Botnet, Proceedings of the 39th Annual IEEE/IFIP International
                                                                                       [30] R.Villamarin-Salomon and J.C. Brustoloni, “Identifying Botnets Using
       Conference on Dependable Systems and Networks (DSN'09), June
       2009                                                                                 Anomaly Detection Techniques Applied to DNS Traffic,” in Proc. 5th
                                                                                            IEEE Consumer Communications and Networking Conference
[8]    J. Oikerinen , D. Reed. Internet Relay Chat protocol .May 1993. Web                  (CCNC2008), 2008, pp. 476-481.
       publication.             Available           at         URL            :                                    [31] D. Dagon, “Botnet Detection and Response, The Network is the
                                                                                            Infection,” in OARC Workshop, 2005.
[9]    J. B. Grizzard, V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon. Peer-
                                                                                       [32] J. Kristoff, “Botnets,” in 32nd Meeting of the North American Network
       to-peer Botnets: Overview and case study. In Proceedings of USENIX
                                                                                            Operators Group, 2004.
       HotBots’07, 2007.
                                                                                       [33] W. Strayer, D. Lapsley, B. Walsh, and C. Livadas, Botnet Detection
[10]   J.Stewart.Bobax                       Trojan                   analysis.
                                                                                            Based on Network Behavior, ser. Advances in Information
                                                                                            Security.Springer, 2008, PP. 1-24.
[11]   Daswani N, Stoppelman M, the Google Click Quality and Security
                                                                                       [34] M. M. Masud, T. Al-khateeb, L. Khan, B. Thuraisingham, K.
       Teams. The anatomy of ClickBot.A. In: Proc. of the 1st Workshop on
       Hot Topics in Understanding Botnets(HotBots 2007). 2007.                             W.Hamlen, “ Flow-based identification of Botnet traffic by mining
                                                                                            multiple in Proc. International Conference on Distributed Framework&
[12]   Chiang K, Lloyd L. A case study of the rustock rootkit and spam Bot.                 Application,Penang,Malaysia.2008
       In: Proc. of the 1st Workshop on Hot Topics in Understanding Botnets
       (HotBots 2007). 2007.                                                           [35] G. Gu, R. Perdisci, J. Zhang, and W. Lee, “Botminer: Clustering
                                                                                            analysis of network traffic for protocol- and structure independent
[13]   Guofei Gu, Junjie Zhang, and Wenke Lee. "BotSniffer: Detecting Botnet                Botnet detection,” in Proc. 17th USENIX Security Symposium, 2008
       Command th Control Channels in Network Traffic." In Proceedings of
       the      15          Annual Network and Distributed System Security             [36]
       Symposium (NDSS'08), San Diego, CA, February2008.                               [37] J. Rayome. “ IRC on your dime? What you really need about Internet
                                                                                            Relay Chat, CIAC/LLNL. May 22, 1998
[14]   R. Binkley and S. Singh. An algorithm for anomaly-based Botnet
       detection. In Proceedings of USENIX SRUTI’06, pages 43–48, July 2006            [38] HTTP Made Really Easy,
[15]   J. Goebel and T. Holz. Rishi: Identify bot contaminated hosts by irc            [39] T. Karagiannis, A.Broido, M. Faloutsos, and kc claffy. Transport layer
       nickname evaluation. In Proceedings of USENIX HotBots’07, 2007.                      identification of P2P traffic. In ACM/SIGCOMM IMC, 2004.
[16]   G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee. BotHunter:               [40] T. Karagiannis, A.Broido, N.Brownlee, kc claffy, and M.Faloutsos. Is
       Detecting malware infection through ids-driven dialog correlation. In                P2P dying or just hiding? In IEEE Globecom 2004, GI.
       Proceedings of the 16th USENIX Security Symposium (Security’07),                [41] S. Sen, O. Spatscheck, and D. Wang. Accurate, Scalable In-Network
       2007.                                                                                Identification of P2P Traffic Using Application Signatures. In WWW,
[17]   A. Karasaridis, B. Rexroad, and D. Hoeflin. Widescale Botnet detection               2004
       and characterization. In Proceedings of USENIX HotBots’07, 2007.                [42] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, "BLINC:multilevel
[18]   W. T. Strayer, R.Walsh, C. Livadas, andD. Lapsley. Detecting Botnets                 traffic classification in the dark," In Proceedings of the 2005 Conference
       with tight command and control. In Proceedings of the 31st IEEE                      on Applications, Technologies, Architectures, and Protocols for
       Conference onLocal Computer Networks (LCN’06), 2006.                                 Computer Communications, pp. 229-240, Philadelphia, Pennsylvania,
[19]   R.     Lemos.       Bot    software    looks    to  improve    peerage.              2005
       Http://, 2006.                                  [43] Argus (Audit Record Generation and Utilization System,
[20]   Z. Zhu, G. Lu, Y. Chen, Z. J. Fu, P.Roberts, K. Han, "Botnet Research
       Survey," in Proc. 32nd Annual IEEE International Conference on                  [44] Collins, m., Shimeall, t., Faber, s., Janies, j., Weaver, R.Shon, M.D.,and
       Computer Software and Applications (COMPSAC '08), 2008, pp.967-                      Kadane,j. “using uncleanliness to predict future Botnet addresses”. In
       972.                                                                                 Proceedings ACM/USENIX Internet Measurement Conference.2007
[21]   A. Ramachandran and N. Feamster, “Understanding the network-level               [45] ZHUGE, J., HOLZ, T., HAN, X., GUO, J., and ZOU, W.,
       behavior of spammers,” in Proc. ACM SIGCOMM, 2006.                                   “Characterizing the ircbased Botnet phenomenon.” Peking University&
                                                                                            University ofMannheim Technical Report, 2007.
[22]   K. K. R. Choo, “Zombies and Botnets,” Trends and issues in crime and
       criminal justice, no. 333, Australian Institute of Criminology,                 [46] Nmap — free security scanner for network exploration & security
       Canberra,March 2007.                                                                 audits.

                                                                                                                        ISSN 1947-5500
                                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                 Vol. 7, No. 3, 2010
[47] L. T. Heberlein, G. V. Dias, K. N. Levitt, B. Mukherjee, J.Wood, and
     D.Wolber. A network security monitor. In Proc.IEEE Symposium on
     Research in Security and Privacy, pages 296–304, 1990.
     “Fast Portscan Detection Using Sequential Hypothesis Testing,” in IEEE
     Symposium on Security and Privacy 2004, (Oakland, CA), May 2004
[49] S. Staniford, J. A. Hoagland, and J. M. McAlerney. Practical automated
     detection of stealthy portscans. In Proceedings of the 7th ACM                                 Azizah Abdul Manaf is a Professor at
     Conference on Computer and Communications Security, Athens,                   Universiti Teknology Malaysia (UTM). She graduated with B.
     Greece, 2000.
                                                                                   Eng. (Electrical) 1980, MSc. Computer Science (1985) and
[50] WARD, M., “More than 95% of e-mail is ’junk’.”, 2006.                      PhD in 1995 from UTM. Her current areas of interest and
                                                                                   research are image processing, watermarking, steganography,
                                                                                   Information Security, Botnets and Worm, Intrusion Detection
                                                                                   and computer forensics and have postgraduate students at the
                          AUTHORS PROFILE
                                                                                   Masters and PhD level to assist her in these research areas.
                                                                                   She has written numerous articles in journals and presented an
                                                                                   extensive amount of papers at national and international
                                                                                   conferences on her research areas. Prof. Dr. Azizah has also
                                                                                   held management positions at the University and Faculty level
                                                                                   such as Head of Department, Deputy Dean, Deputy Director
                                                                                   and Academic Director pertaining to academic development as
                Hossein Rouhani Zeidanloo received his B.Sc.                       well as on training for teaching and learning methodologies at
in software engineering from Meybod University, Iran. He is                        UTM.
currently completing his Master’s degree in Information
Security at the Universiti Teknologi Malaysia (UTM). He has
published two papers in international journals and also
published many papers in international conferences around the
world. His area of interest is Network Security and Ethical

                                                                                                             ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 7, No. 3, March 2010

                              JawaTEX: A System for Typesetting Javanese

                Ema Utami1 , Jazi Eko Istiyanto2 , Sri Hartati3 , Marsono4 , and Ahmad Ashari5
                       Information System Major of STMIK AMIKOM Yogyakarta
                     Ring Road Utara ST, Condong Catur, Depok Sleman Yogyakarta
                           Telp. (0274) 884201-884206, Faks. (0274) 884208
          Candidate Doctor of Computer Science of Postgraduate School Gadjah Mada University
                                     Doctoral Program in Computer Science
                                 Graha Student Internet Center (SIC) 3rd floor
                     Faculty of Mathematic and Natural Sciences Gadjah Mada University
                                 Sekip Utara Bulaksumur Yogyakarta. 55281
                                          Telp/Fax: (0274) 522443
                        Sastra Nusantara Major of Culture Sciences Gadjah Mada University
                                  Humaniora ST No.1, Bulaksumur, Yogyakarta
                                              Fax. (0274) 550451

                         Abstract                                     string split pattern by using rule-based method, whereas the
                                                                      matching process of each Latin string split pattern in map-
                                                                      ping form of LTEX uses Pattern Matching method. The us-
Transliteration of Latin text to Javanese letter is a letter
                                                                      ing of rule-based method can solved problems of the previ-
substitution from Latin alphabet to Javanese alphabet. Re-
                                                                      ous reseraches by using certain methods. The established
searches about transliteration of Latin text document to Ja-
                                                                      transliteration model is supported by the production rule of
vanese letter in digital forms are still very few. At this mo-
                                                                      browsing the Latin string split pattern, the models of the
ment some researches focus on making font that is used for
                                                                      Latin string split pattern, the production rule for the Latin-
a software of word processing. Write Javanese letters by
                                                                      Javanese character mapping, the models of syntax coding
using the font, the user must have good knowledge in read-
                                                                      pattern, style or macro LTEX, Javanese character Metafont.
ing and writing Javanese letters and the user is expected to
                                                                      The spelling checker to correct the mistake of letter typing
memorize the symbols and certain complicated Latin letters
                                                                      by applying Brute Force algorithm also provided within sys-
to obtain the expected Javanese letters. Another researches
are conversion programs but the weaknesses of those pro-
grams are the conversion result is not easy to be transliter-             Several testing results above prove that if the user can
ated to other media or printed, not all programs can convert          write every word correctly including absorption suitable
a text file and the result of conversion is not in accordance          with the original pronunciation and write or re-arrange the
with the rule. In addition, not all Latin character writing           Latin spelling in the source text, so the transliteration model
can be transliterated to Javanese characters.                         of the Latin text document to Javanese character formed
   This research is the beginning phase to overcome the               can be used to transliterate the Latin text document to Ja-
problems of Latin-Javanese text document transliteration.             vanese character writing. The concept of the text document
Therefore it is necessary to build the transliteration model          split and the established transliteration in this article can
named JawaTEX to transliterate the Latin text document                be used as a basis to develop other cases. For the next re-
to Javanese characters.The parser method used in this re-             search, the Javanese character split writing in good form
search is The Context Free Recursive Descent Parser. The              still needs to be developed. The Javanese character writing
Latin text document processing becomes the list of the Latin          sometimes cannot be justified alignment since the Javanese


                                                                                             ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 7, No. 3, March 2010

character writing does not recognize space between words.             used in this transliteration was native text characters writ-
   Key words: transliteration, Javanese characters, type-             ten in Latin. So the scope is not only Javanese society that
setting, The Context Free Recursive Descent Parser, Pattern           know Javanese character but also at society that uses Latin
Matching, rule based, Brute Force, LTEX, JawaTEX
                                    A                                 character [16].

1    INTRODUCTION                                                     2     NATURAL LANGUAGE PRO-
In many countries research has been done to develop char-                  CESSING
acter computeration for their local culture. Latin to Javanese
character transliteration machine is one of research field in          Parser is base system in natural language processing. Parser
linguistic computational. There is just a little research at          is sub program or system that read every sentence, word by
this field in Indonesia than other countries [13]. At this             word to define ”what is what” [11]. Parsing is process for
moment there are already two Javanese character true type             command extraction in natural language. There are 3 kinds
font. The fonts were made by Teguh Budi Sayoga [10] and               of parsers, The State Machine Parser, The Context Free Re-
Jason Glavy [4]. Those fonts are using in word processor              cursive Descent Parser and The Noise Diposal Parser. At
such as Microsoft Word and Open Office [10, 4]. When                   The State Machine Parser, process is done by following
using those fonts users have to remember several formats,             the rule that depends on recent situation. At the Context
for example the Javanese sentence: aksara jawi punika tek-            Free Recursive Descent Parser a sentence is seen as group
sih kathah kekiranganipun thus the Latin character that we            of items, which each item is also arranged by other items
have to wrote are ?aksrjwipunika [tkSih kqh kekirqnNipun/.            and at the end it can be broken as atomic item [11]. The
    How Javanese characters adapt to recent growth of trea-           rule that arranges how each part can be built is called as
sury words? It is impossible to restrict the input of Latin           production rule from grammar. The Noise Diposal Parser is
characters that will be transliterated to Javanese charac-            done by removing words that are not important, like making
ters [15]. For example, the foreign society having com-               transcription.
plex words treasury, including person name, place name, or-               String matching, how to find all possibility of occurrence
ganization name where there were consonant combination                of x string with length of m, is often called pattern on other
that likely impossible happened at Javanese character writ-           string t with n length called text [1]. String matching is pro-
ing. Beside that not all Latin characters have equivalance            cess to find string characters form pattern in other string or
to existing Javanese characters [16]. At this moment there            text document content [9, 2]. Pattern Matching can be dev-
were several rules in Javanese grammar were not relevant to           ided into two model, there are exact string matching and in-
do transliteration because of there are a lot of words in the         exact string matching or fuzzy string matching [5]. Inexact
worlds that not having equivalance in Javanese characters.            string matching are divided into two model, first approxi-
None researchers have already developed algorithm to han-             mate string matching [3] and second phonetic string match-
dle writing Javanese character for x and q Latin characters,          ing [5]. String searching involve two point, text (string
multiple consonant (more that two sequences consonant),               with ���� characters length) and pattern (string with ���� char-
diftong, roman numbering system, accomodates space, dash              acters length (���� < ����)) that will be search in the text [8].
and period to avoid ambiguity [13]. Grammar or rule to                There are several string searching algorithms, for example
write Javanese character in relevation with huge of words             are Brute Force Algorithm, Knuth Morris Pratt and Bayer
at this time need to be considered. The rules that are not            Moore [12]. Each algorithm has their advantage and disad-
relevant need to be revised by coordination with Javanese             vantage [9]. One algorithm is better form other in special
expert.                                                               case but in other case could be the other is more effective
    Transliteration research using TEX/LTEX and Metafont
                                          A                           finding solution [19]. Brute Force algorithm comparing to
has not been done yet [13]. So this paper is new research in          the other is simple and strong [6]. In this research was used
this field (Latin to Javanese character transliteration using          Brute Force algorithm because degree of succesfull finding
TEX/LTEX [13]. By knowing all the problems, advantages
      A                                                               solution is 100% although having weakness on time process
and disadvantages of the existing research are needeed to             efficiency [7].
make font based on Metafont and transliterator from Latin                 Choosing of parsing method and linguistic computation
to Javanese character using TEX/LTEX that we call JawaTEX
                                  A                                   that will use depends on the problems that will be faced
(Javanese Typesetting and Transliteration Text Document).             [13, 14, 15, 16]. This research used Pattern Matching to
JawaTEX is intended to write Javanese character simpler and           transliterate Latin string that is written by user because
easier based on complex transliteration algoritm. Translit-           177 models for building transliteration text document have
eration model in this paper did not focus in font making but          been determined. Production rules to Latin string split pro-
more in document transliteration. The input text that was             duce 177 Latin string split models that can be expended to


                                                                                             ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 7, No. 3, March 2010

produce more than 280490 Latin string split patterns [18].              cility made to find word similarity to correct error spelling
String which will be transliterated is sequence ASCII char-             word. Words checking process consists two words check-
acter in text document, not graphic document) so it is not              ing, Indonesian language consist 7962 words and English
necessary to train. Pattern matching is the act of check-               language consist 29759 words [17]. If words having simi-
ing for the presence of the constituents of a given pattern             larity with Indonesian then words will be saved on tempo-
or model. Pattern matching is used to check wanted struc-               rary then checking doing on English. If there were not in the
ture on something, to look for relevant structure, to get part          database then the words still can be replaced by new input.
wanted and to change part which is looked with others. Pat-                 Parsing in text document was done to get token from
tern matching is used to find relevant Latin string break-               sequences of characters that build text document and read
ing pattern model from every Latin string breaking pattern              from left to right. The result of parsing process was token
got from syntax analysis. Every model has own production                list, composed of sequences of characters, the composer text
rule for transliteration. Those production rules make pre-              document. Token list is broken into set of characters, the
cise transliteration from Latin string breaking pattern to des-         composer of Latin string. Syntax analysis did token check-
tination character. This method needs some consideration                ing and compares with token list and matches with produc-
such as, getting the correct Latin string pattern, sometimes            tion rule [16]. Ambiguity problems that raise when Latin
backwards searching is needed in addition that to determine             were transliterated into Javanese characters were prevented
the first character at Latin string breaking searching process           by handling space character [13]. The list of tokens were di-
there are some methods depending on condition of charac-                vided into pieces of syllables of Latin string composer. Syn-
ters around it. This situation can use method The Context               tax analysis tracked down the character or token to be com-
Free Recursive Descent Parser. Sometimes parsing is not                 pared with the available token list and matched with pro-
only used at source string but also on the result of process-           duction rules of checking and breaking the available Latin
ing string. At formal language theory, every language has               string. The Latin string split must accomadate the handling
consistency and standard grammar. In fact, it is impossible             of ambiguity problem [14]. This splitting will determine the
to bind string that will be substituted to Javanese charac-             structure of Latin string split pattern obtained. So it must be
ter. So that new rules were needed to revise some irrele-               determined which character becomes the fundamental char-
vant rules. A character in input string, for example Latin              acter, which character expresses pasangan character and so
consonant can have more than one possibility of alphabet                on [16].
in Javanese depending on character that follow that conso-                  The structure of split Latin string pattern which have
nant. The period also has possibility as indication finishing            been obtained here then is matched with the Pattern Match-
sentences, decimal or abbreviation so it will influence how              ing production rule to get relevant pattern of writing Ja-
that character build Latin string split pattern. Transliteration        vanese character [14]. So production rules were used for
model in this paper follows the production rule for check-              spliting Latin string and obtaining transliteration pattern.
ing and breaking Latin string by splitting Latin string un-             Next process was looking for the relevant mapping translit-
til smallest token. The smallest token was split Latin string           eration pattern to replace the part of the split Latin string
pattern form. The rule in transliteration model was made ac-            pattern become the Javanese character which first determine
cording to linguistic knowledge of writing Javanese script.             the position of Javanese character. This rule was used to
                                                                        replace the part of relevant mapping transliteration to Ja-
                                                                        vanese character with first determining the situation or po-
3    JawaTEX TRANSLITERATOR                                             sition of Javanese character [17, 18].
     CONCEPT                                                                Intermediate text is document with TEX/LTEX code and

                                                                        syntax. The code and syntax will be used to transliterate
The schema process of Latin to Javanese character translit-             split Latin string pattern that the Javanese character posi-
eration with LTEX is in figure 1.
               A                                                        tions were known. Intermediate text is a text document with
    Text document that would transliterated to Javanese                 extention .tex that follows right rule to write Javanese
character was written using text editor. Before Latin string            character [13]. Metafont is used to design and create Ja-
split pattern are browsing, the state from every words must             vanese character look like or font. The Javanese font were
be known. Error checking possibility that happened on writ-             written and saved using text editor in file Meta-
ing of the text source done by matching with dictionary                 font program then use to converted became TEX
[14]. So the sensitivity or accuracy depends on the com-                font codes, .gf and .tfm. Metafont is a language program
pleteness of words list in the dictionary. Words search-                to define vektor font. Metafont also is a compiler that ex-
ing in the dictionary were doing using Brute Force algo-                cecutes Metafont code, converts source code (vector font)
rithm, every time found unmatching pattern with text the                .mf into bitmap font with .gf and and .tfm extention.
pattern shifted one character to right. This spell checker fa-          JawaTEX.sty is a style file or a class file that contains macro


                                                                                               ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 7, No. 3, March 2010

                   Figure 1: The schema process of Latin to Javanese character transliteration with LTEX

to define all rules that will be used to write Javanese char-            There are 2 kinds of mechanisms for transliteration using
acters [13]. TEX documents that were compiled to compose             JawaTEX: user need all of document part be transliterated
text document become output that can be seen in monitor              and user need only part of document be transliterated. If
screen. TEX Font Metrics (TFM) is used by TEX to compile             user need all of document part be transliterated, then every
document. dvi document (.dvi) is compiled result doc-                source text written using text editor and saved in .txt for-
ument. Generic font (.gf) is compresed into packed font              mat [17]. Source text was processed using Perl to produce
(.pk) using GftoPK program to obtain smaller size. Pro-              correct syllable split pattern according to linguistics knowl-
cess to convert (dvi document become document that are               edge of writing Javanese script and mapping process into
print ready (ps document) use dvips program. The result              LTEX code. So in general text source cultivation include:

of this transliterator was document that contains Javanese           file writing and reading process, formatting process, roman
characters as a result of complex rule using LTEX and Perl
                                               A                     number checking process, spell checker for filtering wrong
program.                                                             words process, and LTEX syntax code writing process. This

                                                                     process produce 3 files which contains:
                                                                      1. File in rev.txt format contains corrected text doc-
4 JawaTEX TESTING                                                        ument after spell checker process.
                                                                      2. File in .jw format contains list of split Latin string
Testing was done with hardware: processor AMD Atlon XP                   pattern.
2500+, RAM 256 MB and hardisk 40 GB, while the soft-                  3. File in .tex format contains list of syntax code pat-
ware which was used is GNU/Linux Debian 3.1 Sarge op-                    tern.
erating system, Perl, LTEX (e-TeX (Web2C 7.4.5) 3.14159-

2.1) and Metafont ((Web2C 7.4.5) 2.718)). The testing of                The system of text document split is tested by using
application was done by giving Latin text document input             based text file as shown in figure 2.
which contains [15]:                                                    An example of a Latin text which has never been suc-
                                                                     cessfully transliterated, because this text contains char-
 1. The sequence of string that has character combination            acter combination which is impossible to happen in Ja-
    which is possible to be written in syllable of Javanese          vanese. The part of processing of document.txt as
    character.                                                       shown in figure 3. Corrected text document was saved
 2. The sequence of string Latin that has character combi-           as document rev.txt by system [17] and this system
    nation which is not possible to be written in syllable of        could produced the pattern of Latin string split and by the
    Javanese character and previously cannot be transliter-          system it was saved the file .jw. as shown on figure 4.
    ated to Javanese character.                                      The result of the testing showed that the model of sylla-


                                                                                            ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 7, No. 3, March 2010

                                                                     ator could produce the syntax code pattern and by the sys-
                                                                     tem it was saved the file .tex. as shown on figure 5. The

            Figure 2: File of document.txt

                                                                                 Figure 5: File of document.tex

                                                                     .tex document then being processed LTEX by calling syn-

                                                                     tax code that have been wroten in the LTEX file style called

                                                                     JawaTEX.sty. The result is .dvi file that can be processed
                                                                     be .ps and .pdf file as shown on figure 6. No deletion

     Figure 3: The part of text document processing
                                                                                 Figure 6: File of document.pdf

                                                                     or removing in the input character that would be transliter-
                                                                     ated. Latin string is transliterated just the way it is without
                                                                     transcription (substitution of writing which is suitable by
                                                                     pronunciation and word).
                                                                        If user need only part of document be transliterated, then
                                                                     user write the LTEX syntax code. User must having knowl-

                                                                     edge how to split the syllabic based on writing Javanese
                                                                     script and remember the syntax code based on JawaTEX.
                                                                     Figure 7 is an example how to write a document that not all
                                                                     of document part will be transliterated. User write source
                                                                     text using text editor and saved in .tex format.
                                                                        File in .tex format which shown in figure 7 then ready
            Figure 4: File of document.jw
                                                                     to be compiled using instructions:
                                                                     ema@debian:˜/JawaTeX/$ latex double.tex
ble browsing could form split pattern which has been in a            ema@debian:˜/JawaTeX/$ dvips double.dvi
line with the existing linguistic knowledge. This transliter-        ema@debian:˜/JawaTeX/$ ps2pdf


                                                                                            ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 7, No. 3, March 2010

                                                                           spelling mistakes.
                                                                        2. Able to read, modify, and insert other characters into
                                                                           the character of input string in order to fulfill writing
                                                                           format requirement.
                                                                        3. Able to handle the writing of Latin characters which
                                                                           have no equalization in Javanese alphabets.
                                                                        4. Able to handle the writing of diftong (multiple vocal
                                                                        5. Able to handle the writing of roman numbering sys-
                                                                        6. Able to accomodate period to avoid ambiguity because
                                                                           period also has possibility as indication finishing sen-
                                                                           tences, decimal or abbreviation so it will influence how
                                                                           that character build split Latin string pattern.
                                                                        7. Able to accomodate space to avoid ambiguity, because
                                                                           Javanese character does not recognize space to divide
                                                                        8. Able to accomodate acute accent to avoid ambiguity.
                                                                        9. Able to handle the writing of more than three multiple
                                                                           consonant characters.
                                                                       10. Able to handle the sequence of string Latin that has
                                                                           character combination which is not possible to be writ-
Figure 7: An example how to write a document that not all                  ten in syllable of Javanese character and previously
of document part will be transliterated                                    cannot be transliterated to Javanese character.

This system could produce file in .pdf format which con-                5    CONCLUSION
tains result of transliteration as shown in figure 8. The result
                                                                       The experimental results showed that by using The Context
                                                                       Free Recursive Descent Parser algorithm, text document in
                                                                       Latin writing can be processed so that syllable split patterns
                                                                       can be produced correctly. The produced syllable split pat-
                                                                       terns were ready to be processed into the next step that is the
                                                                       process of converting syllable split patterns to be Javanese
                                                                       character and the mapping of their writing scheme using
                                                                       Pattern Matching.
                                                                           Text document transliteration algorithm covers: produc-
                                                                       tion rules for splitting Latin string that consist of spell
                                                                       checker, roman number checking, text formatting and Latin
                                                                       string split pattern browsing, list of split Latin string pattern
                                                                       models, list of syntax code pattern models, and production
                                                                       rules were used for transliteration split Latin string pattern
                                                                       to Javanese character. Every process has own production
                                                                       rule. By building a set complex of rule, Latin string se-
                                                                       quence which was written in Latin character could be pro-
                                                                       cessed so that syllable split patterns could be produced cor-
                                                                       rectly. Ambiguity problem could be avoided by handling
                                                                       of space, dash, and period characters, and be expected to
                                                                       solve complex problem without any problems. The concept
      Figure 8: The part of text document processing                   of this transliterator model could improve the existing ma-
                                                                       chine of Latin string split which was made before.
of the testing shows that JawaTEX has some capabilities:                   JawaTEX program package contains two program,
                                                                       checking and breaking Latin string to get the Latin string
 1. Able to find the word similarity to correct the word                split pattern and LTEX style to write LTEX syntax code.
                                                                                            A                       A


                                                                                               ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 7, No. 3, March 2010

JawaTEX also can be used without program of checking and                [13] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A.,
breaking Latin string, but users must have knowledge about                  JawaTEX: Javanese Typesetting and Transliteration Text Doc-
how to get tokens and write LTEX syntax code associated
                                A                                           ument. Presented at Proceedings of International Conference
with that tokens. The concept of checking and breaking                      on Advanced Computational Intelligence and Its Applica-
Latin string to get the pattern and the process of convert it               tion (ICACIA) 2008, ISBN: 978-979-98352-5-3, 1 September
                                                                            2008, page 149-153.
into other character which built in this paper can be used as
the base to be developed in other case.                                 [14] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A., Pe-
                                                                            manfaatan Teknologi Informasi dan Komunikasi untuk Mem-
                                                                            bangun Model Transliterasi Dokumen Teks Karakter Latin ke
References                                                                  Aksara Jawa Melalui Komputasi Linguistik sebagai Alternatif
                                                                            Menarik dalam Melestarikan Kebudayaan Jawa. Presented at
[1] Apostolico, A; Galil, Z., Pattern Matching Algorithms. Ox-              Proceedings of SNASTIA 2008 Surabaya University, ISSN:
    ford University Press. Oxford. UK, 1997                                 1979-3960.

                                                                        [15] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A.,
[2] Black, P., Dictionary of Algorithms and Data Structures.
                                                                            Applying Natural Language Processing in Developing Split
    Nasional Institute of Standards and Technology. Online at
                                                                            Pattern of Latin Character Text Document According to Lin- 26 June 2008
                                                                            guistic Knowledge of Writing Javanese Script. Presented at
[3] French, J.; Powell, A.; Schulman, E., Applications of Ap-               Proceedings of International Graduate Conference on Engi-
    proximate Word Matching. ACM ISSN 0-89791-970-x. 1997.                  neering and Science (IGCES) 2008, ISSN: 1823-3287, 23-24
    Online at 26 June             December 2008, D5.
    2008.                                                               [16] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A.,
                                                                            Applying of The Context Free Recursive Descent Parser and
[4] Glavy,       J.,       Asian     Fonts.     Online  at                  Pattern Matching Method in Developing Transliteration Pat- 10 Septem-                  tern of Latin Character Text Document to Javanese Charac-
    ber 2006.                                                               ter. Presented at Proceedings of International Conference on
                                                                            Telecommunication (ICTEL) 2008, ISSN: 1858-2982, 20 Au-
[5] Gusfield, D., Algorithms on Strings, Trees, and Sequences:
                                                                            gust 2008, page 108-113.
    Computer Science. Cambridge University Press, 1997.
                                                                        [17] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A., Text
[6] Guzma, V., Brute Force:              Brute Force Algo-                  Document Split Pattern Browsing Based on Linguistic Knowl-
    rithm design techniques. Institute of Software Sys-                     edge of Writing Javanese Script using Natural Language Pro-
    tems/Tampere University of Technology. 2005. Online at                  cessing. Presented at Proceedings of International Conference tiraka/english/material2005/lecture3.pdf. 25              on Rural Information and Communication Technology 2009,
    Juni 2008.                                                              ISBN: 978-979-15509-4-9, 17-18 Juni 2009, 372-377

[7] Kumar,    String Matching Algorithms. Online at                     [18] Utami, E.; Istiyanto, J.; Hartati, S.; Marsono; Ashari, A., De-                     veloping Transliteration Pattern of Latin Character Text Doc-
    ImportantAlgorithms/StringMatching(ESomach).pdf. 25 Juni                ument Algorithm Based on Linguistics Knowledge of Writ-
    2008.                                                                   ing Javanese Script. Proceeding of International Conference
                                                                            on Intrumentation, Communication, Information Technology
[8] Mohammad, A; Saleh, O; Abdeen, R., Occurrences Al-                      and Biomedical Engineering (ICICI-BME) 2009, ISBN: 978-
    gorithm for String Searching Based on Brute-force algo-                 979-1344-67-8, IEEE: CFPO987H-CDR, 23-25 November
    rithm. Journal of Computer Science 2(1): 82-85, 2006.                   2009
    ISSN 1549-3636. Science Publications. 2006. Online at 25 Juni 2008.           [19] Ute,    A.,     String Matching Proseminar ”Algo-
                                                                            rithmen”.      Prof.     Brandenburg.     Sommersemester
[9] Nori, K.; Kumar, S., Foundations of Software Technology and             2001.         Online         at       www.infosun.fim.uni-
    Theoretical Computer Science. Springer, 1988.                  ss01/Abel.pdf.
                                                                            25 Juni 2008
[10] Sayoga, T., The Official Site of Aksara Jawa 2005. Online
    pada, 10 September 2006.

[11] Schildt; Herbert., Artificial Intelligence Using C. Osborne-
    McGraw Hill, California, 1987.

[12] Stephen, G., String Searching Algorithms. World Scientific.


                                                                                                  ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                           Vol. 7, No. 3, March 2010

                                                R.Sivaraman, 2 RM.Chandrasekaran
    Dy.Director, Center for Convergence of Technologies (CCT), Anna University Tiruchirappalli, Tiruchirappalli, Tamil
                                                      Nadu, India
                          Registrar, Anna University Tiruchirappalli, Tiruchirappalli, Tamil Nadu, India

Abstract                                                                  device. The storage of information on the mobile device is
           Web Based Query Management System (WBQMS) is a
                                                                          necessary because a business application can stop suddenly
                                                                          due to different reasons, such as, an incoming call phone or
methodology to design and to implement Mobile Business, in                because the device runs out of battery.
which a server is the gateway to connect databases with clients                As of now, most of the web based query management
                                                                          schemes are in preliminary stage only, and the goal of this
which sends requests and receives responses in a distributive
                                                                          query system is to achieve that can be achieved by PC but
manner. The gateway, which communicates with mobile phone                 now by mobile phones.
via GSM      Modem, receives the coded queries from users and                  With an short message based query message, data could
                                                                          be retrieved by customers in mobile business environment and
sends packed results back. The software which communicates
                                                                          unnecessary time would not be wasted. Due to the reliable
with the gateway system via SHORT MESSAGE, packs users’                   nature of short message, the consumers will receive the
requests, IDs and codes, and sends the package to the gateway;            message even if their phone is turned off at that time.
                                                                               The short message in web based Query management
then interprets the packed data for the users to read on a page of
                                                                          system described in this paper is the interaction short message,
GUI. Whenever and wherever they are, the customer can query               which gives only the requested information.
the information by sending messages through the client device
                                                                                               II. RELATED WORK
which may be mobile phone or PC. The mobile clients can get the
appropriate services through the mobile business architecture in               As of now, most of the existing web based business query
distributed environment. The messages are secured through the             management scheme is in preliminary stage only, for example,
                                                                          a small number of query via short message are encoded in
client side encoding mechanism to avoid the intruders. The                PDU (Packet Data Unit) format, which will increase the cost
gateway system is programmed by Java, while the software at               when the payload increases and it also leads to sending more
clients by J2ME and the database is created by Oracle for                 than one message if the requested payload is large. So the
                                                                          traditional PDU method is not suitable for mobile business
reliable and interoperable services.                                      queries in web based distributed environment.
Key words: Query, J2ME, Reliability, Database and Midlet
                                                                                 III. PROPOSED BUSINESS QUERY STRUCTURE
                     I.   INTRODUCTION
     Due to the growth of Mobile network and data                              Web based mobile business query structure proposed here
management schemes, mobile business has drawn attention by                is differed from the traditional PDU format, uses request and
more customers in distributive environment. As a result, it               response short message, which makes use of short message to
must determine how to deliver compelling applications to                  achieve two-way delivery of information. Message of this
ensure that data services to fulfill their potential requirements.        business gateway system, which in special data structure, is
     In order to provide mobile business applications to the              encoded in Text mode based on AT commands as compared
consumers, it is necessary to store and retrieve persistent               to PDU format in traditional method, and that will greatly
information on the mobile device as well as access remote                 shorten the length of short message since the message is coded
information stored on a remote DBMS host from the mobile                  into a standard format.

                                                                                                      ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                             Vol. 7, No. 3, March 2010

    The structure of the mobile business Querying System for               serially to the Processing module will receive the short
Academic Education which is composed of mobile phone,                      message and gives it to the processing module. Processing
data source server (Business Data Management) and the                      module decodes the received business query message and
                                                                           fetches the user’s query information (query code, User ID, and
gateway, is given below.
                                                                           Password) and first checks whether the user is valid to query
                                                                           the business information or not. Processing module then
                                                                           converts the query information to SQL and queries the
                                                                           database according to the business query code. Finally the
                                                                           results which are encoded into short message are sent back to
                                                                           the user’s destination number through short message. The
                                                                           software application in mobile device displays the results in
                                                                           the form of text. That’s called a complete query.

                                                                           A.    Communication Interface Module

                                                                                Communication Interface Module establishes a
                                                                           communication channel between mobile phones and
                                                                           communication module via short message connection. A
                                                                           database is created in communication interface module to
                                                                           store the information of short messages such as user’s phone
                                                                           number, user ID, Date/Time, query code etc. The Database
                                                                           also has four tables to increase the reliability of the mobile
                                                                           business query system, they are, InBox table, OutBox table,
           Fig.1 Structure of mobile business query system
                                                                           BackUp table and FailureSend table.

                                                                           1) InBox Table:      InBox table is used store the received
     The application software which communicates with the                  business query messages from the serial communication port
gateway system via short message connection packs the query                before processing that short messages. Processing module,
code, User ID and Password and sends encoded package to                    which is always monitoring InBox table and fetches one
the gateway. The gateway will receive the coded queries from               message at each time when there are messages in this table.
customers, retrieve the data according to the query code and
sends packed results back. The software at client side, then               2) OutBox Table: OutBox table stores the messages that
interprets the packed data for the users to read on a page of              are waiting for being sent to the corresponding mobile phone
GUI. The system can be extended to support GPRS, CDMA,                     numbers. Once the message has been sent, it is cleared from
Bluetooth and other communication protocols.                               the OutBox table and starts sending the next message from the
     Gateway must be setup in such a manner as shown in Fig.               table.
1 to establish short message connection between mobile
phones and data management system host. The gateway is                     3) BackUp Table: When processing module has finished
composed of Communication Interface Module, Processing                     the query process of short messages, then processing module
Module and Data Source Interface Module. This work takes                   clears the short message from InBox table and preserves it to
educational query as an example to analyze the                             the BackUp table for future use. All short messages in
implementation of web based mobile business query system.                  BackUp table will be removed every week.

          IV. THE BUSINESS GATEWAY SYSTEM                                  4) FailureSend Table: A business application can stop
                                                                           suddenly at any point of time. FailureSend table stores
     Mobile business Query based on short message is                       messages that didn’t be sent due to some failures like network
achieved by the connection between the server, SHORT                       failure, out of range and battery failure etc., Processing
MESSAGE and database management system.                                    Module checks whether there is a message in this table after a
     When business application installed in mobile phones is               particular period of time if yes then fetches the message(s) and
initiated, and then user has to select any one business query              sends it into OutBox table after re-set up.
from the available choices and sends the packed query
information in the form of short message to the gateway
system. Communication interface module which is connected

                                                                                                        ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010

B. Back end (Web) Module                                                 The application software of the mobile business is
                                                                      composed of four layers; they are user interface, record
      The Processing Module, which is responsible to provide          management system (RMS), the application software and
the interface between the communication interface module              Embedded OS.
and data management server together and ensures the
complete implementation of the mobile business query                                              User Interface
function. Once the system is started, web module of the
business system monitors the InBox Table and it deals with                            Record Management System (RMS)
one message each time according to First Come First Serve
(FCFS) basis. This module fetches sender’s number and sends
                                                                                           J2ME Application Software
it to ReceiverMobileNo field of OutBox table, then fetches the
contents from the message which includes three data’s for
further processing: query content, student’s ID number or                                         Embedded OS
teacher’s number, and password.

     First Web Module authenticate whether the user is legal                      Fig.2 Layers of Business Software Implementation
according to user’s ID and password using SHORT
MESSAGE database which is extracted from the message,                 A. Sending Module
then determines which database should be queried after
integrates query information, –query content and additional                 Two forms are created using J2ME technology and is
element, and generates SQL to query database, finally puts the        visualized by customers using Graphical User Interface (GUI)
message into OutBox table of Communication Interface                  which is provided by sending module. First a Form is
Module waiting for being sent. Progressing Module takes               designed to provide query content for users to select from the
multi-thread technique to judge the query.                            list available choices. When users finished their choices, click
                                                                      “OK” button to go into another Form in which there are two
    Since customer’s request message arrives at a random              TextFields for users to input their user IDs and passwords.
time, there are a number of messages arrive within a certain          This module then transforms query content, user ID and
period of time. Moreover, the procedure of processing every           password into digital code after users finished selecting the
customer’s query message from querying database to sending            choices. This can greatly shorten the length of short message.
messages by GSM Modem should be safe and reliable.                    The sending module starts a new thread when user clicks the
Consequently multi-thread technique is exerted in the whole           “send” button to avoid no response due to network blocking,
program. Main thread creates a new sub-thread for each query          no coverage etc.,
process when there is a new message. The sub-thread
processes the new message and won’t be released until the                  First, short message connection is established through
result is sent back to user.                                 method of message connection, in which
                                                                      the connection between the client mode and appropriate port
C. Data Management Module                                             number are set. Host number of destination address is the SIM
                                                                      card number of the GSM modem and destination address is
     The data management server which has been created                indicated in sending module program.
using oracle for education department includes: students’
score table, students’ credit table, students’ exam schedule
table, teachers’ information table, teachers’ schedule table.
Inquiry of business is carried out by connecting to the
database server and table which has been confirmed according
to query content extracted from the message.                          B. Receiving Module

    V.   THE BUSINESS SOFTWARE IMPLEMENTATION                             Receiving the result message of business query takes use
                                                                      of Push mechanism. After the introduction of MIDP2.0,
     As short messaging services is based on store and forward        Midlet applications can startup asynchronously through
mechanism, the transportation of messages is safe and                 network connection.
reliable. The business query system designed in this paper
combines both Advanced User Interface and Low-level User                   A Midlet application of receiving messages is registered
Interface to a portable, extensible component of GUI on               into a Push Registry through a push event by means of static
account of different mobile terminals and functions.                  registration. When there is a new message arrives the

                                                                                                   ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 7, No. 3, March 2010

application manager creates a new instance of receiving                 When query code is ‘002’, for example, it means the
message by new() method and invokes startApp() method to                Attendance credit of a student is considered to be queried.
activate this Midlet. ListConnection() method is used to
establish Message connection and open() method is invoked to                 The objective of the paper is to reduce the length of the
open and process message. The connection will close after the           short message in this query system. In order to achieve this
process is finished.                                                    each part is assigned to a particular length of bits (digital
                                                                        codes) which will undoubtedly reduce the length and cost of
     As soon as the application starts, the receiving module            the short message based query system. Since, the message is
monitors the monitors the designated port, then starts a new            encoded into a standard format.
thread to receive message when there is a new message, and
displays it by translating the message into text form user can               The length of the each part of Short Message in this
read.                                                                   system is listed in Table 2.
C. Record Management System (RMS)                                                                         TABLE II
                                                                                                  LENGTH OF SHORT MESSAGE
      Persistent storage of result message is helpful for future
use. The “javax.microedition.rms” package is based on                       Code of Query                User ID                  Password
Record Stores. A Record Store is the equivalent of a simple
file. A Record Store stores a set of records in binary format.                   3 bits                   12 bits                  10 bits
This package allows developers to store and retrieve
information to/from files on mobile devices. Two Record
Stores are designed in this paper, one is CourseStore which is
used to store the code and name of courses, the other is                              VII. CONCLUSION & FUTURE SCOPE
ClassroomStore which is used to store the code and name of
teaching buildings. When curriculums and teaching buildings
                                                                               This project application starts with welcome page with
need to be displayed, the application can access them at any            time interval set to 2000 milliseconds before it will load the
time.                                                                   options form to provide a list of choices. The login form will
                                                                        then be displayed. This login form requires users to enter their
            VI. SHORT MESSAGE SPECIFICATION                             login name and password. Only valid users can pass through
                                                                        this form before they can proceed to other forms. If the login
     The Short Message used for web based business query                name and password are not valid, error message will be
system contains three parts includes code of the query content,         displayed in the user’s mobile devices.
User ID and Password. The contents of short message are
listed below in Table 1.

                             TABLE I
                     CONTENT OF SHORT MESSAGE

  Query Content            Code        User ID       Password

  Student’s score          001        Student’s
                                   Register number
  Student’s credit         002      (or) Teacher’s
  Teacher’s Quantity       003        Employee        -------
  Teacher’s        class   004         number
  Exam schedule            005
                                                                                   Fig.3 Form I                            Fig. 4 Form II
    The 1st, 2nd and 3 rd characters are the reference which
can help decide to query which database. The 4th to 15th
characters are on behalf of students’ number or teachers’
number. The 18th to 27th characters are user’s password.

                                                                                                       ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 7, No. 3, March 2010

      Fig. 5 Interface                        Fig.6 Result display

        The method to implement mobile business query
system based on request-response short message was
described and developed in this work. This system is software
designed based on typical practical applications to be used by
business providers and customers. Even if there are several
businesses only one query message according to a particular
content query can be sent and the result is displayed correct.
This designed software was found to be reliable and practical.
It is verified that the system is usable and easy-operating.

In Future work, more business query functions can be added
from the prototype design to achieve Book query, Price query,
shopping query and other business queries. When there are a
large number of short messages waiting for processing, GSM
Modem group can be used to solve this problem to increase
the reliability of the system.


[1]   Z.Wang, Z.Guo, and Y.Wang, “Design and Implementation of Short
      message query system for academic office based on J2ME,” in 2008
      ISECS International Colloquium on Computing, Communication,
      Control, and Management, 2008, paper 10.1109, p. 509.
[2]   Marc Alier, Pablo Casado, and M. José Casany, “J2MEMicroDB: a
      new Open Source lightweight Database Engine for J2ME Mobile
      Devices,” in International Conference on Multimedia and Ubiquitous
      Engineering 4 (MUE’07), 2007.
[3]   M.A. Mohammad, A. Norhayati, “A Short Message Service for
      Campus Wide Information Delivery,” in 4th National Conference on
      Telecommunication Technology Proceedings, Shah Alam, Malaysia,
      2003, p. 216.
[4]   R. Cooper, S.Manson, “Extracting temporal     Information    from
      short messages, Data Management, Data Everywhere,” in 24th
      British National Conference on Databases. BNCOD 24, 2007, p. 224.
[5]   Wireless Messaging API (WMA) for Java™ 2 Micro Edition.
      [Online]. Available:
[6]   James Keogh. J2ME: The Complete Reference, 2nd ed. Berkeley,
      U.S.A: McGraw-Hill-Osborne, 2003.

                                                                                                           ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                             Vol. 7, No. 3, March 2010

Predictive Gain Estimation – A mathematical analysis
                                                  Sir Padampat Singhania University
                                                      Udaipur , Rajasthan , India
                                                Email_id :

Abstract- In case of realization of successful business, gain                 III. BEHAVIORAL THEORY OF GAIN ESTIMATION
analysis is essential. In this paper we have cited some new
techniques of gain expectation on the basis of neural property              For a particular business, yearwise gain is noted. It is
of perceptron. Support rule and Sequence mining based                       possible to rationalize this sequence of gain patterns on the
artificial intelligence oriented practices have also been done in
                                                                            basis of likelihood estimate , support rule and sequence
this context. In the view of above fuzzy and statistical based
gain sensing is also pointed out.                                           mining.

                     I. INTRODUCTION                                        A. Likelihood measure

Predictive gain estimation plays a pivotal role in forecast                 In case of gain prediction of a business, forecast based
based strategic business planning. Realization of gain                      expectation can be achieved after sensing original gains.
patterns is essential for proper gain utilization. the gain
analysis can be sensed using perceptron learning rule and                   Theorem1 : Maximum likelihood estimator of expected
other techniques. Prediction can also been performed on the                 gain , taking on threshold , depends on individual estimated
basis of fuzzy assumption theory on the factors on which                    gain.
gain of a business organization depends. Statistical
application of gain estimation can also be used as a tool in                Proof : Let g1 , g2,….,gnbe the gain of business organization
this context.                                                               noted after observation in respective years y1,y2,…yn. The
                                                                            business authority initially expects gain as ge .
II. GAIN ANALYSIS USING PERCEPTRON LEARNING                                 The joint density function of the gain patterns for that
                     RULE                                                   particular business is given by
                                                                            L = f(g1 , ge ) f(g2 , ge )...... f(gn , ge )where L is the
We assume that G= {g1 , g2,….,gn}be the set of gain of                      likelihood. We can write as follows:
business organization noted after observation in respective                  δf(g1,ge)/{f(g1,ge ).δge}+…δf(gn,ge)/{f(gn,ge ).δge} = 0…..(1)
years Y = {y1,y2,…yn}.As per our assumption gk ( where gk                   Solution of ge from eq(1) reveals that expected gain
ε G and 1<k<n}.                                                             depends on individual estimated gain.

The following steps must be done:                                           B.Gain analysis based on support rule

 1. Achieve difference in gain estimate as δi = | g         – gi | ,        The factors on which gain of a particular business depends
(i=1 to n-1).                                                               are mainly production , quality , market competition , risk
                                                                            involvement and cost and the above parameters are denoted
2. gk is optimum i.e. Max(g1 , g2,….,gn) = gk                               as P,Q,M,R,C respectively. Let ,
                                                                            PL = low production ,PH = high production,QM= medium
3. Compute              n-1                                                 quality,QB= best quality,ML= low market competition,
                 δAVG = ∑ δi / (n-1)                                        MH= high market competition, RL= less risk involvement,
                        i=1                                                 RH= high risk involvement,CL= low cost, CH= high cost

4. Predicted gain = gn = g n-1 + δAVG ( if g n-1 < δAVG )
                 or gn = g n-1 - δAVG ( if g n-1 > δAVG )                   YEAR             GAIN          PARAMETER STATUS
5. Normalization of the parameters on which gain depends                    y1               g1            PL QB   MH RL                   CH
will be scaled upwards/downwards by a factor of x where x                   y2               g2            PH QM ML RL                     CL
= (gn - gk ).                                                               y3               g3            PH QB   MH RH                   CH
                                                                            y4               g4            PL QM MH RH                     CH

                                                                            Table 1 : Gain observation with parameter status

                                                                                                        ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 7, No. 3, March 2010

As per support rule , support count is as follows:                      The desired optimum gain will be a fuzzy set whose
PL = PH =1/2                                                            member functions will be such that is intersection rule is
QM=QB= 1/2                                                              applied in case of R and C while for other parameters the
MH =3/4                                                                 union rule.
ML= 1/4
RL=RH=1/2                                                                          Statement         Fuzzy
CH=3/4                                                                                              Values
CL=1/4                                                                            g1               0.1
                                                                                  g2               0.2
Optimum Gain can be achieved if the gain condition =                              g3               0.3
ge = f(PH,QB,MH,RL,CL).                                                           g4               0.4
C. Gain analysis based on sequence mining
                                                                              Table 3: Fuzzy based value assignment
Sequence of gain patterns of a business organizarion
                                                                        The above chart involves the gains and its respective fuzzy
depends on Boolean notation of its associated factors.
                                                                        values and set. Since the gains are based on fuzzy operation,
PL=0 ,PH=1, QM=0, QB=1,                                                 so we can denote each gain as a fuzzy set and the results as
ML=0,MH=1,RL=0,RH=1,CL=0 and CH=1.                                      per our assumption are shown below: -
As per Table 1, the sequence tables are as follows:
GAIN        P     Q     M        R      C                               Fuzzy Set        Member                         Value
g1          0     1     1        0      1                                                Function
g2          1     0     0        0      0                               G1~                             G1~= {(x1, 0.1), (x2, 0.6), (x3,
g3          1     1     1        1      1                                              P,Q,M,R,C        0.9), (x4, 0.2), (x5, 0.8)}
                                                                        G2~                             G2~= {(x1, 0.7), (x2, 0.3), (x3,
g4          0     0     1        1      1
                                                                                                        0.5), (x4, 0.3), (x5, 0.2)}
                                                                        G3~                             G3~= {(x1, 0.8), (x2, 0.8), (x3,
Table 2 : Sequence observation with parameter status                                                    0.7), (x4, 0.7), (x5, 0.7)}
                                                                        G4~                             G4~= {(x1, 0.2), (x2, 0.1), (x3,
As per the sequence,it is quite obvious that the market                                                 0.8), (x4, 0.9), (x5, 0.8)}
competition is high and hence there must be ample
production with high cost and best quality.                             Table4: Fuzzy representation of respective gains

   IV. CHAOS THEORY BASED GAIN ESTIMATION                               Now ,µ (G1~U G2~ UG3~ U G4~ ) (x1) = max(0.1, 0.7,0.8,0.2) = 0.8;
                                                                              µ (G1~U G2~ UG3~ U G4~ ) (x2) = max(0.6, 0.3,0.8,0.1) =0.8;
                                                                              µ (G1~U G2~ UG3~ U G4~ ) (x3) = max(0.9, 0.5,0.7,0.8) =0.8;
The chaos theory can be applied in case of gain estimation.
                                                                              µ (G1~∩ G2~∩ G3~∩ G4~ ) (x4) = min(0.2, 0.3,0.7,0.9) = 0.2;
As per the theory, a minute change in the input will affect
                                                                              µ (G1~∩ G2~∩ G3~∩ G4~ ) (x5) = min(0.8, 0.2,0.7,0.8) = 0.2
the output a lot. Hence if a business organization bestows
gain in an uniform rate, and due to some external factor,
                                                                        Hence Gopt~ = {(x1, 0.8), (x2, 0.8), (x3, 0.8), (x4,0.2),(x5,0.2)}
huge gain/loss is incurred ,then inspection of the internal
parameter status can be observed using Chaos Theory.
                                                                        PARAMETER              YEAR OF
Factors involved are production ,quality ,market competition
, risk involvement and cost. By crisp operation, status of
each has two probable outputs – 0 or 1. The bit values                  P                   y3
expected are X={0,1} and the combinations are 2 5 .                     Q                   y3
                                                                        M                   y1
  V. FUZZY BASED OPTIMUM GAIN REALIZATION                               R                   y1
                                                                        C                   y2
If the gain estimates for a particular business are as g1, g2,
g3 and g4, then the codes in terms of fuzzy value are as
follows:-g1 = 0.1,g2 = 0.2,g3 = 0.3,g4 = 0.4. We assume that            Table 5: Corresponding year realization for optimum gain
(i)the outcomes of the gains are based on parameters
P,Q,M,R,C as before (ii)the parameter P will be high in                 In this case, we realize that the business has flourished in
cases of the gains g2, g3 ; Q will be on g1, g3 ; M on g1, g3,          year y 1 maintaining lowest risk in highest competitive
g4 , R on g1, g2 and C on g2 .                                          market ; lowest cost involvement in y2 while highest
                                                                        production with best quality in y3.

                                                                                                    ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 7, No. 3, March 2010

VI. STATISTICAL APPROACH OF GAIN PREDICTION                                        P(G=gi, Q = qj) = pij, i = 1 to n and j = 1 to n

                                                                                   E (G + Q) =        ∑ ∑ (gi + qj) pij
We can analyze gain in time series form. In a business,the                                            i j
year of highest gain is inspected for future use and
prediction.                                                                                       =     ∑ ∑ gi pij + ∑ ∑ qj pij
                                                                                                         i j         i j
Theorem 2
If a gain of a particular business changes (G) over time (t) in                                   =     ∑ gi ∑ pij + ∑ qj        ∑ pij
an exponential manner, in that case the value of the gain at                                             i    j      j            i
the centre point an interval (a1, a2) is a geometric mean of its
values at a1 and a2.                                                               E( G + Q) =E (G) + E(Q) ………………..........(3)

Proof: Let Ga = mna , m and n being constants.
        Then Ga1 = mna1 and Ga2 = mna2                                   We can also predict gain on the basis of autoregression
Now, value of G at (a1 + a2)/2                                           property
                      = mn (a1+a2)/2                                     Here from next year onwards, a future gain guessing will be
                      = [m2n(a1+a2)]1/2                                  done based on
                      = [(mna1)(mna2)]1/2
                      = (Ga1Ga2)1/2                                      gm+1 =mgm+m-1gm-1+m-2gm-2 +…+….1kg+m+1 ......(4)
Theorem 3
 If a variable m representing gain of a business is related to           Here, m+1, indicates a random error at time m+1. Here,
another variable n representing year in the form m= an,                  each element in the time series can be viewed as a
where a is a constant, then harmonic mean of n is related to             combination of a random error and a linear combination of
that of n based on the same equation.                                    previous values. Here i are the autoregressive parameters.

Proof: Let x is number of observed gain values.                          We can also predict based on the theory of moving average.
 If mHM = x / (∑ 1/mi) for i = 1 to x                                    In this case the gain guessing strategy will be based on :
        = x / (∑ 1/ani)      [ Since mi = ani]
        = x / ( 1/a ∑ 1/ni) for i = 1 to x                               gm+1 = am+1+mam+m-1am-1+ m-2am-2 +…… + m-qam-q…(5)
        = a( x / ( ∑ 1/ni) for i= 1 to x                                 where ai is a shock
        = anHM                                                                  is estimate
Gain can also be estimated based on probabilistic approach.                       q is term indicating last predicted value.
Suppose G be the set of gains g1, g2, g3 ………… for a
particular business organization with respective probability             The values of the gain estimates for different years are not
p1,p2, …….pn.                                                            identical. In some cases the values are close to one another,
         m                                                               where in some cases they are highly dedicated from one
When ∑           pi = 1                                                  another. In order to get a proper idea about overall nature of
         i=1                                                             a given set of values, it is necessary to know, besides
                                                                         average, the extent to which the gain estimates differ among
              m                                                          themselves or equivalently, how they are scattered about the
then    E(G) = ∑         gi pi = 1…………………………..(2)                        average.Let the values g1, g2, g3…….gm are the gain
             i=1                                                         estimates values and c be the average of the original values
                                                                         of gm+1, gm+2,……gn.
provided it is finite.                                                   Mean Deviation of k about c will be given by
                                                                                    1       n-m
Here, we are use bivariate probability based on G (g1, g2,
g3…gm) i.e.set of achieved gains and Q (q1, q2, q3, …….qn)               MDc = ________ ∑ | gi – c |………………….......(6)
i.e. set of predictive gains , ( 1 < m < n)                                     ( n – m) i = 1

Theorem 4                                                                In particular , when c = g , mean deviation about mean will
If the values of observed gain and predicted gain be two                 be given by
jointly distributed random variables then                                                 1    n-m
    E ( G + Q) = E (G) + E(Q) .                                          MDg = _______ ∑ | gi – gi | …………………….(7)
                                                                                     ( n – m) i = 1
Proof : G assume values g1, g2, g3 ……………… gm
        Q assume values q1, q2, q3 ………………. qm

                                                                                                      ISSN 1947-5500
                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                     Vol. 7, No. 3, March 2010

                         VII. CONCLUSION                                            and Technology, Singapore. He has about 90 papers in
                                                                                    national and international journals and conferences in his
The paper deals with several gain estimates which play                              credit. He has several visiting assignments at BHU Varanasi
crucial role in strategic business planning thereby bestowing                       , IIT Kharagpur , Amity University,Kolkata , et al.
optimum gain estimate. In this context gain analysis with
neural , fuzzy and statistical justification pointed out in this


[1] M.Spiegel “Probability and Statistics”,Tata McGraw Hill, 2004

[2] P.Chakrabarti et. al. “A Mathematical realization of software analysis”
to appear in International Journal of Computer and Electrical Engineering
, Vol 2 No.3 , Jun 2010

[3] P.Chakrabarti et. al. “Approach towards realizing resource mining and
secured information transfer” published in international journal of IJCSNS,
Korea , Vol 8 No.7, July08

[4] P.Chakrabarti et. al. “Information representation and processing in the
light of neural-fuzzy analysis and distributed computing” accepted for
publication in international journal AJIT(Asian Journal of Information
Technology), ISSN: 1682-3915, Article ID: 743-AJIT

[5] P.Chakrabarti et. al. “An Intelligent Scheme towards information
retrieval” accepted for publication in international journal AJIT(Asian
Journal of Information Technology), ISSN: 1682-3915, Article ID: 706-

About author:

Dr.P.Chakrabarti(09/03/81) is currently serving as Associate
Professor in the department of Computer Science and
Engineering of Sir Padampat Singhania University,Udaipur.
Previously he worked at Bengal Institute of Technology and
Management , Oriental Institute of Science and Technology,
Dr.B.C.Roy Engineering College, Heritage Institute of
Technology, Sammilani College. He obtained his
Ph.D(Engg) degree from Jadavpur University in Sep09,did
M.E. in Computer Science and Engineering in
2005,Executive MBA in 2008and B.Tech in Computer
Science and Engineering in 2003.He is a member of Indian
Science Congress Association , Calcutta        Mathematical
Society , Calcutta Statistical Association , Indian Society
for Technical Education , Computer Society of India, VLSI
Society of India , Cryptology Research Society of India,
IEEE(USA) , IAENG(Hong Kong) ,CSTA(USA) and Senior
Member of IACSIT(Singapore) .He is a Reviewer of
International journal of Information Processing and
Management(Elsevier), International Journal of Engineering

                                                                                                                ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010

      Lightweight Distance Bound Protocol for Low
                    Cost RFID Tags
                  Eslam Gamal Ahmed                      Eman Shaaban                         Mohamed Hashem
                                           Faculty of Computer and Information Science
                                                       Ain Shams University
                                                          Cairo, EGYPT

Abstract: Almost all existing RFID authentication schemes               attacks since the adversary does not change any data between
(tag/reader) are vulnerable to relay attacks, because of their          the reader and the tag [3].
inability to estimate the distance to the tag. These attacks                 There are three types of attacks related with distance
                                                                        between the reader and the tag. The dishonest tag may claim
are very serious since it can be mounted without the notice
                                                                        to be closer than he really is. This attack is called distance
of neither the reader nor the tag and cannot be prevented by            fraud attack. There are two types of relay attacks: mafia fraud
cryptographic protocols that operate at the application                 attack and terrorist fraud attack. In mafia fraud attack
layer. Distance bounding protocols represent a promising                scenario, both the reader R and the tag T are honest, but a
way to thwart relay attacks, by measuring the round trip                malicious adversary is performing man-in-the-middle attack
time of short authenticated messages. All the existing                  between the reader and the tag by putting fraudulent tag T’
distance bounding protocols use random number generator                 and receiver R’. The fraudulent tag T’ interacts with the
                                                                        honest reader R and the fraudulent reader R’ interacts with
and hash functions at the tag side which make them
                                                                        the honest tag T. T’ and R’ cooperate together. It enables T’
inapplicable at low cost RFID tags.                                     to convince R as if R communicates with T, without actually
    This paper proposes a lightweight distance bound                    needing to know anything about the secret information.
protocol for low cost RFID tags. The proposed protocol                  Terrorist fraud attack is an extension of the mafia fraud
based on modified version of Gossamer mutual                            attack. The tag T is not honest and collaborates with
authentication protocol. The implementation of the                      fraudulent tag T’. The dishonest tag T uses T’ to convince the
proposed protocol meets the limited abilities of low cost               reader that he is close, while in fact he is not. T does not
RFID tags.                                                              know the long-term private or secret key of T. The problem
                                                                        with Mafia fraud attack is that this attack can be mounted
Keywords: RFID, Distance Bound, Mutual Authentication,                  without the notice of both the reader and the tag.
Relay Attack, Gossamer.

                    I.    INTRODUCTION
Radio Frequency Identification (RFID) system is the latest
technology that plays an important role for object
identification as a ubiquitous infrastructure. RFID has many
applications in access control, manufacturing automation,                              Figure 1. Distance Fraud Attack
maintenance, supply chain management, parking garage
management, automatic payment, tracking, and inventory
      RFID tags and contactless smart cards are normally
passive; they operate without any internal battery and receive
the power from the reader. This offers long lifetime but
results in short read ranges and limited processing power.
They are also vulnerable to different attacks related to the
location: distance fraud and relay attacks. Relay attacks occur
when a valid reader is tricked by an adversary into believing
that it is communication with a valid tag and vice versa. That
is, the adversary performs a kind of man-in-the-middle attack
between the reader and the tag. It is difficult to prevent these
                                                                                  Figure 2. Mafia and terrorist fraud attack

                                                                                                  ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 7, No. 3, 2010

    It is difficult to prevent since the adversary does not
change any data between the reader and the tag. Therefore
mafia fraud attack cannot be prevented by cryptographic
protocols that operate at the application layer. Although one
could verify location through use of GPS coordinates, small
resource limited devices such as RFID tags do not lend
themselves to such applications.
     Distance bounding protocols are good solutions to
prevent such relay attacks [8, 9, 11]. These protocols measure
the signal strength or the round-trip time between the reader
and the tag. However the proof based on measuring signal
strength is not secure as an adversary can easily amplify
signal strength as desired or use stronger signals to read from
afar. Therefore many works are devoted to devise efficient
distance bounding protocols by measuring round-trip time.
    This paper is organized in the following way: Section II
presents the related work for distance bound protocols in low
cost RFID tags. Section III presents the proposed protocol.
Section IV presents the implementation of the proposed
protocol. Section V concludes the paper.

                   II.    RELATED WORK
Hancke and Kuhn proposed a distance bounding protocol
(HKP) [9] that has been chosen as a reference-point because                          Figure 3. Hancke and Kuhn’s Protocol
it is the most popular distance bounding protocol in the RFID
framework. As depicted in Figure. 3, the protocol is carried                After exchanges of random nonces (Na and Nb), the reader
out as follows.                                                          and the tag compute 3n-bit sequence, P||v0||v1, using a
                                                                         pseudorandom function. The string P indicates the void-
    After exchanges of random nonces (Na and Nb), the                    challenges; that is, if Pi = 1 reader sends a random challenge
reader and the tag compute two n-bit sequences, v0 and v1,               and if Pi = 0 it does not. These void-challenges allow the tag
using a pseudorandom function (typically a MAC algorithm,                to detect if an adversary is trying to get the responses in
a hash function, etc.). Then the reader sends a random bit for           advance. When the tag detects an adversary it stops sending
n times. Upon receiving a bit, the tag sends back a bit Ri from          responses.
v0 if the received bit Ci equals 0. If Ci equals 1, then it sends
back a bit from v1.                                                          The protocol ends with a message to verify that no
                                                                         adversary has been detected. The adversary can choose
   After n iterations, the reader checks the correctness of Ri’s         between two main attack strategies: asking in advance to the
and the propagation time. In each round, the probability that            tag, taking the risk that the tag uncovers him, and without
the adversary sends a correct response is a priori ½. However            taking in advance and trying to guess the responses to the
the adversary can query the tag in advance with some                     challenges when they occur. The adversary’s success
arbitrary C′ is, between the nonces are sent and the rapid bit           probability depends on pf , the probability of the occurrence
exchange starts. Doing so, the adversary obtains n bits of the           of full challenge, and can be calculated:
registers. For example, if the adversary queries with some               Pmp =     (1 – pf /2) n, if pf ≤ 4/5 (without asking in advance),
zeroes only, he will entirely get v0. In half of all cases, the                    (pf * 3/4 )n, if pf > 4/5 (asking in advance).
adversary will have the correct guesses, that is C′i = Ci, and
therefore will have obtained in advance the correct value Ri                 The success probability of the adversary is (5/8) n if the
that is needed to satisfy the reader. In the other half of all           string P is random [5], which is less than (3/4) n. Note that the
cases, the adversary can reply with a guessed bit, which will            final confirmation message h(K, v0, v1) does not take any
be correct in half of all cases. Therefore, the adversary has            challenges Ci as an input. So it can be pre-computed before
3/4 probability of replying correctly.                                   the start of the fast bit exchange. On the other side, the
    Munilla, and Peinado [5] modified the Hancke and                     disadvantage of their solution is that it requires three
Kuhn’s protocol by applying “void challenges” in order to                (physical) states: 0, 1, and void, which may be difficult to
reduce the success probability of the adversary. As shown in             implement. Furthermore the success probability of the
Figure. 4, the challenges from the reader are divided into two           adversary is higher than (1/2) n.
categories, full challenge and void challenge.

                                                                                                   ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010

                                                                        random challenges from predefined challenges. However,
                                                                        these predefined challenges; the tag is able to detect an
                                                                        adversary sending random challenges in order to get
                                                                        responses in advance. Upon reception of a challenge Ci from
                                                                        the reader, either Ti = 1 (random challenge), in this case the
                                                                        tag sends out the bit vi Ci; or Ti = 0 (predefined challenge), in
                                                                        this case the tag sends out the bit v0i if Ci = Di and a random
                                                                        bit if Ci ≠ Di (it detects an error). From the moment the tag
                                                                        detects an error, it replies a random value to all the
                                                                        subsequent challenges. By doing this, both the reader and the
                                                                        tag fight the adversary.
                                                                            In this protocol no confirmation message after the end of
                                                                        fast bit exchanges is used, which improves efficiency in
                                                                        terms of computation and communication compared to
                                                                        Munilla and Peinado [5]. This protocol with (binary) mixed
                                                                        challenges provides the best performances among all the
                                                                        existing distance bounding protocols with binary challenges
                                                                        that do not use final signature. Indeed, while all previous
                                                                        protocols provide a probability of the adversary success equal
                                                                        to (3/4) n, the probability quickly converges toward (1/2) n in
           Figure 4. Munilla and Peinado Protocol                       this case.
    To overcome the disadvantage of Munilla and Peinado
protocol, Chong Hee Kim and Gildas Avoine [3] present a                                  III. PROPOSED SOLUTION
modification using mixed challenges: the challenges from the            The main issues in implementing previous distance bound
reader to the tag in the fast bit exchanges are divided into two        protocols in low cost RFID tags which has only 4k security
categories, random challenges and predefined challenges.                gates according to EPC Class1 Generation2 standards [6]; is
The earlier are random bits from the reader and the latter are          that it requires random number generation and hash function
predefined bits known to both the reader and the tag in                 which exceeds capabilities of low cost RFID tags.
advance.                                                                     Modified Gossamer protocol [1] can be combined with
                                                                        distance bound protocol using mixed challenges to make
                                                                        distance bound protocol applicable on low cost RFID tags by
                                                                        removing random number generation and hash function from
                                                                        tag side and it will be done only on reader side as in Figure.6.

Figure.5 Distance bounding protocol using mixed challenges
    As shown in Figure. 5, the reader and the tag compute
4n-bit sequence for T||D||v0||v1, after exchange of random
nonces (Na and Nb). The string T indicates random-
challenges: if Ti = 1 the reader sends a random bit Si c {0, 1}
and if Ti = 0 it sends a predefined bit Di to the tag.
    From the point of the adversary’s view, all Ci’s from the           Figure.6 Modified Gossamer and distance bound using mixed
reader look like random. Therefore he cannot distinguish                                       challenges

                                                                                                  ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010

    In this protocol reader and tag will use random numbers                The proposed protocol is implemented using Xilinx 9.2i
n1, n2, n3, n1’ instead of T||D||v0||v1 so there is no need for        using VHDL on VirtexII FPGA (XC2VP20) device [6].The
random number generation and hash functions in tag side.               implementation consists of two parallel processes one for
From storage view we need to store (6*96) bits in RAM for              controller and one for data path. The controller process
old and new values of IDS, K1, K2 and 96 bit in ROM;                   contains all control signal and gates while data path process
which is applicable in low cost RFID tag which has 1K                  contains data registers and ALU gates to implement required
memory bits according to EPC Class1 Generation 2 standard.             operations on these registers.Figure.9 shows general block
                                                                       diagram for hardware implementation for proposed protocol
    The distance bound protocol using mixed challenges in
                                                                       while Figure.10 shows RTL schematic generated from
proposed protocol is implemented as in Figure.7.{T}
sequence is replaced with nonce n1; {D} sequence is
replaced with nonce n2; {v0} sequence is replaced with nonce                  Layer             Attack Type                     Attack

n3 and {v1} sequence is replaced with nonce n1’.Therefore
                                                                                                                           Tag Removal                     X
proposed solution prevents relay attacks and supports mutual                                    Permanently
                                                                                               Disabling Tags
                                                                                                                          Tag Destruction                  X
authentication on low cost RFID tags.                                        Physical
                                                                                                                          KILL Command                     X
                                                                                                Temporarily             Passive Interference               X
                                                                                               Disabling Tags             Active Jamming                   X
                                                                                                                Relay Attacks                              √
                                                                                                                                Cloning                    X
                                                                                             Attacks on the Tags
                                                                                                                                Spoofing                   √
                                                                            Network -
                                                                                               Attacks on the              Impersonation                   √
                                                                                                  Reader                   Eavesdropping                   √
                                                                                                        Network Protocol Attacks                           X
                                                                                                                         Unauthorized Tag
                                                                                                                         Tag Modification                  √
                                                                           Application                                   Buffer Overflows                  X
                                                                                                                      Malicious Code Injection
                                                                                             Middleware Attacks

                                                                                                                       Competitive Espionage            Partially
                                                                                                                        Social Engineering                 X
                                                                            Strategic                                     Privacy Threats                  √
                                                                                                                         Targeted Security
                                                                                                                         Covert Channels                   X
                                                                                                                      Denial of Service Attacks
  Figure.7 Distance bound protocol using mixed challenges                                                                        [3,6]
                                                                        Multilayer Attacks                                Traffic Analysis                 √
            implemented in modified gossamer                                                                              Crypto Attacks                   √
                                                                                                                       Side Channel Attacks                X
                                                                                                                          Replay Attacks                   √
The proposed protocol prevents many RFID attacks [2] as
shown in table 1.                                                               Table 1. Prevented attacks in proposed protocol
                                                                         X: Not protected           Partially: Partially protected                √: Protected

                IV.     IMPLEMENTATION
As shown in Figure.8 the proposed protocol has 11 states.
The tag starts at begin state (S0) and waits for reader
interrogation signal (S) and waits for receiving A
(R_A).when receiving S and R_A tag moves to Recv_A state
(S1) and waits until receiving A(96 bit).the tag moves the
same way in states Recv_B (S2) and Recv_C (S3).In
Calc_T_abc state (S4) the tag calculates A||B||C and
compares between C and C’; if C=C’ then the tag moves to
Calc_T_d state (S5) else it will return to S1.
    After calculating D the tag waits until receiving R_Ci to
begin distance bound protocol; when receiving R_Ci the tag
moves to Recv_Ci state (S6) and moves to Calc_Ri (S7).If no
error detected and distance bits didn’t finished (N’) the tag
returns to S6.If no error detected at (S7) and receive n bits
(N) then tag moves to Tag_update state (S8). If error detected
                                                                                  Figure.8 State diagram of proposed protocol
in distance stage the tag moves to Gen_rnd (S9) to send
random bits and then moves to final state (S10).

                                                                                                        ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                      Vol. 7, No. 3, 2010

The implementation results from synthesis report generated                                V.      CONCLUSION
from Xilinx, as shown in table 2 shows that proposed
                                                                   This paper has proposed a distance bound protocol for low
protocol meets low cost RFID tags according to EPC Class 1
                                                                   cost RFID tags. The proposed protocol based on modified
Generations 2 standard.
                                                                   Gossamer mutual authentication protocol and prevents relay
  #of                                                              attacks. The proposed protocol does not need random number
             #of        #of
 4-bit                            ROM    RAM      Frequency        generation and hash function to be done in tag side.
            Slices   Flip flops
                                                                       In order to verify the proposed protocol, we also
 860         510      886      96 bit 576 bit     487 MHZ
                                                                   designed and implemented it using Xilinx 9.2i with using
                Table 2. Implementation results                    VHDL at the behavioral level. The system operates at a clock
                                                                   frequency of 487 MHz on VirtexII FPGA device. From
                                                                   implementation results, we have shown that our scheme is a
                                                                   well-designed strong protocol which satisfies various security
                                                                   requirements in low cost RFID tags.


                                                                   [1] Eslam Gamal Ahmed, Eman Shabaan, Mohamed Hashem
                                                                   “Lightweight Mutual Authentication Protocol for Low cost RFID
                                                                   Tags”, International Journal of Network Security & Its Application
                                                                   (IJNSA), Academy & Industry Research Collaboration Center
                                                                   (AIRCC), April 2010, Vol.2, No.2.
                                                                   [2]     Aikaterini Mitrokotsa, Melanie R. Riebackand      and
                                                                   rew S. Tanenbaum “Classifying RFID attacks and defenses”, Inf
                                                                   Syst Front, Springer, Published online 29 July 2009, DOI
                                                                   [3] Chong Hee Kim and Gildas Avoine “RFID Distance Bounding
                                                                   Protocol with Mixed Challenges to Prevent Relay Attacks”,
  Figure.9 Hardware implementation of proposed protocol
                                                                   CANS’09, December 2009, former version: IACR ePrint at
                                                                   [4] Hung-Yu Chien, “DOS Attacks on Varying Pseudonyms-Based
                                                                   RFID Authentication Protocols”,2008 IEEE Asia-Pacific Services
                                                                   Computing Conference,pp. 616-622.
                                                                   [5] J. Munilla and A. Peinado. Distance bounding protocols for
                                                                   RFID enhanced by using void-challenges and analysis in noisy
                                                                   channels. Wireless communications and mobile computing.
                                                                   Published online: Jan 17 2008.
                                                                   [6] Pedro Peris López, Dr. D. Julio C. Hernández Castro and Dr. D.
                                                                   Arturo Ribagorda Garnacho” Lightweight Cryptography in Radio
                                                                   Frequency Identification (RFID) Systems” Ph.D. THESIS,
                                                                   UNIVERSIDAD CARLOS III DE MADRID, Computer Science
                                                                   Department, Leganés, October 2008.
                                                                   [7] Hong Lei, Tianjie Cao ” RFID Protocol enabling Ownership
                                                                   Transfer to protect against Traceability and DoS attacks “,First
                                                                   International Symposium on Data, Privacy and E-Commerce,DOI
                                                                   10.1109/ISDPE.2007.29,pp. 508-510.
         Figure.10 RTL schematic for proposed protocol
                                                                   [8] L. Bussard and W. Bagga. Distance-bounding proof of
Total equivalent gates are 10,100 gates; equivalent to 1684        knowledge to avoid real-time attacks. In IFIP/SEC, 2005.
ASIC gates; as 1 ASIC gate ≈ (5-7) FPGA gates. From
                                                                   [9] G. Hancke and M. Kuhn. An RFID distance bounding
implementation results the proposed solution can be
                                                                   protocol. In the 1st International Conference on Security and
implemented in low cost RFID tags.                                 Privacy for Emergin Areas in Communications Networks
                                                                   (SECURECOMM'05), pages 67-73. IEEE Computer Society,

                                                                                               ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 7, No. 3, 2010

[10] Volnei A. Pedroni “Circuit Design with VHDL”. MIT Press
Cambridge, Massachusetts, London, England, ISBN 0-262-16224-5,
[11] S. Brands and D. Chaum. Distance-Bounding Protocols. In
Advances in Cryptology - EUROCRYPT 93, volume 765 of Lecture
Notes in Computer Science, pages 344-359. Springer, 1994.

                   AUTHORS PROFILE

                   Eslam Gamal Ahmed Teacher
                   assistant    in    computer     systems
                   department in faculty of computer and
                   information sciences, Ain Shams
                   University. Graduated at June 2006 with
                   a degree excellent with honor. Currently
                   I am working in my master degree titled
                   “Developing Lightweight Cryptography
                   for RFID”. Fields of interest are
computer and networks security, RFID systems and computer

                     Dr. Eman Shaaban Lecturer in
                     computer systems department in faculty
                     of computer and information sciences,
                     Ain Shams University. Fields of interest
                     are RFID systems, embedded systems,
                     computer and networks security, logic
                     design and computer architecture.

                 Prof. Mohamed Hashem Department
                 head of information systems department
                 in faculty of computer and information
                 sciences, Ain Shams University. Fields
                 of interest are computer networks, Ad-
                 hoc and wireless networks, Qos Routing
                 of wired and wireless networks,
                 Modeling and simulation of computer
networks, VANET and computer and network security.

                                                                                                ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010

      Analysis of Empirical Software Effort Estimation
                      Saleem Basha                                                              Dhavachelvan P
             Department of Computer Science                                             Department of Computer Science
                 Pondicherry University                                                     Pondicherry University
                    Puducherry, India                                                          Puducherry, India

Abstract – Reliable effort estimation remains an ongoing                   the projected environments, whereas the accuracy can be
challenge to software engineers. Accurate effort estimation is the         defined based on understanding the calibration of the software
state of art of software engineering, effort estimation of software        data. Since the precision and reliability of the effort estimation
is the preliminary phase between the client and the business               is very important for the competitiveness of software
enterprise. The relationship between the client and the business           companies, the enterprises and researchers have put their
enterprise begins with the estimation of the software. The                 maximum effort to develop the accurate models to estimate
credibility of the client to the business enterprise increases with        effort near to accurate levels. There are many estimation
the accurate estimation. Effort estimation often requires                  models have been proposed and can be categorized based on
generalizing from a small number of historical projects.
                                                                           their basic formulation schemes; estimation by expert [5],
Generalization from such limited experience is an inherently
                                                                           analogy based estimation schemes [6], algorithmic methods
under constrained problem. Accurate estimation is a complex
process because it can be visualized as software effort prediction,        including empirical methods [7], rule induction methods [8],
as the term indicates prediction never becomes an actual. This             artificial neural network based approaches [9] [17] [18],
work follows the basics of the empirical software effort                   Bayesian network approaches [19], decision tree based
estimation models. The goal of this paper is to study the empirical        methods [21] and fuzzy logic based estimation schemes [10]
software effort estimation. The primary conclusion is that no              [20].
single technique is best for all situations, and that a careful                Among these diversified models, empirical estimation
comparison of the results of several approaches is most likely to
                                                                           models are found to be possibly accurate compared to other
produce realistic estimates.
                                                                           estimation schemes and COCOMO, SLIM, SEER-SEM and FP
Keywords-Software Estimation        Models,    Conte’s    Criteria,        analysis schemes are popular in practice in the empirical
Wilcoxon Signed-Rank Test.                                                 category [24] [25]. In case of empirical estimation models, the
                                                                           estimation parameters are commonly derived from empirical
                                                                           data that are usually collected from various sources of
                       I.    INTRODUCTION                                  historical or passed projects. Accurate effort and cost
    Software effort estimation is one of the most critical and             estimation of software applications continues to be a critical
complex, but an inevitable activity in the software development            issue for software project managers [23]. There are many
processes. Over the last three decades, a growing trend has                introductions, modifications and updates on empirical
been observed in using variety of software effort estimation               estimation models. A common modification among most of the
models in diversified software development processes. Along                models is to increase the number of input parameters and to
with this tremendous growth, it is also realized the essentiality          assign appropriate values to them. Though some models have
of all these models in estimating the software development                 been inundated with more number of inputs and output features
costs and preparing the schedules more quickly and easily in               and thereby the complexity of the estimation schemes is
the anticipated environments. Although a great amount of                   increased, but also the accuracy of these models has shown
research time, and money have been devoted to improving                    with little improvement. Although they are diversified, they are
accuracy of the various estimation models, due to the inherent             not generalized well for all types of environments [13]. Hence
uncertainty in software development projects as like complex               there is no silver bullet estimation scheme for different
and dynamic interaction factors, intrinsic software complexity,            environments and the available models are environment
pressure on standardization and lack of software data, it is               specific.
unrealistic to expect very accurate effort estimation of software
development processes [1]. Though there is no proof on                                   II.   COCOMO ESTIMATION MODEL
software cost estimation models to perform consistently
accurate within 25% of the actual cost and 75% of the time                 A. COCOMO 81
[30], still the available cost estimation models extending their               COCOMO 81 (Constructive Cost Model) is an empirical
support for intended activities to the possible extents. The               estimation scheme proposed in 1981 [29] as a model for
accuracy of the individual models decides their applicability in           estimating effort, cost, and schedule for software projects. It

                                                                                                     ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                       Vol. 7, No. 3, 2010
was derived from the large data sets from 63 software projects           variables, which is highly dependent on development the
ranging in size from 2,000 to 100,000 lines of code, and                 uncertainty at the input level of the COCOMO yields
programming languages ranging from assembly to PL/I. These               uncertainty at the output, which leads to gross estimation error
data were analyzed to discover a set of formulae that were the           in the effort estimation [33]. Irrespective of these drawbacks,
best fit to the observations. These formulae link the size of the        COCOMO II models are still influencing in the effort
system and Effort Multipliers (EM) to find the effort to develop         estimation activities due to their better accuracy compared to
a software system. In COCOMO 81, effort is expressed as                  other estimation schemes.
Person Months (PM) and it can be calculated as
                                                                                      III.   SEER-SEM ESTIMATION MODEL
               PM  a * Size b *  EM i                                SEER (System Evaluation and Estimation of Resources) is
                                      i 1
                                                                         a proprietary model owned by Galorath Associates, Inc. In
                                                                         1988, Galorath Incorporated began work on the initial version
    where,                                                               of SEER-SEM which resulted in an initial solution of 22,000
    “a” and “b” are the domain constants in the model. It                lines of code. SEER (SEER-SEM) is an algorithmic project
contains 15 effort multipliers. This estimation scheme accounts          management software application designed specifically to
the experience and data of the past projects, which is extremely         estimate, plan and monitor the effort and resources required for
complex to understand and apply the same.                                any type of software development and/or maintenance project.
                                                                         SEER, which comes from the noun, referring to one having the
    Cost drives have a rating level that expresses the impact of         ability to foresee the future, relies on parametric algorithms,
the driver on development effort, PM. These rating can range             knowledge bases, simulation-based probability, and historical
from Extra Low to Extra High. For the purpose of quantitative            precedents to allow project managers, engineers, and cost
analysis, each rating level of each cost driver has a weight             analysts to accurately estimate a project's cost schedule, risk
associated with it. The weight is called Effort Multiplier. The          and effort before the project is started. Galorath chose
average EM assigned to a cost driver is 1.0 and the rating level         Windows due to the ability to provide a more graphical user
associated with that weight is called Nominal.                           environment, allowing more robust management tradeoffs and
                                                                         understanding of what drives software projects.[4]
                                                                            This model is based upon the initial work of Dr. Randall
                                                                         Jensen. The mathematical equations used in SEER are not
     In 1997, an enhanced scheme for estimating the effort for           available to the public, but the writings of Dr. Jensen make the
software development activities, which is called as COCOMO               basic equations available for review. The basic equation, Dr.
II. In COCOMO II, the effort requirement can be calculated as            Jensen calls it the "software equation" is:

                                                                                               S e  Cte ( Ktd ) 0.5                    
               PM  a * Size E *  EM i                    
                                      i 1
                                             5                               ‘S’ is the effective lines of code, ‘ct’ is the effective
    where          E  B  0 . 01 *  SF         j                       developer technology constant, ‘k’ is the total life cycle cost in
                                         j 1                            man-years, and ‘td’ is the development time in years.
    COCOMO II is associated with 31 factors; LOC measure as                  This equation relates the effective size of the system and the
the estimation variable, 17 cost drives, 5 scale factors, 3              technology being applied by the developer to the
adaptation percentage of modification, 3 adaptation cost drives          implementation of the system. The technology factor is used to
and requirements & volatility. Cost drives are used to capture           calibrate the model to a particular environment. This factor
characteristics of the software development that affect the              considers two aspects of the production technology -- technical
effort to complete the project.                                          and environmental. The technical aspects include those dealing
    COCOMO II used 31 parameters to predict effort and time              with the basic development capability: Organization
[11] [12] and this larger number of parameters resulted in               capabilities, experience of the developers, development
having strong co-linearity and highly variable prediction                practices and tools etc. The environmental aspects address the
accuracy. Besides these meritorious claims, COCOMO II                    specific software target environment: CPU time constraints,
estimation schemes are having some disadvantages. The                    system reliability, real-time operation, etc.
underlying concepts and ideas are not publicly defined and the               The SEER-SEM developers have taken the approach to
model has been provided as a black box to the users [26]. This           include over 30 input parameters, including the ability to run
model uses LOC (Lines of Code) as one of the estimation                  Monte Carlo simulation to compensate for risk [2].
variables, whereas Fenton et. al [27] explored the shortfalls of         Development modes covered include object oriented, reuse,
the LOC measure as an estimation variable. The COCOMO                    COTS, spiral, waterfall, prototype and incremental
also uses FP (Function Point) as one of the estimation

                                                                                                   ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                    Vol. 7, No. 3, 2010
development. Languages covered are 3rd and 4th generation                                                          Outputs and Interfaces: many capability metrics, plus
languages (C++, FORTRAN, COBOL, Ada, etc.), as well as                                                              hundreds of reports and charts; trade-off analyses with
application generators. It allows staff capability, required                                                        side-byside comparison of alternatives; integration
design and process standards, and levels of acceptable                                                              with other Windows applications plus user
development risk to be input as constraints [15]. Figure 1 is                                                       customizable interfaces.
adapted from a Galorath illustration and shows gross categories
of model inputs and outputs, but each of these represents                             Aside from SEER-SEM, Galorath, Inc. offers a suite of
dozens of specific input and output possibilities and                              many tools addressing hardware as well as software concerns.
parameters.                                                                        One of particular interest to software estimators might be
                                                                                   SEER-SEM, a tool designed to perform sizing of software
                                                           Effort                  projects.
                                                           Cost                        The study done by Thibodeau in 1981 and a study done by
                                                           Schedule                IIT Research Institute (IITRI) in 1989 states that they
       Environment                SEER-SEM                                         calibrated SEER-SEM model using three databases. The
                                                                                   significance of this study is as follows:
       Complexity                                          Maintenance
                                                                                       1. Results greatly improved with calibration, in fact, as high
        Constraints                                        Reliability
                                                                                   as a factor of five.

          Input Parameters                            Output Parameters
                                                                                      2. Models consistently obtained better results when used
                                                                                   with certain types of applications.
                      Figure 1. SEER-SEM I/O Parameters                                The IITRI study was significant because it analyzed the
                                                                                   results of seven cost models (PRICE-S, two variants of
Features of the model include the following:                                       COCOMO, System-3, SPQR/20, SASET, SoftCost-Ada) to
                                                                                   eight Ada specific programs. Ada was specifically designed for
         Allows probability level of estimates, staffing and                      and is the principal language used in military applications, and
          schedule constraints to be input as independent                          more specifically, weapons system software. Weapons system
          variables.                                                               software is different then the normal corporate type of
         Facilitates extensive sensitivity and trade-off analyses                 software, commonly known as Management Information
          on model input parameters.                                               System (MIS) software. The major differences between
                                                                                   weapons system and MIS software are that weapons system
         Organizes project elements into work breakdown                           software is real time and uses a high proportion of complex
          structures for convenient planning and control.                          mathematical coding. Up to 1997, DOD mandated Ada as the
                                                                                   required language to be used unless a waiver was approved.
         Displays project cost drivers.
                                                                                   Lloyd Mosemann stated: The results of this study, like other
         Allows the interactive scheduling of project elements                    studies, showed estimating accuracy improved with calibration.
          on Gantt charts.                                                         The best results were achieved by SEER-SEM model were
                                                                                   accurate within 30 percent, 62 percent of the time.
         Builds estimates upon a sizable knowledge base of
          existing projects.
                                                                                                                           IV.   SLIM ESTIAMTION MODEL
Model specifications include these:                                                    SLIM Software Life-Cycle Model was developed by Larry
         Parameters: size, personnel, complexity, environment                     Putnam [3]. SLIM hires the probabilistic principle called
          and constraints - each with many individual                              Rayleigh distribution between personnel level and time. SLIM
          parameters; knowledge base categories for platform &                     is basically applicable for large projects exceeding 70,000 lines
          application, development & acquisition method,                           of code. [4].
          applicable standards, plus a user customizable                                                                               D
          knowledge base.
         Predictions: effort, schedule, staffing, defects and cost
                                                                                       Percentage of Total Effort

          estimates; estimates can be schedule or effort driven;
          constraints can be specified on schedule and staffing.                                                                                        dy              2 at 2
                                                                                                                                                            2 Kate
         Risk Analysis: sensitivity analysis available on all                                                                                          dt
          least/likely/most values of output parameters;
          probability settings for individual WBS elements
          adjustable, allowing for sorting of estimates by degree
          of WBS element criticality.
                                                                                                                    T=0               td               Time
         Sizing Methods: function points, both IFPUG
          sanctioned plus an augmented set; lines of code, both                                                               Figure 2. The Rayleigh Model
          new and existing.

                                                                                                                                     ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010
    It makes use of Rayleigh curve referred from [14] as shown            design and coding, because requirement and specification
in figure 2 for effort prediction. This curve represents                  engineering is not included in the model.
manpower measured in person per time as a function of time. It
is usually expressed in personyear/ year (PY/YR). It can be                   Complexity: The SLIM model’s complexity is relatively
expressed as                                                              low. For COCOMO the complexity increases with the level of
                                                                          detail of the model. For COCOMO I the increasing levels of
                                                                          detail and complexity are the three model types: basic,
                 dy                              2
                                                                          intermediate, and detailed. For COCOMO II the level of
                                         2 at
                    2 Kate                                         complexity increases according to the following order:
                 dt                                                       Application Composition, Early Design, Post Architecture.
                                                                              Automation of Model Development: The Putnam method is
    where,                                                                supported by a tool called SLIM (Software Life-Cycle
            dy/dt is the manpower utilization per unit time, “ t”         Management). The tool incorporates an estimation of the
is the elapsed time, “a” is the parameter that affects the shape          required parameter technology factor from the description of
of the curve and “K” is the area under the curve. There are two           the project. SLIM determines the minimum time to develop a
important terms associated with this curve:                               given software system. Several commercial tools exist to use
                                                                          COCOMO models.
    1) Manpower Build up given by D0=K/td3
                                                                              Application Coverage: SLIM aims at investigating
    2) Productivity = Lines of Code/ Cumulative Manpower i.e.             relationships among staffing levels, schedule, and effort. The
P=S/E and S= CK1/3td4/3,where C is the technology factor which            SLIM tool provides facilities to investigate trade-offs among
reflects the effects of various factors on productivity such as           cost drivers and the effects of uncertainty in the size estimate.
hardware constraints, program complexity, programming
environment and personal experience.                                          Generalizability: The SLIM model is claimed to be
                                                                          generally valid for large systems. COCOMO I was developed
    The SLIM model uses two equations: the software the                   within a traditional development process, and was a priori not
manpower equation and software productivity level equation                suitable for incremental development. Different development
The SLIM model uses Rayleigh distribution to estimate to                  modes are distinguished (organic, semidetached, embedded).
estimate project schedule and defect rate. Two key attributes             COCOMO II is adapted to feed the needs of new development
used in SLIM method are productivity Index (PI) and                       practices such as development processes tailored to COTS, or
Manpower Buildup Index (MBI). The PI is measure of process                reusable software availability. No empirical results are
efficiency (cost-effectiveness of assets), and the MBI                    currently available regarding the investigation these
determines the effects on total project effort that result from           capabilities.
variations in the development schedule [A Probabilistic
Model].                                                                       Comprehensiveness: Putnam’s method does not consider
                                                                          phase or activity work breakdown. The SLIM tool provides
    Inputs Required: To use the SLIM method, it is necessary              information in terms of the effort per major activity per month
to estimate system size, to determine the technology factor, and          throughout development. In addition, the tool provides error
appropriate values of the manpower acceleration. Technology               estimates and feasibility analyses. As the model does not
factor and manpower acceleration can be calculated using                  consider the requirement phase, estimation before design or
similar past projects. System size in terms of KDSI is to be              coding is not possible. Both COCOMO I and II are extremely
subjectively estimated. This is a disadvantage, because of the            comprehensive. They provide detailed activity distributions of
difficulty of estimating KDSI at the beginning of a project and           effort and schedule. They also include estimates for
the dependence of the measure on the programming language.                maintenance effort, and an adjustment for code re-use.
    Completeness of Estimate: The SLIM model provides                     COCOMO II provides prototyping effort when using the
estimates for effort, duration, and staffing information for the          Application Composition model. The Architectural Design
total life cycle and the development part of the life cycle.              model involves estimation of the actual development and
COCOMO I provides equations to estimate effort, duration,                 maintenance phase. The granularity is about the same as for
and handles the effect of re-using code from previously                   COCOMO I.
developed software. COCOMO II provides cost, effort, and
schedule estimation, depending on the model used (i.e.,                                  V.    REVIC ESTIMATION MODEL
depending on the degree of product understanding and
marketplace of the project). It handles the effect of reuse, re-
engineering, and maintenance adjusting the used size measures                 REVIC (REVised version of Intermediate COCOMO) is a
using parameters such as percentage of code modification, or              direct descendent of COCOMO. Ourada [16] was one of the
percentage of design modification                                         first to analyze validation, using a large Air Force database for
                                                                          calibration of the REVIC model. There are several key
    Assumptions: SLIM assumes the Rayleigh curve
                                                                          differences between REVIC and the 1981 version of
distribution of staff loading. The underlying Rayleigh curve
                                                                          COCOMO, however.
assumption does not hold for small and medium sized projects.
Cost estimation is only expected to take place at the start of the

                                                                                                    ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010
       REVIC adds an Ada development mode to the three                   under study. These issues include: level of system definition,
        original COCOMO modes; Organic, Semi-detached,                    system timing and criticality, documentation, etc. A complexity
        and Embedded.                                                     multiplier is then derived and used to alter the preliminary
                                                                          budget and schedule estimates from Tier II. The software
       REVIC includes Systems Engineering as a starting                  system effort estimation is then calculated. Tier IV and V are
        phase as opposed to Preliminary Design for                        not necessary for an effort estimation. Tier IV addresses the in-
        COCOMO.                                                           scope maintenance associated with the project.
       REVIC includes Development, Test, and Evaluation as
        the ending phase, as opposed to COCOMO ending
        with Integration and Test.                                           The output of Tier IV is the monthly man-loading for the
                                                                          maintenance life-cycle. Tier V provides the user with a
       The REVIC basic coefficients and exponents were                   capability to perform risk analysis on the sizing, schedule and
        derived from the analysis of a database of completed              budget data. The actual mathematical expressions used in
        DoD projects. On the average, the estimates obtained              SASET are published in the User's Guide, but the Guide is very
        with REVIC will be greater than the comparable                    unclear as to what they mean and how to use them
        estimates obtained with COCOMO.
                                                                                     VII. COSTMODL ESTIMATION MODEL
       REVIC uses PERT (Program Evaluation and Review
        Technique) statistical techniques to determine the                    COSTMODL (Cost MODeL) is a COCOMO based
        lines-of-code input value, Low, high, and most                    estimation model developed by the NASA Johnson Space
        probable estimates for each program component are                 Center. The program delivered on computer disk for
        used to calculate the effective lines-of-code and the             COSTMODL includes several versions of the original
        standard deviation. The effective lines-of-code and               COCOMO and a NASA developed estimation model KISS
        standard deviation are then used in the estimation                (Keep It Simple, Stupid). The KISS model will not be
        equations rather than the linear sum of the line-of-code          evaluated here, but it is very simple to understand and easy to
        estimates.                                                        use; however, the calibration environment is unknown. The
       REVIC includes more cost multipliers than COCOMO.                 COSTMODL model includes the basic COCOMO equations
        Requirements volatility, security, management reserve,            and modes, along with some modifications to include an Ada
        and an Ada mode are added.                                        mode and other cost multipliers.

                VI.   SASET ESTIMATIN MODEL                                   The COSTMODL as delivered includes several calibrations
                                                                          based upon different data sets. The user can choose one of
                                                                          these calibrations or enter user specified values. The model also
    SASET (Software Architecture, Sizing and Estimating
                                                                          includes a capability to perform a self-calibration. The user
Tool) is a forward chaining, rule-based expert system using a
                                                                          enters the necessary information and the model will "reverse"
hierarchically structured knowledge database of normalized
                                                                          calculate and derive the coefficient and exponent or a
parameters to provide derived software sizing values. These
                                                                          coefficient only for the input environment data. The model uses
values can be presented in many formats to include
                                                                          the COCOMO cost multipliers and does not include more as
functionality, optimal development schedule, and man-loading
                                                                          does REVIC. This model includes all the phases of a software
charts. SASET was developed by Martin Marietta Denver
                                                                          life cycle. PERT techniques are used to estimate the input
Aerospace Corp. on contract to the Naval Center for Cost
                                                                          lines-of-code in both the development and maintenance
Analysis. To use SASET, the user must first perform a
software decomposition of the system and define the
functionalities associated with the given software system [22].
                                                                                       VIII. STUDY OF EMPIRICAL MODELS

    SASET uses a tiered approach for system decomposition;
Tier 1 addresses software developmental and environmental                      Empirical estimation models were studied for the past
issues. These issues include che class of the software to be              couple of decades, out of these studies many came with the
developed, programming language, developmental, schedule,                 result of accuracy and performance. Table I summaries the
security, etc. Tier 1 output values represent preliminary budget          brief study of the most relevant empirical models. Studies are
and schedule multipliers. Tier II specifies the functional aspects        listed in chronological order. For each study, estimation
of the software system, specifically the total lines-of-code              methods are ranked according to their performance. A “1”
(LOC). The total LOC estimate is then translated into a                   indicates the best model, “2” the second best, and so on.
preliminary budget estimate and preliminary schedule estimate.
The preliminary budget and schedule estimates are derived by
applying the multipliers from Tier I to the total LOC estimate.
Tier III develops the software complexity issues of the system

                                                                                                    ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 7, No. 3, 2010

                                               TABLE I.       S UMMARY OF E MPIRICAL E STIMATION STUDY
                                                                                                         Stepwise            Expert       Other
Sl No.               Author                 Regression COCOMO Analogy SLIM           CART     ANN                   OSR
                                                                                                         ANOVA              Judgment     Methods
  1      Luciana Q, 2009                       1
  2      Yeong-Seok Seo, 2009                  1
  3      Jianfeng Wen, 2009                               2                                                                                  1
  4      Petrônio L. Braga, 2007               2                                                                                             1
  5      Jingzhou Li, 2007                                2           1
         Iris Fabiana de Barcelos Tronto,
  6                                            3          4           5                         2                                            1
  7      Chao-Jung Hsu, 2007                                                                                                                 1
  8      Kristian M Furulund, 2007                                                                                                           1
  9      Bilge Başkeleş, 2007                             2                                                                                  1
  10     Da Deng, 2007                                    2                                                                                  1
  11     Simon, 2006                                      2                                                                                  1
  12     Tim Menzies, 2005                                2                                                                                  1
  13     Bente Anda, 2005                                 2           1
  14     Cuauhtémoc López Martín, 2005                                                                                                       1
  15     Parag C, 2005                                                                 2        3                                            1
  16     Randy K. Smith, 2001                             2                                                                                  1
  17     Myrtveit, Stensrud, 1999              2                      3
  18     Walkerden, Jeffery, 1999              2                      1
  19     Kitchenham, 1998                                                              2                    1
  20     Finnie et al., 1997                   2                      1                         1
  21     Shepperd, Schofield, 1997             2                      1
  22     Jorgensen, 1995                       1                                                2                     1
  23     Srinivasan, Fischer, 1995             2          4                    5       3        1
  24     Bisio, Malabocchia, 1995                         2           1
  25     Subramanian, Breslawski 1993          1          2
  26     Mukhopadhyay, Kerke 1992             1-3         2
  27     Mukhopadhyay et al., 1992             3          4           2                                                         1
  28     Briand et al. 1992                    2          3                                                           1
  29     Vicinanza et al., 1991                2          3                                                                     1

A. Impact of Cost Drivers
    Empirical Software estimation models mainly stands over
the cost drivers and scale factors. These model reveals the
problem of instability due to values of the cost drivers and
scale factors, thus affects the sensitivity of the effort. Also,
most of the model depends on the size of the project, a change
in the size leads to the proportionate change in the effort.
Miscalculations of the cost drives have even more vivid change
in the result too. For example, a misjudgment in personnel
capability in COCOMO or REVIC from ‘very high to very
low’ will result in 300% increase in effort. Similarly in SEER-
SEM changing security requirements from ‘low’ to ‘high’ will
result in 400% increase in effort. In PRICE-S, 20% change in
effort will occur due to small change in the value of the
Productivity factor. All models have one or more inputs for
which small changes will result in large changes in effort and,
perhaps, schedule.
    The input data problem is further compounded in that some
inputs are difficult to obtain, especially early in a program. The
size must be estimated early in a program using one or more
sizing models. These models usually have not been validated
for a wide range of projects. Some sensitive inputs, such as
analyst and programmer capability, are subjective and often
difficult to determine. Studies like one performed by Brent L.
Barber, Investigative Search of Quality Historical Software
Support Cost Data and Software Support Cost-Related Data,
show that personnel parameter data are difficult to collect.                                Figure 3. Relative Cost Driver Impact[32]
Figure 3, extended from the SEER-SEM User’s Manual shows

                                                                                                          ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010
the relative impact on cost/effort of the different input                                  100
parameters for that model. Even "objective" inputs like                        MMRE            predicted i  actual i / actual i (6)
                                                                               where, N = total number of estimates
    Security Requirements in SEER-SEM may be difficult to
confirm early in a program, and later changes may result in                    RMS (Root Mean Square): Now, calculate the Root Mean
substantially different cost and schedule estimates. Some                  Square (model’s ability to accurately forecast the individual
sensitive inputs such as the PRICE-S Productivity Factor and               actual effort) for each data set. This step is a precedent to the
the SLIM PI should be calibrated from past data. If data are not           next step only. Again, satisfactory results are indicated by a
available, or if consistent values of these parameters cannot be           value of 25 percent or less[30].
calibrated, the model’s usefulness may be questionable.

                   IX.    ANALYSIS OF STUDY                                    RMS       1 / n *   predicted   i    actual   i   2    (7)

A. Accuracy Estimatos                                                          RRMS(Relative Root Mean Square): Lastly, calculate the
                                                                           Relative Root Mean Square (model’s ability to accurately
    The field of cost estimation suffers a lack of clarity about
the interpretation of a cost estimate. Jørgensen reports different         forecast the average actual effort) for each data set. According
application of the term ‘cost estimate’ being “the most likely             to Conte, the RRMS should have a value of 25 percent or
cost, the planned cost, the budget, the price, or, something               less[30].
else”. Consequently there is also disagreement about how to                                                                                  (8)
measure the accuracy of estimates. Various indicators for
                                                                               RRMS  RMS /( actual / T )
accuracy – relative and absolute – have been introduced
throughout the cost estimation literature such as mean squared
error (MSE), absolute residuals (AR) or balanced residual error                PRED(n): A model should also be within 25 percent
(BRE). Our literature review indicated that the most commonly              accuracy, 75 percent of the time [30]. To find this accuracy rate
used by far are the mean magnitude relative error or MMRE,                 PRED(n), divide the total number of points within a data set
and prediction within x or PRED(x). Of these two, the MMRE                 that have an MRE = 0.25 or less (represented by k) by the total
is the most widely used, yet both are based on the same basic              number of data points within the data set (represented by n).
value of magnitude relative error (MRE) which is defined as                The equation then is: PRED(n) = k/n where n equals 0.25 [30].
The first step will be to apply Conte’s criteria to determine the          In general, PRED(n) reports the average percentage of
accuracy of the calibrated and uncalibrated model. This will be            estimates that were within n percent of the actual values. Given
achieved using the following equations.                                    N datasets, then
   Conte’s Criteria:                                                                        N                 1 if MRE i <=n/100
                                                                              PRED(n)= 100 Σ                  0 otherwise
    The performance of model generating continuous output                               N
can be assesses in many ways including PRED(30), MMRE,
correlation etc., PRED(30) is a measure calculated from the
relative error, or RE, which is the relative size of the difference            For example, PRED(30) = 50% means that half the
between the actual and estimated value. One way to view these              estimates are within 30 percent of the actual.
measures is to say that training data contains records with
variables 1,2,3,……N and performance measures and
additional new variables N+1, N+2,….                                          Wilcoxon Signed-Rank Test.
    MRE(Magnitude of Relative Error): First, calculate the
Magnitude of Relative Error (degree of estimating error in an                  The next step will be to test the estimates for bias. The
individual estimate) for each data point. This step is a                   Wilcoxon signed-rank test is a simple, nonparametric test that
precedent to the next step and is also used to calculate                   determines level of bias. A nonparametric test may be thought
PRED(n). Satisfactory results are indicated by a value of 25               of as a distribution-free test; i.e. no assumptions about the
percent or less [30].                                                      distribution are made. The best results that can be achieved by
                                                                           the model estimates is to show no difference between the
                                                                           number of estimates that over estimated versus those that under
MRE  predicted  actual / actual                                     estimated. The Wilcoxon signed-rank test is accomplished
                                                                           using the following steps [31],
    MMRE(mean magnitude of the relative error): The mean                       1. Divide each validated subset into two groups based on
magnitude of the relative error, or MMRE, is the average                   whether the estimated effort was greater (T+) or less (T-) than
percentage of the absolute values of the relative errors over an           the actual effort.
entire data set.
                                                                               2. Sum the absolute value of the differences for the T+ and
                                                                           T-groups. The closer the sums of these values for each group
                                                                           are to each other, the lower the bias.

                                                                                                      ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 7, No. 3, 2010
   3. Any significant difference indicates a bias to over or                  cost models could be shown to be accurate. The software
under estimate.                                                               support estimation problem is further convoluted by lack of
   Another performance measure of a model predicting                          quality software support cost data for model development,
numeric values is the correlation between predicted and actual                calibration, and validation. Even if models can be shown to be
values. Correlation ranges from +1 to -1 and a correlation of +1              accurate, another effect must be considered.
means that there is a perfect positive linear relationship
between variables. And can be calculates as follows                           Table III summarizes the parameters used and activities
                                                                              covered by the models discussed. Overall, model based
                                                                              techniques are good for budgeting, tradeoff analysis, planning
    The correlation coefficient for COCOMO II is 0.6952 and                   and control, and investment analysis. As they are calibrated to
the correlation coefficient for proposed model is 0.9985                      past experience, their primary difficulty is with unprecedented

                                                                                    TABLE II.      ANALYSIS OF EMPIRICAL E STIMATION MODELS
      P=    ΣiT Predictedi , a=     ΣiT Actuali ,
                      T                     T                                   Study      Model     Application Type       Validated Accuracy
                                                                                                                       MMRE / RRMS           Pred
               ΣiT (Predicted i –p) 2,     ΣiT (Actuali –a) 2 ,                                                         MRE
     Sp=                            S a=                                                  COCOMO                         34.4        -         -
                           T-1                       T-1          (10)        Karen         SEER
                                                                                                      Flight Software
                                                                                                                        140.7        -         -
                                                                              Lum, 2002 COCOMO                          88.22        -         -
                                                                                                     Ground Software
       Spa= 1 (Predictedi –p) (Actuali –a) ,                                                SEER                        552.33       -         -
                            T-1                                                                          Kind:min         31         -      60(0.3)
                                                                                                          Lang:ftn        44         -      42(0.3)
                                                                                                         Kind:max         38         -      52(0.3)
      Corr=     S pa/ Sp * Sa                                                                               All           40         -      60(0.3)
                                                                                                         Mode:org         32         -      62(0.3)
                                                                                                         Lang:mol         36         -      56(0.3)
   All these performance measures (correlation, MMRE, and                                                Project:Y        22         -      78(0.3)
PRED) address subtly different issues. Overall, PRED                          Karen                  Mission Planning     36         -      50(0.3)
measures how well an effort model performs, while MMRE                                   COSEEKMO
                                                                              Lum, 2006              Avioicsmonitoring    38         -      53(0.3)
measures poor performance.                                                                                Mode:sd         33         -      62(0.3)
                                                                                                         Project:X        42         -      42(0.3)
    A single large mistake can skew the MMREs and not effect                                               Fg:g           32         -      65(0.3)
the PREDs. Sheppard and Schofield comment that MMRE is                                                    Center:5        57         -      43(0.3)
fairly conservative with a bias against overestimate while                                                  All           48         -      43(0.3)
PRED(30) will identify those prediction systems that are                                                  Mode:e          64         -      42(0.3)
generally accurate but occasionally wildly inaccurate[28]                                                Cemter:2         22         -      83(0.3)
                                                                                            REVIC                       0.373      0.776   42(0.25)
                                                                              Gerald L
                                                                                           SASET                         5.95     -0.733   3.5(0.25)
B. Model Accuracy                                                             Ourada,
                                                                                                        Aero Space
                                                                                                                         3.55     -1.696 10.7(0.25)
         There is no proof on software cost estimation models                             Cost Model                     0.46      0.53    29(0.25)
to perform consistently accurate within 25% of the cost and                                 SLIM                        771.87       -         -
                                                                                          COCOMO                        610.09       -         -
75% of the time[30]. In general model fails to produce                                        FP
                                                                                                      ABC Software
                                                                                                                         102         -         -
accurate result with perfect input data. The above studies have               Chris    F
                                                                                           Estimac                      85.48        -         -
compared empirical estimation models with known input data                    Kemour,
                                                                                            SLIM                         772         -         -
and actual cost and schedule information, and have not found                              COCOMO
                                                                                                                         601         -         -
the accuracy to be scintillating. Most model were accurate                                    FP                        102.74       -         -
                                                                                           Estimac                        85         -         -
within 30% of the actual cost and 57% of the time.
                                                                              D Deng,                     Random         0.61        -      0.4(30)
The Ourada study showed even worse results for SEER-SEM,                      De Tran-
                                                                                           Cosmic           B-1           39         -     50(0.25)
SASET, and REVIC for the 28 military ground programs in an                    Cao, 2007
early edition of the Space and Missiles Center database. A
1981 study by Robert Thibodeau entitled An Evaluation of
Software Cost Estimating Models showed that calibration                           In Table II, Summary of the analysis of the study, the result
could improve model accuracy by up to 400%. However, the                      of a collaborative effort of the authors, which includes author
average accuracy was still only 30% for an early version of                   name, cost model name, application type, validated accuracy
SLIM and 25% for an early version of PRICE-S. PRICE-S and                     (MMRE, RRMS, Pred) is the percentage of estimates that fall
System-3 are within 30%, 62% of the time. An Air Force                        within the specified prediction level of 25 or 30 percent. In this
study performed by Ferens in 1983 and published in the ISPA                   Chris F Kemour validated SLIM and obtained the result of
Journal of Parametrics, concluded that no software support                    highest MRE of 772, COCOMO obtained the result of highest

                                                                                                          ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                           Vol. 7, No. 3, 2010

                                                                                                                                  Select   COCOM
      Group                   Factor                  SLIM      CheckPoint            Price-s     Estimacs     SEER-SEM
                                                                                                                                Estimator     O II
                   Source Instruction           Yes           Yes                 Yes           No             Yes             No          Yes
Size Attributes    Function Points              Yes           Yes                 Yes           Yes            Yes             No          Yes
                   OO-Related Metrics           Yes           Yes                 Yes           !              Yes             Yes         Yes
                   Type/Domain                  Yes           Yes                 Yes           Yes            Yes             Yes         No
                   Complexity                   Yes           Yes                 Yes           Yes            Yes             Yes         Yes
Program Attributes Language                     Yes           Yes                 Yes           !              Yes             Yes         Yes
                   Reuse                        Yes           Yes                 Yes           !              Yes             Yes         Yes
                   Required Reliability         !             !                   Yes           Yes            Yes             No          Yes
Computer           Resource Constraints         Yes           Yes                 Yes           Yes            Yes             No          Yes
Attributes         Platform Volatility          !             !                   !             !              Yes             No          Yes
                   Personnel Capability         Yes           Yes                 Yes           Yes            Yes             Yes         Yes
                   Personnel Continuity         !             !                   !             !              !               No          Yes
                   Personnel Experience         Yes           Yes                 Yes           Yes            Yes             No          Yes
                   Tools and Techniques         Yes           Yes                 Yes           Yes            Yes             Yes         Yes
                   Breakage                     Yes           Yes                 Yes           !              Yes             Yes         Yes
                   Schedule Constraints         Yes           Yes                 Yes           Yes            Yes             Yes         Yes
Project Attributes Process Maturity             Yes           Yes                 !             !              Yes             No          Yes
                   Team Cohesion                !             Yes                 Yes           !              Yes             Yes         Yes
                   Security Issues              !             !                   !             !              Yes             No          No
                   Multi Site Development       !             Yes                 Yes           Yes            Yes             No          Yes
                   Inception                    Yes           Yes                 Yes           Yes            Yes             Yes         Yes
                   Elaboration                  Yes           Yes                 Yes           Yes            Yes             Yes         Yes
Activity Covered
                   Construction                 Yes           Yes                 Yes           Yes            Yes             Yes         Yes
                   Transition and Maintenance   Yes           Yes                 Yes           No             Yes             No          Yes
                                                                              [3] B Boehm, C Abts, and S Chulani. "Software Development Cost
                                                                                    Estimation Approaches – A Survey”, Technical Report USC-CSE-
                                                                                    2000-505", University of Southern California – Center for Software
                                                                                    Engineering, USA, (2000).
   MRE rate of 610.09. Karen Lum has validated SEER-                          [4] S. chulani, B. Boehm, and B. Steece, “Bayesian Analysis of
SEM and found to be 552.33. the validation of COSEEKMO                              Emperical Software Engineering Cost Models,’ IEEE Trans.
in Mode:e has an enormous MRE of 64 evaluated by Karen                              Software Eng., vol.25, no. 4, pp.573-583, 1999..
Lum. Ourada study showed REVIC has MMRE of 0.373                              [5] Jorgen M, Sjoberg D.I.K, “The Impact of Customer Expectation on
and SASET even worse result of 5.95.                                                Software Development Effort Estimates” International Journal of
                                                                                    Project Management, Elsevier, pp 317-325, 2004

                         X.     CONCLUSION
    Based upon the background readings, this paper states                     [6]    Chiu NH, Huang SJ, “The Adjusted Analogy-Based Software Effort
that the existing models were highly credible; however, this                         Estimation Based on Similarity Distances,” Journal of Systems and
survey found this not to be so based upon the research                               Software, Volume 80, Issue 4, pp 628-640, 2007
performed. All the models could not predict the actual                        [7]    Kaczmarek J, Kucharski M, “Size and Effort Estimation for
against either the calibration data or validation data to any                        Applications Written in Java,” Journal of Information and Software
                                                                                     Technology, Volume 46, Issue 9, pp 589-60, 2004
level of accuracy or consistency. Surprisingly, SEER and
                                                                              [8]    Jeffery R, Ruhe M,Wieczorek I, “Using Public Domain Metrics to
machine learning techniques were reliable good at predicting                         Estimate Software Development Effort,” In Proceedings of the 7th
the effort. But however they are not accurate because all the                        International Symposium on Software Metrics, IEEE Computer
model lies in the term prediction, prediction never comes                            Society, Washington, DC, pp 16–27, 2001
true is proved in this estimation models. In all the models,                  [9]    Heiat A, “Comparison of Artificial Neural Network and Regression
the two key factors that influenced the estimate were project                        Models for Estimating Software Development Effort,” Journal of
size either in terms of LOC or FP and the capabilities of the                        Information and Software Technology, Volume 44, Issue 15, pp 911-
                                                                                     922, 2002
development team personnel. This paper is not convinced
                                                                              [10]   Huang SJ, Lin CY, Chiu NH, “Fuzzy Decision Tree Approach for
that no model is so sensitive to the abilities of the                                Embedding Risk Assessment Information into Software Cost
development team can be applied across the board to any                              Estimation Model,” Journal of Information Science and Engineering,
software development effort. Finally this paper concludes                            Volume 22, Number 2, pp 297–313, 2006
that the no model is best for all situations and environment.                 [11]   B.W. Boehm, “Software Engineering Economics,” Prentice Hall,
                           REFERENCES                                         [12]   B.W. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark, B.
                                                                                     Steece, A. W. Brown, S. Chulani, and C. Abts, “Software Cost
[1]   Satyananda, “An Improved Fuzzy Approach for COCOMO’s Effort                    Estimation with COCOMO II,” Prentice Hall, 2000.
      Estimation Using Gaussian Membership Function” Journal of               [13]   Vu Nguyen, Bert Steece, Barry Boehm “A Constrained Regression
      Software, vol 4, pp 452-459, 2009                                              Technique for COCOMO Calibration” ESEM’08, ACM, pp 213-222,
[2]   R. Jensen, “An improved macrolevel software development resource               2008
      estimation model”. In 5th ISPA Conference, pp 88–92, 1983

                                                                                                              ISSN 1947-5500
                                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                            Vol. 7, No. 3, 2010

[14] K.K.Aggarwal, Yogesh Singh, A.Kaur, O.P.Sangwan "A Neural Net                                     AUTHORS PROFILE
     Based Approach to Test Oracle" ACM Software Engineering Notes
     Vol. 29, No. 4, pp 1-6, 2004.
[15] R. W Jensen, “A Comparison of the Jensen and COCOMO Schedule                  Saleem Basha is a Ph.D research scholar in the Department
     and Cost Estimation Models”, Proceedings of the International                 of Computer Science, Pondicherry University. He has
     Society of Parametric Analysts, pp. 96-106, 1983.                             obtained B.E in the field of Electrical and Electronics
[16] Oruada, Gerald L, “Software Cost Estimation Models: A                         Engineering, Bangalore University, Bangalore, India and
     Callibration, Evaluation, and Compartison “, Air fore institute of
     Technology.                                                                   M.E in the field of Computer Science and Engineering,
[17] K. Srinivasan and D. Fisher, "Machine learning approaches to
                                                                                   Anna University, Chennai, India. He is currently working in
     estimating software development effort," IEEE Transactions on                 the area of SDLC specific effort estimation models and web
     Software Engineering, vol. 21, pp. 126-137, 1995.                             service modelling systems.
[18] A. R. Venkatachalam, "Software Cost Estimation Using Artificial
     Neural Networks," Presented at 1993 International Joint Conference            Dr. Dhavachelvan Ponnurangam is working as Associate
     on Neural Networks, Nagoya, Japan, 1993.                                      Professor, Department of Computer Science, Pondicherry
[19] G. H. Subramanian, P. C. Pendharkar, and M. Wallace, "An                      University, India. He has obtained his M.E. and Ph.D. in the
     Empirical Study of the Effect of Complexity, Platform, and Program            field of Computer Science and Engineering in Anna
     Type on Software Development Effort of Business Applications,"                University, Chennai, India. He is having more than a decade
     Empirical Software Engineering, vol. 11, pp. 541-553, 2006.                   of experience as an academician and his research areas
[20] S. Kumar, B. A. Krishna, and P. S. Satsangi, "Fuzzy systems and               include Software Engineering and Standards, web service
     neural networks in software engineering project management,"
     Journal of Applied Intelligence, vol. 4, pp. 31-52, 1994.
                                                                                   computing and technologies. He has published around 75
[21] R. W. Selby and A. A. Porter, "Learning from examples: generation
                                                                                   research papers in National and International Journals and
     and evaluation of decision trees for software resource analysis," IEEE        Conferences. He is collaborating and coordinating with the
     Transactions on Software Engineering, vol. 14, pp. 1743-1757, 1988.           research groups working towards to develop the standards
[22] Denver, CO, Martin-Marietta, Ratliff, Robert W., “ SASET 3.0 user             for Attributes Specific SDLC Models & Web Services
     Guide”, 1993                                                                  computing and technologies.
[23] K. Maxwell, L. Van Wassenhove, and S. Dutta, "Performance
     Evaluation of General and Company Specific Models in Software
     Development Effort Estimation," Management Science, vol. 45, pp.
     787-803, 1999
[24] M. van Genuchten and H. Koolen, "On the Use of Software Cost
     Models," Information & Management, vol. 21, pp. 37-44, 1991.
[25] T. K. Abdel-Hamid, "Adapting, Correcting, and Perfecting
     softwareestimates: Amaintenance metaphor " in Computer, vol. 26,
     pp. 20-29, 1993
[26] F. J. Heemstra, "Software cost estimation," Information and Software
     Technology, vol. 34, pp. 627-639, 1992
[27] N. Fenton, "Software Measurement: A necessary Scientific Basis,"
     IEEE Transactions on Software Engineering, vol. 20, pp. 199-206,
[28] M.Sheppered and C. Schofied, “Estimating Software Project Effort
     Using Analogies”, IEEE Trans. Software Eng. Vol. 23, pp 736-743,
[29] Barry Boehm. Software engineering economics. Englewood Cliffs,
     NJ:Prentice-Hall, 1981. ISBN 0-13-822122-7
[30] S.D. Conte, H.E. Dunsmore, V.Y.Shen, “Software Engineering
     Metrics and Models”, Benjamin-Cummings Publishing Co., Inc.,
[31] Mendenhall, W., Wackerly, D.D., Schaeffer, R.L, “Mathematical
     statistics with applications, 4th edition”, PWS-KENT Publishing
     Company, Boston (1990),752 pp,ISBN 0-534-92026-8
[33] Barry Boehma, Chris Abts a and Sunita Chulani, “Software
     development cost estimation approaches -A survey” Annals of
     Software Engineering, pp 177-205, 2000

                                                                                                            ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010

                 A Survey on Preprocessing Methods for
                           Web Usage Data
                            V.Chitraa                                                    Dr. Antony Selvdoss Davamani
                        Lecturer                                                            Reader in Computer Science
          CMS College of Science and Commerce                                            NGM College (AUTONOMOUS )
             Coimbatore, Tamilnadu, India                                              Pollachi, Coimbatore,Tamilnadu, India

Abstract— World Wide Web is a huge repository of web pages                    Web Content Mining deals with the discovery of useful
and links. It provides abundance of information for the Internet          information from the web contents or data or documents or
users. The growth of web is tremendous as approximately one               services.
million pages are added daily. Users’ accesses are recorded in
web logs. Because of the tremendous usage of web, the web log                 Web Structure Mining mines the structure of hyperlinks
files are growing at a faster rate and the size is becoming huge.         within the web itself. Structure represents the graph of the link
Web data mining is the application of data mining techniques in           in a site or between the sites.
web data. Web Usage Mining applies mining techniques in log                   Web Usage Mining mines the log data stored in the web
data to extract the behavior of users which is used in various
applications like personalized services, adaptive web sites,
customer profiling, prefetching, creating attractive web sites
etc., Web usage mining consists of three phases preprocessing,            A. Web Usage Mining
pattern discovery and pattern analysis. Web log data is usually               Web usage mining also known as web log mining is the
noisy and ambiguous and preprocessing is an important process             application of data mining techniques on large web log
before mining. For discovering patterns sessions are to be                repositories to discover useful knowledge about user’s
constructed efficiently. This paper reviews existing work done            behavioral patterns and website usage statistics that can be used
in the preprocessing stage. A brief overview of various data              for various website design tasks. The main source of data for
mining techniques for discovering patterns, and pattern                   web usage mining consists of textual logs collected by
analysis are discussed. Finally a glimpse of various applications         numerous web servers all around the world. There are four
of web usage mining is also presented.
                                                                          stages in web usage mining.
   Keywords- Data Cleaning, Path Completion,               Session           Data Collection : users log data is collected from various
Identification , User Identification, Web Log Mining                      sources like serverside, client side, proxy servers and so on.
                                                                              Preprocessing : Performs a series of processing of web log
                       I.     INTRODUCTION                                file covering data cleaning, user identification, session
    Data mining is defined as the automatic extraction of                 identification, path completion and transaction identification.
unknown, useful and understandable patterns from large
database. Enormous growth of World Wide Web increases the                     Pattern discovery : Application of various data mining
complexity for users to browse effectively. To increase the               techniques to processed data like statistical analysis,
performance of web sites better web site design, web server               association, clustering, pattern matching and so on.
activities are changed as per users’ interests. The ability to                Pattern analysis : once patterns were discovered from web
know the patterns of users’ habits and interests helps the                logs, uninteresting rules are filtered out. Analysis is done using
operational strategies of enterprises. Various applications like          knowledge query mechanism such as SQL or data cubes to
e-commerce, personalization, web site designing, recommender              perform OLAP operations.
systems are built efficiently by knowing users navigation
through web. Web mining is the application of data mining                     All the four stages are depicted through the following
techniques to automatically retrieve, extract and evaluate                figure.
information for knowledge discovery from web documents and
    The objects of Web mining are vast, heterogeneous and
distributing documents. The logistic structure of Web is a graph
structured by documents and hyperlinks, the mining results
maybe on Web contents or Web structures. Web mining is
divided into three types. They are Web content mining, Web
structure mining and Web usage mining.

                                                                                                      ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                Vol. 7, No. 3, 2010
                                                                                 since there are chances of incorrect data or users neglect those
                                                                                     Client Side Collection is advantageous than server side
        Web                                                                      since it overcomes both the caching and session identification
        Log                                                                      problems. Browsers are modified to record the browsing
                                                                                 behaviors. Remote agents like Java Applets are used to collect
                     Data Cleaning                                               user browsing information. Java applets may generate some
                   User Identification         Pattern        Pattern            additional overhead especially when they are loaded for the
                  Session Identification     discovery        Analysis           first time. But users are to be convinced to use modified
                Transaction Identification                                       browser. Along with log files intentional browsing data from
                                                                                 client side like “add to my favorites”, “copy” is also added for
                                                                                 efficient web usage mining [27].
                             Figure 1.   Phases of Web usage mining
                                                                                     Proxy level collection is the data collected from
                                                                                 intermediate server between browsers and web servers. Proxy
                                                                                 caching is used to reduce the loading time of a Web page
                                                                                 experienced by users as well as the network traffic load at the
    The objective of this paper is to provide a review of web                    server and client sides [16]. Access log from proxy servers are
usage mining and a survey of preprocessing stage. Data                           of same format as web server log and it records the web page
Collection section list out various data sources, preprocessing                  request and response for the server. Proxy traces may reveal the
section reviews the different works done in session                              actual HTTP requests from multiple clients to multiple Web
identification, path completion process. Remaining sections                      servers. This may serve as a data source for characterizing the
briefs about pattern discovery, analysis and the different areas                 browsing behavior of a group of anonymous users sharing a
of applications where web usage mining is used.                                  common proxy server.

                     II.    DATA COLLECTION                                                     III.   DATA PREPROCESSING
    Data Collection is the first step in web usage mining                        The information available in the web is heterogeneous and
process. It consists of gathering the relevant web data. Data                    unstructured. Therefore, the preprocessing phase is a
source can be collected at the server-side, client-side, proxy
                                                                                 prerequisite for discovering patterns. The goal of
servers, or obtain from an organization’s database, which
                                                                                 preprocessing is to transform the raw click stream data into a
contains business data or consolidated Web data [13].
                                                                                 set of user profiles [8]. Data preprocessing presents a number
    Server level collection collects client requests and stored in               of unique challenges which led to a variety of algorithms and
the server as web logs. Web server logs are plain text that is                   heuristic techniques for preprocessing tasks such as merging
independent from server platform. Most of the web servers                        and cleaning, user and session identification etc [18]. Various
follow common log format as                                                      research works are carried in this preprocessing area for
    “ ipaddress username password               date/timestamp        url        grouping sessions and transactions, which is used to discover
version status-code bytes-sent“                                                  user behavior patterns.
    Some servers follow Extended log format along with                           A. Data Cleaning
referrer and user agent. Referrer is the referring link url and                      Data Cleaning is a process of removing irrelevant items
user agent is the string describing the type and version of                      such as jpeg, gif files or sound files and references due to
browser software used. Web cache and the IP address                              spider navigations. Improved data quality improves the analysis
misinterpretation are the two drawbacks in the server log.                       on it. The Http protocol requires a separate connection for
Web cache keeps track of web pages that requests and saves a                     every request from the web server. If a user request to view a
copy of these pages for a certain period. If there is a request for              particular page along with server log entries graphics and
same page, the cache page is in use instead of making new                        scripts are download in addition to the HTML file. An
request to the server. Therefore, these requests are not record                  exception case is Art gallery site where images are more
into the log files.                                                              important. Check the Status codes in log entries for successful
    Cookies are unique ID generated by the web server for                        codes. The status code less than 200 and greater than 299
individual client browsers and it automatically tracks the site                  were removed.
visitors [16]. When the user visits next time the request is send
back to the web server along with ID. However if the user                        B. User Identification
wishes for privacy and security, they can disable the browser                        Identification of individual users who access a web site is
option for accepting cookies.                                                    an important step in web usage mining. Various methods are
    Explicit User Input data is collected through registration                   to be followed for identification of users. The simplest method
forms and provides important personal and demographic                            is to assign different user id to different IP address. But in
information and preferences. However, this data is not reliable                  Proxy servers many users are sharing the same address and
                                                                                 same user uses many browsers. An Extended Log Format

                                                                                                            ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                        Vol. 7, No. 3, 2010
overcomes this problem by referrer information, and a user                systems are same, the referrer information is taken into
agent. If the IP address of a user is same as previous entry and          account. The Referrer URL field is checked and a new user
user agent is different then the user is assumed as a new user.           session is identified if the URL in the Referrer URL field has
If both IP address and user agent are same then referrer URL              never been accessed before, or there is a large interval (more
and site topology is checked. If the requested page is not                than 10 seconds [12]) between the access time of this record
directly reachable from any of the pages visited by the user,             and the previous one if the Referrer URL field is empty. If the
then the user is identified as a new user in the same address             sessions identified by the previous step contain more than one
[20]. Caching problem can be rectified by assigning a short               visit by the same user at different time, the time-oriented
expiration time to HTML pages enforcing the browser to                    heuristics is then used to divide the different visits into
retrieve every page from the server [7].                                  different user sessions.
C. Session Identification                                                     A simple algorithm is devised by Baoyao Zhou [4]. An
                                                                          access session is created as a pair of URL and the requested
    A user session can be defined as a set of pages visited by
                                                                          time in a sequence of requests with a timestamp. The duration
the same user within the duration of one particular visit to a
                                                                          of an URL is estimated as the difference of request time of
web-site. A user may have a single or multiple sessions during
                                                                          successor entry and current entry. For the last URL there is no
a period. Once a user was identified, the click stream of each
                                                                          successor. So the duration is estimated as the average duration
user is portioned into logical clusters. The method of portioning
                                                                          of the current session. The end time of session is the start time
into sessions is called as Sessionization or Session
                                                                          and duration. This algorithm is suitable when there are more
Reconstruction. A transaction is defined as a subset of user
                                                                          number of URL’s in a session. The default time set by author
session having homogenous pages. There are three methods in
                                                                          is 30 minutes per session.
session reconstruction. Two methods depend on time and one
on navigation in web topology.                                                Smart Miner is a new method devised by Murat Ali and
                                                                          team [16, 17]. This framework is a part of Web Analytics
    Time Oriented Heuristics : The simplest methods are time
                                                                          Software. The sessions constructed by SMART-SRA contains
oriented in which one method based on total session time and
                                                                          sequential pages accessed from server-side works in two stages
the other based on single page stay time. The set of pages
                                                                          and follows Timestamp Ordering Rule and Topology rule. In
visited by a specific user at a specific time is called page
                                                                          the first stage the data stream is divided into shorter page
viewing time. It varies from 25.5 minutes [5] to 24 hours [23]
                                                                          sequences called candidate sessions by using session duration
while 30 minutes is the default timeout by R.Cooley [19]. The
                                                                          time and page stay time rules. In the second stage candidate
second method depends on page stay time which is calculated
                                                                          sessions are divided into maximal sub sessions from sequences
with the difference between two timestamps. If it exceeds 10
                                                                          generated in the first phase. In the second phase referrer
minutes then the second entry is assumed as a new session.
                                                                          constraints of the topology rule are added by eliminating the
Time based methods are not reliable because users may involve
                                                                          need for inserting backward browser moves. The pages
in some other activities after opening the web page and factors
                                                                          without any referrers are determined in the candidate session
such as busy communication line, loading time of components
                                                                          from the web topology. Then those pages are removed. If a
in web page, content size of web pages are not considered.
                                                                          hyperlink exists from the previously constructed session then
    Navigation-Oriented Heuristics : uses web topology in                 those pages are appended to the previous sessions. In this
graph format. It considers webpage connectivity, however it is            sessions are formed one after another. An agent simulator is
not necessary to have hyperlink between two consecutive page              developed by authors to simulate an actual web user. It
requests. If a web page is not connected with previously visited          randomly generates a typical web site topology and a user
page in a session, then it is considered as a different session.          agent to accesses the same from its client side and acts like a
Cooley proposed a referrer based heuristics on the basis of               real user. An important feature of the agent simulator is its
navigation in which referrer URL of a page should exists in the           ability to model dynamic behaviors of a web agent. Time
same session. If no referrer is found then it is a first page of a        constraint is also considered as the difference between two
new session.                                                              consecutive pages is smaller than 10 minutes
    Both the methods are used by many applications. To                        Another method using Integer Programming was proposed
improve the performance different methods were devised on                 by Robert F.Dell [21]. The advantage of this method is
the basis of Time and Navigation Oriented heuristics by                   construction of all sessions simultaneously. He suggests that
different researchers. Different works were done by researchers           each web log is considered as a register. Registers from the
for effective reconstruction of sessions.                                 same IP address and agent as well as linked are grouped to
                                                                          form a session. A binary variable is used and a value of 1 or 0
    The referrer-based method and time-oriented heuristics                is assigned depending on whether register is assigned a position
method are combined to accomplish user session identification             in a particular session or not. Constraints such as each register
in [13]. Web Access Log set is the set of all records in the web          is used at most only once and in only one session for each
access log and stored according to time sequence A User                   ordered position. A maximization problem is formulated. To
Session Set is obtained from the Web Access Log Set by                    improve the solution time the subset of binary variables is set
following rules such as different users are distinguished by              to zero. An experiment is conducted to show how objective
different IP address. If the IP addresses are same, the different         function is varied and results are obtained with raw registers
browsers or operating systems indicate different users and if             and filtered for MM objects and errors. Unique pages and links
the IP addresses are same, the different browsers and operation           between pages are counted. Chunks with same IP address and

                                                                                                     ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                          Vol. 7, No. 3, 2010
agent within a time limit is formed such that no register in one            auxiliary page is small and content page is more. A reference
chunk could ever be part of a session in another. Experiment                length can be calculated that estimates the cut off between
is focused with IP address with high diversity and a higher                 auxiliary and content references. The length of each reference
number of registers. Sessions produced better match an                      is estimated by taking the difference between the time of the
expected empirical distribution.                                            next reference and the current reference. But the last reference
                                                                            has no next reference. So this approach assumes the last one is
Graphs are also used for session identification. It gives more              always a auxiliary reference.
accurate results for session identification. Web pages are
represented as vertices and hyperlinks are represented as                       Maximal Forward Reference: A transaction is considered
edges in a graph. User navigations are modeled as traversals                as the set of pages from the visited page until there is a
from which frequent patterns can be discovered. i.e., the sub-              backward reference. Forward reference pages are considered
traversals that are contained in a large ratio of traversals [22].          as content pages and the path is taken as index pages. A new
A method was proposed by Mehdi Heydari and team [15].                       transaction is considered when a backward reference is made.
They considered client side data is also important to                           Time Window : A time window transaction is framed from
reconstruct user’s session. There are three phases in this                  triplets of ipaddress, user identification, and time length of each
method. In first phase an AJAX interface is designed to                     webpage up to a limit called time window. If time window is
monitor user’s browsing behavior. Events such as Session                    large, each transaction will contain all the page references for
start, end, on page request, on page load, on page focus are                each user. Time window method is also used as a merge
created along with user’s interaction and are recorded in                   approach in conjunction with one of the previous methods.
session. In the second phase a base graph is constructed using                   An optimal algorithm is devised by G.Arumugam and
web usage data. Browsing time of web pages in indicated as                  S.Suguna [2] to generate accurate path sequences by using two
vertices. Traversal is a sequence of consecutive web pages on               way hashed structure based access history list to frame a
a base graph [22]. A database is created with traversals. In                complete path with optimal time. In this tree structure of server
phase three graph mining method is applied to the database to               pages are searched. There are two problems in this search as
discover weighted frequent pattern. Weighted frequent pattern               backward reference consumes more time in unused pages also
is the pattern when weight of traversal is greater than or equal            and pages which are directly referred from other server’s leads
to a given Minimum Browsing Time.                                           to incorrect session identification. To overcome these issues
    Another algorithm proposed by Junjie Chen and Wei Liu in                authors gives different algorithm. In this Session Identification
which data cleaning and session identification is combined [13]             algorithm data structures such as Array List to represent Web
In this deleting the content foreign to mining algorithms                   Logs and User Access List, a Hash table to represent server
gathered from web logs. User activity record is checked,                    pages, a two-way hashed structure are utilized. Two way
judges whether the record is spider record or not and judges                hashed structure is used to store Access History List (AHL) to
whether it is embedded object in pages or not according to                  represent user accessed page sequence. Two hash tables
URL of pages requested and site structure graph. Session                    primary and secondary hash tables are used in which primary is
record is searched if no session exists, a new session is                   used to store sessions and pointers to secondary table which is
established. If the present session ends or exceeds the preset              having a complete path navigation.            To solve the time
time threshold, the pattern will ends it and founds a new one.              consumption only visited pages are stored in access history list
Graph mining methods constructs accurate sessions and the                   and unused is not considered. Using a single search in history
time taken is also comparatively less. More research is to be               list, the page sequences are directly located. When pages are
done in this area.                                                          referred from other servers directly start from the page and not
                                                                            from root. If the page is not available in present sessions, start a
D. Path Completion                                                          new session and we can infer that this is not a backward
                                                                            reference but the page is browsed in another server. This
    There are chances of missing pages after constructing
                                                                            method generates correct complete path than maximal forward
transactions due to proxy servers and caching problems
                                                                            and reference length methods.
[25][26]. So missing pages are added as follows: The page
request is checked whether it is directly linked to the last page
or not. If there is no link with last page check the recent                             IV. PATTERN DISCOVERY AND ANALYSIS
history. If the log record is available in recent history then it is            Once user transactions have been identified, a variety of
clear that “back” button is used for caching until the page has             data mining techniques are performed for pattern discovery in
been reached. If the referrer log is not clear, the site topology           web usage mining. These methods represent the approaches
can be used for the same effect. If many pages are linked to the            that often appear in the data mining literature such as discovery
requested page, the closest page is the source of new request               of association rules and sequential patterns and clustering and
and so that page is added to the session. There are three                   classification etc., [13]. Classification is a supervised learning
approaches in this regard.                                                  process, because learning is driven by the assignment of
    Reference Length approach: This approach is based on the                instances to the classes in the training data. Mapping a data
assumption that the amount of time a user spends on a page                  item into one of several predefined classes is done. It can be
correlates to whether the page is a auxiliary page or content               done by using inductive learning algorithms such as decision
page for that user. It is expected that the time spent on                   tree classifiers, naive Bayesian classifiers, Support Vector
                                                                            Machines etc., Association Rule Discovery techniques are

                                                                                                         ISSN 1947-5500
                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                         Vol. 7, No. 3, 2010
applied to databases of transactions where each transaction                entries of document traversal, file retrieval and unsuccessful
consists of a set of items. By using Apriori algorithm the                 web events among many others that are organized according to
biggest frequent access item sets from transaction databases               the date and time. It is important to eliminate the irrelevant
that is the user access pattern are discovered. Clustering is a            data. So cleaning is done to speed up analysis as it reduces the
technique to group users exhibiting similar browsing patterns.             number of records and increases the quality of the results in the
Such knowledge is especially useful for inferring user                     analysis stage. Efforts in this data to find accurate sessions are
demographics in order to perform market segmentation in E-                 likely to be the most fruitful in the creation of much effective
commerce applications or provide personalized web content to               web usage mining and personalization systems. By following
pages. Sequential Patterns are used to find inter-session                  data preparation steps, it is very easier to generate rules which
patterns such that the presence of a set of items followed by              identify directories for website improvement. More research
another item in a time-ordered set of sessions. By using this              can be done in preprocessing stages to clean raw log files, and
approach, web marketers can predict future visit patterns which            to identify users and to construct accurate sessions.
will be helpful in placing advertisements aimed at certain user
groups.                                                                                                VII CONCLUSION
    Pattern Analysis is the last stage of web usage mining.                    Web sites are one of the most important tools for
Mined patterns are not suitable for interpretations and                    advertisements in international area for universities and other
judgments. So it is important to filter out uninteresting rules or         foundation. The quality of a website can be evaluated by
patterns from the set found in the pattern discovery phase. In             analyzing user accesses of the website. To know the quality of
this stage tools are provided to facilitate the transformation of          a web site user accesses are to be evaluated by web usage
information into knowledge. The exact analysis methodology                 mining. The results of mining can be used to improve the
is usually governed by the application for which Web mining is             website design and increase satisfaction which helps in various
done. Knowledge query mechanism such as SQL is the most                    applications. Log files are the best source to know user
common method of pattern analysis [7]. Another method is to                behavior. But the raw log files contains unnecessary details
load usage data into a data cube in order to perform OLAP                  like image access, failed entries etc., which will affect the
operations.                                                                accuracy of pattern discovery and analysis. So preprocessing
                                                                           stage is an important work in mining to make efficient pattern
             V WEB USAGE MINING APPLICATIONS                               analysis. To get accurate mining results user’s session details
                                                                           are to be known. The survey was performed on a selection of
    Users’ behavior is used in different applications such as              web usage methodologies in preprocessing proposed by
Personalization, e-commerce, to improve the system and to                  research community.         More concentration is done on
improve the system design as per their interest etc., Web                  preprocessing stages like session identification and path
personalization offers many functions such as simple user                  completion and we have presented various works done by
salutation to more complicate such as content delivery as per              different researchers. Our research in future is to create more
users interests. Content delivery is very important since non-             efficient session reconstructions through graphs and mining the
expert users are overwhelmed by the quantity of information                sessions using graph mining as quality sessions gives more
available online. It is possible to anticipate the user behavior by        accurate patterns for analysis of users.
analyzing the current navigation patterns with patterns which
were extracted from past web log. Recommendation systems
are the most common application. Personalized sites are                                                  REFERENCES
example for recommendation systems. E-Commerce                             [1]   Archana N.Mahanta ,“Web Mining:Application of Data Mining,”,
applications need customer details for Customer Relationship                     Proceedings of NCKM , 2008.
Management. Usage mining techniques are very useful to                     [2]   Arumugam G. and Suguna S,“Optimal Algorithms for Generation of
focus customer attraction, customer retention, cross sales and                   User Session      Sequences Using Server Side Web User Logs,
                                                                                 “,ESRGroups, France , 2009.
customer departure.         System Improvement is done by
                                                                           [3]   Bamshad Mobasher “Data Mining for Web Personalization,”, LCNS,
understanding the web traffic behavior by mining log data so                     Springer-Verleg Berlin Heidelberg, 2007.
that policies are developed for Web caching, load balancing,
                                                                           [4]   Baoyao Zhou, Siu Cheung Hui and Alvis C.M.Fong,“An Effective
network transmission and data distribution. Patterns for                         Approach for Periodic Web Personalization,“, Proceedings of the
detecting intrusion fraud, attempted break-ins are also provided                 IEEE/ACM International Conference on Web Intelligence. IEEE,2006.
by mining. Performance is improved to satisfy users. Site                  [5]   Catlegde L. and Pitkow J., “Characterising browsing behaviours in the
Modification is a process of modifying the web site and                          world wide Web,”, Computer Networks and ISDN systems, 1995.
improving the quality of design and contents on knowing the                [6]   Chungsheng Zhang and Liyan Zhuang , “New Path Filling Method on
interest of users. Pages are re-linked as per customer behavior.                 Data Preprocessing in Web Mining ,“, Computer and Information
                                                                                 Science Journal , August 2008.
                                                                           [7]   Cyrus Shahabi, Amir M.Zarkessh, Jafar Abidi and Vishal Shah
                           VI FUTURE                                             “Knowledge discovery from users Web page navigation, “, In.
   There are a number of issues in preprocessing of log data.                    Workshop on Research Issues in Data Engineering, Birmingham,
Volume of requests in web log in a single log file is the first
                                                                           [8]   Demin Dong, “Exploration on Web Usage Mining and its Application, “,
challenge. Analyzing web user access log files helps to                          IEEE,2009.
understand the user behaviors in web structure to improve the
design of web components and web applications. Log includes

                                                                                                           ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                     Vol. 7, No. 3, 2010
[9]    Dimitrios Pierrakos, Georgios Paliouras, Christos Papatheodorou and                     Journal of Computer Science and Network Security, VOL.7 No.4, April
       Constantine D.Spyropoulos “Web Usage Mining as a Tool for                               2007.
       Personalization, “, Kluwer Academic Publishers,2003.                             [23]   Spilipoulou M.and Mobasher B, Berendt B.,“A framework for the
[10]   Feng Tao and Fionn Murtagh, “Towards Knowledge Discovery from                           Evaluation of Session Reconstruction Heuristics in Web Usage
       WWW Log Data,” IEEE 2007.                                                               Analysis,” INFORMS Journal on Computing Spring ,2003.
[11]   Jaideep Srivastave, Robert Cooley, Mukund Deshpande, Pang-Ning Tan               [24]   Suresh R.M. and          Padmajavalli .R. ,“An Overview of Data
       “Web Usage Mining:Discovery and Applications of Usage Patterns from                     Preprocessing in Data and Web usage Mining ,“ IEEE, 2006.
       Web Data,”, SIGKDD Explorations. ACM SIGKDD,2000.                                [25]   Yan Li, Boqin FENG and Qinjiao MAO, “Research on Path Completion
[12]   Jia Hu and Ning Zhong “Clickstream Log Acquisition with Web                             Technique in Web Usage Mining,,”, International Symposium on
       Farming,,”, Proceedings of the International Conference on Web                          Computer Science and Computational Technology, IEEE,2008.
       Intelligence, IEEE,2005.                                                         [26]   Yan Li and Boqin FENG “The Construction of Transactions for Web
[13]   Jose M. Domenech1 and Javier Lorenzo, “A Tool for Web Usage                             Usage Mining,,”, International Conference on Computational
       Mining , “ , 8th International Conference on Intelligent Data Engineering               Intelligence and Natural Computing, IEEE,2009.
       and Automated Learning ,2007                                                     [27]   Yu-Hai tao , Tsung-Pei Hong andYu-Ming Su, “Web usage mining with
[14]   Jungie Chen and Wei Liu, “Research for Web Usage Mining Model,”,                        intentional browsing data, “, Expert Systems with Applications , Science
       International Conference on Computational Intelligence for Modelling                    Direct,2008.
       Control and Automation, IEEE,2006.
[15]   Mehdi Heydari, Raed Ali Helal, and Khairil Imran Ghauth, “A Graph-                                         AUTHORS PROFILE
       Based Web Usage Mining Method Considering Client Side Data, “,
       International Conference on Electrical Engineering and Informatics,
       IEEE, 2009
[16]   Murat Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar and Guven Fidan
       “Discovering more accurate Frequent Web Usage Patterns, ”,
       arXiv0804.1409v1, 2008                                                                                    Mrs. V. Chitraa is a doctoral student in
[17]   Murat Ali Bayir, Ismail Hakki Toroslu, Ahmet Cosar and Guven Fidan                                        Manonmaniam Sundaranar University, Tirunelveli,
       “Smart Miner:A new Framework for Mining Large Scale Web Usage                                             Tamilnadu. She is working as a lecturer in CMS
       Data,” , International World Wide Web Conference Committee, ACM,                                          college of Science and Commerce, Coimbatore. Her
       2009.                                                                                                     research interest lies in Database Concepts, Web
                                                                                                                 Usage Mining, Clustering.
[18]   Raju G.T. and Sathyanarayana P. “Knowledge discovery from Web
       Usage Data : Complete Preprocessing Methodology, ”, IJCSNS 2008.
[19]   Robert.Cooley,Bamshed Mobasher, and Jaideep Srinivastava, “Web
       mining:Information and Pattern Discovery on the World Wide Web,”,In
       International conference on Tools with Artificial Intelligence, pages
       558-567, Newport Beach, IEEE,1997.
                                                                                                                 Dr. Antony Selvadoss Davamani is working as a
[20]   Robert.Cooley,Bamshed Mobasher and Jaideep Srinivastava,“Data
                                                                                                                 Reader in NGM college with a teaching experience of
       Preparation for Mining World Wide Web Browsing Patterns ,“, journal
       of knowledge and Information Systems,1999.                                                                about 22 years. His research interests includes
                                                                                                                 knowledge management, web mining, networks,
[21]   Robert F.Dell ,Pablo E.Roman, and Juan D.Velasquez, “Web User                                             mobile computing, telecommunication. He has
       Session Reconstruction Using Integer Programming,” , IEEE/ACM                                             published about 8 books and 16 papers.
       International Conference on Web Intelligence and Intelligent
[22]   Seong Dae Lee, and Hyu Chan Park, “Mining Weighted Frequent
       Patterns from Path Traversals on Weighted Graph, “,International

                                                                                                                          ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 7, No. 3, March 2010
        Seamless Data Services for Real Time Communication in a Heterogeneous
                 Networks using Network Tracking and Management

 Adiline Macriga. T1                                             Dr. P. Anandha Kumar2

 1. Research Scholar, Department of Information & Communication, MIT Campus, Anna University
 Chennai, Chennai – 600025. email:,
 2. Asst. Professor, Department of Information Technology, MIT Campus, Anna University Chennai,
Chennai – 600025. email:
                                                                       updated day by day to meet the needs of the various
   Abstract                                                            industries. As the communication industry is considered it
                                                                       is one of the challenging one for the researchers.
 Heterogeneous Networks is the integration of all existing             Considering the infrastructure of the existing
 networks under a single environment with an                           communication industry a huge amount has been
 understanding between the functional operations and                   deployed by different service providers to satisfy their
 also includes the ability to make use of multiple                     customer needs. It is now a challenge to provide the flow
 broadband transport technologies and to support                       of information seamlessly without re-modifying the
 generalized mobility. It is a challenging feature for                 existing infrastructure. The most challenging one is to
 Heterogeneous networks to integrate several IP-based                  provide mobility management for real time
 access technologies in a seamless way. The focus of this              communication services. This paper mainly concentrates
 paper is on the requirements of a mobility management                 on framing a path in advance between the source and the
 scheme for multimedia real-time communication services                destination based on the nature of the data being
 - Mobile Video Conferencing. Nowadays, the range of                   communicated in this case mobile video conferencing.
 available wireless access network technologies includes               This paper discusses the challenge involved in the
 cellular or wide-area wireless systems, such as cellular              mobility and handoff technique. In heterogeneous
 networks (GSM/GPRS/UMTS) or Wi-Max, local area                        wireless networks, traditionally handoff is mainly
 Network or personal area wireless systems, comprising                 classified as : horizontal handoff and vertical handoff. A
 for example, WLAN (802.11 a/b/g) and Bluetooth. As the                horizontal handoff is made between different access
 mobile video conferencing is considered, the more                     points within the same link-layer technology such as
 advanced mobile terminals are capable of having more                  when transferring a connection from one Base Station to
 than one interface active at the same time. In addition, the          another or from one Access Point to another. A vertical
 heterogeneity of access technologies and also the                     handoff is a handoff between access networks with
 seamless flow of information will increase in the future,             different link-layer technologies, which will involve the
 making the seamless integration of the access network a               transfer of a connection between a Base Station and an
 key challenge for mobility management in a                            Access Point. Seamless and efficient Vertical Handoff
 heterogeneous network environment. Services must be                   between different access technologies is an essential and
 provided to the user regardless of the particular access              challenging problem in the development toward the next-
 technology and also the type of service provider or the               generation wireless networks [1][12]. Internally as the
 network used.                                                         handoff process is considered it can be further carried out
                                                                       using the following main steps: system discovery, handoff
 Keywords: Location Tracking, Location Management,                     decision, and handoff execution [24]. During the system
 Mobility Management, Heterogeneous Networks,                          discovery phase, mobile terminals equipped with multiple
 Seamless services.                                                    interfaces have to determine which networks can be used
                                                                       and the services available in each network. During the
   I. INTRODUCTION                                                     handoff decision phase, the mobile device determines
                                                                       which network it should connect to. During the handoff
 Today’s communication technology becomes outdated for                 execution phase, connections need to be rerouted from the
 tomorrows      requirement.    The    growth     of    the            existing network to the new network in a seamless
 communication industry is tremendous and unimaginable.                manner. There are three strategies for handoff decision
 There are different modes like “wired, wireless, adhoc,               mechanisms: mobile-controlled handoff, network-
 mobile etc., supporting the growth of the communication               controlled handoff, and mobile-assisted handoff [14].
 industry but with all certain limits. Now, it is time to              Mobile Controlled Handoff is used in IEEE 802.11
 emerge into the world of mobility where the wireless                  WLAN networks, where a Mobile Host continuously
 communication plays a vital role, where it is necessary to            monitors the signal of an Access Point and initiates the
 satisfy the requirements of the modern world. A world                 handoff procedure. Network Controlled Handoff is used
 without a mobile phone is unimaginable. It has taken                  in cellular voice networks where the decision mechanism
 people to a different world. Now it is the time for                   of handoff control is located in a network entity. Mobile
 providing services in an uninterrupted way. In medical                Assisted Handoff has been widely adopted in the current
 industry a millisecond delay in transfer of information               WWANs such as GPRS, where the mobile host measures
 may lead to a loss of life. So the technology has to be               the signal of surrounding base stations and the network
                                                                                                ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 7, No. 3, March 2010
then employs this information and decides whether or not              Their main results show that the handoff delay caused
to trigger handoff [3][13]. The handoff algorithms[3]                 by frequent handoff has a much bigger degrading effect
considered are based on the threshold comparison of one               for the throughput in the transition region. In addition,
or more metrics and dynamic programming/artificial                    the benefit that can be achieved with the optimal value
intelligent techniques applied to improve the accuracy of             of the dwelling timer as in [7] may not be enough to
the handoff procedure. The most common metrics are                    compensate for the effect of handoff delay. In [27],
received signal strength, carrier-to-interference ratio,              claudio et al. propose an automatic Interface selection
signal-to-interference ratio, and bit error rate [2]. In              approach by performing the VHO if a specific number
heterogeneous wireless networks, even though the                      of continuous received beacons from the WLAN exceed
functionalities of access networks are different, all the             or fall below a predefined threshold.
networks use a specific signal beacon or reference                              Additionally, in the real-time service, the
channel with a constant transmit power to enable received             number of continuous beacon signals should be lower
signal strength measurements. Thus, it is very natural and            than that of the non-real-time service in order to reduce
reasonable for vertical handoff algorithms to use received            the handoff delay [26][30]. More parameters may be
signal strength as the basic criterion for handoff decisions          employed to make more intelligent decisions. Li et al.
[14] [16].                                                            [10] propose a bandwidth aware VHO technique which
          In order to avoid the ping-pong effect [4],                 considers the residual bandwidth of a WLAN in
additional parameters such as hysteresis and dwelling                 addition to RSS as the criterion for handoff decisions.
timer can be used solely or jointly in the handoff decision           However, it relies on the QBSS load defined in the
process. In addition to the absolute received signal                  IEEE 802.11e Standard to estimate the residual
strength threshold, a relative received signal strength               bandwidth in the WLAN. In [29], Weishen et al.
hysteresis between the new base station and the old base              propose a method for defining the handoff cost as a
station is added as the handoff trigger condition to                  function of the available bandwidth and monetary cost.
decrease unnecessary handoffs. With this implementation               In[16], actual RSS and bandwidth were chosen as two
it will be possible to provide an uninterrupted flow of               important parameters for the Waveform design. Hossain
multimedia communication. The ultimate goal of this                   et al. [15] propose a game theoretic frame work for
paper is to provide the services based on the saying “Any             radio resource management perform VHO in
time Anything Anywhere and Everywhere Anystate and                    heterogeneous wireless networks. One main difficulty
Everystate”. We follow this approach by proposing a                   of the cost approach is its dependence on some
location tracking based solution that supports hybrid                 parameters that are difficult to estimate, especially in
handovers without disruption of real time multimedia                  large cellular networks. Mohanty and Akyildiz [14]
communication services. The solution provided is                      developed a cross-layer (Layer 2 + 3) handoff
mobility management using Location based tracking and                 management protocol CHMP, which calculates a
Network management.                                                   dynamic value of the RSS threshold for handoff
          The rest of the paper is organized as follows:              initiation by estimating MH’s speed and predicting the
Section II reviews related work on location estimation and            handoff signaling delay of possible handoffs.
vertical handover. Section III provides the solution for                        To sum up, the application scenario of current
mobility management. Section IV presents the proposed                 Vertical Handoff algorithms is relatively simple. For
method. Section V deals with the performance evaluation.              example, most Vertical Handoff algorithms only
Section VI presents the results and related discussion.               consider the pure Vertical Handoff scenario, where the
Section VII discusses directions for future work and                  algorithm only needs to decide when to use a 3G
concludes this paper.                                                 network and when to use a WLAN [1], [10], [17], [18],
                                                                      [21], [25]. In fact, at any moment, there may be many
II. SURVEY OF EXISTING TECHNOLOGIES &                                 available networks (homogeneous or heterogeneous),
METHODS                                                               and the Handoff algorithm has to select the optimal
                                                                      network for Horizontal Handoff or Vertical Handoff
The session below provides the survey on the available                from all the available candidates. For example, if the
technologies that supports the communication at the                   current access network of Mobile Host is a WLAN, the
various stages and aspects. Daniel et al. [5] proposes a              Mobile Assisted Handoff may sense many other
handoff scheme based on RSS with the consideration of                 WLANs and a 3G network at a particular moment, and
thresholds and hysteresis for mobile nodes to obtain                  it has to decide whether to trigger Horizontal Handoff
better performance. However, in heterogeneous wireless                or Vertical Handoff. If the Horizontal Handoff trigger is
networks, RSS from different networks can vary                        selected, Mobile assisted handoff then needs to decide
significantly due to different techniques used in the                 which WLAN is the optimal one [20] [22].
physical layers and cannot be easily compared with each               Consequently, an analytical framework to evaluate
other. Thus, the methods in [4] and [5] cannot be                     VHO algorithms is needed to provide guidelines for
applied to VHO directly. Anthony et al. [19] use the                  optimization of handoff in heterogeneous wireless
dwelling timer as a handoff initiation criterion to                   networks. It is also necessary to build reasonable and
increase the WLAN utilization. It was shown in [21]                   typical simulation models to evaluate the performance
that the optimal value for the dwelling timer varies                  of VHO algorithms. The proposed work will provide an
along with the used data rate or, to be more precise,                 optimal solution to choose the network based on the
with the effective throughput ratio. In [8], Olama et al.             type of service that is being carried out. As the work
extend the simulation framework in [14] by introducing                mainly concentrates on the application oriented
a scenario for multiple radio network environments.            85     approach the type of network and handover selection is
                                                                                               ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 7, No. 3, March 2010
also based on requirement It also concentrates on the               Mobile Host is about to handle the handoff . This is
identification of the source and destination and by                 achieved by having a proper QoS understanding
locating them the path is chosen.                                   between the networks. Eg. The data considered for
                                                                    transmission is huge and the available network has a
III. SOLUTIONS FOR MOBILITY                                         bandwidth that can handle the data but dute to the flow
MANAGEMENT                                                          density in the network it is not able to handle that. The
                                                                    proposed work will take the nature of the data and if it is
A. HYBRID HANDOFF                                                   an emergency information which requires immediate
Seamless and efficient information flow between                     attention the corresponding network will be requested to
different access technologies is an essential and                   rearrange the traffic in the network by sharing the
challenging problem in the development toward the                   resources of the surrounding network and to provide the
next-generation wireless networks. Also a single hand               network for the current transmission.
off cannot be suggested. So the available handoff for
different technologies is grouped under the single head             B. MOBILE INTERNET PROTOCOL
hybrid handoff. For the seamless flow of information                Mobile Internet Protocol is a mobility solution working
the main technique is handoff.In general, the handoff               at the network layer. IPv4 assumes that every node has
process can be divided into three main steps: system                its own IP address that should remain unchanged during
discovery, handoff decision, and handoff execution.                 a communication session. Mobile IP introduces the
Also the During the system discovery phase, mobile                  concepts of home address, the permanent address of the
terminals equipped with multiple interfaces have to                 Mobile Host and of care-of-address. The latter is a
determine which networks can be used and the services               temporary address assigned to the Mobile Host as soon
available in each network. During the handoff decision              as it moves from its home network to a foreign one. A
phase, the mobile device determines which network it                specific router in the home network in the home agent is
should connect to. During the handoff execution phase,              informed as soon as the node acquires the Care-of-
connections need to be rerouted from the existing                   Address in the foreign network from a so-called foreign
network to the new network in a seamless manner.                    agent. The home agent acts as an anchor point, relaying
During the Vertical Handoff procedure, the handoff                  the packets addressed to the home address towards the
decision is the most important step that affects Mobile             actual location of the Mobile Host, at the care-of-
Host’s communication [9][11]. An incorrect handoff                  address. Using mobile IP for real-time communications
decision may degrade the QoS of traffic and even break              has some drawbacks. A well-known problem is
off current communication. Handoff algorithms in                    triangular routing, that is, the fact that the packets sent
heterogeneous wireless networks should support both                 to the Mobile Host are captured by the home agent and
Horizontal Handoff and Vertical Handoff and can                     tunneled, whereas the Mobile Host can send packets
trigger Horizontal handoff or Vertical handoff based on             directly to the Corresponding Host. This asymmetric
the network condition. What should be noted is that,                routing adds delay to the traffic towards the Mobile
because of the uncertainty of the network distribution              Host, and delay is an important issue in voice over IP
and the randomness of MH’s mobility, it is impossible               (VoIP). The fact that the packets are tunneled also
to forecast the type of the next handoff in advance. For            means that an overhead of typically 20 bytes, due to the
this purpose only in the proposed work the nature of the            IP encapsulation, will be added to each packet. Still
network between the source and the destination is                   another drawback of using mobile IP is that each
studied in advance and the triggering based on the                  Mobile Host requires a permanent home IP address,
availability and services provided by the available                 which can be a problem because of the limited number
network. Thus, handoff algorithms in heterogeneous                  of IP addresses in IPv4. A number of works have built
wireless networks must make the appropriate handoff                 upon MIP to overcome its drawbacks. A notable one is
decision based on the network metrics in a related short            cellular IP [4], which improves MIP, providing fast
time scale.                                                         handoff control and paging functionality comparable to
There are three strategies for handoff decision                     those of cellular networks. Being a network level
mechanisms: mobile-controlled handoff, network-                     solution, cellular IP requires support from the access
controlled handoff, and mobile-assisted handoff [14].               networks, and it is suitable for micro-mobility, namely,
Mobile-controlled handoff is used in IEEE 802.11                    mobility within the environment of a single provider.
WLAN networks, where the Mobile Host continuously                   The major work of this paper concentrates on seamless
monitors the signal of an AP and initiates the handoff              streaming in heterogeneous networks; one of the major
procedure. Network-controlled handoff is used in                    challenges is the bit rate adaptation when the gap
cellular voice networks where the decision mechanism                between two different networks is large. In the
of handoff control is located in a network entity.                  heterogeneous networks, an available channel
Mobile-controlled handoff has been widely adopted in                bandwidth usually fluctuates in a wide range from bit
the current WWANs such as GPRS, where the Mobile                    rate below 64kbps to above 1mbps according to the type
handoff measures the signal of surrounding Base                     of network. The following technique will overcome the
Stations and the network then employs this information              above mentioned drawbacks and will provide a
and decides whether lor not to trigger handoff. During              comfortable solution for the seamless flow of
Vertical handoff, only Mobile Hosts have the                        information based on the nature of information.
knowledge about what kind of interfaces they are
equipped with. Even if the network has this knowledge,              IV. PROPOSED SERVICE METHOD
there may be no way to control another network that the    86                      
                                                                                             ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010
    Given the requirements for seamless mobility listed
    above, the roles to be supported by a mobility
    management function exploiting the terminal capability to
    access several radio access networks are the following:
    i.        Selection of the access network at application
    launch. This role is ensured by mobility management
    subfunctions here referred to as service-to-radio mapping
    ii.       Triggering of the handover during a session. The
    mobility management function aims at always providing
    the best access network to the terminal.
    iii.      A terminal-centric selection without network
    assistance or recommendation
    iv.       A network-controlled selection within network
    entities, based on both terminal and access network
    measurements, enforcing decisions on the terminal
    v.        Network-assisted selection on the terminal side,             Fig. 2 Graphical Representation between source (S)
    the network providing operator policies, and access/core               and Destination (D)
    load information (joint terminal/network decisions).
    When only one access remains available, network-
    assisted selection is applied; when access selection is
    triggered by network load considerations, network control
    may be used for load balancing.
    vi.     Finally, for access network selection, the mobility
    management function must retrieve the status of resource
    usage in each access network. This information is
    provided by an “access network resource management”
    function, which computes a technology-independent
    abstracted view of access resource availability.

    Functional Entities Involved in Mobility Management
    — The mobility management functions are supported
    by functional entities described below that are
    responsible for selecting the best access network for
    each terminal. They may be triggered at application
    launch or during the time of connection establishment.
•      Generate a geographical map between the source and                  Fig.3Topological Representation between source (S)
      the destination based on the mode of transport                       and Destination (D)
•   Tabulate the list of service providers with their
      frequency and coverage
•   Create an acyclic graph by pointing out the service
      providers and the geographical scenarios
•   Form a topological graph from the acyclic graph by
      removing the obstacles and considering the signal
•      Get the traffic status of each and every link from
      source to destination
•      Now create a path from source to destination based on
      the network traffic and also by choosing the shortest
      path with minimum handovers

                                                                           Fig.4Path Representation between source (S) and
                                                                           Destination (D)

                                                                       • For all the network in the specified path between the
                                                                         source and destination
                                                                                Get the bandwidth of the network
                                                                                  Calculate the average traffic of the network
                                                                                                  ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010
             Note down the obstacles on the way                          Management.
             Note down the active service providers with                 •         The mobility register is a database that stores
     their coverage area                                                 profiles and high-level characteristics of users, radio
             Note down the type of network services [3G,                 access networks, and operators. It also stores session
     3GPP]                                                               information such as terminal location or application
                                                                         currently used. There is one database entity per
•    Calculate the traffic density in the in the network                 operator.
     NLc + ∆E ≤ NLth → take the same path                                •         The global Mobility Management registers
     NLc → Current Network Load                                          users to the seamless mobility service. The global
     ∆E → Estimated Increase in Load                                     Mobility Management activates the user context in the
     NLth → Network Threshold Load                                       mobility register and is the screen for the Authentication
                                                                         Availability Access rights process. nterfaces with the
• Generate a shortest path based on                                      mobility register receives “global” information from the
           The maximum signal coverage                                   terminal on application needs, user preferences, and
           Traffic density below the NLth                                basic access network data type, identity, availability and
• Continue the transmission in the specified path                        performs pre-selection of the access network. Using
• Continue the same procedure till the source reaches the                information from the terminal, the global Mobility
  destination or the source is stable or not in mobility                 Management provides the local Mobility Management
                                                                         with an ordered list of recommended access networks
    Assumptions made for the analysis                                    based on operator policies e.g., prioritize WLAN access
     Atleast one type of service provider is available within            for streaming applications, manage access privileges
    the limit                                                            according to user’s profile, user and network profiles,
                                                                         application quality of service (QoS) constraints (e.g.,
    V. PERFORMANCE EVALUATION                                            minimum bandwidth, latency guaranty), and basic radio
                                                                         information such as availability of access networks.
                                                                           •       The local Mobility Management receives
                                                                           measurements of the environment in its coverage area
                                                                           from access network RM entities access network load
                                                                           and quality information such as delay, packet loss
                                                                           ratio and from the terminal on an event-triggered or a
                                                                           periodic basis, processes application requests from the
                                                                           terminal in order to map QoS application needs to
                                                                           radio parameters (e.g., bandwidth vs. load on the
                                                                           different networks) within the service-to-radio
                                                                           mapping control function, Computes handoff triggers
                                                                           such as radio coverage and quality on the current
                                                                           network being below an acceptable threshold, current
                                                                           access network load being above a threshold, and
                                                                           modification of the network classification provided by
                                                                           the global Mobility Management. Also processes the
                                                                           global Mobility Management recommendation before
                                                                           selecting an access network for a user. This
                                                                           recommendation arises on a global Mobility
                                                                           Management event trigger or a terminal event trigger.
                                                                           Finally, makes the final handoff decision based on the
                                                                           radio triggers, as well as on the global Mobility
                                                                           Management recommendation, and orders the terminal
                                                                           to execute the handover.
                                                                           •       The terminal implements a seamless mobility
                                                                           application programming interface in charge of
                                                                           Computing handover triggers, related to coverage
    Fig.5 Cellular/ WLAN Seamless Mobile Architecture[17]                  radio signal strength and signal quality in order to
                                                                           detect whether the radio bearer fulfills application
      As far the architecture is considered the existing                   needs in terms of link quality.
      architecture is used without any modifications in the                 •      Sending out radio signal strength and quality
      hardware. Modifications are carried out at the router                measurement for current and target access networks
      level to reduce the router management delay as the                   on an event-triggered based on local Mobility
      information from the origin will have the path to be                 Management request or triggers below thresholds, but
      taken and the address of the destination also will have              also on a periodic basis if the radio metrics remain
      the details of the corresponding node through which it               below the threshold. If the signal is good enough for
      has to pass and when to go for handoff and what type                 the terminal, no periodic measurement will be sent to
      of handoff is required in advance.                                   the local Mobility Management unless the local
    They may be triggered at application launch or during                  Mobility Management sends out a specific request to
    the application session by the terminal or local Mobility   88         get information detecting available networks in line of
                                                                                                  ISSN 1947-5500
                                                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                            Vol. 7, No. 3, March 2010
sight, Managing user preferences, Invoking services
via service-to-radio mapping control,        the API                                                          AS the existing technology is considered the 3GPP/
provides application needs and a preferred access                                                             UMTS network has a constant coverage for a limited
network for this application. Then terminal pre-                                                              traffic. The maximum users allowed is 80 – 85. Also as
selection is confirmed or not by Mobility                                                                     the usage spectrum is considered only 80 – 85% of the
Managements according to access network condition.                                                            available spectrum is utilized efficiently. The result is
executing the handover upon local Mobility                                                                    shown in fig. 6. During which no traffic can be
Management order depends on the link layer                                                                    transmitted or received (it corresponds to traffic
technology. There is no specific constraint on the                                                            interruptions on Fig. 6. Upon returning to normal
mobility management protocol, which could be based                                                            operation, a peak of traffic is observed when the terminal
on Internet Engineering Task Force specifications, for                                                        transmitting a burst of packets that could not be
instance triggering the handoff in case it does not                                                           transmitted during the scanning periods of 200–400 ms.
receive local Mobility Management orders in time or                                                           In order to avoid perceivable interruption of voice
when only one target access network remains                                                                   transmission, an adaptive buffer has been set up on the
available. Optionally, managing operator policies and                                                         receiver side, which enables the network to cope with
preferences so that it can efficiently make the                                                               “silences” but results in slightly increased latency. This
handover decision by itself in the absence of a local                                                         configuration could further be improved by breaking the
Mobility Management recommendation.                                                                           scanning period into shorter ones in order to avoid latency
The access network Resource Management entities.                                                              increase. However, this configuration may lead to lower
This entity is access-technology-specific as it                                                               measurement precision, so an acceptable compromise
interfaces each access network with the local Mobility                                                        must be reached. In any case, this scenario is not
Management Resource Managements receive real-                                                                 considered for wider-scale deployment, since the latency
time radio load-related information from the access                                                           on the EDGE network leads to unacceptable VoIP quality.
network access point and use it to provide load
indicators to the local Mobility Management in a                                                                                                              Maximum Data rate for WLAN/ Wifi
standardized abstracted format.                                                                                                30
                                                                                                                                         Usable Data Rate
                                                                                                                                        Maximum Data Rate
The following section is to provide the seamless flow                                                                          20
of information in a practical context that addresses the
                                                                                                                No. of Users

integrating of cellular and WLAN access networks. In                                                                           15

order to implement the different functions listed                                                                              10
earlier, some initial technological choices need to be
made. First, only intertechnology handoffs WLAN are                                                                             5

considered in the seamless mobility architecture.
Intratechnology handoffs are taken care of by                                                                                   0
                                                                                                                                    0            2                 4                       6     8   10
technology-specific mechanisms. Then Mobile IP has                                                                                                                     Data rate in Mbps

been chosen as the L3 protocol for handoff execution
in the proof of concept, and is used on top of either                                                           Fig. 7. Data Rate of WLAN Network
IPv4 or IPv6 in order to provide session continuity
during intertechnology handoff. A clear separation of                                                           Another goal of the test bed was to assess per-
handoff decision and execution processes allows any                                                             formance of mobility management in the WLAN
evolution of IP protocols to minimize new care-of                                                               environment. As an example, we considered handoff
address configuration and rerouting latencies, for                                                              delay for a 60 kb/s Real-Time Transmission Protocol
instance, to replace baseline Mobile IP without                                                                 streaming service, with handoff delay defined as the
modifying the proposed architecture.                                                                            delay between the first Real-Time Transmission
                                                                                                                Protocol packet on the new interface and the last
                                                    Maximum Data rate for 3GPP/ UMTS
                                                                                                                packet on the old interface. When network control is
                                 Usable Data Rate                                                               enabled, the decision to perform handover is taken on
                                Maximum Data Rate                                                               load criteria: the streaming starts on the WLAN
                       80                                                                                       interface where other terminals load the AP when the
                                                                                                                network load reaches a given threshold, mobility
                                                                                                                management entities trigger the handover. In both
 N o . o f U s e rs

                                                                                                                cases handoff delay was about 0.05 ms, because of
                                                                                                                Mobile IP and Generic Packet Radio Service network
                       40                                                                                       latencies. The results of the data rate in fig. 7 also
                                                                                                                gives a clear picture that it was mainly based on the
                                                                                                                nature of the application and also the stability of the
                                                                                                                network varies upon the nature of the functions of the
                                                                                                                hardware deployed.
                            0   10        20        30        40         50       60   70      80     90
                                                            Data rate in Mbps
Fig. 6. Data rate of the Network 3GPP/ UMTS
                                                                                                                                                            ISSN 1947-5500
                                                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                             Vol. 7, No. 3, March 2010
                                                  Maximum Data rate for Wimax/ 3G
                                                                                                                 having proper power management in the 3G network
                   150                                                                                           and by making use of antennas with wider coverage in
                              Usable Data Rate
                             Maximum Data Rate                                                                   WLAN environment, the available bandwidth can be
                   140                                                                                           maximum utilized and also the number of handoffs
                                                                                                                 can be reduced as the nature of the network present in
                                                                                                                 the graphical architecture between the source and the
  No. of Users

                                                                                                                 destination is studied in advance, a maximum
                                                                                                                 throughput can be achieved with minimum tolerable
                                                                                                                 delay or no delay based on the nature of the
                                                                                                                 information that is taken for transmission. The data
                   100                                                                                           rate of the heterogeneous network is very close to the
                                                                                                                 available rate as shown in fig. 9.
                         0                5                     10                      15            20
                                                         Data rate in Mbps                                       VII. FUTURE SCOPE & CONCLUSION
  Fig.8. Data Rate in the 3G Environment
                                                                                                                 Results have confirmed the feasibility of the approach;
  The higher transmission latency experienced in the                                                             its scalability to large geographical areas has to be
  cellular access network can be observed in the graph                                                           confirmed with additional validation through
  provided (fig. 8). On the transmit side, the                                                                   simulations and trials. A possible stepwise approach to
  transmission is performed with no silence period. On                                                           the deployment of the different functional elements of
  the receive side, handing over to the cellular network                                                         the presented architecture is defined. In this approach
  introduces more latency, results in a silence period the                                                       a vector based location tracking and management is
  order of magnitude of which is equal to the latency                                                            only considered for the seamless flow. By combining
  difference between both networks. The use of an                                                                the parameters such as signal strength and delay
  adaptive buffer at the receiver side makes it                                                                  management in flow and also the formats of the
  transparent to the user which is reflected as a smooth                                                         information we can have a seamless flow. Also the
  seamless flow in the heterogeneous Networks. When                                                              QoS between the Radio Access Networks’s should be
  considering the 3G/ Wimax cellular network, the                                                                standardized in such a way that there is no mismatch
  number of users is high compared with the other                                                                of transmission from one type of environment to
  networks and also had a wider coverage but there is a                                                          another type. Finally, with the advent of research on
  pitfall at the end, the bandwidth fluctuates beyond                                                            moving networks (e.g., Network Mobility), in which
  80%. At the time of mobility the network coverage is                                                           the whole network is mobile, the integration of
  limited as shown in fig. 8                                                                                     WLANs and WMANs can improve mobile network
                                                                                                                 connectivity. It is expected that public transportation
                                                                                                                 (trains, buses, airplanes, ships, etc.) will popularize
                                 Maximum Data rate for Heterogeneous Network
                                                                                                                 Internet access through WiFi connectivity to
                        Usable Data Rate                                                                         passengers while in movement. To this end, it will be
                  300 Maximum Data Rate
                                                                                                                 equipped with a bridge to communicate with external
                                                                                                                 networks such as WiMAX. Moreover, seamless
                  250                                                                                            mobility is still an issue as clients may be equipped
N o. o f U sers

                                                                                                                 with both interfaces, and the vehicle gateway may also
                                                                                                                 give support to WiFi external connectivity through
                                                                                                                 dual gateways /interfaces (WiFi/ WiMAX) in order to
                                                                                                                 offer fault tolerance and load balance between
                  150                                                                                            networks as well as new connectivity opportunities to
                                                                                                                 passengers. Apart from serving the movement
                  100                                                                                            network, the mobile gateway can also be used by
                         0      1             2         3          4                5         6        7         external clients, such as those outside the WiFi AP
                                                     Data rate in Gbps
  Fig.9. Data Rate in the Heterogeneous Network                                                                  and WiMAX BS coverage areas, but that have the
  Environment                                                                                                    opportunity to download data or attain Internet access
                                                                                                                 through the dual gateway belonging to the vehicular
  By considering the positive measures of the above                                                              area network (VAN).
  mentioned networks and by having a thorough
  understanding between the available networks the                                                               REFERENCES
  heterogeneous        network     is    designed.   The
  heterogeneous        network      provides    maximum                                                     [1] Fei Yu and Vikram Krishnamurthy, “ Optimal Joint
  throughput, minimum number of handoffs and                                                                    Session Admission Control in integrated WLAN and
  maximum coverage at mobile. By designing a proper                                                             CDMA Cellular Networks with Vertical Handoff”,
  QoS standard and having proper understanding                                                                  IEEE Transaction on Mobile Computing, vol 6, No. 1,
  between the network the desires which are explained                                                           pp. 126 – 139, Jan’ 2007.
  at the initial paragraph can be achieved. By improving                                                    [2] Jaroslav Holis and Pavel Pechac,” Elevation
  the performance measures by deploying and allocating                                                          Dependent      Shadowing          Model     for       Mobile
  the code cpectrum for the 3GPP network and by                                                                 Communications via High Altitude Platforms in Built-
                                                                                                      90        Up Areas” - IEEE TRANSACTIONS ON
                                                                                                                                        ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 7, No. 3, March 2010
     ANTENNAS AND PROPAGATION, VOL. 56, NO.                          Mobility Patterns in Wireless Networks” - IEEE
     4, APRIL 2008                                                   TRANSACTIONS                ON         VEHICULAR
[3] Sándor Imre, “Dynamic Call Admission Control for                 TECHNOLOGY, VOL. 56, NO. 1, JANUARY 2007
     Uplink in 3G/4G CDMA-Based Systems” - IEEE                 [15] Dusit Niyato and Ekram Hossain, “A Non-
     TRANSACTIONS              ON          VEHICULAR                 cooperative Game-Theoretic Framework for Radio
     TECHNOLOGY, VOL. 56, NO. 5, SEPTEMBER                           Resource Management in 4G Heterogeneous Wireless
     2007                                                            Access Networks” - IEEE TRANSACTIONS ON
[4] Yan Zhang, and Masayuki Fujise, “Location                        MOBILE COMPUTING, VOL. 7, NO. 3, MARCH
     Managemenet Congestion Problem in Wireless                      2008
     Networks”     -   IEEE     TRANSACTIONS        ON          [16] Marcus L. Roberts, Michael A. Temple, Richard A.
     VEHICULAR TECHNOLOGY, VOL. 56, NO. 2,                           Raines, Robert F. Mills, and Mark E. Oxley,
     MARCH 2007                                                      “Communication Waveform Design Using an
[5] Daniel Morris and A. Hamid Aghvami, “Location                    Adaptive Spectrally Modulated, Spectrally Encoded
     Management Strategies for Cellular Overlay                      (SMSE) Framework”          - IEEE JOURNAL OF
     Networks—A Signaling Cost Analysis” - IEEE                      SELECTED TOPICS IN SIGNAL PROCESSING,
     TRANSACTIONS ON BROADCASTING, VOL. 53,                          VOL. 1, NO. 1, JUNE 2007
     NO. 2, JUNE 2007                                           [17] Christian Makaya and Samuel Pierre, “An
[6] Haining Chen, Hongyi Wu, Sundara Kumar and                       Architecture for Seamless Mobility Support in IP-
     Nian-Feng Tzeng, “Minimum-Cost Data Delivery in                 Based Next-Generation Wireless Networks” - IEEE
     Heterogeneous Wireless Networks”          - IEEE                TRANSACTIONS                ON         VEHICULAR
     TRANSACTIONS              ON          VEHICULAR                 TECHNOLOGY, VOL. 57, NO. 2, MARCH 2008
     TECHNOLOGY, VOL. 56, NO. 6, NOVEMBER                       [18] Haiyun Luo, Xiaqiao Meng, Ram Ramjee, Prasun
     2007                                                            Sinha, and Li (Erran) Li, “The Design and Evaluation
[7] Yuh-Shyan Chen, Ming-Chin Chuang, and Chung-Kai                  of Unified Cellular and Ad Hoc Networks” - IEEE
     Chen, “DeuceScan: Deuce-Based Fast Handoff                      TRANSACTIONS ON MOBILE COMPUTING,
     Scheme in IEEE 802.11 Wireless Networks” - IEEE                 VOL. 6, NO. 9, SEPTEMBER 2007\
     TRANSACTIONS              ON          VEHICULAR            [19] Anthony Almudevar, “Approximate Calibration-Free
     TECHNOLOGY, VOL. 57, NO. 2, MARCH 2008                          Trajectory Reconstruction in a Wireless Network” -
[8] Mohammed M. Olama, Seddik M. Djouadi, Ioannis                    IEEE       TRANSACTIONS            ON      SIGNAL
     G. Papageorgiou, and Charalambos D. Charalambous                PROCESSING, VOL. 56, NO. 7, JULY 2008
     “Position and Velocity Tracking in Mobile Networks         [20] Yi Yuan-Wu and Ye Li, “Iterative and Diversity
     Using Particle and Kalman Filtering With                        Techniques for Uplink MC-CDMA Mobile Systems
     Comparison”      - IEEE TRANSACTIONS ON                         With Full Load” - IEEE TRANSACTIONS ON
     VEHICULAR TECHNOLOGY, VOL. 57, NO. 2,                           VEHICULAR TECHNOLOGY, VOL. 57, NO. 2,
     MARCH 2008                                                      MARCH 2008
[9] Archan Misra, Abhishek Roy and Sajal K. Das,
     “Information-Theory Based Optimal Location
     Management Schemes for Integrated Multi-System             [21] Abhishek Roy, Archan Misra, and Sajal K. Das,
     Wireless Networks” - IEEE/ACM TRANSACTIONS                      “Location Update versus Paging Trade-Off in Cellular
     ON NETWORKING, VOL. 16, NO. 3, JUNE 2008                        Networks: An Approach Based on Vector
[10] Yang Xiao, Yi Pan and Jie Li, “Design and Analysis              Quantization” - IEEE TRANSACTIONS ON
     of Location Management for 3G Cellular Networks” -              MOBILE COMPUTING, VOL. 6, NO. 12,
     IEEE TRANSACTIONS ON PARALLEL AND                               DECEMBER 2007
     DISTRIBUTED SYSTEMS, VOL. 15, NO. 4, APRIL                 [22] Enrique Stevens-Navarro, Yuxia Lin, and Vincent W.
     2004                                                            S. Wong, “An MDP-Based Vertical Handoff Decision
[11] Di-Wei Huang, Phone Lin and Chai-Hien Gan,                      Algorithm for Heterogeneous Wireless Networks” -
     “Design and Performance Study for a Mobility                    IEEE     TRANSACTIONS              ON      VEHICULAR
     Management Mechanism (WMM) Using Location                       TECHNOLOGY, VOL. 57, NO. 2, MARCH 2008
     Cache for Wireless Mesh Networks” -IEEE                    [23] Sai Shankar N and Mihaela van der Schaar,
     TRANSACTIONS ON MOBILE COMPUTING,                               “Performance Analysis of Video Transmission Over
     VOL. 7, NO. 5, MAY 2008                                         IEEE 802.11a/e WLANs”- IEEE TRANSACTIONS
[12] Yi-hua Zhu and Victor C. M. Leung, “Optimization                ON VEHICULAR TECHNOLOGY, VOL. 56, NO. 4,
     of Sequential Paging in Movement-Based Location                 JULY 2007
     Management Based on Movement Statistics” - IEEE            [24] Abhishek Roy, Archan Misra and Sajal K. Das,
     TRANSACTIONS              ON          VEHICULAR                 “Location Update versus Paging Trade-Off in Cellular
     TECHNOLOGY, VOL. 56, NO. 2, MARCH 2007                          Networks: An Approach Based on Vector
[13] Ramón M. Rodríguez-Dagnino, and Hideaki Takagi,                 Quantization” - IEEE TRANSACTIONS ON
     “Movement-Based Location Management for General                 MOBILE COMPUTING, VOL. 6, NO. 12,
     Cell Residence Times in Wireless Networks” - IEEE               DECEMBER 2007
     TRANSACTIONS              ON          VEHICULAR            [25] Dusit Niyato, and Ekram Hossain, “A Noncooperative
     TECHNOLOGY, VOL. 56, NO. 5, SEPTEMBER                           Game-Theoretic Framework for Radio Resource
     2007                                                            Management in 4GHeterogeneous Wireless Access
[14] Wenchao Ma, Yuguang Fang and Phone Lin,                         Networks” - IEEE TRANSACTIONS ON MOBILE
     “Mobility Management Strategy Based on User          91         COMPUTING, VOL. 7, NO. 3, MARCH 2008
                                                                                            ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                         Vol. 7, No. 3, March 2010

   Effect of Weighting Scheme to QoS Properties in Web Service Discovery
               ¹Agushaka J. O., Lawal M. M., Bagiwa, A. M. and Abdullahi B. F.

               Mathematics Department, Ahmadu Bello University Zaria-Nigeria



  Specifying QoS properties can limit the selection of some good web services that the user will
have considered; this is because the algorithm used strictly ensures that there is a match between
QoS properties of the consumer with that of the available services. This is to say that, a situation
may arise that some services might not have all that the user specifies but are rated high in those
they have. With some tradeoffs specified in form of weight, these services will be made available
to the user for consideration. This assertion is from the fact that, the user’s requirements for the
specified QoS properties are of varying degree i.e. he will always prefer one ahead of the other.
This can be captured in form of weight i.e. the one preferred most will have the highest weight.
If a consumer specifies light weight for those QoS properties that a web service is deficient in
and high weight for those it has, this will minimize the difference between them. Hence the
service can be returned.

Key Words: QoS properties, QoS weighting vector, Distance Measure

1. Introduction                                         (2002) attributed this slow take off to factors
                                                        such as perceived lack of security and
Web Services are the third generation web               transaction support and also quality of the
applications; they are modular, self-                   web service. Web Services standards like
describing, self-contained applications that            WSDL (, SOAP
are accessible over the Internet Cubera et al           (,             UDDI
(2001). A Web Services (sometimes called                ( and BPEL
an XML Web Services) is an application                  (
that enables distributed computing by                   eloper/library/ws-bpel.pdf) provide syntax
allowing one machine to call methods on                 based interaction and composition of Web
other machines via common data formats                  Services in a loosely coupled way that does
and protocols such as XML and HTTP. Web                 not take into account the non-functional
Services are accessed, typically, without               specification like quality of service (QoS)
human intervention. Web service technology              properties such as scalability, performance,
address     the    problem      of   platform           accessibility etc. QoS for Web services
interoperability however, in the work of                gives consumers assurance and confidence
Plammer and Andrews (2001), they showed                 to use the services, consumers aim to
that there is actually a slow take off of web           experience a good service performance, e.g.
services technology and DuWaldt and Trees               low waiting time, high reliability, and

                                                                                    ISSN 1947-5500
                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                          Vol. 7, No. 3, March 2010

availability to successfully use services.               given next, it is closely followed by QoS
Service registries host hundreds of similar              matching in Tahers work, then our propose
Web services, which make it difficult for the            extension. Detailed examples are given next
service consumers to choose from, as the                 to proof our assertions. Finally, conclusion
selection is only based on the functional                and future work is given.
properties albeit they differ in QoS that they
deliver. Such variety in QoS is considered as            2. Related work
an important criterion for Web service                   At the present time, Universal Description,
selection. Taher, L. et al (2005a) proposed a            Discovery and Integration of Web services
generic QoS Information and Computation                  (UDDI) based look ups for Web services are
(QoS_IC) framework for QoS-based service                 based on the functional aspects of the
selection in which the QoS selection                     desired Web services. In his work, Ran
mechanism utilizes an established Registry               (2003) extended UDDI model by adding a
Ontology: which is used to present the                   new role called QoS certifier which verifies
semantics of the proposed framework and its              the service providers QoS claims. Figure 1 is
QoS structure. The QoS selection                         an adaptation from the work of Ran (2003).
mechanism also uses the Euclidian distance               In his proposed model, Ran assumes that a
measure to evaluate the similarity between               Web service provider needs to supply
the consumer/provider QoS specification in               information about the company, the
the matchmaking process. We try to extend                functional aspects of the provided service as
the work of Taher et al (2005a) to                       requested by the current UDDI registry, as
accommodate a user defined weighting                     well as to supply quality of service
scheme. This weighting scheme is defined in              information related to the proposed Web
such a way that the highest weight signifies             service. The claimed quality of service
the most desired QoS property. It decreases              needs to be certified and registered in the
base on order of priority. Also, the                     repository.
weighting scheme normally between [0,1].
The algorithm presented here is a slight
modification of Taher’s as it take into
consideration the weighting scheme. As part
of the aim of this paper, we show that the
introduction of a weighting scheme into the
discovery algorithm can greatly address the
issues of “trade off” that can arise in service
selection. That is, depending on the weight
specification, certain web services can
perform better and hence be returned. The
examples in this paper helped us in making
these assertions. The sections in this paper
are organized as follows: Related work is

                                                                                     ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                         Vol. 7, No. 3, March 2010

                                                        scheme for the QoS properties. Depending
                                                        on which property the user desires best he
                                                        gives it a higher weight. Apart from this two
                                                        works, several attempt have been made to
                                                        add QoS specification to the discovery
                                                        process of web services. Examples of such
                                                        approaches are: Web Service Level
                                                        Agreement (WSLA) (Keller, A. et al, 2002),
                                                        Web service-QoS (Ws-QoS) (Tian, M.,
                                                        2004), Web Service Offering Language
                                                        (WSOL) (Tosic, V., 2003), SLAng
                                                        (Lamanna, D., 2003), UDDI eXtension
                                                        (UX) (Chen, Z., 2003) and UDDIe (Ali, S.,
                                                        2003). A comparison of the work of Taher et
                                                        al (2005a) and other approaches is given in
                                                        (Taher et al, 2005b). This serves as basis for
                                                        our selection of Taher’s work for
Figure 1: Rans UDDI model
                                                        improvement. All such approaches do not
The consumer searches the UDDI registry                 address issues related to adapting the
for a Web service with the required                     consumers to the changing conditions of
functionality as usual; they can also add               providers systems (Taher et al, 2005a).
constraints to the search operation. One type
                                                        3. QoS Matching in Taher’s Work
of constraint is the required quality of
service. If there were multiple Web services            Matchmaking problem meets the question of
in the UDDI registry with similar                       distance measure between objects, there are
functionalities, then the quality of service            many approaches to measure distance
requirement would enforce a finer search.               between any two objects based on their
The search would return a Web service that              numerical or semantic closeness, the
offers the required functionality with the              Euclidean distance measure was chosen for
desired set of quality of service. If there is          the algorithm. In other words, Euclidean
no Web service with these qualities,                    distance is used to evaluate the square root
feedback is given to the consumer. This                 of the sum of squared differences between
approach lacks support for the dynamic                  corresponding elements of the two vectors
nature of these QoS properties. The
approach of Taher et al (2005a) takes into              3.1. The QoS matchmaking algorithm

                                                        determines which Web service ������������������������ , from
account this issue of dynamic nature of QoS
properties. We implemented his work using               The     QoS           matchmaking                             algorithm

                                                                ���������������� = {����������������1 , ����������������2 , ����������������3 , … ������������������������ },
a different similarity metric and also
improve his matching algorithm to                       WS,                                                                       is
accommodated user defined weighting                     selected based on consumer’s QoS

                                                                                           ISSN 1947-5500
                                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                         Vol. 7, No. 3, March 2010

Manager constructs ���������������������������������������� matrix, where n
specifications (QPc). For that purpose, QoS                                                             consumer (QPc), i.e. to find a Web service
                                                                                                        with a minimum Euclidian distance. Given
represents the total number of Web services                                                             the QoS mode (qm) and the submitted QoS
(WS) that have the same functional                                                                      properties (QPc) submitted by the consumer,
properties, and k represent the total number                                                            QoS matchmaking algorithm works as
of QoS properties. To compensate between                                                                follows:

different QoS properties values (������������������������,�������� ), the
different measurement units between
                                                                                                        Step-1: Check qm, based on that
values need to be normalized to be in the
                                                                                                        Step-2: Construct QoS matrix.
range [0, 1]. We will use the following

       ���������������������������������������������������������,�������� �
equations to normalize them.                                                                            Step-3 Normalizes (QPc) using equation-1

                   ������������������������,�������� ������������������������ − ������������������������,��������
                                                                                                        and equation-2.

       =                                                                                 1
             ������������������������,�������� ������������������������ − ������������������������,�������� ������������������������
                                                                                                        Step-4-5 Normalize QoS matrix using
                                                                                                        equation-1 and equation-2.

             ���������������������������������������������������������,�������� �
                         ������������������������,�������� − ������������������������,�������� ������������������������
                                                                                                        Step-6-7 Compute the Euclidian distance

             =                                                                       2
                   ������������������������,�������� ������������������������ − ������������������������,�������� ������������������������
                                                                                                        Step-7 Find ������������������������ with the minimum distance
                                                                                                        between each QP(wsi) and (QPc).

Where ������������������������,�������� , is the QoS property that one                                                  4. Proposed Extensions
wishes to normalize by minimization using
equation-1 or maximization using equation-                                                              In this section, we give the extensions
2, for example, response time need to be                                                                proposed for the model given in Taher et al
normalized by minimization using equation-                                                              (2005a). The assumption here is that all

by maximization using equation-2. ������������������������,�������� ������������������������
1 while availability needs to be normalized                                                             other components given in Taher’s model

is the ������������������������,�������� that has the maximum value
                                                                                                        remain except the similarity metric and the
                                                                                                        QoS matching algorithm. Detail is given in

������������������������,�������� ������������������������ is the ������������������������,�������� that has the minimum
                                                                                                        the following sections:
among all values on column v and
                                                                                                        4.1. Similarity Metric
value among all values on column v. To

array ����������������, ���������������� = {����������������1 , ����������������2 , … … . ������������������������ } with
normalize matrix QoS, we need to define an                                                              In their work, Taher et al (2005a) used the

1 ≤ �������� ≤ ��������. The value of ������������������������ can be either
                                                                                                        Euclidean distance to measure the similarity

0 or 1, 0 indicates that ������������������������ should be
                                                                                                        between two vectors. This does not capture
                                                                                                        any form of weighting for the QoS

indicates that ������������������������ should normalized using
                                                                                                        properties. As an extension to their work, we

                                                                                                        ���������������������������������������� ℎ���������������� . As we have said earlier is normally
normalized using equation-1, whereas 1
                                                                                                        introduced a user specified weighting vector
equation-2. The key idea of the QoS

������������������������ to the QoS specifications of the
matchmaking algorithm is to find the nearest                                                            between [0,1]. The modified formula is
                                                                                                        given below:

                                                                                                                                       ISSN 1947-5500
                                                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                               Vol. 7, No. 3, March 2010

      •        Euclidean distance measure is used                                                             We will use the same example given in

               two vectors ���������������� = ����������������1 … … ������������������������ and
               to evaluate the similarity between                                                             Taher et al (2005a) to proof our assertion

               ���������������� = ���������������� 1 … … ������������������������ . Here we introduce
               ���������������������������������������� ℎ���������������� which we use to multiply
                                                                                                              5. The new algorithm

                                                                                                              This new algorithm is an improved version
               the Euclidean distance.                                                                        of that given in Taher et al (2005a). This is

                                                                                                              because it incorporates a user defined

 ����������������������������������������� , ���������������� � = � �(����������������ℎ − ���������������� ℎ )² × ���������������������������������������� ℎ�������� ��������
                                                                                                              weighting scheme for the desired QoS


                                                                                                              ���������������� = {����������������1 , ����������������2 , … … ������������������������ } that satisfy the
                                                                                                              Given                                web                       services

                                                                                                              properties ������������������������ and QoS properties weight
4.2. Assertion
                                                                                                              user’s functional requirements, QoS
We are saying that the introduction of a
weighting scheme will help our algorithm to                                                                   base on the user’s specified priority. Just as
accommodate the tradeoffs that exists in                                                                      in the work of Taher et al (2005a), this

                                                                                                              services ������������������������ that best satisfy the consumer’s
nature. Specifying QoS properties can limit                                                                   algorithm tries to find which of the web
the selection of some good web services that
the user will have considered, as the                                                                         request based on the non-functional

                                                                                                              quality matrix, ℚ = {��������(������������������������ ); 1 ≤ �������� ≤
algorithm strictly ensures that there is a                                                                    specification (QoS). For this purpose, a

                                                                                                               ��������; 1 ≤ �������� ≤ ��������} is created, this refers to a
match between QoS properties of the
consumer with that of the available services.
This is to say that, some services might not                                                                  collection of quality attribute-values for a set
have all that the user specifies but are rated                                                                of candidate services, such that, each row of
high in those they have. With some tradeoffs                                                                  the matrix corresponds to the value of a
specified in form of weight, these services                                                                   particular QoS attribute (in which the user is
will be made available to the user for                                                                        interested) and each column refers to a

                                                                                                              ��������(������������������������ ), represents the value of the ����������������ℎ QoS
consideration. This assertion is from the fact                                                                particular candidate service. In other words,

                                                                                                              attribute for the ����������������ℎ candidate service. The
that, the user’s requirements for the
specified QoS properties are of varying
degree i.e. he will always prefer one ahead
                                                                                                              normalization equations given in Taher’s
of the other. This can be captured in form of
                                                                                                              work is used to normalize QoS properties

                                                                                                              ������������������������ to be in the range [0,1]. Given the QoS
weight i.e. the one preferred most will have
                                                                                                              obtained from profile of web services and
the highest weight. If a consumer specifies
light weight for those QoS properties that a
                                                                                                              mode (qm) as in Taher et al (2005a), the
                                                                                                              submitted QoS properties (������������������������ ) submitted
web service is deficient in and high weight
for those it has, this will minimize the
                                                                                                              by the consumer and the QoS weight, QoS
difference between them. Hence the service
                                                                                                              matchmaking algorithm works as follows:
can be returned as it shows from case 2.
                                                                                                              Step-1: Check qm, based on that

                                                                                                                                              ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                Vol. 7, No. 3, March 2010

Step-2: Construct Quality matrix.                                               In these schemes, two different tradeoffs or

Step-3: Normalize (������������������������ ) and quality matrix.
                                                                                variation in user’s wants are shown. Based

                                                                                four Web services {����������������1 , ����������������2 , ����������������3 , ����������������4 }
                                                                                on the functional specifications assume that

Step-4: Compute the similarity using the

between each ����������������(������������������������ ) and (������������������������ ).
                                                                                have been returned by UDDI. QoS

                                                                                relevant QoS properties associated with ����������������
metric given in previous section,
                                                                                matchmaking algorithm retrieves the

Step-5 Find ������������������������ with the minimum distance                              mode of the four Web services and use it to
                                                                                construct Quality matrix, as shown in table-
Just as in the case of Taher’s alogorithm,                                      1.
The QoS matchmaking algorithm works fine
even if (������������������������ ) does not include the whole set
                                                                                Table-1: quality matrix
of QoS properties, as it is anticipated that
consumers need not to specify all QoS
parameters               defined    previously.   The
Complexity is O(n), since the number of
QoS parameters is constant and n represents

                                                                                normalizing ������������������������ and Quality matrix. The
the total number of Web services that have
                                                                                QoS matchmaking algorithm continues by

                                                                                QoS values of ������������������������ after normalization are
the same functionality based on the

                                                                                {0.90, 0.00, 0.17, 0.75, 1.00, 0.75}. The QoS
consumer’s                functional     requirements.
However the complexity could change, in
case the number of QoS properties change to
a large value.                                                                  values of Quality matrix after normalization
                                                                                are shown in table-2
6. Example
                                                                                Table-2: normalized QoS matrix
This example is adopted from Taher et al
(2005a). It considers a scenario of how the
QoS matchmaking algorithm works.

������������������������ = {0.9, 20, 50, 0.9, 1, 200}, ���������������� =
Assume                                                              that

 (������������������������/������������������������) ������������������������ ���������������� = {1, 0, 1, 1, 1, 0},        The algorithm then calculates the
the QoS properties are in order of                                              similarities for the web services by using the

                                                                                                                                        ���������������������������������������� ℎ�������� �������� =
scalability, response time, throughput,                                         our extended formula given earlier
availability, accessibility and cost. Also,

assuming the user specify weighting                                                  1. Case      1:     using

     1. ���������������������������������������� ℎ���������������� = {0.9,1,0.6,0.4,0.6,0.1}
schemes for the QoS properties as follows

     2. ���������������������������������������� ℎ���������������� = {0.9,0.1,1,0.1,0.2,0.9}                          for ����������������1 , ����������������2 , ����������������3 ������������������������ ����������������4 to be
                                                                                          The algorithm calculates the distance

                                                                                                                 ISSN 1947-5500
                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                          Vol. 7, No. 3, March 2010

       0.782, 1.215, 1.266 ������������������������ 0.655
       respectively. Since ����������������4 has the
                                                                                                   specified, tradeoffs specified in form
                                                                                                   of weights helps return services
       minimum, it is returned.                                                                    which hitherto will not be considered
                                                                                                   because of their low weight in some
       Note: comparing our result with that                                                        QoS properties which will have
       in Taher et al (2005a), we see that                                                         caused their difference to increase.
       the distance in all the web services is

       at the distance for ����������������1 , as compared
                                                                                                   This case confirms our assertion that
       greatly reduced. Remarkably, if look                                                        weight plays a great deal in returning
                                                                                                   services that actually meets the
       to that in Taher’s work we see that                                                         consumer’s needs. The fact is that
       the distance is significantly reduced.                                                      there is always trade off that exists in
       This is because the user is less                                                            these needs and these tradeoffs are
       interested in QoS property cost as                                                          best captured using a weighting
       shown by its weight, in which the                                                           scheme.
       web service differ from the request (
       as seen from both the QoS profile of                                                As shown in the above example the results

                                                  ���������������������������������������� ℎ�������� �������� =
       the service and request)                                                            are promising and proofing our concept. In
                                                                                           our ongoing work, we are attempting to

2. Case                   2:              using                                            provide an implementation for this work.
                                                                                           This we hope will be based on the

   ����������������1 , ����������������2 , ����������������3 , ����������������4
   The algorithm returns distances for                                                     implementation of Taher’s work.

   0.8017, 1.00, 0.7222 ������������������������ 0.8684.
                                                 to                              be

   Since ����������������3 has the minimum, it is
                                                                                           7. Conclusion

                                                                                           Quality of Service selection for Web

   in distance for ����������������3 this is because
                                                                                           services is becoming a significant challenge.
   Note: there is a significant decrease
                                                                                           We proposed an advanced QoS based
                                                                                           selection framework that is an improvement
   the user is less interested in QoS

   weights, which ����������������3 is not rated high
                                                                                           on the work Taher et al (2005a) that
   properties, as indicated by their
                                                                                           manages Web service quality and provides

   ����������������4 was returned with the least
                                                                                           mechanisms for QoS updates. The proposed
   for. As compared to case 1 where
                                                                                           improvement preserves the architecture
                                                                                           presented in Taher’s work and proposes an
   difference, we see that the weighting
                                                                                           extension that includes a user defined
   scheme for case 1 has more weight

   high between ����������������4 and user required
                                                                                           weighting scheme in both metric used to
   in areas where the similarities are
                                                                                           calculate similarities and the matching
                                                                                           algorithm. This extension can be added
   QoS properties. This goes to show
                                                                                           seamlessly without any change in the
   the effect of weighting scheme in
                                                                                           architecture in Taher’s work, it also can be
   determining which service is
                                                                                           customized for specific domain. From the
   returned. Though the consumer
                                                                                           example given herein especially in case 2,
   wants all the QoS properties he

                                                                                                                     ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                         Vol. 7, No. 3, March 2010

we see that adding a weighting scheme to               UDDI             Version       3.0      [online]
the QoS matching process greatly affects the           
service to be return. Low weight to QoS                          published-20020719.htm
property a web service has low rating in will
make it insignificant and greatly affects its          Business Process Execution Language for
(service) selection as can be seen in case 2                  Web Services Version 1.1. [online]
of our example. Our work also works base                      available
on the assumption in Taher’s work that the          
QoS Modes have been derived with the                          e/developer/library/ws-bpel.pdf
assumption that network conditions are
                                                       L.Taher, R.Basha, H. El Khatib. (2005a) “A
static; in practice network factors would
                                                             Framework and qos Matchmaking
have a direct affect on many QoS properties
                                                             Algorithm for Dynamic Web Services
such as response time, we are working on
                                                             Selection”, The Second International
addressing this limitation in our future work.
                                                             Conference on Innovations in
8. References                                                Information Technology (IIT’05)

Curbera, F., Nagy, W. And Weerawarana, S.              Ran, S. (2003). A Model for Web Services
      (2001). Web Services: Why and How.                      Discovery with qos
      In Workshop on Object-Oriented
                                                       A. Keller, H. Ludwig (IBM), “The WSLA
      Web Services – OOPSLA 2001,
                                                             Framework:        Specifying    and
      Tampa, Florida, USA.
                                                             Monitoring          of       Service
PLUMMER, D., AND ANDREWS, W.                                 levelagreements for Web Services”,
    2001. The Hype Is Right: Web                             IBM research report RC22456, 2002,
    Services Will Deliver Immediate                
    Benefits.                                                es/paper_search.shtml
                                                       M. Tian, A. Gramm, H. Ritter, J. Schiller, R.
                                                             Winter (2004) “A Survey of current
Duwaldt and Trees. 2002. Web Services A                      Approaches towards Specification
      Technical Introduction, DEITEL™                        and Management of Quality of
      Web Services Publishing.                               Service for Web Services ”. Freie
                                                             Universität Berlin, Institut für
WSDL: Web Services Description Language,                     Informatik
                                                       V. Tosic, B. Pagurek, K. Patel, B. Esfandiari,
SOAP Version 1.2 Part 1: Messaging                            W. Ma, (2003) "Management
    Framework     [online]   Available                        Applications of the Web Service                              Offerings Language (WSOL)", Proc.
    part1/.                                                   Of the 15th International Conference

                                                                                    ISSN 1947-5500
                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                        Vol. 7, No. 3, March 2010

       on Advanced Information Systems                          engineering. All other co-authors are
       Engineering(caise'03).                                   his colleagues in department of
                                                                mathematics of the same institution.
D. D. Lamanna, J. Skene, W. Emmerich
      (2003). “slang: A language for
      defining Service Level Agreements”,
      The Ninth IEEE Workshop on Future
      Trends of Distributed Computing
      Systems (FT DCS'03).

Z. Chen, C. Liang-Tien, B. Silverajan, L. Bu-
       Sung (2003) “UX – An Architecture
       Providing qos-Aware and Federated
       Support for UDDI”, Proc. Of the
       2003 International Conference on
       Web Services (ICWS'03).

A. S. Ali, O. F. Rana, R. Al-Ali, and D. W.
       Walker. (2003) “uddie: An Extended
       Registry for Web Services”. In
       Workshop on Service Oriented
       Computing: Models, Architectures
       and     Applications   at     SAINT
       Conference. IEEE Computer Society

L.Taher, R.Basha, H. El Khatib. (2005b)
      “qos Information & Computation
      (qos-IC) Framework for qosbased
      Discovery of Web Services”, to
      appear on MOSAIC, Upgrade journal

Authors profile

Agushaka J. O. had his B.Sc. (mathematics
      with computer science) in 2005 and
      M.Sc.(computer science) in 2010 all
      at Ahmadu Bello University, Zaria-
      Nigeria. He currently lectures at the
      same institution. He has special
      interest in semantic web services,
      knowledge representation, software

                                                                                   ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                            Vol. 7, No. 3, March 2010

   Fuzzy Logic of Speed and Steering Control System for Three Dimensional
                   Line Following of an Autonomous Vehicle
               Dr. Shailja shukla                                                                   Mr. Mukesh Tiwari
      Department of Electrical Engineering                                                 Department of Electrical Engineering
      Jabalpur engineering College Jabalpur                                               Jabalpur Engineering College Jabalpur
      Jabalpur (M.P) India                                                                Jabalpur (M.P) India
      E-mail                                                 E-mail

Abstract- The major problem of robotics research today is that there            algorithm. This can lead to a robust controller design. The
is a huge barrier to entry into Robotics research due to system                 modeling of a mobile robot is a very complex task and a direct
software complexity and need for a researcher to learn more about               application of FLC can be found in this area. An excellent
details, dependencies and intricacies of the complete system. This is           introduction to the mathematical analysis of mobile robots can
because a robot system needs several different modules to                       be found in [1]
communicate and execute in parallel. Today there are not much
                                                                                    Even though the visualization and recognition of image
controlled comparisons of algorithms and solutions for a given task,
which is the standard scientific method of other sciences. There is             information for the guidance of mobile robot have been
also very little sharing between groups and projects, requiring code to         studied for many years, The design of a mobile vehicle system
be written from scratch over and over again. This paper is to describe          is a challenging task in the sense determining what
exploratory research on the design of a modular autonomous mobile               information to measure and how to use this information to
robot controller. The controller incorporates a fuzzy logic [8] [9]             design an intelligent controller in a manner that will satisfy the
approach for steering and speed control [37], a FL approach for                 performance specifications of the system.
ultrasound sensing and an overall expert system for guidance. The                   Overall fuzzy logic approaches for modeling control
advantages of a modular system are related to portability and                   systems for vehicles have been studied in the past. A fuzzy
transportability, i.e. any vehicle can become autonomous with
minimal modifications. A mobile robot test bed has been constructed
                                                                                logic controller [8] that guarantees stability of a control system
in university of Cincinnati using a golf cart base. This cart has full          for a computer simulated model car and advanced fuzzy logic
speed control with guidance provided by a vision system and obstacle            application for automobiles application has been discussed in
avoidance using ultrasonic sensors. The speed and steering fuzzy                Altrock et. al. [9].
logic controller is supervised through a multi-axis motion controller.
The obstacle avoidance system is based on a microcontroller                                             II. OBJECTIVES
interfaced with ultrasonic transducers. This micro-controller                       The main aspect of intelligent control addressed in this
independently handles all timing and distance calculations and sends
distance information back to the fuzzy logic controller via the serial          paper is the design of a controller for a mobile robot using
line. This design yields a portable independent system in which high            fuzzy logic. The design specifications selected here fully
speed computer communication is not necessary. Vision guidance has              satisfies the building of a robot simulation which could follow
been accomplished with the use of CCD cameras judging the current
                                                                                a line, avoid obstacles, and adapt to variations in terrain. The
position of the robot.[34] [35][36] It will be generating a good image
for reducing an uncertain wrong command from ground coordinate to               adaptive capabilities of mobile robots depend on the
tackle the parameter uncertainties of the system, and to obtain good            fundamental analytical and architectural designs of the sensor
WMR dynamic response.[1] Here we Apply 3D line following                        systems used.
mythology. It transforms from 3D to 2D and also maps the image
coordinates and vice versa, leading to the improved accuracy of the                 The mobile robot provides an excellent test platform for
WMR position. The fuzzy logic Controller may give a good                        investigations into generic vision guided robot control since it
command signal; moreover we can find a highly accurate plant model              is similar to an automobile and is a multi-input, multi-output
to design the controller taking into account                                    system. An algorithm has been developed to establish a
     The unknown factors like friction and dynamic environment                  mathematical and geometrical relationship between the
.This design, in its modularity, creates a portable autonomous fuzzy            physical three dimensional (3-D) ground coordinates of the
logic controller applicable to any mobile vehicle with only minor               line to follow and its corresponding two dimensional (2-D)
adaptations.                                                                    digitized image coordinates.
                                                                                    This relationship is incorporated into the vision tracking
                      I. INTRODUCTION                                           system to determine the perpendicular distance and angle of
    Controller design for any system needs some knowledge                       the line with respect to the centroid of the robot. The
about the system. Usually this involves a mathematical                          information from the vision tracking system is used as input to
description of the relation among inputs to the process, its state              a closed loop fuzzy logic controller to control the steering and
variables, and its Output. This description is called the model                 the speed of the robot.
of the system. The model can be represented as a set of
                                                                                                  III. RESEARCH OBJECTIVE
transfer functions for linear time invariant systems or other
                                                                                    The main goal of this research is to model a modular
relationships for non-linear or time variant systems.
                                                                                Fuzzy Logic Control for an automated guided vehicle and test
    Modeling of complex systems may be very difficult task.
In a complex system such as a multiple input and multiple                       the performance of the vehicle Simulation in A MATLAB
outputs system inaccurate models can lead to unstable                           Simulation the research is focused on the design of the Fuzzy
systems, or unsuitable system performance. [9] Fuzzy Logic                      Controller for vision and sonar navigation of the automated
Control (FLC) is an effective alternative approach for systems                  guided vehicle.
which are difficult to model. The FLC uses the qualitative                          The design of the controller has been executed in three
aspects of the human decision process to construct the control                  stages. In the first stage the universe of discourse is identified
                                                                                                          ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010

and fuzzy sets are defined. The rule base (Fuzzy Control
Rules) for the control is then defined through a human
decision making process. The membership functions and their
intervals are defined. Aggregation and de fuzzification                                P=
methods are selected. In the second stage the Fuzzy Controller
is implemented on the autonomous guided vehicle. In the                             Translation operation
third and final stage performance of the controller is tested
through a series of simulations and real time running of the
                                                                           T              =
                     IV. METHODOLOGY                                                                …………. (3)
    The purpose of the vision guidance is to guide the robot to
                                                                                              ………………. (4)
follow the line using a digital charge couple device (CCD)
camera. To do this, the camera needs to be calibrated. Camera                                   ……………. (5)
calibration 34 is a process to determine the relationship
between a given 3-D coordinate system (world coordinates)
and the 2-D image plane a camera perceives (image                                             IV.II.II 2, SCALING
coordinates). More specifically, it is to determine the camera
and lens model parameters that govern the mathematical or
geometrical transformation from world coordinates to image
coordinates based on the known 3-D control field and its
image. The CCD camera digitizes the line from 3-D coordinate
system to 2-D image system. Since the process is autonomous,
the relationship between the 2-D system and the 3-D system
has to be accurately determined so that the robot can follow
the line. The objective of this section is to show how a model                                             Fig. 1 Scaling
was developed to calibrate the vision system so that, given any
                                                                                    Current position
2-D image coordinate point, the system can mathematically
compute the corresponding ground coordinate point. The X
and Y (the Z is constant) coordinates of two ground points are
then computed from which the angle and the perpendicular
distance of the line with respect to the centroid of the robot are                     P=
determined. The vision system was modeled by the following
   xPI = A11 Xg + A12Yg + A13Zg + A14 …………..(1)
                                                                                    New position
  yPI = A21Xg + A22Yg + A23Zg + A24……………(2)

Where Anm are coefficients, xPI and yPI are x and y image
coordinates, and Xg, Yg, and Zg are the ground coordinates. In                          P’=
transforming the ground coordinate points to the image                              Scaling operation
coordinate points the following
     Translation
     Scaling
     Rotation                                                                                                         ……. (6)
     Shear                                                                         Matrix multiplication

                 IV.II.I, 2D TRANSLATION                                                    IV.II.III 2D-ROTATION
             Current position
                                                                                    Current position

                   P=                                                                          P=
             New position (after translation)
                                                                                                    ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                 Vol. 7, No. 3, March 2010

                 New position                                                                Current position

                 Rotation operation


                                         ………… (7)
                                                                                              Shear position


           Positive angles are “counter-clockwise"!
                                                                                                                   …………….... (14)
                                                                                              Geometric meaning
           Derivation of rotation
                                                                                              Shear operation along y axis
                                         …………. (8)

                                           ………... (9)
           Rotate θ2

                                             …..… (10)
                                                                                             Geometric meaning!
                                         ……….... (11)
                                                                                             Consider more complicated cases!
           Observation (important results from trigonometry)!                               Various examples are shown in the class
                                                                                  Transformation operations occur on the points: scaling,
                                             .…….. (12)                           translation, rotation, perspective, and projective. Solving for
                                                                                  the transformation parameters to obtain the image and ground
                                                  ....... (13)                    coordinate relationship is a difficult task. Fortunately, in the
2D Rotation                                                                       model equations given above, the transformation parameters
                                                                                  are embedded into the coefficients. To compute the
                                                                                  coefficients, a calibration device was built to obtain 12 data
                                                                                  points. With the 12 points, a matrix equation was yielded as
                                                                                  Shown below

                                                                                      =        ………………………...................(15)
                                                                                      =        ….. ………………………….... (16)


                             Fig. 2. 2D Rotation

                                                                                  Eqns. (15) & (16) consist of 12 linearly independent equations
                                                                                  and four unknowns; the least-square regression method is
                               Fig. 3. 2D Shear
                                                                                  applied to yield a minimum mean square error solution for the
                                                                                  coefficients. Below are the equations for the solution:
                                                                                                            ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010

                                                                           the resulting echo is detected. The elapsed time between the
A   = (C C) C XPI……………………...(17)                                           start of the transit pulse and the reception of the echo pulse is
                                                                           measured. Knowing the speed of sound in air, the system can
                                                                           convert the elapsed time into a distance measurement. The
A    = (C C) C YPI …………………… (18)
                                                                           drive electronics has two major categories - digital and analog.
Given an image coordinate xPI and yPI, and z ground                        The digital electronics generate the ultrasonic frequency. A
coordinate (the z coordinate of the points with respect to the             drive frequency of 16 pluses per second at 52 kHz is used in
centroid of the robot is maintained constant since the robot is            this application.
run on a flat surface in this model) the corresponding Xg and
Yg ground coordinates are computed as indicated by the                         VI. DESCRIPTION OF THE AUTONOMOUS VEHICLE
following matrix equations.                                                     The system that is to be controlled is an electrically
                                                                           propelled mobile vehicle was created and assembled during
                                                                           the spring quarter of 1998 in the Advanced Robotics Lab at the
                   ………………………….(19)                                         University of Cincinnati. This vehicle was built as part of the
                                                                           Autonomous Guided Vehicle contest sponsored by the Army
                                                                           Tank Command. A 3D rendering of above said mobile vehicle
                                                                           is shown below. The vehicle is constructed of an aluminum
                                                                           frame designed to hold the controller, obstacle avoidance,
                                                                           vision sensing, vehicle power system, and drive components.
                                                                           Two independently driven brushless DC motors are used for
                                                                           both vehicle propulsion, as well as for vehicle steering This
    Note that equation (1) and (19) can be modified to                     independent drive system not only gives vehicle the capability
accommodate the computation of Zg when an elevation of the                 to move in a straight line, and perform turning actions, but this
ground surface is considered. The image processing of the
physical points is done by the ISCAN tracking device, which                system also allows the vehicle to have a zero turning radius
returns the centroid of the brightest or darkest region in a               feature. This feature allows the vehicle to turn directly about
computer controlled windows and returns its X and Y                        the center of the drive without requiring forward motion,
coordinates. Two points on the line are windowed and their                 thereby giving the vehicle the ability to navigate through more
corresponding coordinates are computed as described above.                 complicated course requirements. A 3-axis Gallil motion
From the computed x and y ground coordinates of the points,                control board is used as the interface between the controller
the angle of the line with respect to the centroid of the robot is
                                                                           CPU and the drive components, including the brushless servo
computed from simple trigonometric relationship. In the next
section, we shall show how the angle of the line just computed             drive motors and the encoders. A Galill board, the brushless
is used with other parameters to model the steering control of             motor, and an optical encoder provide a closed loop system
the robot with a fuzzy logic controller.                                   that allows the controller code to specify accurately motor
                                                                           dynamics parameters, including position, velocity and
                                                                           acceleration. The Gallil controller contains a digital PID type
                                                                           controller. This controller is tuned with the Servo Design Kit
                                                                           Software package, that selects the Proportional, Integral, and
                                                                           Derivative gains (Kp, Ki, Kd) to optimize the system response.

                     Fig. 4 Line following

                V. OBSTACLE AVOIDANCE
    The obstacle avoidance system consists of six ultrasonic
transducers. An ultrasonic ranging system from Polaroid is
used for the purpose of calibrating the ultrasonic transducers.
An Intel 80C196 microprocessor and a circuit board with a                          Fig. 5 Description of the Autonomous Vehicle
liquid crystal display is used for processing the distance
calculations. The distance value is returned through a RS232                   Once tuned, the controller code is able to simply select
port to the control computer. The system requires an isolated              motor position, velocity and acceleration, and control the
power supply: 10-30 VDC, 0.5 amps. The two major                           trajectory of the mobile Vehicle. It is important to distinguish
components of an ultrasonic ranging system are the transducer              these PID components from the control logic used for vehicle:
                                                                           this controller drives the motor to a specified velocity, whereas
and the drive electronics. In the operations of the system, a
                                                                           the control logic selects the value of that specified velocity
pulse of electronic sound is transmitted toward the target and             based on vision system inputs. There are two control system
                                                                                                    ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 7, No. 3, March 2010

types that may be used to control an autonomous guided                   Galill control board to minimize the steady state error between
vehicle. The firm is to provide the vehicle with complete                the target velocity, and the actual motor velocity. The end
information about the environment, and the required path.                result is the desired vehicle motion at the target velocity. This
    The vehicle then uses navigation sensors, and the                    system provides motor output signals for each input sample.
programmed information about the environment to navigate                 The system is therefore able to adjust to a changing
through the course this method requires extensive                        environment, and changing path conditions[1] .The previous
programming to completely define the path, and is unable to              control logic for the AGV utilized a classical linear control
navigate in an unknown territory. The second method is to                system with feedback, and a PID (Proportional, Integrator,
gather information from the environment using external                   Derivative) controller.
sensors, and process the information to control the speed and                This system has been selected due to the simplicity of its
steering parameters of the vehicle. Due to the unknown                   components. This system proved to be excellent for challenges
environment for the usage of the AGV, this second method is              focusing mainly on the complexity of the multiple input /
utilized. For path generation, the input to the controller is            multiple output, and the difficulty of tuning the system with
attained through two cameras mounted on the front of the                 respect to changing environments. For this reason, a fuzzy
vehicle. These cameras sense the visual image of the path-line.          logic controller, which is simpler to understand, and to
This information is processed through the controller                     modify, has been created. To control the vehicle, a control
                                                                         algorithm is developed that can easily be coded for the Gallil
    VII.SIMULATION MODEL OF AN AUTONOMOUS                                control hardware interface. The Gallil board translates speed
                          VEHICLE                                        commands into output signals which, after proper tuning,
    In the simulation model of the autonomous vehicle there              result in a desired motor angular position, velocity and
                                                                         acceleration. The membership functions were tuned to
are six inputs and three output all these six inputs are
                                                                         improve the performance of the vehicle.
combined by using multiplexer and provide single input to
fuzzy logic controller. All the inputs evaluated for each rule                          IX. FUZZY LOGIC CONTROLLER
that create the fuzzy rule editor and give the single output                 Fuzzy logic controller [9] [13] uses the fuzzy set and fuzzy
which is decoded by de-multiplexer. Two outputs for left and             logic theory previously introduced in its implementation. A
right motor speed are closed loop with feedback follow the               detailed reference on how to design a fuzzy controller can be
characteristic of the PID and unity gain. This measured error            found in 29, 30, and 31. Fuzzy Inference System Fuzzy
signal between the desired and actual state of the linear closed         inference is the actual process of mapping from a given input
loop control system, is driven to a zero value and the desired           to an output using fuzzy logic 27. Fuzzy logic starts with the
state is achieved                                                        concept of a fuzzy set. A fuzzy set is a set without a crisp,
                                                                         clearly defined boundary. It can contain elements with only a
                                                                         partial degree of membership. The MATLAB Fuzzy Logic
                                                                         Toolbox was used to build the initial experimental Input fuzzy
                                                                         sets the first level of the fuzzy system has two inputs, error and
                                                                         error. These inputs are resolved into a number of different
                                                                         fuzzy linguistic sets.

                                                                                                   X. RULE BASE
                                                                             The way one develops control rules [16] depends on
                                                                         whether or not the process can be controlled by a human
                                                                         operator. If the operator's knowledge and experience can be
                                                                         explained in words, then linguistic rules can be written
     Fig. 6 Simulation model of An Autonomous Vehicle                    immediately. If the operator's skill can keep the process under
                     VIII. ISCAN SYSTEM                                  control, but this skill cannot be easily expressed in words, then
    The ISCAN system returns two variables: the distance of              control rules may be formulated based on the observation of
the vehicle from the border line and the relative angle between          operator's actions in terms of the input - output operating data.
the path of the vehicle and the border line. The controller              However, when the process is exceedingly complex, it may
utilizes these two inputs to select motor speeds that will drive         not be controllable by a human expert. In this case, a fuzzy
the vehicle to follow the specified path without crossing the            model of the process is built and the control rules are derived
boundary lines. The vehicle can be driven in a straight line by
specifying equivalent angular velocities of the motors, or               theoretically.
driven in a turn by specifying difference between angular                    It should be noted however, that this approach is quite
velocities of the motors. A schematic of the components of the           complicated and has not yet been fully developed. Therefore
control system is through a fuzzy logic algorithm, the code              the FLC is ideal for complex ill - defined systems that can be
translates the input from the vision system into target output           controlled by a skilled human operator without the knowledge
velocities for each of the right and left vehicle drive motors.          of their underlying dynamics. In such cases an FL controller is
    These outputs are processed through the Gallil motion                quite easy to design and implementation is less time
control system that translates the target output into controller         consuming than for a conventional controller
output signals. These signals are passed through the amplifiers                                           .
to increase the signals to a level that will drive the motors at                                    XI. RESULTS
the proper velocity. The encoders, mounted on the ends of the                The testing of the dual level fuzzy system controller
motors, supply an angular position feedback signal back to the           explained in that has been done in two steps. First, a
                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 7, No. 3, March 2010

theoretical simulation is run using the MATLAB fuzzy tool
box.                                                                        4. Extreme environment path.
    The input for this simulation has generated using                           In Case 4, an extreme environment with excessive noise in
theoretical test cases. The inputs to the MATLAB batch file                 form of adverse vision input data and obstacles in sensitive
which is use to run the simulation (M-file) were error and                  positions (on curves as pointed out in the simulation) have
error. The outputs were steering and the speeds of the motor                been introduced. To make this case more difficult steep curves
two cases were considered. Results of the simulation are                    were used. The simulation presents a failure at the point when
shown in Obstacle avoidance with case staid.                                the robot path crosses the line it is supposed to follow. This
                                                                            extreme environment path graph has starting time of obstacle
1. Straight line path.                                                      simulation 0 sec.,settling angle is 60 degree and settling tine
    The objective of this case is to test if the controller handles         14sec.
a simple case as straight line this simulation is presented in                  This presents a limitation of the fuzzy inference engine.
Figure.1 The output indicating a successful line following in               This limitation arises due to the limitation on the number of
this path has all condition are zero settling time , angle, peak            rules that can be implemented in this present controller. The
time.                                                                       FLC fails due to the excess adverse parameters introduced in
                                                                            this case. A solution to this problem is to use a neural network
                                                                            to identify such extreme conditions and use dedicated fuzzy
                                                                            inference engines for each case.

                    Fig. 7 Straight line path
2. Curved Path.
     In Case 2, the robot Is made to follow a curved line. In
this case the input data was free of any noise in form of                                     Fig. 10 Environment path
obstacles and loss of vision. The curved path direction of
vehicle graph as stating time 0sec. Settling angle 35.settling                                          TABLE.1
time 20sec.The result of this simulation as seen in Figure 2
suggests that the fuzzy inference system was able to direct the                              Obstacle avoidance
vehicle along curved lines.                                                 Direction of
                                                                              Vehicle        Starting     Settling        Settling time
                                                                                              Time        angle θ              sec.
                                                                            Straight Line       0             0                   0
                                                                            Curved Path         0            35                  20
                                                                              Angular          2.2           35                  19
                                                                            Environment         0            60                  14
                      Fig. 8 Curved Path                                                         XII. MOTOR SPEED
3. Angular path.                                                                Components, including the brushless servo drive motors
    In Case 3, the robot followed a curved line. Noise in form              and the encoders. A Galill board, the brushless motor, and an
of loss of input vision data and obstacle is used during data               optical encoder provide a closed loop system that allows the
collection stage. Angular path conditions has starting time 2.2             controller code to specify accurately motor dynamics
sec settling angle 35 settling time 19sec.The simulation                    parameters, including position, velocity and acceleration. The
presents a successful line following. The inference engine                  Gallil controller contains a digital PID type controller.
worked successfully for this case                                               This controller is tuned with the Servo Design Kit
                                                                            Software package, that selects the Proportional, Integral, and
                                                                            Derivative gains (Kp, Ki, Kd) to optimize the system response.
                                                                            In right motor torque 0.6N/ms in t= 2 sec., peak time torque is
                                                                            0.7N/ms in t=2.4sec. and settling time torque is 0.5N/ms in t=
                                                                            16 sec. for left motor speed as well as right motor speed.
                                                                                Once tuned, the controller code is able to simply select
                                                                            motor position, velocity and acceleration, and control the
                                                                            trajectory of the AGV,. It is important to distinguish these PID
                                                                            components from the control logic used for vehicle: this
 .                                                                          controller drives the motor to specified velocity, whereas the
                                                                            control logic selects the value of that specified velocity based
                Fig. 9 Angular fluctuations path
                                                                                                     ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 7, No. 3, March 2010

on vision system inputs has the code for the M-file. Note that           selected, as a soft computing solution for this problem keeping
the results appear reasonable for both angle and speed. The              in minds its robustness and flexibility. The performance of the
results of these simulations encouraged further tests using a            robot was studied with simulations for five different cases
real life scenario. As a result the model has been also                  selected for the study. The FLC shows good stability and
implemented on the mobile robot, which is scheduled to take              response for three of the cases. The problem at hand seems to
part in nationwide obstacle avoidance and part following                 be a complex problem for just one inference engine to handle.
competition                                                              This limitation arises due to the limit on the number of rules
                                                                         and membership functions that can be used in a single
                                                                         inference engine. A better system performance can be obtained
                                                                         if a FL approach is used. The environment in which the robot
                                                                         runs should be divided into a number of specific classes
                                                                         according to input data. The control system model will contain
                                                                         an identified and classified input data, which will finally fire
                                                                         the right inference engine for the input data class. From the
                                                                         results obtained in the MATLAB simulation and the
                                                                         preliminary testing of the model on the robot, it can be
                                                                         concluded that the model presented, can be reliably and
                                                                         successfully implemented permanently on the robot. Fuzzy
                  Fig. 11 Right motor speed                              logic has been proven to be an excellent solution to control
                                                                         problems where the number of rules for a system are finite and
                                                                         which can be easily established.[16] In this application an
                                                                         infinite number of rules can also be established. The fuzzy
                                                                         control in a way acts as a learning system control, as it has the
                                                                         ability to learn from situations where it fails. This learning is
                                                                         possible by increasing the number of rules in the system. In
                                                                         this way the system can keep on learning until it becomes a
                                                                         perfect system.

                                                                             [1]. P. F. Muir and C. P. Neuman, ‘Kinematic Modeling of Wheeled
                                                                                  Mobile Robots,’ Journal of Robotic Systems, 4(2), 1987, pp. 281-
                   Fig. 12 Left motor speed
                                                                             [2]. E. L. Hall and B. C. Hall, Robotics: A User-Friendly Introduction,
                                                                                    Holt, Rinehart, and Winston, New York, NY, 1985, pp. 23.
      Motor            ⁄              ⁄              ⁄                       [3].   Z. L. Cao, S. J. Oh, and E. L. Hall, “Dynamic omni-directional
                                                                                    vision for mobile robots,” Journal of Robotic Systems, 3(1), 1986,
                                                                                    pp. 5-17.
      Right          0.6/2         0.7/2.4        0.5/16                     [4].   Z .L. Cao,     Y. Y. Huang, and E. L. Hall, “Region Filling
      motor                                                                         Operations with Random Obstacle Avoidance for Mobile Robots,”
                                                                                    Journal of Robotics Systems, 5(2), 1988, pp. 87-102.
                                                                             [5].   S. J. Oh and E. L. Hall, “Calibration of an omni-directional vision
    Left motor       0.6/2         0.7/2.4        0.5/16                            navigation system using an industrial robot,” Optical Engineering,
                                                                                    Sept. 1989, Vol. 28, No. 9, pp. 955-962.
                                                                             [6].   R. M. H. Cheng and R. Rajagopalan, “Kinematics of Automated
                                                                                    Guided Vehicles with an Inclined Steering Column and an Offset
                      XIII. CONCLUSION
                                                                                    Distance: Criteria for Existence of Inverse Kinematic Solution,”
    The design and implementation of a modular fuzzy logic
based controller for an autonomous mobile robot for line                            Journal of Robotics Systems, 9(8), Dec. , 1992, 1059- 1081.
following along with position control with respect to an                     [7].   M. P. Ghayalod, E. L. Hall, F. W. Reckelhoff, B. O. Mathews and
obstacle course has been presented. The control algorithm for                       M. A. Ruthemeyer, “Line Following Using Omni-directional
this application is based on vision navigation. The                                 Vision,” Proc. of SPIE Intelligent Robots and Computer Vision
development of the [8] [9] FLC controller was accomplished
                                                                                    Conf., SPIE Vol. 2056, Boston, MA, 1993, page 101.
after the detailed study of an autonomous guided vehicle and
its environment. A rule base was generated using expert                      [8].   Kazuo Tanaka, “Design of Model-based Fuzzy Controller Using
system knowledge. Fuzzy membership functions and fuzzy                              Lyapunov’s Stability Approach and Its Application to Trajectory
sets were developed. The FLC model was first tested on the                          Stabilization of a Model Car,” Theoretical Aspects of Fuzzy
MATLAB fuzzy logic toolkit with some special cases. A                               Control, John Wiley & sons, 1995. 2nd IEEE Conference on fuzzy
number of tests were run to analyze the stability and response
                                                                                    system, San Francisco, CA, 1993, Inc, pp.31-50.
of the system under fuzzy control in a real life scenario.
Tuning of the system in form of adjusting the membership                     [9].   C. V. Altrock et al., “Advanced fuzzy logic control technologies in
functions and the rules has been accomplished to improve the                        automotive applications”, Proceedings of 1st IEEE international
stability of the FLC. The fuzzy logic control is a very flexible                    Conference on Fuzzy Systems, 1992, pp. 835-842.
and robust soft computing tool for control. The number of
variants involved in the current application present a challenge
for any type of control system. A fuzzy logic control was
                                                                                                       ISSN 1947-5500
                                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                            Vol. 7, No. 3, March 2010

[10]. L. A. Zadeh, "Outline of a new Approach to the Analysis of                       [29]. N. Gulley "How To Design Fuzzy Logic Controllers" from
     Complex Systems and Decision Process", 1973 IEEE Transacions                            Machine Design, November 1992, page 26. 62
     on Systems, Man and Cybernetics, v3, pp 28-44.                                    [30]. Mark Kantrowitz, Erik Horstkotte, and Cliff Joslyn “The Internet
[11]. E. H. Mamdani, "Fuzzy Sets for Man - Machine Interaction ",                            fuzzy-logic FAQ ("frequently-asked questions") list” postings for
     1977, IEEE Transactions on Computer, v26, pp. 1182-1191.                      
[12]. J. F. Baldwin and N. C. F. Guild, "Feasible Algorithms For                   , 1996
     Approximate Reasoning Using Fuzzy Logic", 1980, Fuzzy Sets                        [31]. Kevin    Self "Designing      With      Fuzzy Logic"   from   IEEE
     and Systems, v3, pp. 225-251..                                                          SPECTRUM, November 1990, Volume 105 pp 42-44.
[13]. J. F. Baldwin and B. W. Pilsworth, "Axiomatic Approach For                       [32]. T. Takagi, M. Sugeno, "Fuzzy Identification of Systems and it's
     Approximate Reasoning with Fuzzy Logic", 1980, Fuzzy Sets and                           Application to Modeling and Control", 1985, IEEE Transactions on
     Systems, v3, pp. 193-219                                                                Systems, Man, and Cybernetics, V.SMC-15, pp. 116-132.
[14]. J. B. Kiszka, M. M. Gupta and G. M. Trojan, "Multi Variable                      [33]. Sung, Eric. Loon, Ng Kok. Yin, Yee Chiang. Parallel linkage
     Fuzzy Controller under Goedel's Implication", Fuzzy Sets and                            steering for an automated guided vehicle. IEEE Control Systems
     Systems, v34, pp. 301-321.                                                              Magazine. v 9 n 6 Oct 1989 p 3-8
[15]. Z. Cao and A. Kandel, "Applicability of some Fuzzy Implication                   [34]. Position Control for Wheeled Mobile Robots Using a Fuzzy Logic
     Operators", 1989, Fuzzy Sets and Systems, v31, pp. 151-186.                             Controller T. H. Lee F. H. F. hung 1999 IEEE
[16]. A. Bardossy, L. Duckstein, "Fuzzy Rule based Modeling with                       [35]. G. Campion, G. Bastin and B. D 'Andrea- Novel, "Structural
     Application to Geophysical, Biological and Engineering Systems",                        properties and classification of kinematic and dynamic models of
     1995, CRC Press, New York, pp.62-68.                                                    wheeled mobile robots, " IEEE Trans. Robotics and Automation,
[17]. H. Hellendoorn, R. Palm, "Fuzzy Systems Technologies at                                vol. 12, no. 1, pp. 41-62, Feb. 1996.
     Siemens R & D", 1994, Fuzzy Sets and Fuzzy Systems, Vol. 63,                      [36]. A Comparison of Two and Three Dimensional Imaging Ernest
                                                                                             Hall, Donald Rosselot, Mark Aull and Manohar Balapa Center for
     pp. 245-269.
                                                                                             Robotics Research University of Cincinnati
[18]. M. Jamshidi, N. Vadiee, T. Ross, "Fuzzy Logic Control", 1993,
                                                                                       [37]. Autonomous Vehicle Following by a Fuzzy Logic Controller N.
     PTR Prentice Hall, Englewood Cliffs, New Jersey 07632, pp 89-
                                                                                             Kehtarnavaz, E. Nakamura, N. Griswold, J. Yen Departments of
     101.                                                                                    Electrical Engineering and Computer Science Texas A&M
                                                                                             University, College Station, TX 77843
[19]. P. King, E. Mamdani, "The application of Fuzzy Control Systems
     to Industrial Process", 1977, Automatic a, Vol. 3, pp. 235-242.                                       AUTHORS PROFILE
[20]. R. Stenz, U. Kuhn, "Automation of a batch distillation column
     using Fuzzy and Conventional Control", 1995 IEEE Transactions                                                  Dr. Shailja Shukla received B.E. degree in
                                                                                                                  Electrical Engg. from Jabalpur Engg. College,
     on Systems Technology, Vol. 3, pp. 171-176.                                                                  Jabalpur in 1984 and the Ph.D. degree in
[21]. T. Takahashi, M. Kitou, M. Asai, M. Kido, T. Chiba, J.                                                      Control System from Rajiv Gandhi Technical
                                                                                                                  University, Bhopal in 2002. She is currently
     Kawakami, Y. Matsui, "A new voltage equipment using Fuzzy                                                    Professor in Electrical Engg. and the
     Inference", 1994, Electrical Engineering in Japan, Vol. 114, pp. 18-                                         Chairperson of the Department of Computer
                                                                                                                  Science and Engg. at Jabalpur Engg. College,
     32.                                                                                                          Jabalpur. Her research interest on Large Scale
[22]. C. von Altrock, Fuzzy Logic and Neuro Fuzzy Applications                                                    Control Systems, Soft Computing and include
                                                                                                                  Machine Learning, Face Recognition and
     Explained Prentice Hall PTR, Englewood Cliffs, NJ, 07632, page               Digital Signal Processing. She has been the Organizing Secretary of
     78.                                                                          International Conference on Soft Computing and Intelligent Systems. She has
                                                                                  published more than 40 Research papers in International/National Journals and
[23]. J. Yen, R. Langari, L. Zadeh, "Industrial applications of Fuzzy             Conferences. She is Editorial member of many International Journals
     Logic and Intelligent Systems", 1995, IEEE Press, IEEE Inc, New
     York, pp 08-09.                                                                                        Mr. mukesh tiwari was born in Madhya-Pradesh at
[24]. T. Yamakawa, "Stabilization of an inverted pendulum by a high-                                        katni on 27th November 1983. He received B.E
                                                                                                            degree in (Electronic & Communication) from
     speed Fuzzy Logic Controller Hardware System", Fuzzy Sets and                                          Rewa Institute of Technology Rewa in 2007 he is
     Systems, Vol. 32, pp. 161-180.                                                                         the student of master of Engineering in (Control
                                                                                                            system) Department of Electrical Engineering.
[25]. Charles P. Coleman and Datta Godbole, University of California                                        Jabalpur Engineering College Jabalpur (M.P)
     at Berkeley “A Comparison of Robustness: Fuzzy Logic, PID, &                                           INDIA. His research interest on fuzzy logic,
                                                                                                            communication, and control system He has
     Sliding Mode Control”, 1996, pp.06- 08                                       published one International journal & Three National Conference
[26]. Z. Yuzhou, R. McLauchlan, R. Challoo and S. Omar "Fuzzy Logic
     Control of a four-link Robotic Manipulator in a Vertical Plane"
     Proceedings of the Artificial Neural Networks in Engineering
     (ANNIE '96) Conference, held November 10-13, 1996, St. Louis
     US, pp 198-200.
[27]. J.-S. Roger Jang and Ned Gulley, Fuzzy Logic Toolbox For Use
     with MATLAB, The MathWorks Inc. 1995, pp 90-92.
[28]. Bart Kosko “Neural Networks and Fuzzy Systems- A Dynamical
     Systems Approach To Machine Intelligence”, Prentice Hall,
     Englewood Cliffs, NJ 07632, pp 45-
                                                                                                                 ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 7, No. 3, March 2010

 A reversible high embedding capacity data hiding
     technique for hiding secret data in images
Mr. P. Mohan Kumar,                                                           Dr. K. L. Shunmuganathan,
Asst. Professor, CSE Department,                                              Professor and Head, CSE Department
Jeppiaar Engineering College,                                                 R.M.K. Engineering College,
Chennai., India.                                                              Chennai. India.                                           

Abstract -- As the multimedia and internet technologies are                same function; an artist can post sample images on his
growing fast, the transmission of digital media plays an                   website with an embedded signature so that he can prove
important role in communication. The various digital media                 her ownership in case others attempt to steal his work or try
like audio, video and images are being transferred through
                                                                           to show as their work.
internet. There are a lot of threats for the digital data that are
                                                                                      The following formula can provide a very generic
transferred through internet. Also, a number of security
techniques have been employed to protect the data that is                  description of the steganographic process:
transferred through internet. This paper proposes a new                    Cover data + hidden data + stego key = stego data
technique for sending secret messages securely, using                                 In this formula, the cover data is the file in which
steganographic technique. Since the proposed system uses                   we will hide the hidden data, which may also be encrypted
multiple level of security for data hiding, where the data is              using the stego key. The resultant file is the stego
hidden in an image file and the stego file is again concealed in           data which will be of the same type as the cover data [2].
another image. Previously, the secret message is being                     The cover data and stego data are typically image or audio
encrypted with the encryption algorithm which ensures the
                                                                           files. In this paper, we are going to focus on image files and
achievement of high security enabled data transfer through
                                                                           will discuss about the existing techniques of image
Keywords – steganography, watermarking, stego image, payload
                                                                                      Before discussing how information is hidden in an
                      I.   INTRODUCTION                                    image file, we should have an idea about how images are
                                                                           stored. An image file is simply a binary file containing a
          Steganography is     the    technique     of hiding              binary representation of the color or light intensity of each
information. The primary goal of cryptography is to make a                 picture element known as pixel, comprising the image.
data that cannot be understood by a third party, where as                             Images are normally using either 8-bit or 24-bit
the goal of steganography is to hide the data from a third                 color. When using 8-bit color, there is a definition of up to
party. There are many number of steganographic methods                     256 colors forming a palette for this image, where each
ranging from invisible ink and microdots to hide a secret                  color is denoted by an 8-bit value. A 24-bit color scheme
message in the second letter of each word of a large body of               uses 24 bits per pixel which provides a much better set of
text and spread spectrum radio communication. With the                     colours. In this case, each pixel is represented by three
vast development of computers and internet, there are many                 bytes, each byte representing the intensity of the three
other methods of hiding information [1], such as:                          primary colors red, green, and blue (RGB), respectively[3].
     a. Covert channels                                                               The size of an image file is directly related to the
     b. Concealment of text message within Web pages                       number of pixels and the granularity of the color definition.
     c. Hiding files in "plain sight"                                      A typical 640x480 pix image using a palette of 256 colors
     d. Null ciphers                                                       would require a file about 307 KB in size (640 • 480 bytes),
     One of the most important applications of                             whereas a 1024x768 pix high-resolution 24-bit color image
steganography is digital watermarking. A watermark is the                  would result in a 2.36 MB file (1024 • 768 • 3 bytes).
replication of an image, logo, or text on paper stock so that                         There are a number of image compression
the source of the document can be at least partially                       schemes have been developed as Bitmap (BMP), Graphic
authenticated. A digital watermark can accomplish the                      Interchange Format (GIF), and Joint Photographic Experts

                                                                                                     ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 7, No. 3, March 2010

Group (JPEG) file types. Anyway, we are not able to use                          A steganographic method undetectably alters a
them all as the same way for steganography.                            cover object to embed a secret message (Cox et al., 2007)
          GIF and 8-bit BMP files are using                            [6]. Thus, steganographic methods can hide the very
lossless compression, a scheme that allows the software to             presence of covert communications. Information hiding
exactly reconstruct the original image. JPEG, on the other             techniques can be performed in three domains (Bender et
hand, uses lossy compression, which means that the                     al., 1996) [7], namely, spatial domain (Zhang and Wang,
expanded image is very nearly the same as the original but             2006), compressed domain (Pan et al., 2004), and
not an exact duplicate. While both of these methods allow              frequency (or transformed) domain (Kamstra and Heijmans,
computers to save storage space, lossless compression is               2005; Wu and Frank, 2007; Zhou et al., 2007) [8].
much better suited to applications where the integrity of the                    Each domain has its own advantages and
original information must be maintained, such as                       disadvantages in terms of embedding capacity, execution
steganography. Even though JPEG can be used for stego                  time, storage space, etc. Two main factors that really affect
applications, more commonly used files for hiding data are             an information hiding scheme are visual quality of stego
GIF or BMP files.                                                      images (also called visual quality for short), embedding
                                                                       capacity (or payload). An information hiding scheme with
                 II. LITERATURE SURVEY                                 low image distortion is more secure than that with high
                                                                       distortion because it does not raise any suspicions of
          The rapid advances of network technologies and               adversaries. The second important factor is embedding
digital devices make information exchange fast and easy.               capacity (also called capacity for short).
However, distributing digital data over public networks                          An information hiding scheme with high payload
such as the Internet is not really secure due to copy                  is preferred because more secret data can be transferred [9].
violation, counterfeiting, forgery, and fraud. Therefore,              However, embedding capacity is inversely proportional to
protective methods for digital data, specially for sensitive           visual quality. Thus, the tradeoff between the two factors
data, are highly demanded. Traditionally, secret data can be           above varies from application to application, depending on
protected by cryptographic methods such as DES and RSA                 users’ requirements and application fields. Consequently,
(Rivest et al., 1978) [4]. The drawback of cryptography is             different techniques are utilized for different applications.
that cryptography can protect secret data in transit, but once         Therefore, a class of data hiding schemes is needed to span
they have been decrypted, the content of the secret data has           the range of possible applications. Embedding the secret
no further protection (Cox et al., 2007).                              data into an image causes the degradation of image quality.
           In addition, cryptographic methods do not hide              Even though small image distortion is unacceptable in some
the very existence of the secret data. Alternatively,                  applications such as law enforcement, military image
confidential data can be protected by using information                systems, and medical diagnosis.
hiding techniques. Information hiding embeds secret                              If a data embedding scheme is irreversible (also
information into cover objects such as written texts, digital          called lossy), then a decoder can extract secret data only
images, adios, and videos (Bender et al., 1996) [5]. For               and the original cover image cannot be restored. In contrast,
more secure, cryptographic techniques can be applied to an             a reversible (also called invertible, lossless, or distortion-
information hiding scheme to encrypt the secret data prior             free) data embedding scheme allows a decoder to recover
to embedding.                                                          the original cover image completely upon the extraction of
          In general, information hiding (also called data             the embedded secret data [10]. A reversible data hiding
hiding or data embedding) technique includes digital                   scheme is suitably used for some applications such as the
watermarking and steganography (Petitcolas et al., 1999).              healthcare industry and online content distribution systems.
Watermarking is used for copyright protection, broadcast                         To our best knowledge, the first reversible data
monitoring, transaction tracking, etc. A watermarking                  embedding scheme was proposed in 1997 (Barton, 1997).
scheme imperceptibly alters a cover object to embed a                  Macq (2000) extended the patchwork algorithm (Bender et
message about the cover object (e.g., owner’s identifier)              al., 1996) [11] to achieve the reversibility. This method
(Cox et al., 2007). The robustness (i.e. the ability to resist         encounters the underflow and overflow problem (i.e.,
certain malicious attacks such as common signal processing             grayscale pixel values are out of the allowable range [0,
operations) of digital watermarking schemes is critical. In            255]). Honsinger et al. (2001) [12] used modulo arithmetic
contrast, steganography is used for secret communications.             operation to resolve the underflow and overflow problem.

                                                                                                 ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 7, No. 3, March 2010

Consequently, Honsinger et al.’s method raises the salt-              very low (i.e., less than 30 dB) at high embedding capacity
and-pepper effect. Fridrich et al. (2001) [13] also proposed          (i.e., more than 1 bpp).
the reversible data embedding method for the authentication
purpose so the embedding capacity of this method is low.                                   III. PROPOSED SYSTEM
Later on, De Vleeschouwer et al. (2003) [14] proposed the
circular interpretation of bijective transforms to face the                     This section presents our new reversible
underflow and overflow problem. However, the salt-and-                steganographic scheme with good stego-image quality and
pepper problem still remains in De Vleeschouwer et al.’s              high payload by using the multiple embedding strategies to
method.                                                               improve the image quality and the embedding capacity of
         As a whole, the problem with the aforementioned              the DE method. For increasing the security of secret data
methods is either the salt-and-pepper problem or low                  delivery, it is assumed that the secret data have been
embedding capacity. Tian (2003) [15] proposed the                     encrypted by using the well-known cryptosystem (e.g.,
reversible data embedding scheme with high embedding                  DES or RSA) to encrypt the secret data prior to embedding.
capacity and good visual quality of embedded images (also             Therefore, even an attacker somehow extracts the secret
called stego images). Tian’s scheme is of a fragile                   data from the stego image; the attacker still cannot obtain
technique meaning that the embedded data will be mostly               the real information without the decryption key. The details
destroyed when some common signal processing operations               of the proposed method are described next.
(e.g., JPEG compression) are applied to a stego image.
Tian’s method uses the difference expansion (DE)                           A. The embedding phase
operation to hide one secret bit into the difference value of                   Basically, the proposed method embeds one
two neighboring pixels. Thus, the embedding capacity of               information bit b of the information bit stream into one
the DE method is at most 0.5 bpp for one layer embedding.             grayscale cover pixel pair of an original grayscale cover
Tian also suggested the multiple-layer embedding to                   image O sized H _W at a time in raster scan order.
achieve higher embedding capacity. Alattar (2004) [16]                Specifically, the proposed scheme consists of two main
generalized Tian’s method to embed n _ 1 secret bits into a           stages, namely, the horizontal embedding procedure HEm
group of n cover pixels. Thus, the embedding capacity of              and the vertical embedding procedure VEm. The secret bit
Alattar’s method is at most (n _ 1)/n bpp.                            stream S whose length is LS is divided into two secret bit
         Kamstra and Heijmans (2005) [17] also improved               streams S1 and S2. The lengths of S1 and S2 are denoted as
Tian’s method in terms of visual quality at low embedding             LS1 and LS2, respectively. The information bit stream B1
capacities. The maximum embedding capacity of Kamstra                 is created by concatenating the secret bit stream S1 and the
and Heijmans’ method is 0.5 bpp. Chang and Lu (2006)                  auxiliary data bit stream A1. That is, B1 = S1||A1.
exploited Tian’s method to achieve the average embedding                        Similarly, the information bit stream B2 is created
capacity of 0.92 bpp and the average PSNR of 36.34 dB for             by concatenating the secret bit stream S2 and the auxiliary
one-layer embedding. Next, Thodi and Rodriquez (2007)                 data bit stream A2 (i.e., B2 = S2||A2). The generation of A1
improved Tian’s scheme and proposed the novel method                  and A2 will be described later. Firstly, the information bit
called prediction error expansion (PEE) embedding. The                stream B1 is horizontally embedded into O by using the
PEE method embeds one secret bit into one cover pixel at a            procedure HEm to obtain the output image T sized H _W.
time. However, at its maximum embedding capacity (i.e.,               Secondly, the compressed location map CM1 whose length
around 1 bpp), the visual quality of the PEE method is                is LC1, which will be described later, is embedded into T
always less than 35 dB for all test images. Then, Kim et al.          by using the least significant bit (LSB) replacement
(2008) improved Tian’s method by simplifying the location             technique to obtain the output image U sized H _W.
map to achieve higher embedding capacity while keeping                Thirdly, the information bit stream B2 is vertically
the image distortion the same as the original DE method.              embedded into U by using the procedure VEm to obtain the
Lou et al. (2009) improved the DE method by proposing                 output image V sized H _W. Fourthly, the compressed
the multiple layer data hiding scheme. Lou et al.’s method            location map CM2 whose length is LC2, which will be
reduces the difference value of two neighboring cover                 described later, is embedded into V by using the LSB
pixels to enhance the visual quality. The problem with the            replacement technique to obtain the final stego image X
aforementioned schemes is that the PSNR value becomes                 sized H _ W.

                                                                                                ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010

         The overview of the proposed embedding process                 as follows. It is noted that LC1 is the length of the
is shown in the following diagram. For the horizontal                   compressed location map CM1 ended with the unique end-
embedding procedure HEm: horizontally scan the cover                    of-map indicator EOM1. Initially, B1 is equal to S1 (i.e.,
image O in raster scan order (i.e., from left to right and top          B1 = S1). During the execution of the procedure HEm, for
to bottom) to gather two neighboring pixels x and y into a              the first LC1 pixels in O, when each pixel has been
cover pixel pair (x, y). If y is an odd value, then the cover           processed for embedding, its LSB is taken as an auxiliary
pixel pair (x, y) is defined as a horizontally embeddable               data bit of A1 and appended to the end of B1. That is, B1 is
pixel pair. Otherwise, the cover pixel pair (x, y) is defined           gradually grown until the LC1 auxiliary data bits in A1 are
as a horizontally non-embeddable pixel pair. Let the set of             concatenated into B1. Finally, the information bit stream is
horizontally embeddable pixel pairs of O be E1 whose                    B1 = S1||A1, which is completely embedded into O.
cardinality is LE1. It is clear that the length of B1 is LE1.           For the vertical embedding procedure VEm:
The horizontally non-embeddable pixel pairs are kept                              Vertically scan the output image U in raster scan
unchanged during the horizontal embedding stage. Each                   order to group two neighboring pixels u and v into a pixel
information bit b in B1 is horizontally embedded into each              pair (u, v). If v is an even value, then the pixel pair (u, v) is
horizontally embeddable pixel pair (x, y) in E1 at a time by            defined as a vertically embeddable pixel pair. Otherwise,
using the proposed horizontal embedding rule HR defined                 the pixel pair (u, v) is defined as a vertically non-
below.                                                                  embeddable pixel pair. Let the set of vertically embeddable
                                                                        pixel pairs of U be E2 whose cardinality is LE2. It is
                                                                        obvious that the length of B2 is LE2. The vertically non-
                                                                        embeddable pixel pairs are left unchanged during the
                                                                        vertical embedding stage. Each information bit b in B2 is
                                                                        vertically embedded into each vertically embeddable pixel
                                                                        pair (u, v) in E2 at a time by using the proposed vertical
                                                                        embedding rule VR defined below.
                                                                        The vertical embedding rule VR:
                                                                                  For each vertically embeddable pixel pair (u, v),
                                                                        we apply the following embedding rules:
                                                                                  VR1: If the information bit b = 0, then the final
                                                                        stego pixel pair is computed by (u0, v0) = (u, v).
         Fig 1. Embedding Phase of Proposed system                                VR2: If the information bit b = 1, then the final
                                                                        stego pixel pair is computed by (u0, v0) = (u, v + 1).
The horizontal embedding rule HR:                                                 The vertical embedding rule VR is iteratively
         For each horizontally embeddable pixel pair (x, y),            applied to conceal each information bit b in B2 into each
we apply the following embedding rules:                                 pixel pair (u, v) in E2 of U until the entire information bit
         HR1: If the information bit b = 1, then the stego              stream B2 is totally concealed into U to obtain the output
pixel pair is computed by (x0 , y0) = (x, y).                           image V. It is noted that the proposed vertical embedding
         HR2: If the information bit b = 0, then the stego              rule VR does not raise the underflow and overflow
pixel pair is calculated by (x0 , y0) = (x, y _ 1).                     problem. That is, the final stego pixel pairs (u0 , v0)’s are
         The horizontal embedding rule HR is repeatedly                 assured to fall in the allowable range [0, 255]. Similar to
applied to embed each information bit b in B1 into each                 the generation of A1, the auxiliary data bit stream A2 is
cover pixel pair (x, y) in E1 of O until the whole                      actually the LSBs of the first LC2 pixels in the image V and
information bit stream B1 is completely embedded into O                 generated as follows. It is noted that LC2 is the length of
to obtain the output image T. It is noted that the proposed             the compressed location map CM2 ended with the unique
horizontal embedding rule HR does not cause the                         end-of-map indicator EOM2.
underflow and overflow problem. That is, the embedded                             Initially, B2 equals the secret bit stream S2 (i.e.,
pixel pairs (x0 , y0)’s are guaranteed to fall in the allowable         B2 = S2). During the execution of the procedure VEm, for
range [0, 255].                                                         the first LC2 pixels in the image U, when each pixel has
         The auxiliary data bit stream A1 is actually the               been processed for embedding, its LSB is taken as an
LSBs of the first LC1 pixels in the image T and generated               auxiliary data bit of A2 and appended to the end of B2.

                                                                                                  ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 7, No. 3, March 2010

That is, B2 is gradually grown until the LC2 auxiliary data             the vertical extracting procedure VEx. Thirdly, the
bits in A2 are concatenated into B2. Finally, the                       embedded CM1 is obtained by extracting the LSBs of the
information bit stream is B2 = S2||A2, which is fully                   first LC1 pixels of the image U. The extracted CM1 is then
embedded into the image U. For the purposes of extracting               decompressed to obtain HL which is used to identify the
B1 and recovering O, a location map HL sized H _ (W/2) is               horizontal embeddable pixel pairs belonging to the set E1
needed to record the positions of the horizontally                      of U.
embeddable pixel pairs (x, y) in O. The location map HL is                        Next, A1 is extracted from the last LC1 pixel pairs
a one-bit bitmap.                                                       in E1 of U by using the horizontal extracting rule HX.
          All the entries of HL are initialized to 0. If the            Then, the first LC1 pixel pairs of U are replaced with the
cover pixel pair (x, y) is the horizontally embeddable pixel            extracted A1 to obtain the image T. Fourthly, from the
pair, then the corresponding entry of HL is set to be 1.                image T, extract the embedded B1 and recover the original
Next, the location map HL is losslessly compressed by                   cover image O by using the horizontal extracting procedure
using the JBIG2 codec (Howard et al., 1998) or an                       HEx. The first LS1 bits of B1 is the secret bit stream S1 and
arithmetic coding toolkit (Carpenter, 2002) to obtain the               the first LS2 bits of B2 is the secret bit stream S2. The
compressed location map CM1 whose length is LC1. The                    extracted secret bit streams S1 and S2 are concatenated to
compressed location map CM1 is embedded into the image                  form the original secret bit stream S (i.e., S = S1||S2.). The
T by using the LSB replacement technique as mentioned                   overview of the proposed extracting process is shown in the
above. Similarly, for the purposes of extracting B2 and                 following figure.
recovering the image U, a location map VL sized (H/2) _W
is required to save the positions of the vertically
embeddable pixel pairs (u, v) in U. The location map VL is
a one-bit bitmap.
          All the entries of VL are initialized to 0. If the
pixel pair (u, v) is the vertically embeddable pixel pair, then
the corresponding entry of VL is set to be 1. Then, VL is
also lossless compressed by using the JBIG2 codec
(Howard et al., 1998) or an arithmetic coding toolkit
(Carpenter, 2002) to obtain the compressed location map
CM2 whose length is LC2. The compressed location map
CM2 is embedded into the image V by using the LSB
replacement technique as mentioned above. The final                               Fig.2. Extracting phase of proposed system
output of the embedding phase is the final stego image X
sized H _W. Then, the stego image X is sent to the                      For vertical extracting procedure VEx
expected receivers.                                                              Vertically scan the image V in raster scan order to
     B. The extracting phase                                            group two neighboring pixels u0 and v0 into a pixel pair
          The extracting phase is actually the reverse                  (u0 , v0). The extracted VL is used to determine whether a
process of the embedding phase. The extracting phase is                 pixel pair (u0 , v0) belongs to the set E2 (i.e., a vertically
composed of two main stages, namely, the vertical                       embeddable pixel pair). The extraction of the embedded B2
extracting procedure VEx and the horizontal extracting                  and the recovery of the image U are performed as follows.
procedure HEx. Specifically, firstly, the embedded CM2 is                        The vertical extracting rule VX
retrieved by extracting the LSBs of the first LC2 pixels of                      If v0 is an even value,
the received stego image X. The extracted CM2 is then                            then The information bit in B2 is extracted by b =
decompressed to obtain VL which is used to identify the                 0 and The pixel pair (u, v) is recovered by (u, v) = (u0 , v0).
vertical embeddable pixel pairs belonging to the set E2 of                       Else if (u0 , v0) belongs to the set E2,
X. Next, A2 is extracted from the last LC2 pixel pairs in E2                      thenThe information bit in B2 is extracted by b =
of X by using the vertical extracting rule VX. Then, the                1 and
first LC2 pixel pairs of X are replaced with the extracted                       The pixel pair (u, v) is recovered by (u, v) = (u0 ,
A2 to obtain the image V. Secondly, from the image V,                   v0 _ 1).
extract the embedded B2 and recover the image U by using                         Else

                                                                                                   ISSN 1947-5500
                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 7, No. 3, March 2010

         There is no information bit extraction and                    systems because it is hard to be detected by detectors.
         The pixel pair (u, v) is recovered by (u, v) = (u0 ,          Because the lack of a universal image quality measurement
v0).                                                                   tool, we used peak signal-to-noise ratio (PSNR) to measure
          The output of the vertical extracting procedure              the distortion between an original cover image and the
VEx is the image U.                                                    stego image. The PSNR is defined by
          From the image U, the embedded CM1 is
extracted and the image T is recovered as mentioned above.
          The location map HL is achieved from
decompressing the extracted CM1.
For horizontal extracting procedure VEx
          Horizontally scan the image T in raster scan order
to gather two neighboring pixels x0 and y0 into a pixel pair
(x0 , y0). The location map HL is used to identify if a pixel
pair (x0 , y0) belongs to the set
          E1 (i.e., a horizontally embeddable pixel pair). The         (a)                     (b)                    (c)
extraction of the embedded B1 and the recovery of the
original cover image O are performed as below.
          The horizontal extracting rule HX
          If y0 is an odd value, then
          The information bit in B1 is extracted by b = 1 and          (d)
          The original cover pixel pair (x, y) is recovered by
(x, y) = (x , y).                                                      Fig 3.a. Host image b. Image after preprocessing c. Stego
          Else if (x0 , y0) belongs to the set E1, then                image d. Image quality after extracting secret image
          The information bit in B1 is extracted by b = 0 and
          The original cover pixel pair (x, y) is recovered by                                V. CONCLUSION
(x, y) = (x0 , y0 + 1).
          Else                                                                  In this paper, we propose a simple reversible
          There is no information bit extraction and                   steganographic scheme in spatial domain for digital images
          The original cover pixel pair (x, y) is recovered by         by using the proposed multiple embedding strategies. The
(x, y) = (x0 , y0).                                                    experimental results show that the proposed reversible
                                                                       steganographic method is capable of achieving very good
               IV. EXPERIMENTAL RESULTS                                visual quality of stego images and high embedding capacity
                                                                       (especially, when multiple-layer embedding is performed).
         To evaluate the performance of the proposed                   Specifically, with the one-layer embedding, the proposed
method, we implemented the proposed method and Tian’s                  method can obtain the embedding capacity of more than 0.5
method by using Borland C++ Builder 6.0 software running               bpp and the PSNR value greater than 54 dB for all test
on the Pentium IV, 3.6 GHz CPU, and 1.49 GB RAM                        images. In addition, with the two-layer embedding, the
hardware platform. The secret bit stream S was randomly                proposed method can achieve the embedding capacity of
generated by using the library function random(). The                  about 1 bpp and the PSNR value greater than 53 dB for all
multiple-layer embedding was performed for the DE and                  test images. Especially, with the five-layer embedding, the
proposed methods. To make the DE method achieve its                    proposed method has the embedding capacity of more than
maximum embedding capacity, the threshold TH was not                   2 bpp and the PSNR value higher than 52 dB for all test
used in the experiments. The location maps L, HL, and VL               images. Therefore, it can be said that the proposed method
were losslessly compressed and decompressed by using the               is the one that really allows users to perform multiple layer
arithmetic coding toolkit (Carpenter, 2002). The commonly              embedding to achieve the purposes of very high embedding
used grayscale images sized 512 _ 512, were used as the                capacity and very good visual quality of stego images. As a
cover images in our experiments. The good visual quality               whole, the proposed method outperforms many existing
of stego images (i.e. images embedded with a secret                    reversible data embedding methods in terms of visual
message) is the most important property of steganographic              quality, embedding capacity, and computational

                                                                                                 ISSN 1947-5500
                                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                    Vol. 7, No. 3, March 2010

complexity. Thus, we can conclude that our proposed                                        IEEE Transaction Information Forensics and Security 3 (3), 456–
method is applicable to some information hiding
                                                                                      [14] Lou, D.C., Hu, M.C., Liu, J.L., 2009. Multiple layer data hiding
applications such as secret communications, medical                                        scheme for medical images. Computer Standards and Interfaces 31
imaging systems, and online content distribution systems.                                  (2), 329–335.
                                                                                      [15] Macq, B., 2000. Lossless multiresolution transform for image
                                                                                           authenticating watermarking. In: Proceedings of the EUSIPCO,
                                                                                           Tampere, Finland, pp. 533– 536 (September).
                                                                                      [16] Pan, J.S., Sung, M.T., Huang, H.C., Liao, B.Y., 2004. Robust VQ-
         We take immense pleasure in thanking our                                          based digital watermarking for the memoryless binary symmetric
chairman Dr. Jeppiaar M.A, B.L, Ph.D, the Directors of                                     channel. IEICE Transactions on Fundamentals E-87A (7), 1839–
Jeppiaar Engineering College Mr. Marie Wilson, B.Tech,
                                                                                      [17] Petitcolas,   F.A.P., Anderson, R.J., Kuhn, M.G., 1999. Information
MBA, (Ph.D), Mrs. Regeena Wilson, B.Tech, MBA, (Ph.D)
                                                                                           hiding-a survey. Proceedings of IEEE 87 (7), 1062–1078.
and the principal Dr. Sushil Lal Das M.Sc(Engg.), Ph.D for
their continual support and guidance. We would like to
extend our thanks to my guide, our friends and family
members without whose inspiration and support our efforts
                                                                                                             AUTRHORS PROFILE
would not have come to true. Above all, we would like to
thank God for making all our efforts success.

                             REFERENCES                                                                      P. Mohan Kumar B.E.,M.E.,(Ph.D) works as Assistant
                                                                                                             Professor in Jeppiaar Engineering College and he has
[1]    Bender, W., Gruhl, D., Morimoto, N., Lu, A., 1996. Techniques for                                     more than 8 years of teaching experience. His areas of
       data hiding. IBM Systems Journal 35 (3–4), 313–336.                                                   specializations are Network security, Image processing
[2]    Alattar, A.M., 2004. Reversible watermark using the difference                                        and artificial intelligence.
       expansion of a generalized integer transform. IEEE Transactions on
       Image Processing 13 (8), 1147–1156.                                                                   Dr. K.L. Shanmuganathan B.E, M.E.,M.S.,Ph.D
[3]    Barton, J.M., 1997. Method and apparatus for embedding                                                works as the Professor & Head of CSE
       authentication information within digital data. US Patent 5 646 997.                                  Department of RMK Engineering College,
[4]    Carpenter, B., 2002. Compression via Arithmetic Coding <http://                                       Chennai, TamilNadu, India. He has more than 18>                                                                 years of teaching experience and his areas of
[5]    Chang, C.C., Lu, T.C., 2006. A difference expansion oriented data                                     specializations are Artificial Intelligence, Computer
       hiding scheme for restoring the original host images. Journal of                                      Networks and DBMS.
       Systems and Software 79 (12), 1754–1766.
[6]    Cox, I.J., Miller, M.L., Bloom, J.A., Fridrich, J., Kalker, T., 2007.
       Digital Watermarking and Steganography. Morgan Kauffman, ISBN
[7]    Davis, R.M., 1978. The data encryption standard in perspective.
       IEEE Communications Magazine 16 (6), 5–9.
[8]    De Vleeschouwer, C., Delaigle, J.F., Macq, B., 2003. Circular
       interpretation of bijective transformations in lossless watermarking
       for media asset management. IEEE Transactions on Multimedia 5
       (1), 97–105.
[9]    Fridrich, J., Goljan, M., Du, R., 2001. Invertible authentication. In:
       Proceedings of the SPIE Security Watermarking Multimedia
       Contents, San Jose, CA, pp. 197–208 (January).
[10]   Honsinger, C.W., Jones, P.W., Rabbani, M., Stoffel, J.C., 2001.
       Lossless recovery of an original image containing embedded data.
       US Patent 6 278 791 (August).
[11]   Howard, P.G., Kossentini, F., Martins, B., Forchhammer, S.,
       Rucklidge, W.J., 1998. The emerging JBIG2 standard. IEEE
       Transactions on Circuits and Systems for Video Technology 8 (7),
[12]   Kamstra, L., Heijmans, H.J.A.M., 2005. Reversible data embedding
       into images using wavelet techniques and sorting. IEEE Transactions
       on Image Processing 14 (12), 2082–2090.
[13]   Kim, H.J., Sachnev, V., Shi, Y.Q., Nam, J., Choo, H.G., 2008. A
       novel difference expansion transform for reversible data embedding.

                                                                                                                   ISSN 1947-5500
                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                        Vol. 7, No. 3, March 2010


                                       J. Arokia Renjit
                Asst. Professor/ CSE Department, Jeppiaar Engineering College,
                              Chennai, TamilNadu,India – 600119.

            Professor & Head, Department of CSE, RMK Engineering College,
                              TamilNadu , India – 601 206.

Abstract--Association rule mining is an                ARM performance. In distributed
active data mining research area and                   mining, synchronization is implicit in
most ARM algorithms cater to a                         message passing, so the goal becomes
centralized environment. Centralized                   communication       optimization.    Data
data mining to discover useful patterns in             decomposition is very important for
distributed databases isn't always feasible            distributed memory[2]. Therefore, the
because merging data sets from different
                                                       main challenge for obtaining good
sites incurs huge network communication
costs. In this paper, an Improved                      performance on distributed mining is to
algorithm based on good performance                    find a good data decomposition among
level for data mining is being proposed.               the nodes for good load balancing, and
In local sites, it runs the application based          to minimize communication.
on the improved LMatrix algorithm,                             Distributed ARM algorithms aim
which is used to calculate local support               to generate rules from different data sets
counts. Local Site also finds a centre site            spread over various geographical site
to manage every message exchanged to                   hence,      they     require      external
obtain all globally frequent item sets. It             communications throughout the entire
also reduces the time of scan of partition             process [3].. They must reduce
database by using LMatrix which
                                                       communication costs so that generating
increases the performance of the
algorithm. Therefore, the research is to               global association rules costs less than
develop a distributed algorithm for                    combining the participating sites' data
geographically distributed data sets that              sets into a centralized site[4]. Mining
reduces communication costs, superior                  association rules is to generate all
running       efficiency,    and     stronger          association rules that have support and
scalability than direct application of a               confidences are larger than the user-
sequential algorithm in distributed                    specified    minimum       support    and
databases.                                             minimum confidence respectively [5].
                                                       The main challenges include work-load
           I.        INTRODUCTION
                                                       balancing,               synchronization,
                                                       communication minimization, finding
    Most existing parallel and distributed             good data layout, data decomposition,
ARM algorithms are based on a kernel                   and disk I/O minimization, which is
that employs the well-known Apriori                    especially important for DARM.
algorithm [1]. Directly adapting an
Apriori algorithm will not significantly
improve performance over frequent item
sets generation or overall distributed

                                                                                   ISSN 1947-5500
                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                      Vol. 7, No. 3, March 2010

        II.     LITERATURE SURVEY                    fewer candidate itemsets compared to
The Count Distribution (CD) Algorithm
        CD algorithm uses the sequential                         III.      PROPOSED SYSTEM
Apriori algorithm in a parallel
environment and assumes datasets are                 Mining Association Rules
horizontally partitioned among different
sites[6]. At each iteration, it generates                    Efficient algorithms for mining
the candidate sets at every site by                  frequent itemsets are crucial for mining
applying the Apriori-gen function on the             association rules as well as for many
set of frequent itemsets found at the                other data mining tasks. Methods for
previous iteration. Every site then                  mining frequent itemsets have been
computes the local support counts of all             implemented using a prefix-tree
these candidate sets and broadcasts them             structure, known as an FP-tree, for
to all the other sites. Subsequently, all            storing compressed information about
the sites can find the globally frequent             frequent        itemsets.       Numerous
itemsets for that iteration, and then                experimental results have demonstrated
proceed to the next iteration. This                  that these algorithms perform extremely
algorithm has a simple communication                 well. In this paper, we present a novel
scheme for count exchange. However, it               FP-array technique that greatly reduces
also has the similar problems of higher              the need to traverse FP-trees, thus
number of candidate sets and larger                  obtaining      significantly     improved
amount of communication overhead. It                 performance        for       FP-tree-based
does not use the memory of the system                algorithms. Our technique works
effectively.                                         especially well for sparse data
                                                     sets.Furthermore, we present new
The Fast Distributed Mining Algorithm                algorithms for mining all, maximal, and
                                                     closed frequent itemsets. The results
    FDM generates fewer candidates                   show that our methods are the fastest for
than CD, and use effective pruning                   many cases. Even though the algorithms
techniques to minimize the messages for              consume much memory when the data
the support exchange step. In each site,             sets are sparse, they are still the fastest
FDM finds the local support counts and               ones when the minimum support is low.
prunes all infrequent local support
counts[7]. After completing local                    The L-Matrix Algorithm
pruning, instead of broadcasting the
local counts of all candidates as in CD,             Algorithm L-Matrix minimizes the
they send the local counts to polling site.          communication overhead. Our solution
FDM's main advantage over CD is that it              also reduces the size of average
reduces the communication overhead to                transactions and datasets that leads to
O (|Cp|*n), where |Cp| and n are                     reduction of scan time. It minimizes the
potentially frequent candidate item sets             number of candidate sets and exchange
and the number of sites, respectively[8].            messages by local and global pruning.
When        different     sites      have            Reduces the time of scan partition
nonhomogeneous data sets, the number                 databases to get support counts by using
of disjoint candidate itemsets among                 a compressed matrix-L-Matrix, which is
them is frequent, and FDM generates                  very effective in increasing the

                                                                                 ISSN 1947-5500
                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                       Vol. 7, No. 3, March 2010

performance. Finds a centre site to                   coffee, tea, bread, butter. The third
manage every the message exchanges to                 transaction consists of coffee, milk,
obtain all globally frequent item sets,               butter.The LMatrix and the transaction
only O(n) messages are needed for                     table would look like the one given
support count exchange. It has superior               below.
running          efficiency,           lower          Then we can obtain the support count of
communication cost and stronger                       'A' by accumulating the numbers of '1' in
scalability that direct application of a              the first column. Then counting the
sequential algorithm in distributed                   numbers of '1' in Metavector A & C we
databases.                                            get the support of AC is 2.
This new algorithm LMatrix is used to
achieve     maximum         efficiency    of                                                             ITEM
                                                                                        ON ID
algorithms..The transaction database is
first created to develop the L-Matrix. A                1 1 1 0 0                          1             ABC

LMatrix is an object-by-variable                        1 1 0 1 1
                                                                                           2             ABDE
compressed       structure.      Transaction            1 0 1 0 1
                                                                                           3             ACE
database is a binary matrix where the
rows represent transactions and columns
represent alarms. The partitioned                     Improved Mining Algorithm
databases need to be scanned only once
to convert each of them to the local                                    For a site Si, if an
LMatrix. The local LMatrix is read to                 itemset X is both locally and globally
find support counts instead of scanning               frequent at site Si, we say that X is heavy
the partition databases time after time,              at site Si.
which will save a lot of memory. The
proposed algorithm can be applied to the                   A. Algorithm to compute frequent
mining of association rules in a large                        Itemset in Local Sites.
centralized database by partitioning the
database to the nodes of a distributed                1. While flag i = true, find heavy
system. This is particularly useful if the               itemsets at site s i .Then generate the
data set is too large for sequential                     candidate      sets    using   Apriori
mining.                                                  algorithm.
                                                      2. For each candidate set at s i ,prune
LMatrix implementation                                   away candidate sets whose max
                                                         count value is less than s * D, where
The algorithm is implemented with the                    s is min support and D is partition
help of the following supermarket                        size of the distributed database.
example.       Let    the supermarket                 3. Read LMatrix to compute the local
contains five items namely coffee, tea,                  support count of the remaining
milk, bread, butter which are represented                candidate set. Locally frequent
as A,B,C,D and E respectively and                        candidate set items are put in LLk
transactions are being done in the                    4. Send the candidate sets in LLk to
following manner. Let us consider three                  center sites to collect their global
transactions. The first transaction                      support counts.
consists of items coffee, tea, and milk.              5. If si receives a count request of
The second transaction consists of items                 itemset X from center site, it reads
                                                         LMatrix again to obtain support

                                                                                  ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                         Vol. 7, No. 3, March 2010

    counts of X and sents it back to                    combination of three item is chosen then
    centre site else it receives globally               a count of 1 is incremented if it occurs in
    frequent itemsets and their support                 a particular transaction and items sets
    counts.                                             having maximum support is the result of
                                                        the transaction. We get Result== [AC,
    B.   Algorithm to compute globally                  AE, CE]. In the above result, it is true
         frequent Itemset in Central                    fact that item D is not in the list of
         Sites.                                         frequent item sets and so it is eliminated
                                                        and again the above step continues with
1.        Center site receives all LLk sent             the help of the items in the list. So the
to it from the partition sites. When                    steps above is done locally and now
LLk=Ø, set flag = false. For every                      global pruning is done that takes
candidate set X € LLk, it finds the list of             frequent item sets from the both nodes
originating sites.                                      and would result in a final result [A, B,
2.      If all partition sites are in the list          C, E]. So we get the list of items which
   of X, put X in Lk.Else calculate                     are locally frequent at site si and also
   X.MaxCount and prune away those X                    globally frequent as follows.
   whose X.MaxCount < s*D
3.      Then broadcast the remaining                    [Coffee, Tea, Milk, Butter]
   candidate sets to the other sites not on
   the list to collect the support counts.              A       Coffee
4.      Center site receives the local                  B       Tea
   support counts back and adds together                C       Milk
   and if X.count >= s*D, put it also in                E       Butter
5.      Center site then numbers all X €                The following graphs have been drawn
   Lk from 1 to m. X is frequent only                   to see the performance of the algorithm
   when its (k-1) subsets are frequent. If              in terms of execution time with respect
   |Lk| < k+1, set flag = false.                        to various minimum supports and
6.      Finally when flag = true, it                    database sizes
   broadcasts the globally frequent
   itemsets, together with their global
   support counts to all the sites and find                               Execution time with different
   the heavy itemsets in each site si.                                       minimum support
    If a item is being selected among                       ec 250
items A,B,C,D,E         in that particular                  on 200
transaction then a count of 1 is                            Ti
                                                            me 150
incremented for each item. Then a
combination of items is being chosen                             100

and if it occurs in a particular transaction                      50
then a count of 1 is incrementally added                           0
to this and the item sets which is less                                  1    2     3   4   5   6   7            8     9
than the minimum support count is                                                   Mimimum Support

removed from the list. After that a

                                                                                    ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 7, No. 3, March 2010

                                                                    database to the nodes of a distributed
          Execution time with different
                                                                    system. This is particularly useful if the
                 Database Size                                      data set is too large for sequential
X                                                                   mining.
E   60
    50                                                                           ACKNOWLEDGEMENT
T   40
O   30                                                              We take immense pleasure in thanking
    20                                                              our Chairman Dr. Jeppiaar M.A, B.L,
T 10
I                                                                   Ph.D, the Directors of Jeppiaar
m 0
                                                                    Engineering College Mr. Marie Wilson,
           1k       2k       4k   8k     10k   12k
                         Database Size
                                                                    B.Tech, MBA.,(Ph.D) Mrs. Regeena
                                                                    Wilson, B.Tech, MBA., (Ph.D) and the
                                                                    Principal          Dr. Sushil Lal Das
                                                                    M.Sc(Engg.), Ph.D for their continual
         The final transaction table which                          support and guidance. We would like to
         contains the frequent itemsets alone will                  extend our thanks to my guide, our
         look like this.                                            friends and family members without
                                                                    whose inspiration and support our efforts
                                                                    would not have come to true. Above all,
         TRANS ID    ITEM ID
                                                                    we would like to thank God for making
            1            A
            2            A
                                                                    all our efforts success.
            3            A
            1            B
            2            B
                                                                    [1] D.W. Cheung, et al., "A Fast
            1            C
                                                                    Distributed Algorithm for Mining
            3            C
            2            E
                                                                    Association Rules," Proc. Parallel and
            3            E
                                                                    Distributed Information Systems, IEEE
                                                                    CS Press, 1996,pp. 31-42;
                             V. CONCLUSION                          [2] M.J. Zaki and Y. Pin , "Introduction:
                                                                    Recent Developments in Parallel and
                  We have developed an efficient                    Distributed Data Mining," J. Distributed
         algorithm for mining association rules in                  and Parallel Databases , vol. 11, no. 2,
         distributed databases which reduces                        2002,pp. 123-127.
         communication costs and takes away the                     [3] Ma, Y., Liu, B., Wong, C.K.: Web
         overhead of combining the partition                        for Data Mining: Organizing and
         database sites datasets into a centralized                 Interpreting the Discovered Rules Using
         site. It also has the advantage of reduced                 the Web. SIGKDD Explorations, Vol.
         size of messages passed through the                        2 (1). ACM Press, New York (2000) 16-
         network. It also reduces the time of scan                  23
         of partition database by using LMatrix                     [4] A. Schuster and R. Wolff ,
         which increases the performance of the                     "Communication-Efficient Distributed
         algorithm.      Furthermore,     Improved                  Mining of Association Rules," Proc.
         mining algorithm can be applied to the                     ACM SIGMOD Int'l Conf. Management
         mining of association rules in a large                     of Data, ACM Press, 2001,pp. 473-484.
         centralized database by partitioning the

                                                                                                ISSN 1947-5500
                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                 Vol. 7, No. 3, March 2010

[5] R. Agrawal, T. Imielinski, and A.           [8] Hand, D., Manilla, H., Smyth,P.:
Swami, "Mining Association Rules                Principles of Data Mining, MIT
Between Sets of Items in Large                  Press,Cambridge-London (2001)
Databases," Proc. ACMSIGMOD Int'l                                   J.ArokiaRenjit B.E.,M.E.,(Ph.D)
Conf. Management of Data, pp. 207-                                  works as Assistant Professor in
216, May 1993.                                                      Jeppiaar Engineering College and
                                                                    he has more than 8 years of
[6] M.Z Ashrafi, Monash University
                                                                    teaching experience. His areas of
ODAM: An Optimized Distributed
Association Rule Mining Algorithm,                 specializations are Networks,               Artificial
IEEE DISTRIBUTED SYSTEMS                           Intelligence, Software Engineering.
ONLINE 1541-4922 © 2004 Published
by the IEEE Computer Society Vol. 5,
                                                                     Dr. K.L. Shanmuganathan B.E,
 [7] Kimball, R., Ross, M.: The Data                                 M.E.,M.S.,Ph.D works as the
Warehouse Toolkit, The Complete                                      Professor & Head of CSE
Guide to Dimensional Modeling. 2nd                                   Department       of      RMK
edn. John Wiley & Sons, New York                                     Engineering College, Chennai,
(2002)                                                               TamilNadu, India. He has more

                                                  than 18 years of teaching experience and his
                                                  areas of specializations are Artificial Intelligence,
                                                  Networks and DBMS

                                                                            ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 7, No. 3, March 2010

    Node Sensing & Dynamic Discovering Routes f o r Wireless Sensor
1                                      2                                            3
 Arabinda Nanda                         Amiya Kumar Rath                             Saroj Kumar Rout
Department of CSE                      Department of CSE & IT                      Department of CSE
Krupajal Engineering College           College of Engineering                      Krupajal Engineering College
Bhubaneswar, India                     Bhubaneswar, India                          Bhubaneswar, India            

Abstract-The applications of Wireless Sensor                    Actually combining sensors, radios, and CPU’s into
Networks (WSN) contain a wide variety of                        an effective WSN requires a detailed understanding
scenarios. In most of them, the network is composed             of the both capabilities and limitations of each of the
of a significant number of nodes deployed in an                 essential hardware components, as well as a detailed
extensive area in which not all nodes are directly              understanding of modern networking technologies
connected. Then, the data exchange is supported by              and distributed systems theory’s that combines data
multihop communications. Routing protocols are in               sensing, computing, and communication has been
charge of discovering and maintaining the routes in             gaining great popularity in recent years. Several real
the network. However, the correctness of a                      world applications have already been designed,
particular routing protocol mainly depends on the               implemented and deployed [1]. WSN consists of a
capabilities of the nodes and on the application                large number of Sensor Nodes and one or more Base
requirements. This paper presents a dynamic                     Stations. A Base Station acts as a gateway to connect
discover routing method for communication                       a WSN to the outside world. Individual Sensor
between sensor nodes and a base station in WSN.                 Nodes sense their environment, and transmit the
This method tolerates failures of arbitrary                     sensed data to a Base Station through a multi-hop
individual nodes in the network (node failure) or a             network consisting of several sensor nodes. The
small part of the network (area failure). Each                  Base Station in turn transfers the data to the WSN
node in the network does only local routing                     users. Several routing protocol for WSN has been
preservation, needs to record only its neighbor                 proposed [2, 3, 4, 5, 6, 7, 8, 9, 10].
nodes’ information, and incurs no extra routing                      Deng, Han&Mishra had done a lot of work on
overhead during failure free periods. It dynamically            routing mechanism for WSN.They had studied on
discovers new routes when an intermediate node or               loops and how to eliminate the loops in WSN.But
a small part of the network in the path from a                  here we add loop finding algorithm, to find all
sensor node to a base station fails. In our planned             possible loops in WSN. Although each individual
method, every node decides its path based only on               sensor node is highly constrained in its computing
local information, such as its parent node and                  and communication capabilities, a complete WSN is
neighbor nodes’ routing information. So, it is                  capable of performing complex tasks.
possible to form a loop in the routing path. We                     Common failures in the system includes:
believe that the loop problem in sensor network                 Node Failure, Area Failure and Lost Message. In
routing is not as serious as that in the Internet               order to function properly, the rest of the system
routing or traditional mobile ad-hoc routing. We                must (1) detect failures; (2) determine the cause,
are trying to find all possible loops and eliminate             such as identifying the types of failure and the
the loops as far as possible in WSN.                            failed component; (3) reconfigure the system so
                                                                that it can continue to operate; and (4) recover
Keywords- routing protocol; wireless                            when the failed component is repaired.
sensor network; node failure; area failure                         A node engaged in a handshaking protocol of
                                                                some kind usually experiences a failure as the lack
              1. INTRODUCTION                                   of the expected response from its partner within a
                                                                prescribed time limit. The use of time-outs is a
 A WSN is composed of a large number of tiny                    common technique for detecting missing response.
autonomous devices, called sensor nodes. A sensor               However, the choice of a specific time-out value
node has limited sensing and computational                      presents some practical problems. Too long a
capabilities and can communicate only in short                  time-out result in slow detection of missing
distances. Routing protocol is a set of rules defining          message. On the other hand, too short a time-out
the way router machines find the way that packets               may trigger false alarms by declaring as missing
containing information have to follow to reach the              message that is just delayed. Moreover, short
anticipated destination.                                        time-outs require the communication subsystem to
The concept of WSN is based on a simple equation:               deal with duplicate message sent in response to
Sensing + CPU + Radio = Thousands of potential                  hurriedly requested replays.
applications                                                       Every node in a WSN has similar chances to
As soon as people understand the capabilities of a    122                    
                                                                                               Node Failure, which is
                                                                suffer from an arbitrary 1947-5500
WSN, hundreds of applications come to mind.
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 7, No. 3, March 2010

generally caused by battery drain or some internal                sensor node to a base station. The ARRIVE routing
problem in the node. An Area Failure results in a                 protocol [13] uses this strategy to forward multiple
failure of all nodes within a certain geographical                copies of the same data.
area. This is typically caused by outside accidents,                   Woo, Tong, and cullar [14] investigated the
such as a bomb blast, fire, successful denial-of-                 challenges of multihop routing in wireless sensor
service attacks, and so on. Figure 1 illustrates these            networks and proposed a routing scheme based on
two types of failures in a WSN.                                   node’s neighborhood link estimates. This protocol is
                                                                  for a many-to-one, data collection routing
                                                                  development in WSN. A sensor network can quickly
                                                                  respond to node failures and data transmission range
                                                                  changes, and find new routing path for sensor nodes.
                                                                  To do this, a sensor node needs to periodically
                                                                  broadcast its routing information, or periodically
                                                                  search its neighbor nodes’ routing information. In
                                                                  addition, it needs to maintain a table which contains
                                                                  its neighbor nodes’ routing information.
                                                                       In this paper, we propose a dynamic discovering
                                                                  routing method that can be integrated in any routing
                                                                  protocol for WSN to make it fault tolerant. It
                                                                  dynamically repairs a routing path between a sensor
                                                                  node and a base station. In contrast to [14], a node
   Figure 1 Node failure in WSN                                   stores only its parent node routing information, and
                                                                  asks for neighbor nodes routing information when
    Three different methods have been used to                     parent node is hard to find. When an original routing
maintain routing paths in the occurrence of node                  path is broken, a node selects a new path from its
failures. In the first method, routing paths are                  neighbor nodes. This dynamic discovering routing
reconstructed from time to time. For example, in a                method tolerates both arbitrary node and area failures.
simple beacon protocol [11], a base station
periodically broadcasts a beacon message. By
receiving a beacon message, a node receives an up-                        2. PROTOCOL EXPLANATION
to-date routing path to the base station.
Reconstruction of routing paths is expensive in this              2.1. A s s u m p t i o n s
method and consumes lots of energy. In addition,
since reconstruction is not on demand, a node has to                   In this paper, we center of attention on how each
wait until the beacon to update the routing                       sensor node maintains its routing path to base
information on a node failure.                                    stations. We assume that the initial routing method
                                                                  from each sensor node to a base station has already
     In the second method, multiple routing paths are             been set up. This can be done using a number of
used to transfer data. The idea is that unless every              protocols that have been proposed in the past, e.g.
path from a sensor node to a base station is broken by            the TinyOS beacon protocol discussed below. In
a failed node, data can be transmitted to base station.           particular, we assume that each node already has a
The multipath version of directed diffusion [12] uses             path to the base station, and knows its parent node,
this strategy. This method can result in increased                neighbor nodes and the number of hops it is from the
energy consumption and packet collisions, because data is         base station. This information can be initialized by
sent along multiple paths, irrespective of whether                using the TinyOS beacon protocol for setting up
there is a node failure or not. Also, this method                 routing paths. In this protocol, the base station
cannot guarantee bypassing an area failure.                       floods a beacon message in the network. When a
                                                                  node first knows the beacon message, it records the
    In the third method a routing path is selected                sender of that beacon message as its parent node and
probabilistically. In this method, a node chooses                 forwards the beacon message to all of its neighbor
another node to forward a packet with certain                     nodes. When a node needs to send/forward a message
probability. Since there is no fixed path to forward              to the base station, it sends the message to its parent
data, a failed node can’t block all packets from a                node. The parent node in turn forwards the message to

                                                                                          ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                              Vol. 7, No. 3, March 2010

its parent node, and so on, until the message gets to            If p (a) cannot connect to base station, it sends
base station. A key problem with this protocol is that           BACK_N message back to a:
it is not error tolerant. If the parent node of a sensor
node fails, the sensor node cannot communicate with              BACK_N: p (a) → a: broken||broken_ h ops
the base station
                                                                  If p (a) cannot connect to its parent node p.parent,
2.2.P a t h Repair Algorithm                                     the p.broken_hops is set to 1. Otherwise,

    The basic idea is to repair routing paths in case            P.broken_hops= p.parent.broken_hops + 1
of arbitrary node or area failures is quite simple:
every node monitors its parent node. When it finds               3 (a). If a receives BACK_Y from p(a), a resets its
that parent node has failed, it asks its neighbor                hops as parent p’s hops plus one:
nodes for their connection information. It then                   a hops ← hops + 1. If a node’s hops beyond a
chooses a new parent node from its neighbor nodes                maximum threshold value, it sets itself unconnected:
based on this connection information. As shown in                ahops ← ∞.
Figure-1, the method can tolerate node failures and
routes a message circulating the failed nodes.                     (b). I f p (a) is dead or its signal is blocked, it
    This mechanism consists of four parts: the                   cannot reply BACK message within timeout. If a
failure detection, failure information propagation,              cannot receive BACK message from p(a) within the
new parent detection, and new parent selection.                  specified timeout, a knows that it cannot connect to
First, a node detects if its parent node is alive and            base station through p(a). Then it broadcasts a
if the parent node can connect to base station. This             REQUEST message to all of its neighbor nodes to
part is called failure detection. If a node s detects            find a new parent node.
that its parent node works well, it won’t do any
maintenance work. If there are some problems in                  REQUEST: a→NEIGHBOR: request_parent
parent node, such as node failure or disconnected to
base station (possibly one of parent node’s ancestor             (c). If a receives BACK_N from p(a) , a knows
node is failed), node s informs its children nodes               that p(a) cannot connect to base station at that
                                                                 moment. Instead of broadcasting REQUEST
about the failure, which is called failure
information propagation. In addition, s requests the             message immediately, a waits a timeout b e f o r e
                                                                 sending REQUEST. The timeout depends on the
connection information from its neighbor nodes
since it needs to choose a new parent node from                  value of broken_hops from BACK_N message.
                                                                 This strategy gives parent node p some time to find
them. This part is called new parent detection.
                                                                 its new parent node. a will set its broken_hop, and
After collecting information from its neighbor
nodes, s decides a new parent node based on the                  propagate it when its children nodes send
                                                                 FORWARD message to a.
information it collected. This part is called new
parent selection.
                                                                 4. when one of a’s neighbor node n receives
                                                                 REQUEST message from a, and if n can connect to
We denote a as the node who tries to maintain its
                                                                 base station, it sends a REPLY message back to a.
route path. Node p(a) is a’s parent node.
                                                                 REPLY message contains the ID of n’s parent node,
1. Node a sends FORWARD message to its parent                    and n’s hops to base station:
node p(a) , and set a timeout (timeout_ppt) for
BACK message from p(a).                                          REPLY: n → a: connect||n_hops||n.parent

     FORWARD: a → p(a) :forward_ppt                               If n can’t connect to base station, it will not send
                                                                 any message back to a. Instead, it records a as one
2. If p (a) receives the FORWARD m e s s a g e , it              of its REQUEST senders. (Here, a’s children nodes
will reply a BACK message. The BACK message                      will not sends REPLY message back to a since it is
contains the information that whether p(a) connects              not necessary.)
to base station or not, and if it is connected, the hops         I f a has not got any REPLY message from its
to base station. If p (a) connects to base station, it           neighbor nodes, it will resend REQUEST after a
sends BACK_Y message back to a.                                  certain timeout.
BACK_ Y: p (a) → a: connect||hops

                                                                                         ISSN 1947-5500
                                                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                   Vol. 7, No. 3, March 2010

 5. When a receives REPLY messages from its                                                                              In figure 2, we present a formal description of this
 neighbor nodes, if the REPLY message says that the                                                                   method with a finite Automata (FA). This FA
 sender connects to base station, a records the sender                                                                shows the major state translation except the
 as a parent candidate. Finally, a selects its new                                                                    processing of REQUEST requests. We use x : y to
 parent node whose hops to base station is smallest                                                                   describe the state translation condition. x denotes
 among all candidates. After it selects parent node, a                                                                the action of events denotes the content of message.
 sets its hops as its parent node’s hops plus one:                                                                    R means receives a message, S means sends a
      a hops←p(a)hops + 1.
                                                                                                                                        3. PROPERTIES
 If a ever received REQUEST message from its
 neighbor nodes, it will send REPLY back to the                                                                       3.1. A r b i t r a r y node failure and area failure
 REQUEST senders.
                                            R:RPLY;R: BACK _Y

                                                                     R:RQST         S:RQST
R :R Q S T

             Connected                 Probing                       Disconnected                  RQSTing
                                                 R :B


                         R:BACK _Y                                                  Timeout-rqst
                                                                     ut -
                                                    K -N

                                                              T im

                         A   CK                         Pending
                                                                                                                          The proposed method is forceful in finding new
                                                        R:RQST                                                        paths under arbitrary node failure and area failure.
                                                                                                                      Figure 3 demonstrates how a node find alternative
 Figure 2 Protocol Finite Automata                                                                                    path when its parent node is failed. In figure 3,
                                                                                                                      p(a) is a failed node, showed as a black node.
                                                                                                                      When p(a)’s child node a detects that it cannot
             I0     I1            I2        I3          I4        I5          I6          I7          I8
                                                                                                                      connect to p(a) by running step 1, a broadcasts
 Q0          Q0                                         Q1
                                                                                                                      REQUEST message to its neighbor nodes . If any
 Q1                 Q0            Q3                                                      Q2                          of a’s nonchild neighbor nodes can connect their
 Q2          Q2                                                   Q4                                                  parent nodes, they will send REPLY message back
 Q3          Q3     Q0                                                        Q2                                      to a. This figure demonstrates that a chooses n as
 Q4          Q4     Q0                      Q0                                                        Q2
                                                                                                                      its new parent node from the REPLY messages,
                                                                                                                      and then a has a new path to base station.

                    (Transition table for FA)

 The states of FA:
 Q={ Q0,Q1,Q2,Q3,Q4 }
 Q0 = Connect, Q1=Probing,Q2=Disconnected,
 The input of FA:
 I0= R: Request, I1= R: Back_Y,
 I2= R: Back_N,
 I3= R: Reply, I4= S: Forward, I5= S: Request,                                                                                Figure 4 Bypass Area Failures
 I6= Timeout-ppt, I7 = Timeout-Forward,
 I8= Timeout-Request.

                                                                                                                                              ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                              Vol. 7, No. 3, March 2010

   Figure 4 demonstrates that the nodes within a                 Development of Algorithm
certain area are all failed. This may caused by some
accidents, i.e. fire, a bomb, or a signal blocking               Input Data: For loop finding, data to be supplied are
attack. This type of failure is called area failure.             the basic line information’s, that is, the line with its
When it happens, the nodes just close to the failure             end nodes. From the raw data, the algorithm will
area will send REQUEST messages to their neighbor                prepare a line-node-incidence matrix (LNI), which
nodes. In the beginning, some nodes choose other                 will contain the lines connected to a particular node.
nodes along the failure edge as their parent nodes.
That is because these nodes may detect the failure                The searching process for loop finding will be to
area at slight different time. But quickly, the nodes            start from the source node and go forward till the
just behind the failure area will detect that their              source node is reached again. To a programmer,
                                                                 however, the problem appears to be a little bit
neighbor nodes are also disconnected to base station.
                                                                 harder because at every step of the searching
We call this area as “block area”. Because of routing
                                                                 process, the searching direction has to be chosen
update inconsistency, some nodes may form routing
                                                                 judiciously with a wrong direction, the program may
loop in the “block area”. In the edge of the failure             enter ending searching process and will never reach
area, which we call “edge area”, nodes will find the             the source node, may not be able to find out all the
real path to base station, and the routing information           loops or may travel along the same loop every time.
of these nodes will ultimately affect the nodes in
“block area” and connect them to base station.                     The program must therefore remember the part
                                                                 along which it had already traveled. A line on the
 3.2. R o u t i n g Loop                                         other hand, may participate in several loops. The
                                                                 program thus has to select the lines through which it
3.2.1. L o o p s : In our proposed method, every node            must travel again, though already traveled while
decides its path based only on local information,                finding out other loops. The problem may become
such as its parent node and neighbor nodes’ routing              easier to understand with the help of an example.
information. So, it is possible to form a loop in the            The developed algorithm has been tested with fairly
routing path, because the REPLY message contains                 large size sensor networks; the network of fig 5 has
the parent’s node of REPLY sender. A node only                   been taken as example for simplicity. The network
finds and eliminates the short loop which is having              of Figure 5 has 5 nodes, eight lines and 10 loops
only 2 or 3 nodes. The longer loops can’t be                     starting from node 1.
eliminated. An occurrence of a loop is more likely
incase of area failure than arbitrarily node failure.
When an area failure occurs, some nodes detect their
parent failure and send REQUEST messages, and
some nodes that haven’t yet detect failure keep their                                  1
old routing information. This information
inconsistency can create loops. The problem caused                    1                                      2
by loops is energy consumption and increased packet
delay/loss. Nodes in a loop may waste their power by
                                                                                   2            4
continually forwarding packets.                                   7                                              5
3.2.2 Algorithm for finding all loops in
                                                                              6                      3
Sensor Network.
                                                                      5                    3                 4
 An algorithm for finding the loops in a sensor
network has been presented. The algorithm first
detects a basic loop, copies this as the part of second
loop excluding the last element, and then searches in                                      8
forward and backward directions to find other loops.
This process goes on till all the loops are found out.                 Figure 5 Example Network
Loop finding is a typical searching process and
efficient algorithms for searching loop are not
readily available.

                                                                                               ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 7, No. 3, March 2010

A list of the loops is given as below:                          be considered in the next BLOCK. At the start of a
                                                                new BLOCK LECON is to be initialized again.
1   2   3   1
1   2   3   4   5   1                                           A counter NBLOCK counts the number of
1   2   3   5   1                                               BLOCKS. When NBLOCK = number of lines
1   2   4   3   1                                               connected to the source node, the end of the search
1   2   4   3   5   1                                           process with a source node is indicated.
1   2   4   5   1
1   2   4   5   3   1                                           If all the loop of the network is required, consider
1   3   4   5   1                                               next node as the source node and modify LNI matrix
1   3   2   4   5   1                                           to omit the previous source node from the network.
1   3   5   1                                                   At least three nodes are required to form one loop.
                                                                Thus the loop finding process will be continuing till
 Let node 1 is the source node. Starting from node 1,           the reduced network contains only two nodes.
one may reach node 2 and come back to the source
node along line 2 or 7. There are seven such loops              The complete algorithm is given below:
with a common second node. All such loops with a                    1. Form LNI from the raw data. Set row
common second node will be referred to as a                             number K1=1.
BLOCK. Once all the loops of the first BLOCK are                    2. Set NBLOCK = 1, initialize the loops, set
found out, the starting line (Line 1, here) must be                     column no K2=1.
omitted because all the possible loops with this line               3. Enter source node as the first column entry.
have been found out. From source node one may                       4. From the LNI of the source node take the
now reach at node 3 and come back to node 1                             first line, find its end node, modify LNI if
through line 7. There are three such loops, forming                     both the nodes to omit the first line.
another BLOCK. Now line 2 will also be omitted                          Elements of LNI are to be shifted towards
and no other loop is possible with only one line                        left by one position. Enter the end node as
connected to source node.                                               an element of the loop.
                                                                    5. Set LECON = 0 and LCON = 0 for all
  For a network having N lines connected to the                         nodes.
source node there will be N – 1 BLOCKS. While the                   6. Check serially all the lines connected to the
program is in a particular BLOCK, it must store the                     second node. If all the lines have been
number of lines, along which it had to travel to                        considered go to step 29. If a new line is
complete each loop. All such lines are stored in                        found, detect its end node.
LECON matrix. But a line may be the part of many                    7. Detect any line connected to the new node.
loops. As line 4 (3 – 2) is appearing in 3 loops, line                  If no line exists go to next step. If the end
7 (5 – 1) is appearing in 4 loops of the first BLOCK.                   node of the new line is the start node, end
The program thus defines another matrix LCON,                           of a loop is indicated. Go to step 9. If the
which initially contains the same content as                            end bus is not the start node, check if the
LECON, but afterwards, on reaching a particular                         node has already been entered in the loop.
node it judiciously select some lines connected to                      If so, consider the next line, otherwise enter
that node to make free and these lines are eliminated                   the bus in the loop and go on checking till
from LCON of that node.                                                 the start node is reached.
                                                                    8. Go one step backward. Repeat the search
 As one loop is found out, the program copies it in                     from step 7, if column number = 2 go to
the next row, excluding the last entry. From the last                   step 6.
entry of the new row, it then searches for any other                9. Enter the last node in the loop. Enter all the
path to come back to the source node. If a path is                      lines in the loop in LECON.
available, a new loop will be formed and again it                   10. Set LEVEL= K2 – 1.
will be copied. If there is no way to proceed further               11. K1 = K1 + 1. Copy the last loop excluding
in the forward direction, the program will move one                     the last element.
step backward and search again. If in the process of                12. Set LCON = LECON, NRESTNODE = last
going backward, the program comes to second                             element of the present row. K2 = K2 – 1.
column and can’t find any forward path, the end of a                13. If number of rows in the present BLOCK is
BLOCK is indicated. Line connecting the first and                       less than 3, go to step 22, otherwise go to
second column entry of previous BLOCK will never                        next step.

                                                                                        ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 7, No. 3, March 2010

    14. LBS1=Loop (K1-1, K2), LBS2= Loop (K1-2,                 mechanism to eliminate loop. Suppose there is a
        K2). If LBS1 = LBS2 go to step 22.                      loop a1 → a2 → . . . → ak → a1. This loop exists
    15. Set ITN= 0.                                             because there is a node (ak) that finds that its
    16. If LBS2=Loop (K1-1, K2+1), go to step 19,
                                                                original path is broken and it can connect to a1.
        otherwise go to next step.
    17. If LBS2=any element of the present loop go              Before choosing a1 as its new parent node, ak
        to step 19, otherwise go to next step.                  needs to broadcast REQUEST to its neighbor
    18. Find the line connecting LBS1 and LBS2 if               nodes, and so a k−1 will know ak’ s path is broken.
        any, if no line exists go to step 19. If any            If ak−1 and its downstream nodes continually
        line exists make this line free (eliminate              inform their downstream nodes the path broken
        from LCON).                                             event, the “path broken” information will quickly
    19. If ITN = 1, go to step 22, otherwise set ITN            propagate to all downstream nodes. At the same time,
        = 1.                                                    ak accepts a1 as its new parent node and sends new
    20. Check the next column in the previous row.
                                                                hops information to its downstream nodes.
        If it is the source node go to step 22.
                                                                Although the “new hops” information will
    21. Make LBS2=Source node. Go to step 18.
                                                                eventually propagate to all downstream nodes, its
    22. Click if the present loop up to the present
        entry is same as any other previous loop. If            propagation speed is much slower than “path
        same, go to step 23. Otherwise, make                    broken” event, since child node gets “new hops”
        LCON of present node = 0.                               information after it sends FORWARD message
    23. Take a line from LNI of NRESTNODE.                      and receives BACK _Y. If there is a loop, from a1
        Check if it is in LCON. If no line                      through ak to a1, the “path broken” event will get
        connected to the node is free go to step 27,            to a1 a n d continue to reach ak, and eventually it
        otherwise go to next step.                              will catch “new hops” information. At that time,
    24. Find the end node, if end node=any node in              every node on the loop will get “path broken”
        the present row or next column in the                   event and the path of the loop will disappear.
        previous row except the source node go to
        step 23. Otherwise go to next step.                     One