Effective Classification Algorithms to Predict the Accuracy of Tuberculosis - A Machine Learning Approach

Document Sample
Effective Classification Algorithms to Predict the Accuracy of Tuberculosis - A Machine Learning Approach Powered By Docstoc
					                                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                    Vol. 9, No. 7, July 2011

      Effective Classification Algorithms to Predict the
       Accuracy of Tuberculosis-A Machine Learning
                          Approach
                Asha.T                                        S. Natarajan                                      K.N.B. Murthy
  Dept. of Info.Science & Engg.,                    Dept. of Info. Science & Engg.                      Dept.of Info. Science & Engg.
 Bangalore Institute of Technology                  P.E.S. Institute of Technology                      P.E.S.Institute of Technology
        Bangalore, INDIA                                  Bangalore,INDIA                                     Bangalore,INDIA



Abstract— Tuberculosis is a disease caused by mycobacterium                medical knowledge as has been proved in a number of medical
which can affect virtually all organs, not sparing even the                data mining applications.
relatively inaccessible sites. India has the world’s highest burden
of tuberculosis (TB) with million estimated incident cases per             Data classification process using knowledge obtained from
year. Studies suggest that active tuberculosis accelerates the             known historical data has been one of the most intensively
progression of Human Immunodeficiency Virus (HIV) infection.               studied subjects in statistics, decision science and computer
Tuberculosis is much more likely to be a fatal disease among               science. Data mining techniques have been applied to medical
HIV-infected persons than persons without HIV infection.                   services in several areas, including prediction of effectiveness
Diagnosis of pulmonary tuberculosis has always been a problem.             of surgical procedures, medical tests, medication, and the
Classification of medical data is an important task in the                 discovery of relationships among clinical and diagnosis data.
prediction of any disease. It even helps doctors in their diagnosis        In order to help the clinicians in diagnosing the type of disease
decisions. In this paper we propose a machine learning approach            computerized data mining and decision support tools are used
to compare the performance of both basic learning classifiers and          which are able to help clinicians to process a huge amount of
ensemble of classifiers on Tuberculosis data. The classification           data available from solving previous cases and suggest the
models were trained using the real data collected from a city              probable diagnosis based on the values of several important
hospital. The trained models were then used for predicting the             attributes. There have been numerous comparisons of the
Tuberculosis as two categories Pulmonary Tuberculosis(PTB)
                                                                           different classification and prediction methods, and the matter
and Retroviral PTB(RPTB) i.e. TB along with Acquired Immune
                                                                           remains a research topic. No single method has been found to
Deficiency Syndrome(AIDS). The prediction accuracy of the
classifiers was evaluated using 10-fold Cross Validation and the
                                                                           be superior over all others for all data sets.
results have been compared to obtain the best prediction                   India has the world’s highest burden of tuberculosis (TB) with
accuracy. The results indicate that Support Vector Machine                 million estimated incident cases per year. It also ranks[20]
(SVM) performs well among basic learning classifiers and                   among the world’s highest HIV burden with an estimated 2.3
Random forest from ensemble with the accuracy of 99.14% from               million persons living with HIV/AIDS. Tuberculosis is much
both classifiers respectively. Various other measures like                 more likely to be a fatal disease among HIV-infected persons
Specificity, Sensitivity, F-measure and ROC area have been used
                                                                           than persons without HIV infection. It is a disease caused by
in comparison.
                                                                           mycobacterium which can affect virtually all organs, not
                                                                           sparing even the relatively inaccessible sites. The
Keywords-component;        Machine     learning;     Tuberculosis;         microorganisms usually enter the body by inhalation through
Classification, PTB, Retroviral PTB                                        the lungs. They spread from the initial location in the lungs to
                                                                           other parts of the body via the blood stream. They present a
                                                                           diagnostic dilemma even for physicians with a great deal of
                       I.    INTRODUCTION                                  experience in this disease.
There is an explosive growth of bio-medical data, ranging
from those collected in pharmaceutical studies and cancer                                       II.   RELATED WORK
therapy investigations to those identified in genomics and                 Orhan Er. And Temuritus[1] present a study on tuberculosis
proteomics research. The rapid progress in data mining                     diagnosis, carried out with the help of Multilayer Neural
research has led to the development of efficient and scalable              Networks (MLNNs). For this purpose, an MLNN with two
methods to discover knowledge from these data. Medical data                hidden layers and a genetic algorithm for training algorithm
mining is an active research area under data mining since                  has been used. Data mining approach was adopted to classify
medical databases have accumulated large quantities of
                                                                           genotype of mycobacterium           tuberculosis using c4.5
information about patients and their clinical conditions.
                                                                           algorithm[2].Rethabile Khutlang et.al. present methods for the
Relationships and patterns hidden in this data can provide new



                                                                      89                              http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                  Vol. 9, No. 7, July 2011
automated identification of Mycobacterium tuberculosis in                                        III.       DATA SOURCE
images of Ziehl–Neelsen (ZN) stained sputum smears obtained              The medical dataset we are classifying includes 700 real
using a bright-field microscope.They segment candidate                   records of patients suffering from TB obtained from a city
bacillus objects using a combination of two-class pixel                  hospital. The entire dataset is put in one file having many
classifiers[3].                                                          records. Each record corresponds to most relevant information
Sejong Yoon, Saejoon Kim [4]               proposes a mutual             of one patient. Initial queries by doctor as symptoms and some
information-based Support Vector Machine Recursive Feature               required test details of patients have been considered as main
Elimination (SVM-RFE) as the classification method with                  attributes. Totally there are 11 attributes(symptoms) and one
feature selection in this paper.Diagnosis of breast cancer               class attribute. The symptoms of each patient such as age,
using different classification techniques was carried                    chroniccough(weeks), loss of weight, intermittent fever(days),
out[5,6,7,8]. A new constrained-syntax genetic programming               night sweats, Sputum, Bloodcough, chestpain, HIV,
algorithm[9] was developed to discover classification rules              radiographic findings, wheezing and class are considered as
for diagnosing certain pathologies.Kwokleung Chan et.al. [10]            attributes.
used several machine learning and traditional calssifiers in the         Table I shows names of 12 attributes considered along with
classification of glaucoma disease and compared the                      their Data Types (DT). Type N-indicates numerical and C is
performance using ROC. Various classification algorithms                 categorical.
based on statistical and neural network methods were
presented and tested for quantitative tissue characterization of                       Table I. List of Attributes and their Datatypes
diffuse liver disease from ultrasound images[11] and                           No                  Name                           DT
comparison of classifiers in sleep apnea[18]. Ranjit Abraham
                                                                               1                    Age                            N
et.al.[19] propose a new feature selection algorithm CHI-WSS
to improve the classification accuracy of Naïve Bayes with                     2           Chroniccough(weeks)                     N
respect to medical datasets.                                                   3                WeightLoss                         C
Minou Rabiei et.al.[12] use tree based ensemble classifiers for                4              Intermittentfever                    N
the diagnosis of excess water production. Their results
                                                                               5                Nightsweats                        C
demonstrate the applicability of this technique in successful
diagnosis of water production problems. Hongqi Li, Haifeng                     6                Bloodcough                         C
Guo et.al. present[13] a comprehensive comparative study on                    7                 Chestpain                         C
petroleum exploration and production using five feature
selection methods including expert judgment, CFS, LVF,                         8                    HIV                            C
Relief-F, and SVM-RFE, and fourteen algorithms from five                       9           Radiographicfindings                    C
distinct kinds of classification methods including decision tree,
                                                                               10                 Sputum                           C
artificial neural network, support vector machines(SVM),
Bayesian network and ensemble learning.                                        11                Wheezing                          C

Paper on “Mining Several Data Bases with an Ensemble of                        12                   Class                          C
Classifiers”[14] analyze the two types of conflicts, one created
by data inconsistency within the area of the intersection of the
data bases and the second is created when the meta method
                                                                                       IV.     CLASSIFICATION ALGORITHMS
selects different data mining methods with inconsistent
competence maps for the objects of the intersected part and              SVM (SMO)
their combinations and suggest ways to handle them.                      The original SVM algorithm was invented by Vladimir
Referenced paper[15] studies medical data classification                 Vapnik. The standard SVM takes a set of input data, and
methods, comparing decision tree and system reconstruction               predicts, for each given input, which of two possible classes
analysis as applied to heart disease medical data mining.                the input is a member of, which makes the SVM a non-
Under most circumstances, single classifiers, such as neural             probabilistic binary linear classifier.
networks, support vector machines and decision trees, exhibit            A support vector machine constructs a hyperplane or set of
worst performance. In order to further enhance performance               hyperplanes in a high or infinite dimensional space, which can
combination of these methods in a multi-level combination                be used for classification, regression or other tasks. Intuitively,
scheme was proposed that improves efficiency[16]. paper[17]              a good separation is achieved by the hyperplane that has the
demonstrates the use of adductive network classifier                     largest distance to the nearest training data points of any class
committees trained on different features for improving                   (so-called functional margin), since in general the larger the
classification accuracy in medical diagnosis.                            margin the lower the generalization error of the classifier.
                                                                         K-Nearest Neighbors(IBK)
                                                                         The k-nearest neighbors algorithm (k-NN) is a method for[22]
                                                                         classifying objects based on closest training examples in the




                                                                    90                                  http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                   Vol. 9, No. 7, July 2011
feature space. k-NN is a type of instance-based learning., or             individual trees. It is a popular algorithm which builds a
lazy learning where the function is only approximated locally             randomized decision tree in each iteration of the bagging
and all computation is deferred until classification. Here an             algorithm and often produces excellent predictors.
object is classified by a majority vote of its neighbors, with the
object being assigned to the class most common amongst its k
nearest neighbors (k is a positive, typically small).                                       V.    EXPERIMENTAL SETUP
                                                                          The open source tool Weka was used in different phases of the
Naive Bayesian Classifier (Naive Bayes)
                                                                          experiment. Weka is a collection of state-of-the-art machine
It is Bayes classifier which is a simple probabilistic classifier
                                                                          learning algorithms[26] for a wide range of data mining tasks
based on applying Baye’s theorem(from Bayesian statistics)
                                                                          such as data preprocessing, attribute selection, clustering, and
with strong (naive) independence[23] assumptions. In
                                                                          classification. Weka has been used in prior research both in the
probability theory Bayes theorem shows how one conditional
                                                                          field of clinical data mining and in bioinformatics.
probability (such as the probability of a hypothesis given
observed evidence) depends on its inverse (in this case, the              Weka has four main graphical user interfaces(GUI).The main
probability of that evidence given the hypothesis). In more               graphical user interface are Explorer and Experimenter. Our
technical terms, the theorem expresses the posterior                      Experiment has been tried under both Explorer and
probability (i.e. after evidence E is observed) of a hypothesis           Experimenter GUI of weka. In the Explorer we can flip back
H in terms of the prior probabilities of H and E, and the                 and forth between the results we have obtained,evaluate the
probability of E given H. It implies that evidence has a                  models that have been built on different datasets, and visualize
stronger confirming effect if it was more unlikely before being           graphically both the models and the datasets themselves-
observed.                                                                 including any classification errors the models make.
                                                                          Experimenter on the other side allows us to automate the
C4.5 Decision Tree(J48 in weka)
                                                                          process by making it easy to run classifiers and filters with
Perhaps C4.5 algorithm which was developed by Quinlan is
                                                                          different parameter settings on a corpus of datasets, collect
the most popular tree classifier[21]. It is a decision support
                                                                          performance statistics, and perform significance tests.
tool that uses a tree-like graph or model of decisions and their
                                                                          Advanced users can employ the Experimenter to distribute the
possible consequences, including chance event outcomes,
                                                                          computing load across multiple machines using java remote
resource costs, and utility. Weka classifier package has its own
                                                                          method invocation.
version of C4.5 known as J48. J48 is an optimized
implementation of C4.5 rev. 8.
                                                                          A. Cross-Validation
Bagging(bagging)                                                          Cross validation with 10 folds has been used for evaluating the
Bagging (Bootstrap aggregating) was proposed by Leo                       classifier models. Cross-Validation (CV) is the standard Data
Breiman in 1994 to improve the classification by combining                Mining method for evaluating performance of classification
classifications of randomly generated training sets. The                  algorithms mainly, to evaluate the Error Rate of a learning
concept of bagging (voting for classification, averaging for              technique. In CV a dataset is partitioned in n folds, where each
regression-type problems with continuous dependent variables              is used for testing and the remainder used for training. The
of interest) applies to the area of predictive data mining to             procedure of testing and training is repeated n times so that
combine the predicted classifications (prediction) from                   each partition or fold is used once for testing. The standard
multiple models, or from the same type of model for different             way of predicting the error rate of a learning technique given a
learning data. It is a technique generating multiple training             single, fixed sample of data is to use a stratified 10-fold cross-
sets by sampling with replacement from the available training             validation. Stratification implies making sure that when
data and assigns vote for each classification.                            sampling is done each class is properly represented in both
Adaboost(Adaboost M1)                                                     training and test datasets. This is achieved by randomly
AdaBoost is an algorithm for constructing a “strong” classifier           sampling the dataset when doing the n fold partitions.
as linear combination of “simple” “weak” classifier. Instead of           In a stratified 10-fold Cross-Validation the data is divided
resampling, Each training sample uses a weight to determine               randomly into 10 parts in which the class is represented in
the probability of being selected for a training set. Final               approximately the same proportions as in the full dataset. Each
classification is based on weighted vote of weak classifiers.             part is held out in turn and the learning scheme trained on the
AdaBoost is sensitive to noisy data and outliers. However in              remaining nine-tenths; then its error rate is calculated on the
some problems it can be less susceptible to the overfitting               holdout set. The learning procedure is executed a total of 10
problem than most learning algorithms.                                    times on different training sets, and finally the 10 error rates
                                                                          are averaged to yield an overall error estimate. When seeking
Random forest (or random forests)                                         an accurate error estimate, it is standard procedure to repeat
The algorithm for inducing a random forest was developed by               the CV process 10 times. This means invoking the learning
leo-braiman[25]. The term came from random decision forests               algorithm 100 times. Given two models M1 and M2 with
that was first proposed by Tin Kam Ho of Bell Labs in 1995. It            different accuracies tested on different instances of a data set,
is an ensemble classifier that consists of many decision trees
and outputs the class that is the mode of the class's output by



                                                                     91                               http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                     Vol. 9, No. 7, July 2011
to say which model is best, we need to measure the confidence
level of each and perform significance tests.


                VI.    PERFORMANCE MEASURES
Supervised Machine Learning (ML) has several ways of
evaluating the performance of learning algorithms and the
classifiers they produce. Measures of the quality of
classification are built from a confusion matrix which records
correctly and incorrectly recognized examples for each class.
Table II presents a confusion matrix for binary classification,
where TP are true positive, FP false positive, FN false
negative, and TN true negative counts.

                      Table II. Confusion matrix


                                     Predicted Label
                                                                                Figure.1 Comparison of average F-measure and ROC area
                                 Positive          Negative

                                                    False
                               True Positive
                  Positive                         Negative
     Known                         (TP)
                                                    (FN)
     Label
                               False Positive   True Negative
                 Negative
                                    (FP)            (TN)



The different measures used with the confusion matrix are:
True positive rate(TPR)/ Recall/ Sensitivity is the percentage
of positive labeled instances that were predicted as positive
given as TP / (TP + FN). False positive rate(FPR) is the
percentage of negative labeled instances that were predicted as
positive given as FP / (TN + FP).Precision is the percentage of
positive predictions that are correct given as TP / (TP +
                                                                               Figure.2 Comparing the prediction accuracy of all classifiers
FP).Specificity is the percentage of negative labeled instances
that were predicted as negative given as TN / (TN + FP)
.Accuracy is the percentage of predictions that are correct                                             Conclusions
given as (TP + TN) / (TP + TN + FP + FN).F-measure is the
                                                                         Tuberculosis is an important health concern as it is also
Harmonic mean between precision and recall given as
                                                                         associated with AIDS. Retrospective studies of tuberculosis
2xRecallxPrecision/ Recall+Precision.
                                                                         suggest that active tuberculosis accelerates the progression of
                                                                         HIV infection. Recently, intelligent methods such as Artificial
                                                                         Neural Networks(ANN) have been intensively used for
               VII. RESULTS AND DISCUSSIONS                              classification tasks. In this article we have proposed data
Results show that certain algorithms demonstrate superior                mining approaches to classify tuberculosis using both basic
detection performance compared to others. Table III lists the            and ensemble classifiers. Finally, two models for algorithm
evaluation measures used for various classification algorithms           selection are proposed with great promise for performance
to predict the best accuracy. These measures will be the most            improvement. Among the algorithms evaluated, SVM and
important criteria for the classifier to consider as the best            Random Forest proved to be the best methods.
algorithm for the given category in bioinformatics. The
prediction accuracy of SVM and C4.5 decision trees among
                                                                                                  Acknowledgment
single classifiers, Random Forest among ensemble are
considered to be the best.                                               Our thanks to KIMS Hospital, Bangalore for providing the
                                                                         valuable real Tuberculosis data and principal Dr. Sudharshan
Other measures such as F-measure and ROC area of above
                                                                         for giving permission to collect data from the Hospital.
classifiers are graphically compared in figure 1. It displays the
average F-measure and ROC area of both the classes.
Prediction accuracy of these classifiers are shown in figure 2.




                                                                    92                                  http://sites.google.com/site/ijcsis/
                                                                                                        ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                              Vol. 9, No. 7, July 2011

                                            Table III. Performance comparison of various classifiers

       Classifier category             Classifier model                         Various measures                Disease categories(class)

                                                                                                                PTB                      RPTB
       Basic Learning classifiers      SVM(SMO)                                 TPR/ Sensitivity                98.9%                    99.6%
                                                                                FPR                             0.004                    0.011
                                                                                Specificity                     99.6%                    98.9%
                                                                                Prediction                      99.14%
                                                                                Accuracy
                                       K-NN(IBK)                                TPR/ Sensitivity                99.1%                    96.9%
                                                                                FPR                             0.03                     0.008
                                                                                Specificity                     96.9%                    99.1%
                                                                                Prediction                      98.4%
                                                                                Accuracy
                                       Naive Bayes                              TPR/ Sensitivity                96.4%                    96.5%
                                                                                FPR                             0.035                    0.037
                                                                                Specificity                     96.5%                    96.4%
                                                                                Prediction                      96.4%
                                                                                Accuracy
                                       C4.5 Decision Trees(J48)                 TPR/ Sensitivity                98.5%                    100%
                                                                                FPR                             0                        0.015
                                                                                Specificity                     100%                     98.5%
                                                                                Prediction                      99%
                                                                                Accuracy
           Ensemble classifiers        Bagging                                  TPR/ Sensitivity                98.5%                    99.6%
                                                                                FPR                             0.004                    0.015
                                                                                Specificity                     99.6%                    98.5%
                                                                                Prediction                      98.85%
                                                                                Accuracy

                                       Adaboost(AdaboostM1)                     TPR/ Sensitivity                98.5%                    100%
                                                                                FPR                             0                        0.015
                                                                                Specificity                     100%                     98.5%
                                                                                Prediction                      99%
                                                                                Accuracy

                                       Random Forest                            TPR/ Sensitivity                98.9%                    99.6%
                                                                                FPR                             0.004                    0.011
                                                                                Specificity                     99.6%                    98.9%
                                                                                Prediction                      99.14%
                                                                                Accuracy




                                                                                     [3]   Rethabile Khutlang, Sriram Krishnan, Ronald Dendere, Andrew
                               REFERENCES                                                  Whitelaw, Konstantinos Veropoulos, Genevieve Learmonth, and Tania
                                                                                           S. Douglas, “Classification of Mycobacterium tuberculosis in Images of
[1]   Orhan Er, Feyzullah Temurtas and A.C. Tantrikulu, “ Tuberculosis                     ZN-Stained Sputum Smears”, IEEE Transactions On Information
      disease diagnosis using Artificial Neural Networks ”,Journal of Medical              Technology In Biomedicine, VOL. 14, NO. 4, JULY 2010.
      Systems, Springer, DOI 10.1007/s10916-008-9241-x online,2008.
                                                                                     [4]   Sejong Yoon and Saejoon Kim, “ Mutual information-based SVM-RFE
[2]   M. Sebban, I Mokrousov, N Rastogi and C Sola “ A data-mining                         for diagnostic Classification of digitized mammograms”, Pattern
      approach to spacer oligo nucleotide typing of Mycobacterium                          Recognition Letters, Elsevier, volume 30, issue 16, pp 1489–1495,
      tuberculosis” Bioinformatics, oxford university press, Vol 18, issue 2,              December 2009.
      pp 235-243. J. Clerk Maxwell, A Treatise on Electricity and Magnetism,
      3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73,2002.




                                                                                93                                      http://sites.google.com/site/ijcsis/
                                                                                                                        ISSN 1947-5500
                                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                                                Vol. 9, No. 7, July 2011
[5]    Nicandro Cruz-Ramırez , Hector-Gabriel Acosta-Mesa , Humberto                           http://www.nacoonline.org/Quick_Links/HIV_Data/         Accessed       06
       Carrillo-Calvet and Rocıo-Erandi Barrientos-Martınez, “Discovering                      February, 2008.
       interobserver variability in the cytodiagnosis of breast cancer using            [21]   J.R. Quinlan, “Induction of Decision Trees” Machine Learning 1, Kluwer
       decision trees and Bayesian networks” Applied Soft Computing, Elsevier,                 Academic Publishers, Boston, pp 81-106, 1986.
       volume 9,issue 4,pp 1331–1342, September 2009.
                                                                                        [22]   Thomas M. Cover and Peter E. Hart, "Nearest neighbor pattern
[6]    Liyang Wei, Yongyi Yanga and Robert M Nishikawa,                                        classification," IEEE Transactions on Information Theory, volume. 13,
       “Microcalcification classification assisted by content-based image                      issue 1, pp. 21-27,1967.
       retrieval for breast cancer diagnosis” Pattern Recognition , Elsevier,
       volume 42,issue 6, pp 1126 – 1132, june 2009.                                    [23]   Rish and Irina, “An empirical study of the naïve Bayes classifier”, IJCAI
                                                                                               2001, workshop on empirical methods in artificial intelligence,
[7]    Abdelghani Bellaachia and Erhan Guven, “ Predicting breast cancer                       2001(available online).
       survivability using Data Mining Techniques” Artificial Intelligence in
       Medicine, Elsevier, Volume 34, Issue 2, pp 113-127, june 2005.                   [24]   R. J. Quinlan, "Bagging, boosting, and c4.5," in AAAI/IAAI: Proceedings
                                                                                               of the 13th National Conference on Artificial Intelligence and 8th
[8]    Maria-Luiza Antonie, Osmar R Zaıane and Alexandru Coman,                                Innovative Applications of Artificial Intelligence Conference. Portland,
       “Application of data mining techniques for medical image classification”                Oregon, AAAI Press / The MIT Press, Vol. 1, pp.725-730,1996.
       In Proceedings of Second International Workshop on Multimedia Data
       Mining (MDM/KDD’2001) in conjunction with Seventh ACM SIGKDD,                    [25]   Breiman, Leo, "Random Forests". Machine Learning 45 (1): 5–
       pp 94-101,2000.                                                                         32.,Doi:10.1023/A:1010933404324,2001.
[9]    Celia C Bojarczuk, Heitor S Lopes and Alex A Freitas, “ Data Mining              [26]   Weka        –    Data    Mining      Machine      Learning      Software,
                                                                                               http://www.cs.waikato.ac.nz/ml/.
       with Constrained-Syntax Genetic Programming: Applications in Medical
       Data Set” Artificial Intelligence in Medicine, Elsevier, volume 30, issue        [27]   J. Han and M. Kamber, Data mining: concepts and techniques: Morgan
       1, pp. 27-48,2004.                                                                      Kaufmann Publishers, 2006.
[10]   Kwokleung Chan, Te-Won Lee, Associate Member, IEEE, Pamela A.                    [28]   I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
       Sample, Michael H. Goldbaum, Robert N. Weinreb, and Terrence J.                         Tools and Techniques, Second Edition: Morgan Kaufmann Publishers,
       Sejnowski, Fellow, IEEE ,“Comparison of Machine Learning and                            2005.
       Traditional Classifiers in Glaucoma Diagnosis”, IEEE Transactions On
       Biomedical Engineering, volume 49, NO. 9, September 2002.                                                      AUTHORS PROFILE
[11]   Yasser M. Kadah, Aly A. Farag, Member, IEEE, Jacek M. Zurada,                                          Mrs.Asha.T obtained her Bachelors and Masters in Engg.,
       Fellow, IEEE,Ahmed M. Badawi, and Abou-Bakr M. Youssef,                                                from Bangalore University, Karnataka, India. She is
       “Classification algorithms for Quantitative Tissue Characterization of                                 pursuing her research leading to Ph.D in Visveswaraya
       diffuse liver disease from ultrasound images”, IEEE Transactions On
                                                                                                              Technological University under the guidance of Dr. S.
       Medical Imaging, volume 15, NO. 4, August 1996.
                                                                                                              Natarajan and Dr. K.N.B. Murthy. She has over 16 years
[12]   Minou Rabiei and Ritu Gupta, “Excess Water Production Diagnosis in                                     of teaching experience and currently working as Assistant
       Oil Fields using Ensemble Classifiers”, in proc. of International                                      professor in the Dept. of Information Science & Engg.,
       Conference on Computational Intelligence and Software Engineering ,                                    B.I.T. Karnataka, India. Her Research interests are in Data
       IEEE,pages:1-4,2009.                                                                                   Mining, Medical Applications, Pattern Recognition, and
[13]   Hongqi Li, Haifeng Guo, Haimin Guo and Zhaoxu Meng, “ Data Mining                Artificial Intelligence.
       Techniques for Complex Formation Evaluation in Petroleum Exploration
       and Production: A Comparison of Feature Selection and Classification                                   Dr S.Natarajan holds Ph. D (Remote Sensing) from
       Methods” in proc. 2008 IEEE Pacific-Asia Workshop on Computational                                     JNTU Hyderabad India. His experience spans 33 years in
       Intelligence and Industrial Application ,volume 01 Pages: 37-43,2008.                                  R&D and 10 years in Teaching. He worked in Defence
[14]   Seppo Puuronen, Vagan Terziyan and Alexander Logvinovsky, “Mining                                      Research and Development Laboratory (DRDL),
       several data bases with an Ensemble of classifiers” in proc. 10th                                      Hyderabad, India for Five years and later worked for
       International Conference on Database and Expert Systems Applications,                                  Twenty Eight years in National Remote Sensing Agency,
       Vol.1677 , pp: 882 – 891, 1999.                                                                        Hyderabad, India. He has over 50 publications in peer
[15]   Tzung-I Tang,Gang Zheng ,Yalou Huang and Guangfu Shu, “A                                               reviewed Conferences and Journals His areas of interest
       comparative study of medical data classification methods based on                                      are Soft Computing, Data Mining and Geographical
       Decision Tree and System Reconstruction Analysis” IEMS ,Vol. 4, issue            Information System.
       1, pp. 102-108, June 2005.
[16]   Tsirogiannis, G.L.         Frossyniotis, D.     Stoitsis, J.    Golemati                             Dr. K. N. B. Murthy holds Bachelors in Engineering
       S. Stafylopatis and A. Nikita, K.S, “Classification of medical data with                             from University of Mysore, Masters from IISc,
       a robust multi-level combination scheme” in proc. 2004 IEEE                                          Bangalore and Ph.D. from IIT, Chennai India. He has
       International Joint Conference on Neural Networks, volume 3, pp 2483-                                over 30 years of experience in Teaching, Training,
       2487, 25-29 July 2004.                                                                               Industry, Administration, and Research. He has authored
[17]   R.E. Abdel-Aal, “Improved classification of medical data using abductive                             over 60 papers in national, international journals and
       network committees trained on different feature subsets” Computer                                    conferences, peer reviewer to journal and conference
       Methods and Programs in Biomedicine, volume 80, Issue 2, pp. 141-153,                                papers of national & international repute and has
       2005.                                                                                                authored book. He is the member of several academic
                                                                                        committees Executive Council, Academic Senate, University Publication
[18]   Kemal polat,Sebnem Yosunkaya and Salih Guines, “Comparison of                    Committee, BOE & BOS, Local Inquiry Committee of VTU, Governing Body
       different classifier algorithms on the Automated Detection of Obstructive        Member of BITES, Founding Member of Creativity and Innovation Platform of
       Sleep Apnea Syndrome”, Journal of Medical Systems,volume 32 ,Issue 3,
                                                                                        Karnataka. Currently he is the Principal & Director of P.E.S. Institute of
       pp. 9129-9, June 2008.
                                                                                        Technology, Bangalore India. His research interest includes Parallel
[19]   Ranjit Abraham, Jay B.Simha and Iyengar S.S “Medical datamining with             Computing, Computer Networks and Artificial Intelligence.
       a new algorithm for Feature Selection and Naïve Bayesian classifier”
       proceedings of 10th International Conference on Information
       Technology, IEEE, pp.44-49,2007.
[20]   HIV Sentinel Surveillance and HIV Estimation, 2006. New Delhi, India:
       National AIDS Control Organization, Ministry of Health and Family
       Welfare,                  Government               of              India.




                                                                                   94                                      http://sites.google.com/site/ijcsis/
                                                                                                                           ISSN 1947-5500

				
DOCUMENT INFO
Shared By:
Stats:
views:150
posted:8/26/2011
language:English
pages:6
Description: Tuberculosis is a disease caused by mycobacterium which can affect virtually all organs, not sparing even the relatively inaccessible sites. India has the world’s highest burden of tuberculosis (TB) with million estimated incident cases per year. Studies suggest that active tuberculosis accelerates the progression of Human Immunodeficiency Virus (HIV) infection. Tuberculosis is much more likely to be a fatal disease among HIV-infected persons than persons without HIV infection. Diagnosis of pulmonary tuberculosis has always been a problem. Classification of medical data is an important task in the prediction of any disease. It even helps doctors in their diagnosis decisions. In this paper we propose a machine learning approach to compare the performance of both basic learning classifiers and ensemble of classifiers on Tuberculosis data. The classification models were trained using the real data collected from a city hospital. The trained models were then used for predicting the Tuberculosis as two categories Pulmonary Tuberculosis (PTB) and Retroviral PTB(RPTB) i.e. TB along with Acquired Immune Deficiency Syndrome (AIDS). The prediction accuracy of the classifiers was evaluated using 10-fold Cross Validation and the results have been compared to obtain the best prediction accuracy. The results indicate that Support Vector Machine (SVM) performs well among basic learning classifiers and Random forest from ensemble with the accuracy of 99.14% from both classifiers respectively. Various other measures like Specificity, Sensitivity, F-measure and ROC area have been used in comparison.