PATTERN PATTERN ANALYSIS OF BINARY DATA OF MOTOR VEHICLE

Shared by: niusheng11
-
Stats
views:
23
posted:
4/14/2011
language:
English
pages:
11
Document Sample
scope of work template
							PATTERN ANALYSIS OF BINARY DATA OF
MOTOR VEHICLE ACCIDENT FATALITIES


                 M. Salam1 , E. Curtin2and N. Langlois3
        1 School of Information Technologies and Western Clinical School
                             The University of Sydney
                                Sydney, Australia.
                       E-mail: msalam@med.usyd.edu.au

                          2 Western Clinical School
                          The University of Sydney
                               Sydney, Australia.
                      E-mail: ecurtin@med.usyd.edu.au

                  3 Department of Forensic Medicine, ICPMR
                        Westmead Hospital, Westmead
                             Sydney, Australia.




Abstract: Binary data is rare in medical domain but whenever it
          appears, poses serious problems in term of detrmining
          clinically interesting rules and patterns. In this paper we
          develop such lagorithms that incorporating some of the
          data mining techniques. Various techniques have been
          devised particularly in satellite navigation systems [3]
          requiring high powered computing facilities. Similar
          techniques have been used in image analysis as well. But
          all of them use various approximation tools with extemely
          large data sets. We instead apply concepts of constraint
          programming to reduce a huge combinatorial problem to a
          more manageable level. We also compare our results with
          traditional statistical methods used in medicine.

Keywords: Data Mining, Pattern Recognition, Binary Data
1. INTRODUCTION
    The traditional research in evidence based medicine is based on the
classical works by Cochran [4], Cochran and Cox [5] and Snedecor and
Cochran [11]        on sampling techniques, statistical methods and
experimental designs. Although in other areas of interest, sampling is
used more for its advantages such as reduced cost, greater speed,
greater scope and greater accuracy, in medicine it is a necessity. Unlike
national census where almost every individual is included in the study
or survey, in medicine, it is impossible to include every patient in any
study. That means the validity of the results very much depends on the
statistical properties of the data. Such statistical analysis has limited
scope in medicine where it is used for testing a hypothesis such as
comparison of two treatments or investigative techniques.
    Data mining techniques are widely used in the context of enterprise
and business data but are relatively unexploited in the medical and
clinical domain. In recent years there has been a great deal of interest in
the use of data mining techniques in medical domain with wide ranging
applications such as genetic analysis, decision support systems, clinical
management and diagnostic problems. Pechenizkiy et. al., [10]
presented an evaluation and comparison of several data mining
strategies that apply feature transformation for subsequent
classification, and to consider their application to medical diagnostics.
Azuaje et. al., [2] discuss improving clinical decision support system
based on data fusion methods. Lenic et. al., [8] have addressed the hard
question of harnessing the subjective and at times conflicting opinions
drawn from human expertise in computer based decision support
systems. Lavrac et. al., [7] looked at the big picture about the data
analysis in medicine. They talked about using rule induction, if-then
rules, rough sets, association rules, ripple down rules, learning of
classification and regression trees, inductive logic programming,
discovery of concept hierarchies and constructive induction, case-based
reasoning, instances-based learning, neural networks, Bayesian
classifier, etc. Jurisica et. al., [6] proposed a case-based reasoning
system to suggest possible modifications to an IVF (In Vitro
Fertilization) treatment plan in order to improve overall success rate.
This system is named TA3 IVF system. A practical use of data mining for
the prediction of a disease process using statistical methods are
demonstrated at a website by Tigrani and John [13]. Lavrac [7] gave a
detailed overview of some of the techniques used in the analysis of the
medical data. Brault et. Al, showed the power of data mining in
determining interesting patterns from a large data warehouse of a
specific disease progression. The medical domain poses a very unique
ethical issue in data mining in term of confidentiality as discussed by
Berman [3].
   There is a greater need to design the algorithms to meet the varying
demands of the specific requirements in dealing with medical data.
There is a specific requirement in medicine to incorporate the clinical
knowledge and expertise in the decision support systems. The results
should also be clinically plausible. In order to achieve that we use the
modified Composite Association Rules Algorithm of Ye and Keane
[15] as presented in [14]. This allows us to explore various
combinations possible in medical context.

2. THE DATA
    The injury patterns of motor vehicle accident fatalities were studied
using post-mortem reports from the Department of Forensic Pathology
at Westmead Hospital. Each postmortem report was accompanied by a
Police investigation report which confirmed the position of the
deceased in the vehicle.
   The analysis was restricted to MVAs occurring between 1996 and
2004. Postmortem information after this period was not available at the
time of data collection. Reports prior to 1996 were not included in an
effort to restrict new variables such as differences in vehicle design and
in the postmortem methods of the pathologist.
    Few restrictions were placed on crash type: data was collected for
crashes involving single and multiple vehicles, with and without
rollover and occupant ejection. Excluded were collisions with animals
and MVAs which involved immersion or incineration.
     Drivers and front seat passengers were selected for inclusion only if
death occurred within seven days of the crash and a full post-mortem
report was available. Excluded were drivers and passengers of
motorcycles, pedal cycles and wheelchairs. Where the occupant‟s
position in the vehicle was unknown, that occupant was excluded.
     Victims included many more drivers than passengers, and
following consultation with a statistician it was decided to achieve an
approximate 2:1 ratio of drivers to passengers for each calendar year
included. Ninety one front passengers were identified and 206 drivers
were randomly selected from the eligible drivers. The data collected for
each of the 297 cases included age, gender, year of death, role within
the vehicle and injuries sustained during the MVA. Injuries were
catalogued using a code which carried information for the following
three variables:
         (A) The REGION of the body affected was assigned a number
         between 1 and 34 as shown in Figure 1.
         (B) The DEPTH of the injury was described as either
         superficial or deep: superficial referred to skin injuries such as
         abrasions, contusions and lacerations with an overall diameter
         of at least 2cm. Deep injuries included bone fractures, organ
         damage and lacerations extending to deep fascia.
         (C) The LOCATION of the injury within the region was
         described as either anterior, posterior, medial or lateral where
         appropriate.
    Aortic injuries were excluded from analysis since they are non-
localizing injuries arising from rapid deceleration forces rather than
direct trauma. Where injuries were described in the postmortem report
using compound descriptors such as anterolateral, only the first
descriptor was coded; an anterolateral injury would thus be coded as an
anterior injury. Where the report did not specify the exact location of an
injury and used instead broad descriptors such as „back‟ or „upper
limb‟, the injury was assigned multiple region codes and location was
coded as unknown.
    Relative risks (RR) and their 95% confidence intervals (CI) were
calculated to quantify differences between drivers and passengers with
regard to the frequency of each injury type.
                            Figure 1. Body Regions
3. OBJECTIVE AND METHODOLOGY
Our objective is to find the patterns of injuries that can discriminate
between driver and passenger fatalities. Given the number of fields one
can expect 2158 possible distinct patterns. It is a combinatorial
nightmare. Though the number of records are so low that only a
fraction of these patterns will be found but even that requires
extensive search.

First of all we establish the constraints of our patterns of interest. There
are two fundamental values about the pattern that will tell us if it is of
any interest, namely, sensitivity and specificity. The definition of these
two parameters fits well with the characterstics of our binary data. We
can calculate these values using Table~\ref{sensitivity}.

                                    Standard
                                             Yes No
                         Test Positive TP FP
                                 Negative FN TN

                       Table 1. Sensitivity and Specificity

The sensitivity is calculated as:
                         Sensitivity = TP/(TP + FN)

and the specificity is calculated as:

                         Specificity = TN/(TN + FP)

where
TP - true positive
FN - false negative
TN - true negative and
FP - false positive

    We can nicely use these parameters if we divide the data on the
basis of driver and passenger records then the patterns and their
sensitivities and specifities can be tested against each other by selecting
driver as standard and passenger as test and then vice versa. But first
we need to obtain the patterns at a manageble level. For this purpose we
use another constraint borrowing from data mining [1], namely support
which can be easily calculated for each field prior to making any
attempt of extracting patterns. A minimum value can be set in
accordance with the acceptable level of evidence required.

    The following algorithm describes our approach.

Definition 1
A binary pattern p is said to be in another pattern q if Σi (pi - qi) ≤ 0
where pi and qi are elements of p and q.

Algorithm

1 Split the data into two groups, driver D and passenger T.
2 Use D as standard and T as test data.
3 Select S = {s1, s2, ··· , sn} a set of minimum level of support for each
field 1 , ··· , n.
4 Select a minimum threshold t for true positive TP.
5 Calculate the support for each field in D using
                                    σi = Σj δij
 where δi is the value of the jth node of ith field.
           j

6 Discard the field i if σi ≤si.
7 Let the number of remaining fields be n.
8 Select minimum length l of pattern in terms of number of fileds.
9 Generate the set P of possible patterns of combination Cni.
10 Treat each record as a pattern.
11 For each pattern p є P, calculate the number of records it is in. This
is true positive TP for this pattern. We will denote it with TPp.
12 Calculate false negative FNp by subtracting from total number of
records.
13 Repeat steps 11 and 12 on T dataset and calculate false positive FPp
and true negative TNp for each p.
14 Calculate sensitivity and specificity for each p.
15 Repeat steps 2 - 14 reversing the role of D and T.
16 Compare the results.




4. APPLICATIONS AND RESULTS
   First we apply the statistical methods and then our algorithm in
serach of discriminators for the injuries sustained by the drivers and
diriver side passengers.

4.1 Statistical Methods
   Initial analysis was done using statistical methods Cochran [4,5]
with less than satisfactory results. For example we could only calculate
the confidence intervals or sensitivities and specificities of various
regions as shown in Table 1. There were significant overlaps of
confidence intervals. Overall accuracy was 65%. It is hard to draw any
conclusions from these results.

                        95% CI for Drivers 95% CI for Passengers
      Head                   94 – 99               83 – 95
      Neck                   49 – 63               41 – 62
      Chest                  90 – 97               90 – 99
      Right Arm              70 – 82               59 – 78
      Left Arm               64 – 77               66 – 84
      Mid Abdomen            73 – 84               61 – 80
      Lower Abdomen          49 – 63               56 – 76
                           95% CI for Drivers 95% CI for Passengers
       Right Leg                  71 – 82                     61 – 80
       Left Leg                   74 – 85                     69 - 87
                         Table 2. Confidence Intervals (CI)

4.2 Data Mining Applications
    In this section we will give the results of the algorithm we
developed along with the data mining techniques. Following are the
facts from the application of our algorithm.

   ●    Since we obtained 65% confidence from the statistical methods
        we set support si to be 50 in order to catch most of the patterns.
   ●    We kept the threshold value t to be one third of all the records
        in the dataset under consideration.
   ●    Length of the pattern for passengers was 8 and for the drivers it
        was 9.

After various iterations requiring only a few seconds we obtained the
results in Tables 3 and 4. The binary patterns have been translated to
fields names.

                        Pattern          Sensitivity Specifcity
                       a58, a150            51.6          30.1
                        a22,a35             59.3          38.8
                    a22,a35,a58,a150        41.8          52.4
                         a3,a5              67.0          21.4
                     a3,a5,a58,a150         38.5          45.1
                     a3,a5,a22,a35          45.1          50.0
                         a1,a2              60.4          33.5
                     a1,a2,a22,a35          41.8          56.3
                       a1,a2,a3,a5          48.4          41.3
                   a1,a2,a3,a5,a22,a35      34.1          59.7

                                 Table 3. Passenger
It is important to note both the differences and similarities in the
patterns. The comparison indicates that the passenger deaths include
patterns of injuries other than head and neck, which are the most
common ones.

                      Pattern           Sensitivity Specificity
                     a58,a153              69.9          48.4
                  a3,a5,a58,a153           54.9          61.5
                 a1,a3,a5,a58,a153         43.2          72.5
                  a1,a2,a58,a153           49.0          67.0
               a1,a2,a3,a4,a58,a153        43.2          72.5

                                 Table 4. Driver

As expected the left side injuries are more common in passengers
whereas right side injuries in drivers. It appears that the central thoracic
injuries are not common in drivers. It may be the case that most cars
now a days come with driver side air bags and fewer with dual bag. The
elements of each pattern in Tables 3 and 4 are given in Table 5.

                   Code               Description
                    a1    Scalp including internal injuries
                    a2                   Brain
                    a3            Left Face Suprficial
                    a5          Right Face Superficial
                    a22         Central Thoracic (Deep)
                    a35           Left Chest (Lateral)
                    a58      Right Abdominal(Lateral)
                   a150          Left Shoulder (Deep)
                   a153          Right Shoulder(Deep)

                             Table 5. Injury Code
7. CONCLUSION AND DISCUSSION
   We have developed efficient algorithms and techniques that give us
a more detailed picture than can be obtained through statistical
methods. The first three rows of Table 3 and the first row of Table 4 are
prime candidate as discriminators. In the analysis of these results one
must keep in mind the limitations of the data. For example, there has
been no information regarding the nature of impact. A side impact will
have significantly different outcome than head-on collision. Other
factors such as speed, colliding object, the type of vehicle, use of air
bags particularly driver only or dual, etc also influence the outcome of
the accident.




REFERENCES
[1] Agrawal, R., Imielinski, T. and Swami, A. (1993) “Mining association
    rules between sets of items in large databases.”, Proceedings of ACM
    SIGMOD Conference, 207-216.Cochran, W. G. (1977) Sampling
    Techniques, Wiley, New York.
[2] Azuaje, F.,Dubitzky, W., Black, N. and Adamson, K. Improving clinical
    decision support through case-based data fusion, IEEE Transactions on
    Biomedical Engineering, 46(10), 1181 – 1185.
[3] Bau-Hua et, al. (2005) An star pattern recognition algorithm using bit
    match, In Intrnational Conference on Machine Learning and Cybernetics,
    8, 4818-4823.
[3] Berman, J. J. (2002) Confidentiality issues for medical data miners,
    Artificial Intelligence in Medicine, 26, 25-36.Breault, J. L., Goodall, C. R.
    and Fos, P. J. (2002) Data mining a diabetic data warehouse, Artificial
    Intelligence in Medicine, 26, p. 37-54.
[4] Cochran, W. G. (1977) Sampling Techniques, Wiley, New York.
[5] Cochran, W. G. and Cox, G. M. (1992) Experimental Designs, Wiley, New
    York.
[6] Jurisica, I., et al., (1998) Case-based reasoning in IVF: prediction and
    knowledge mining. Artificial Intelligence in Medicine, 12(1), 1-2.
[7] Lavrac, N. (1999) Selected techniques for data mining in medicine,
    Artificial Intelligence in Medicine, 16(1), 3-23.
[8] Lenic, M., Povalej, P.,Zorman, M. and Kokol, P. “Multiple opinions for
    medical decision support”, Proceedings of 17th IEEE Symposium on
    Computer Based Systems, Bethesda, MD, USA, 230-235.
[10]Pechenizkiy, M., Tsymbal, A. and Puuronen, S. “PCA-based feature
   transformation for classification: issues in medical diagnostics” in
   Proceedings of 17th IEEE Symposium on Computer Based Medical Systems,
   Bethesda, MD, USA, 535-540.
[11] Snedecor, G. W. and Cochran, W. G. (1989) Statistical Methods, Ames: Iowa State
    University Press, USA.
[12]Srikant, R. and Agrawal, R. (1995) “Mining generalised association
    rules.”, Proceedings of 21st Int. Conference on VLDB.
[13]Tigrani, V. and John, G. H. (2005) Data Mining And Statistics In
    Medicine: An Application In Prostate Cancer Detection, URL. http://
    citeseer.nj.nec.com.
[14]Salam M. A., Illingworth P. and Davis J (2005) “Applications of data
    mining techniques in assisted reproductive technology”, Proceedings of
    16th Australasian Conference on Information Systems, Sydney, Australia.
[15]Ye X. and Keane, J. A. (1997) “Mining Composite Items in Association
    Rules”, Proceedings of IEEE International conference on Systems, Man
    and Cybernetics, 1367-1372.

						
Related docs
Other docs by niusheng11