PATTERN PATTERN ANALYSIS OF BINARY DATA OF MOTOR VEHICLE
Document Sample


PATTERN ANALYSIS OF BINARY DATA OF
MOTOR VEHICLE ACCIDENT FATALITIES
M. Salam1 , E. Curtin2and N. Langlois3
1 School of Information Technologies and Western Clinical School
The University of Sydney
Sydney, Australia.
E-mail: msalam@med.usyd.edu.au
2 Western Clinical School
The University of Sydney
Sydney, Australia.
E-mail: ecurtin@med.usyd.edu.au
3 Department of Forensic Medicine, ICPMR
Westmead Hospital, Westmead
Sydney, Australia.
Abstract: Binary data is rare in medical domain but whenever it
appears, poses serious problems in term of detrmining
clinically interesting rules and patterns. In this paper we
develop such lagorithms that incorporating some of the
data mining techniques. Various techniques have been
devised particularly in satellite navigation systems [3]
requiring high powered computing facilities. Similar
techniques have been used in image analysis as well. But
all of them use various approximation tools with extemely
large data sets. We instead apply concepts of constraint
programming to reduce a huge combinatorial problem to a
more manageable level. We also compare our results with
traditional statistical methods used in medicine.
Keywords: Data Mining, Pattern Recognition, Binary Data
1. INTRODUCTION
The traditional research in evidence based medicine is based on the
classical works by Cochran [4], Cochran and Cox [5] and Snedecor and
Cochran [11] on sampling techniques, statistical methods and
experimental designs. Although in other areas of interest, sampling is
used more for its advantages such as reduced cost, greater speed,
greater scope and greater accuracy, in medicine it is a necessity. Unlike
national census where almost every individual is included in the study
or survey, in medicine, it is impossible to include every patient in any
study. That means the validity of the results very much depends on the
statistical properties of the data. Such statistical analysis has limited
scope in medicine where it is used for testing a hypothesis such as
comparison of two treatments or investigative techniques.
Data mining techniques are widely used in the context of enterprise
and business data but are relatively unexploited in the medical and
clinical domain. In recent years there has been a great deal of interest in
the use of data mining techniques in medical domain with wide ranging
applications such as genetic analysis, decision support systems, clinical
management and diagnostic problems. Pechenizkiy et. al., [10]
presented an evaluation and comparison of several data mining
strategies that apply feature transformation for subsequent
classification, and to consider their application to medical diagnostics.
Azuaje et. al., [2] discuss improving clinical decision support system
based on data fusion methods. Lenic et. al., [8] have addressed the hard
question of harnessing the subjective and at times conflicting opinions
drawn from human expertise in computer based decision support
systems. Lavrac et. al., [7] looked at the big picture about the data
analysis in medicine. They talked about using rule induction, if-then
rules, rough sets, association rules, ripple down rules, learning of
classification and regression trees, inductive logic programming,
discovery of concept hierarchies and constructive induction, case-based
reasoning, instances-based learning, neural networks, Bayesian
classifier, etc. Jurisica et. al., [6] proposed a case-based reasoning
system to suggest possible modifications to an IVF (In Vitro
Fertilization) treatment plan in order to improve overall success rate.
This system is named TA3 IVF system. A practical use of data mining for
the prediction of a disease process using statistical methods are
demonstrated at a website by Tigrani and John [13]. Lavrac [7] gave a
detailed overview of some of the techniques used in the analysis of the
medical data. Brault et. Al, showed the power of data mining in
determining interesting patterns from a large data warehouse of a
specific disease progression. The medical domain poses a very unique
ethical issue in data mining in term of confidentiality as discussed by
Berman [3].
There is a greater need to design the algorithms to meet the varying
demands of the specific requirements in dealing with medical data.
There is a specific requirement in medicine to incorporate the clinical
knowledge and expertise in the decision support systems. The results
should also be clinically plausible. In order to achieve that we use the
modified Composite Association Rules Algorithm of Ye and Keane
[15] as presented in [14]. This allows us to explore various
combinations possible in medical context.
2. THE DATA
The injury patterns of motor vehicle accident fatalities were studied
using post-mortem reports from the Department of Forensic Pathology
at Westmead Hospital. Each postmortem report was accompanied by a
Police investigation report which confirmed the position of the
deceased in the vehicle.
The analysis was restricted to MVAs occurring between 1996 and
2004. Postmortem information after this period was not available at the
time of data collection. Reports prior to 1996 were not included in an
effort to restrict new variables such as differences in vehicle design and
in the postmortem methods of the pathologist.
Few restrictions were placed on crash type: data was collected for
crashes involving single and multiple vehicles, with and without
rollover and occupant ejection. Excluded were collisions with animals
and MVAs which involved immersion or incineration.
Drivers and front seat passengers were selected for inclusion only if
death occurred within seven days of the crash and a full post-mortem
report was available. Excluded were drivers and passengers of
motorcycles, pedal cycles and wheelchairs. Where the occupant‟s
position in the vehicle was unknown, that occupant was excluded.
Victims included many more drivers than passengers, and
following consultation with a statistician it was decided to achieve an
approximate 2:1 ratio of drivers to passengers for each calendar year
included. Ninety one front passengers were identified and 206 drivers
were randomly selected from the eligible drivers. The data collected for
each of the 297 cases included age, gender, year of death, role within
the vehicle and injuries sustained during the MVA. Injuries were
catalogued using a code which carried information for the following
three variables:
(A) The REGION of the body affected was assigned a number
between 1 and 34 as shown in Figure 1.
(B) The DEPTH of the injury was described as either
superficial or deep: superficial referred to skin injuries such as
abrasions, contusions and lacerations with an overall diameter
of at least 2cm. Deep injuries included bone fractures, organ
damage and lacerations extending to deep fascia.
(C) The LOCATION of the injury within the region was
described as either anterior, posterior, medial or lateral where
appropriate.
Aortic injuries were excluded from analysis since they are non-
localizing injuries arising from rapid deceleration forces rather than
direct trauma. Where injuries were described in the postmortem report
using compound descriptors such as anterolateral, only the first
descriptor was coded; an anterolateral injury would thus be coded as an
anterior injury. Where the report did not specify the exact location of an
injury and used instead broad descriptors such as „back‟ or „upper
limb‟, the injury was assigned multiple region codes and location was
coded as unknown.
Relative risks (RR) and their 95% confidence intervals (CI) were
calculated to quantify differences between drivers and passengers with
regard to the frequency of each injury type.
Figure 1. Body Regions
3. OBJECTIVE AND METHODOLOGY
Our objective is to find the patterns of injuries that can discriminate
between driver and passenger fatalities. Given the number of fields one
can expect 2158 possible distinct patterns. It is a combinatorial
nightmare. Though the number of records are so low that only a
fraction of these patterns will be found but even that requires
extensive search.
First of all we establish the constraints of our patterns of interest. There
are two fundamental values about the pattern that will tell us if it is of
any interest, namely, sensitivity and specificity. The definition of these
two parameters fits well with the characterstics of our binary data. We
can calculate these values using Table~\ref{sensitivity}.
Standard
Yes No
Test Positive TP FP
Negative FN TN
Table 1. Sensitivity and Specificity
The sensitivity is calculated as:
Sensitivity = TP/(TP + FN)
and the specificity is calculated as:
Specificity = TN/(TN + FP)
where
TP - true positive
FN - false negative
TN - true negative and
FP - false positive
We can nicely use these parameters if we divide the data on the
basis of driver and passenger records then the patterns and their
sensitivities and specifities can be tested against each other by selecting
driver as standard and passenger as test and then vice versa. But first
we need to obtain the patterns at a manageble level. For this purpose we
use another constraint borrowing from data mining [1], namely support
which can be easily calculated for each field prior to making any
attempt of extracting patterns. A minimum value can be set in
accordance with the acceptable level of evidence required.
The following algorithm describes our approach.
Definition 1
A binary pattern p is said to be in another pattern q if Σi (pi - qi) ≤ 0
where pi and qi are elements of p and q.
Algorithm
1 Split the data into two groups, driver D and passenger T.
2 Use D as standard and T as test data.
3 Select S = {s1, s2, ··· , sn} a set of minimum level of support for each
field 1 , ··· , n.
4 Select a minimum threshold t for true positive TP.
5 Calculate the support for each field in D using
σi = Σj δij
where δi is the value of the jth node of ith field.
j
6 Discard the field i if σi ≤si.
7 Let the number of remaining fields be n.
8 Select minimum length l of pattern in terms of number of fileds.
9 Generate the set P of possible patterns of combination Cni.
10 Treat each record as a pattern.
11 For each pattern p є P, calculate the number of records it is in. This
is true positive TP for this pattern. We will denote it with TPp.
12 Calculate false negative FNp by subtracting from total number of
records.
13 Repeat steps 11 and 12 on T dataset and calculate false positive FPp
and true negative TNp for each p.
14 Calculate sensitivity and specificity for each p.
15 Repeat steps 2 - 14 reversing the role of D and T.
16 Compare the results.
4. APPLICATIONS AND RESULTS
First we apply the statistical methods and then our algorithm in
serach of discriminators for the injuries sustained by the drivers and
diriver side passengers.
4.1 Statistical Methods
Initial analysis was done using statistical methods Cochran [4,5]
with less than satisfactory results. For example we could only calculate
the confidence intervals or sensitivities and specificities of various
regions as shown in Table 1. There were significant overlaps of
confidence intervals. Overall accuracy was 65%. It is hard to draw any
conclusions from these results.
95% CI for Drivers 95% CI for Passengers
Head 94 – 99 83 – 95
Neck 49 – 63 41 – 62
Chest 90 – 97 90 – 99
Right Arm 70 – 82 59 – 78
Left Arm 64 – 77 66 – 84
Mid Abdomen 73 – 84 61 – 80
Lower Abdomen 49 – 63 56 – 76
95% CI for Drivers 95% CI for Passengers
Right Leg 71 – 82 61 – 80
Left Leg 74 – 85 69 - 87
Table 2. Confidence Intervals (CI)
4.2 Data Mining Applications
In this section we will give the results of the algorithm we
developed along with the data mining techniques. Following are the
facts from the application of our algorithm.
● Since we obtained 65% confidence from the statistical methods
we set support si to be 50 in order to catch most of the patterns.
● We kept the threshold value t to be one third of all the records
in the dataset under consideration.
● Length of the pattern for passengers was 8 and for the drivers it
was 9.
After various iterations requiring only a few seconds we obtained the
results in Tables 3 and 4. The binary patterns have been translated to
fields names.
Pattern Sensitivity Specifcity
a58, a150 51.6 30.1
a22,a35 59.3 38.8
a22,a35,a58,a150 41.8 52.4
a3,a5 67.0 21.4
a3,a5,a58,a150 38.5 45.1
a3,a5,a22,a35 45.1 50.0
a1,a2 60.4 33.5
a1,a2,a22,a35 41.8 56.3
a1,a2,a3,a5 48.4 41.3
a1,a2,a3,a5,a22,a35 34.1 59.7
Table 3. Passenger
It is important to note both the differences and similarities in the
patterns. The comparison indicates that the passenger deaths include
patterns of injuries other than head and neck, which are the most
common ones.
Pattern Sensitivity Specificity
a58,a153 69.9 48.4
a3,a5,a58,a153 54.9 61.5
a1,a3,a5,a58,a153 43.2 72.5
a1,a2,a58,a153 49.0 67.0
a1,a2,a3,a4,a58,a153 43.2 72.5
Table 4. Driver
As expected the left side injuries are more common in passengers
whereas right side injuries in drivers. It appears that the central thoracic
injuries are not common in drivers. It may be the case that most cars
now a days come with driver side air bags and fewer with dual bag. The
elements of each pattern in Tables 3 and 4 are given in Table 5.
Code Description
a1 Scalp including internal injuries
a2 Brain
a3 Left Face Suprficial
a5 Right Face Superficial
a22 Central Thoracic (Deep)
a35 Left Chest (Lateral)
a58 Right Abdominal(Lateral)
a150 Left Shoulder (Deep)
a153 Right Shoulder(Deep)
Table 5. Injury Code
7. CONCLUSION AND DISCUSSION
We have developed efficient algorithms and techniques that give us
a more detailed picture than can be obtained through statistical
methods. The first three rows of Table 3 and the first row of Table 4 are
prime candidate as discriminators. In the analysis of these results one
must keep in mind the limitations of the data. For example, there has
been no information regarding the nature of impact. A side impact will
have significantly different outcome than head-on collision. Other
factors such as speed, colliding object, the type of vehicle, use of air
bags particularly driver only or dual, etc also influence the outcome of
the accident.
REFERENCES
[1] Agrawal, R., Imielinski, T. and Swami, A. (1993) “Mining association
rules between sets of items in large databases.”, Proceedings of ACM
SIGMOD Conference, 207-216.Cochran, W. G. (1977) Sampling
Techniques, Wiley, New York.
[2] Azuaje, F.,Dubitzky, W., Black, N. and Adamson, K. Improving clinical
decision support through case-based data fusion, IEEE Transactions on
Biomedical Engineering, 46(10), 1181 – 1185.
[3] Bau-Hua et, al. (2005) An star pattern recognition algorithm using bit
match, In Intrnational Conference on Machine Learning and Cybernetics,
8, 4818-4823.
[3] Berman, J. J. (2002) Confidentiality issues for medical data miners,
Artificial Intelligence in Medicine, 26, 25-36.Breault, J. L., Goodall, C. R.
and Fos, P. J. (2002) Data mining a diabetic data warehouse, Artificial
Intelligence in Medicine, 26, p. 37-54.
[4] Cochran, W. G. (1977) Sampling Techniques, Wiley, New York.
[5] Cochran, W. G. and Cox, G. M. (1992) Experimental Designs, Wiley, New
York.
[6] Jurisica, I., et al., (1998) Case-based reasoning in IVF: prediction and
knowledge mining. Artificial Intelligence in Medicine, 12(1), 1-2.
[7] Lavrac, N. (1999) Selected techniques for data mining in medicine,
Artificial Intelligence in Medicine, 16(1), 3-23.
[8] Lenic, M., Povalej, P.,Zorman, M. and Kokol, P. “Multiple opinions for
medical decision support”, Proceedings of 17th IEEE Symposium on
Computer Based Systems, Bethesda, MD, USA, 230-235.
[10]Pechenizkiy, M., Tsymbal, A. and Puuronen, S. “PCA-based feature
transformation for classification: issues in medical diagnostics” in
Proceedings of 17th IEEE Symposium on Computer Based Medical Systems,
Bethesda, MD, USA, 535-540.
[11] Snedecor, G. W. and Cochran, W. G. (1989) Statistical Methods, Ames: Iowa State
University Press, USA.
[12]Srikant, R. and Agrawal, R. (1995) “Mining generalised association
rules.”, Proceedings of 21st Int. Conference on VLDB.
[13]Tigrani, V. and John, G. H. (2005) Data Mining And Statistics In
Medicine: An Application In Prostate Cancer Detection, URL. http://
citeseer.nj.nec.com.
[14]Salam M. A., Illingworth P. and Davis J (2005) “Applications of data
mining techniques in assisted reproductive technology”, Proceedings of
16th Australasian Conference on Information Systems, Sydney, Australia.
[15]Ye X. and Keane, J. A. (1997) “Mining Composite Items in Association
Rules”, Proceedings of IEEE International conference on Systems, Man
and Cybernetics, 1367-1372.
Related docs
Other docs by niusheng11
Get documents about "