Project Report - Filebox - Virginia Tech

Document Sample
Project Report - Filebox - Virginia Tech Powered By Docstoc

           Design and Implementation of a Spatial Data
          Engine and Visualization Interface for a Crime
                       Information System
                                                                     Michael Wyland

                                                                                 calculated assessments of crime hotspots so that resources can
   Abstract—Personal safety is of paramount importance to                         be applied against those areas. Technology exists to allow the
nearly all Americans. Law enforcement organizations in our                        creation of such a CIS.
nation’s cities are constantly challenged to apply their resources                   There are other uses for such a CIS, however, in addition to
to keep us safe from crime.           Under strict budgets, law
enforcement is forced to apply resources to the areas with the
                                                                                  identification of crime hotspots. Average citizens could also
most need, often referred to as “crime hotspots.” This is an                      use the CIS to determine how safe their neighborhood is. New
efficient use of their law enforcement resources. A Crime                         residents moving into an area could identify the areas of the
Information System with a Spatial Data Engine and Visualization                   neighborhood that they should avoid at night, for example.
Interface would allow these law enforcement organizations to                      One could also investigate local sex offenders, for example, to
identify the crime hotspots and apply resources most efficiently.                 evaluate safety for their children.
In addition, citizens themselves could use such a system to
monitor crime activity in neighborhoods. This paper documents                        The CIS could also be used by crime intelligence analysts.
the design and implementation of such a Crime Information                         These may not be the typical police officers that are interested
System.                                                                           in crime hotspots. Instead, a crime intelligence analyst may be
                                                                                  interested in trends over time and space. For example, such an
   Index Terms— data processing, database systems, crime                          analyst may be interested in identifying what time of the year
information systems, visualization                                                is worst for crime in a particular area. An analyst may
                                                                                  correlate crime rates with the weather in the area at the time.
                                                                                  Most importantly, a crime analyst may be able to identify
                           I. INTRODUCTION
                                                                                  spatial or temporal trends in crime, potentially indicating gang

P   ERSONAL safety and the safety of our property is of
    paramount importance to nearly all Americans. Law
enforcement organizations in our nation’s cities are constantly
                                                                                  activity, for example.
                                                                                     The CIS must be able to seamlessly integrate with the
                                                                                  operations of law enforcement organizations, so it must be
challenged to apply their resources to keep us safe from crime.                   flexible and adaptable. For example, if the staff in a police
Faced with shrinking budgets each year, law enforcement is                        department are used to typing police reports into a form, such
forced to apply resources, such as patrolling police officers, to                 as in Microsoft Word, the CIS needs to allow them to input
the areas of neighborhoods with the most need, often referred                     that data in a similar way. That will ensure smooth continuity
to as “crime hotspots.” This makes for an efficient use of the                    of business for the organization as they begin to use the CIS.
limited law enforcement resources that are available. But                         The query and reporting functions of the CIS that allow law
determining where those crime hotspots are can be a real                          enforcement, crime analysts, and average citizens must also be
problem. For example, police departments cannot realistically                     intuitive and easy to use. The interface designs may be
be expected to manually sort through hundreds or thousands of                     customized for each of these user segments due to their
historical police reports to make such decisions on how to                        varying needs. The CIS must also be fast and efficient, able to
apply resources in the future.                                                    ingest large numbers of crime data (potentially from multiple
   A better approach is to use an information system to solve                     precincts or even from multiple cities), while also processing
this problem, specifically a Crime Information System (CIS).                      large spatial and temporal queries for crime information.
A CIS would allow police departments and other law                                   Two key components of such a CIS are a Spatial Data
enforcement organizations to electronically store historical                      Engine (SDE) and a Visualization Interface (VI). This
information, such as police reports, and automatically make                       document describes the design and implementation of an SDE
                                                                                  and VI for just such a CIS.
                                                                                     The SDE component is responsible for receiving, storing,
   This work is completed in partial fulfillment of the requirements for course
CS 6604, Spatial Data Management, at Virginia Tech, Spring 2008, under the        and processing data from law enforcement organizations. The
guidance of Professor Chang-Tien Lu.                                              specific design of a SDE in this document will support data
   M. S. Wyland is a student at Virginia Tech, National Capital Region, Falls     formats acquired from the Fairfax County, Virginia, Police
Church, VA 22043 USA; Phone: 703-327-3009; e-mail:

Department and the City of Falls Church, Virginia, Police            paired with crime data to determine where and when crime
Department, both in partnership with Virginia Tech. The SDE          “sprees” were happening, in the form of pattern recognition.[5]
is flexible, however, to support data input from other sources.         Many other implementations of the CIS concept exist, and
The SDE must support access to the data from external                each has unique strengths and weaknesses. For example, some
applications, such as the VI component, in a variety of ways to      web-based CIS applications that exist today are very user-
include temporal and spatial data aggregation.                       friendly and allow users to search for nearby crimes. Very few
   The VI component is responsible for retrieving raw or             CIS applications that are widely available have implemented
processed crime data from the SDE component for display to           crime hotspot detection and analytical decision-making based
end users.                                                           on that data.
   Section II of this document discusses some related work in             [6]
                                                                               This website was intended to spatially visualize
the field of crime information systems. Section III of this
                                                                               Fairfax County, Virginia, crime data and sex offender
document discusses the detailed design and implementation of
                                                                               locations. To display crime data, the user selects
a Crime Information System, including the SDE and VI
                                                                               from 1 month datasets between March 2003 and
components.                                                                    August 2005. The fact that the data ends in August
                                                                               2005 may be related to the manual process required
                                                                               to ingest data from the current format of Fairfax
                      II. RELATED WORK                                         County Police Department crime incident reports
   Much work has been done in the area of Crime Information                    (non-standardized, prose text in Microsoft Word).
Systems. Many organizations have attempted to implement                        The site does allow mapping of incidents and
web-bsaed crime information systems, but not all of them meet                  provides police report data in an intuitive interface,
the requirements described above. A lot of work has also been                  but does not provide for crime hotspot identification,
done in the general area of Spatial Data Mining, which can be                  crime analysis, or other analytical functions.
directly applied to the design and implementation of a CIS.               
First we shall examine some related work in the area of Spatial                d/mynpolice.aspx?fxmResolution=1024x768[7]
Data Mining for Crime Information Systems.                                     This website contains the current, official, crime
   There are several spatio-temporal clustering algorithms that                incident mapping utility for Fairfax County Police
have been evaluated for use in CIS applications. These                         Department. The site allows users to search by
algorithms are primarily used to help identify and evaluate                    address and property identifiers to plot crime
“hotspots” in the spatial data. One algorithm, known as                        incidents on a map. The site is not extremely
Knox’s method[1], is very simple and uses only a                               powerful, as it does not offer crime hotspot detection
straightforward application of a Chi-square value calculation                  or crime analysis. The site is also less user-friendly
based on user-defined proximity of time and space between                      than other CIS implementations are.
crime events. Another algorithm, known as Mantel’s                         http://www.georgia-sex-
method[2], is also straightforward. Mantel’s method uses a           [8]
cross product of the space and time distance between crimes,                   This website allows users to plot sex offenders in the
and then normalizes using the standard deviation. A better                     state of George on a map. The user can easily select
method, and one that seems to be mentioned in much of the                      from the city he or she is interested in and view
research in this area, is Geoffrey’s “K-nearest Neighbor                       detailed information about all sex offenders in the
Method”[3]. This method is more complex than either Knox                       area. The site links to more detailed database
or Mantel’s method, but resolves some of their issues. The K-                  information for each sex offender. One drawback of
nearest Neighbor Method works by rating the location by its                    the system is that it is limited to sex offender
number of “kth” nearest neighbors (first, first and second, etc.)              information and cannot identify crime hotpots, in
and accumulating these values to determine hotspots.                           general, or perform any crime analysis.
   Finally, it is important to note that crime data has properties        
that must be understood when applying a clustering algorithm.                  20Church&state=VA[9]
It has large amounts of scattered noise points most often, and                 This website is focused on reporting crime statistics
should be clustered based on the density of crime activity in a                for the City of Falls Church, VA. The site only
given area. One density-based spatial clustering algorithm that                contains data based on the 2003 FBI Report of
is particularly suitable for high-noise is known as DBSCAN.                    Offenses Known to Law Enforcement, so it is clearly
The details of the DBSCAN algorithm will be discussed later.                   not up-to-date or dynamically connected to any law
   In addition to these efforts in the area of spatio-temporal                 enforcement crime reporting processes. The site
clustering and hotspot detection, other work has been done                     does not include any spatial visualization such as
regarding the overall approach to CIS implementation. Much                     mapping. The strength of the site is its analytical
research has been done to evaluate the necessary features of a                 presentation of the crimes in Falls Church. A citizen
CIS and the methodology for implementation.[4] Nath, from                      researching a possible move into the area could
Oracle Corporation, evaluated a k-means clustering technique                   compare these statistics, which are detailed and

          graphically represented, against the same statistics for                           B. Languages and Software Tools
          other cities.                                                                      To implement the SDE component of the Crime Information
   The CIS being designed in this paper intends to utilize some                           System, Oracle 10g Express Edition was selected as the
of the algorithms and best practices defined by these related                             Relational Database Management System platform. Oracle
works. For example, this paper will utilize a Spatial Clustering                          was chosen over other available database platforms primarily
algorithm generated previously, DBSCAN, and will use                                      due to the author’s familiarity with it. The Express Edition
interface design features that have been successful with                                  version was selected because it offers the most flexible license
previously implemented web-based CIS applications.                                        agreement for non-commercial development, without
                                                                                          sacrificing features or performance limitations that would
                                                                                          impact the SDE. One downside to using the Express Edition is
                          III. PROPOSED APPROACH                                          that the powerful Oracle Spatial component is not available for
                                                                                          it. However, Oracle does offer “Locator” which is a less
  A. System Architecture                                                                  powerful version of Oracle Spatial that is compatible with
   The Crime Information System would nominally be                                        Express Edition, and available with a flexible license
supported by a multi-tier architecture, with dedicated hardware                           agreement for non-commercial development. Locator has
and software for the backend database server as well as the                               sufficient features to implement the envisioned CIS, however.
front-end web server. The implementation described in this                                   Microsoft Internet Information Server 5 was selected as the
paper offers a simplified architecture, with the database server                          web server platform for the proposed system. IIS 5 was
and web server combined on one hardware platform. The                                     selected due to its wide availability with many Microsoft
hardware architecture for the proposed system is depicted                                 Windows-based systems. IIS 5 was recommended by the
below in Figure 1, as implemented and demonstrated for the                                Spatial Data Management lab as well.
Spatial Data Management Lab. The details of the server                                       ASP.NET 2.x was used to build the highly dynamic user
components will be discussed in detail in later sections.                                 interface for the proposed system. ASP.NET makes
                                                                                          development web forms like the ones required for this system
                                                                                          easy and powerful at the same time. As described below,
                                                                                          several web form controls that are available have been
                                                                                          integrated into the implemented system using ASP.NET.
                                                                                             One of the .NET components included in the implemented
                                                                                          system is Oracle Data Provider for .NET (ODP.NET).
                                                                                          ODP.NET, available directly from Oracle and easily integrated
                                                                                          with an Oracle database, offers Oracle-unique features and
                                                                                          improved performance over other providers like the generic
                                                                                          ones available with ASP.NET.
                                                                                             Other web components used include Dundas Chart. This is
                                                                                          a visualization control that makes implementation of the VI
                                                                                          component easier. Dundas Charts allow extremely flexible
                                                                                          graphing of crime data, in this instance, with professional
           Figure 1 – Hardware Architecture Diagram                                       looking results. This component was integrated into the Crime
                                                                                          Information System to easily display the crime information to
   The software architecture, including the relationship of the                           end users.
SDE and the VI components is depicted below in Figure 2 in a                                 Another visualization platform, Google Maps, was included.
workflow format from left to right. Theses components and                                 The flexible and free Google Maps API is utilized. Another
their relationships will be described in more detail in later                             critical feature of Google Maps that is harnessed for the
sections.                                                                                 development of this CIS is the “Geocoding” capability. This is
                                                                                          the ability to translate a given street address into a
                                                                                          latitude/longitude point, along with other data.
                 Query Type
           (window, k-near., range)                                           Charts
                                                                                             The development language used for the implementation is
                                                                                          Visual Basic. This language can be used with ASP.NET for
              Entity Type Selection
 Web App   (corners, dist., crime type)         Data Set
                                                             Web App         Data Grids   web applications. This language was used based on the
                                          SDE              (Visualization)
  (UI)                                                                                    experience of the author.
             Entity Value Selection
                 (set or range)                                                Maps

                  Time Range
                                                                                            C. Data Sets
           Figure 2 – Software Architecture Diagram                                         To implement the CIS, the Virginia Tech Spatial Data
                                                                                          Management Lab provided crime reports from Fairfax County,
                                                                                          Virginia, Police Department and the City of Falls Church
                                                                                          Police Department. The Fairfax County dataset was provided
                                                                                          from early 2006 through late 2007. The City of Falls Church

dataset was provided for the year 2007. Each dataset offered a          formats, creating several challenges and driving the need for a
unique set of challenges. The datasets are not provided in an           unique data-loading solution.
easy-to-import format, such as a comma delimited file. Other               Rather than create a custom parser application to import the
actions must be taken to import the data into a useable format.         sample crime report data, the author chose to develop a “crime
   The Fairfax County Police Department releases their crime            report input” interface for the CIS. This approach requires
report data in the form of a Microsoft Word document. Each              that a human physically input (or copy/paste) data into the
crime is documented in a prose paragraph with no standard               form interface. This web-based form then interfaces with the
structure or form. The document format and the non-                     Google Maps API GeoCoding object to retrieve
standardized structure of the content makes automated parsing           latitude/longitude values for the crime address. This data is
of the data nearly impossible. Another challenging aspect of            returned in a JSON format, and several attributes such as the
the police reports is the form that the crime addresses take.           address (reformatted), latitude, longitude, city, and accuracy of
The reports often state crime locations in terms of the block or        the geocoding process are all stored in the tcrimes table
street they occur on, not a specific, unique address. For               described in Section D. In addition, the other crime
example, the addresses are written in the form “4100 block of           information input by the operator, such as the crime type,
Chain Bridge Road”.                                                     date/time, and narrative are also stored in the same table. Note
   The City of Falls Church Police Department releases their            that this crime report input interface could be used by law
crime report data in Acrobat PDF format. Each crime is                  enforcement professionals to input crime records into the
concisely documented in a comma-delimited fashion, but the              system in the future, so this is seen as an advantage over other
content is not consistently structured. For example, some               CIS implementations that use parsers to import sample data.
crime entries include a specific date and time that the crime
                                                                          F. Indexing Structures
occurred, and others do not. This makes the comma-delimited
structure of the document less useful.                                    A typical B-tree index is implemented for the non-spatial
   The Fairfax Count Police Department dataset was chosen for           primary key of the crimes table. In addition, in order for
use in this CIS implementation because it had significantly             Oracle to support spatial queries, there must be a spatial index
more data than the City of Falls Church dataset, and a larger           created and applied to the spatial object in the crimes table.
overall spatial area to work with.                                      By default Oracle implements this as a spatial R-tree index.
                                                                          In addition, Oracle recommends in its spatial query best
  D. Physical Data Model                                                practices documentation that index hints be supplied to drive
   To implement the Spatial Data Engine to support the CIS, a           the query optimizer toward use of the spatial index.
physical data model and relational database implementation                The R-tree index is then automatically used for spatial
was needed. We must consider the fundamental attributes of              queries offered by Oracle Locator. These queries are
crimes to determine the makeup of the physical data model.              described in detail in Section G below.
   Crime records, in general, have the following consistent
                                                                          G. Spatial Query Support
      Data/time of the crime                                             Oracle Locator offers spatial query functions that allow easy
      Type of the crime (e.g. robbery, assault)                        implementation of the CIS. For example, one type of query
      Narrative or description of the crime                            that would be useful in a CIS is a “Window Query”. In this
                                                                        type of query, the user is able to select a “box” or window on a
      Address of the crime
                                                                        map indicating the area over which he or she wants to find
  To implement a spatial database for this data, we need
                                                                        crime information. The Oracle Locator function sdo_relate is
additional information. The following are examples of the
                                                                        perfectly suited to handle query requests of this nature. This
additional data required:
                                                                        function can indicate the topological relationship of two or
      Crime identifier (primary key)
                                                                        more spatial objects, and results returned by the function can
      Latitude/longitude of the crime
                                                                        be limited by the topological interaction type (e.g. overlap,
      A spatial object representing the location of the crime          intersect, disjoint). An additional option is known as
      Accuracy of the “GeoCoding” process results                      “anyinteract” which returns results anytime the input spatial
          described in Section B above                                  objects are non-disjoint. That is, with regard to the CIS, the
  To store this information, a single relation/table referred to        sdo_relate function returns crime records if the spatial object
as “tcrimes” was created with the following structure:                  (point location) of the crime has any non-disjoint spatial
                                                                        interaction with the selected window region (represented as a
 Crime_ID   Crime_type    Crime_address   Crime_dt Crime_geo_location
Narrative   Geocode_lat    Geocode_lon    Crime_city Geocode_accuracy
                                                                        rectangle object).
             Figure 3 – Crime Physical Data Model                         Two other types of queries useful for implementation of a
   Note that the “crime_geo_location” attribute is a spatial            CIS are K-nearest neighbor and point range (“circle search”)
object designed to store point-type data indicating the location        query. These two types of queries operate based on a single
of the crime in a given record.                                         reference point input by the user on a map interface. For the
                                                                        K-nearest neighbor, the user inputs the number of crimes they
  E. Data Loading                                                       would like to see returned by the CIS from that point. The
As described in Section C above, the sample crime datasets              user asks the question “Show me the closest X number of
available for this CIS implementation had non-standard data             crimes in the system from this point location.” Again, Oracle

Locator offers a function uniquely suited to support this query.       Before discussing how the algorithm was implemented for
Here the function is known as sdo_nn, which returns a               this CIS, we must first describe how the cluster (“crime
specified number of records containing a spatial object nearest     hotspot”) data is stored in the system. An additional table was
to a given point location object. There is one nuance of this       added to the physical data model to support cluster storage.
function that absolutely must be noted. First, with regard to       This additional relation is depicted here, and its attributes will
the CIS implementation of this function, only certain types of      be described as the algorithm itself is discussed.
crime records (e.g. burglaries, assaults) may be desired. If the    TCLUSTERS
“sdo_num_res” parameter is used with the sdo_nn function to          Crime_ID   Density    Eps       MinPts    Category   Cluster_Label

indicate the number of desired results, the function returns                     Figure 4 – Cluster Physical Data Model
exactly that number of nearest neighbors, without regard to            The first step in the DBSCAN algorithm is to compute a
any additional where clauses filtering based on non-spatial         “density score” for each point record (each crime location).
attributes (such as crime type). For example, if we desire to       The definition of the density score is the number of points with
see the 5 nearest crimes to a location, in a SQL query the          a specified radius (Eps) of the crime point location. Since this
sdo_nn function, if using the sdo_num_res = 5 parameter,            is exactly the definition of the range query described above,
returns only 5 records which may be partly or entirely filtered     we again use the sdo_within_distance function of Oracle
by another where clause value. We workaround this by not            Locator, and count the number of entries it returns for each
using the sdo_num_res parameter and instead using the               point in the crime database within Eps distance. We then store
sdo_batch_size parameter along with a “rownum” constraint in        a record in the clusters table above for every point, and record
the where clause. This is the recommended implementation            its Density score, the Eps value used, and MinPts value, which
from Oracle, and they indicate that they sdo_batch_size             will be used in the next step in the algorithm.
parameter should actually be set to 0 (zero) to allow Oracle to        The second step in the DBSCAN is to categorize all records
determine the optimal behavior of the sdo_nn function.              in the database (all crime records) as a core, border, or noise
  The range query or “circle search” query is meant to return       point. A core point is defined as a point that has more than a
all crime records within a fixed distance from a reference point    specified number of points (MinPts) within Eps distance. In
input by the user on the map interface. Oracle Locator offers       other words, a core point has a Density Score greater than the
the sdo_within_distance to handle just such a query. This           specified MinPts value. These points eventually become the
function records with a spatial object within a specified           interior of each cluster (“hotspot”) identified. A border point
distance from a reference spatial object.                           is a point that has fewer than MinPts points within distance
  All three of these query types (window query, k-nearest           Eps, but is in the Eps-neighborhood of a core point. Finally a
neighbor query, and range query) were implemented for the           noise point is one that is neither core nor border. We execute
CIS discussed here. In addition, for each of these query types,     this step in the DBSCAN algorithm by updating the “category”
the user is able to constrain a start/stop date and time for the    attribute for each point in the cluster relation based on the
result set, and is also able to constrain the type of crime types   above criteria. A value of “core” is added to records with
returned. The user may also choose to return results from all       density score >= the minpts value, and a value of “border” is
date/times and all crime types.                                     added to records with density score < minpts, but are returned
  Dynamic SQL is constructed within ASP.NET to call each of         by sdo_within_distance from a core point. Finally a value of
these query types, utilizing the Oracle Locator functions           “noise” is added to records that are not already categorized as
described above, with the applicable constraint parameters.         core or border.
  Note that analysis of the Oracle query plans for each of these       Using the CIS implementation, we can visualize the
query types to confirm that the correct non-spatial and spatial     categorization of these points for sample values of Eps and
indices are being utilized.                                         MinPts. Below is an example:
  H. Crime Hotspot Identification and DBSCAN
The crime hotspot identification capability of this CIS was
implemented using the DBSCAN algorithm. DBSCAN was
chosen initially due to its straightforward algorithm leading to
easy integration and its density-based approach, which is very
applicable to crime hotspot identification. Where crime rates
are high in a fixed area (high density) a hotspot should be
identified and law enforcement resources applied to it. This
choice of spatial clustering algorithm was later validated as
DBSCAN is also resistant to “noise” which is prevalent in
crime data. This concept will be discussed further later.
   DBSCAN is a density-based spatial clustering algorithm
that operates based on two key parameters: an “Eps” (Epsilon)                 Figure 5 – DBSCAN Crime Categorization
value in the form of distance, and a “MinPts” (Minimum                In this example, points that are categorized as core are
Points) value.                                                      marked in red, those that are border are marked in green, and

those that are noise are marked in blue. This example uses an
Eps value of 0.5 miles and a MinPts value of 20. Some of the
potential crime hotspot areas already become clear visually at
this point.
   Also of note in Figure 5 are the larger number of blue
“noise” points. The existence of this large number of noise
points validates the choice of the DBSCAN algorithm for
clustering crime hotspots. We will see that these noise points
are removed and not considered in the cluster identification
process later.
   Now that all points have been given a density score based
on Eps distance, and have been categorized based on MinPts                       Figure 7 – Initial DBSCAN Test Results
and Eps distance, we must now execute the main routine of the
algorithm to identify clusters. The first step in this process,         Immediately we see that clusters have been identified, and
one which makes DBSCAN uniquely suitable for crime                   they line up roughly with the major “red” areas in the
hotspot identification, is that the points categorized as “noise”    categorization results in Figure 5. These clusters line up with
are simply removed from the clusters relation. This is done          known crime hotspots in Fairfax County, such as the Route 1
with a straightforward delete command. There have been               corridor, Tysons Corner (mall located there), and Springfield
many presentations of what follows for the DBSCAN                    (mall located there). But it appears that the colors in the
algorithm, in terms of actually identifying the unique clusters      image, which are randomly generated and assigned to each
and labeling each point to the cluster that it belongs to. Below     cluster, appear to be inter-mixed. This is an indication that
is the algorithm implementation used for the initial                 there may be a problem with the algorithm. Further
implementation of DBSCAN in this CIS:                                investigation confirms there is a problem, as can be seen in the
                                                                     cluster example below:

            Figure 6 – DBSCAN Cluster Identification
   Based on this algorithm, we must cycle through the core
points, and for each one, if it is not already part of a cluster,
we label it into a new cluster. We then examine all points
(non-noise) around that cluster point and label its neighbors
with the same cluster label. When we complete this process                         Figure 8 – DBSCAN Results Problem
for all of the core points, we are finished, and the clusters have      In this example we see what should clearly be labeled as one
all been identified.                                                 unique cluster by the DBSCAN algorithm. Instead, two points
   This entire algorithm, including the first steps of DBSCAN        are marked as part of another cluster which is primarily
and this cluster identification algorithm, was implemented           located far away from this location.
directly using PL/SQL in an Oracle stored procedure such that           Significant investigation indicates there is a fatal flaw with
it could be easily called repeatedly from the CIS web interface.     the DBSCAN algorithm as presented in Figure 6. The order in
Below is an example of the map output after execution of this        which the core points are evaluated is critical to the success of
algorithm using Eps = 0.25 miles and MinPts = 20.                    the DBSCAN algorithm if it is to be implemented in this way.
                                                                     The problem arises as follows: core point 1 is labeled as the
                                                                     start of a new cluster, and its Eps-neighbors are labeled with
                                                                     that same cluster. Core point 2 (the next core point in the
                                                                     unordered list) may be far away from the cluster that was just
                                                                     identified, and so core point 2 and its Eps-neighbors are
                                                                     labeled as a new cluster. Core point 3 (again, the next core
                                                                     point in the unordered list) is in the near vicinity of the original
                                                                     cluster identified, and so it is already labeled. Despite already
                                                                     being labeled, the Eps-neighbors of core point 3 are labeled

with the current cluster label, which is still set to the value of                          distances from each point. In other words, for points in a
the second cluster identified. So even though these neighbor                                cluster, their kth nearest neighbors are at roughly the same
points are nearby the original cluster, they are labeled into a                             distance. To illustrate this we plot the distance of the kth
cluster which is actually far away. This problem must be                                    nearest neighbor from all points, for various values of k.
corrected for DBSCAN to operate effectively.                                                Results are depicted here for the same Fairfax County crime
   Another minor flaw with the algorithm presented in Figure 6                              dataset:
is that it begins with initializing the current cluster label to a                                                                                                  DBSCAN: Determining Eps and MinPts

value of 1 (one), but that gets incremented before ever being                                                                 3

assigned as a cluster. The result is that the first cluster label is
always 2 (two).                                                                                                              2.5

   We can improve upon the presented DBSCAN algorithm in

                                                                                             kth Nearest Neighbor Distance
Figure 6 to resolve this issue. A modified version of the                                                                     2

cluster identification algorithm is presented below. The                                                                                                                                                                                     k=4
significant changes from the original algorithm are highlighted                                                              1.5

in red.                                                                                                                                                                                                                                      k=20

current_cluster_label  0                                                                                                     1

for all core points do
   if the core point has no cluster label then                                                                               0.5
         current_cluster_label  current cluster_label + 1
         Label the current core point with cluster label current_cluster_label
   end if                                                                                                                     0
                                                                                                                                   1   51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951 100110511101115112011251
   for all core points in the Eps-neighborhood, except the point itself do                                                                                  Points Sorted According to Distance of kth Nearest Neighbor
         if the point does not have a cluter label then
             Label the point with cluster label current_cluster_label
         end if
                                                                                                                                          Figure 10 – Eps and MinPts Optimization Plot
   end for
end for                                                                                        To optimize Eps and MinPts, we choose a “vertical slice”
for all border points do
   if the border point has no cluster label then                                            through the plot in Figure 10, where all points to the left of the
         current_cluster_label  cluster label of nearest core point to this border point   slice will be considered core points and those to the right will
         Label the current border point with cluster label current_cluster_label
   end if                                                                                   be non-core. Therefore the further to the left we slide, the
end for                                                                                     fewer core points (and therefore fewer clusters) we will have.
      Figure 9 – Improved Cluster Identification Algorithm                                  The opposite applies when the vertical slice is made near the
                                                                                            right side of the graph. There is no strict definition for optimal
   This improved cluster identification algorithm was                                       Eps and MinPts values based on this plot, but we use the plot
implemented as a separate stored procedure in the CIS, and the                              to identify “jumps” or discontinuities that indicate a group of
case that produced the problem in Figure 8 was re-executed.                                 points that have similar distances for their kth nearest neighbor
The correct results were returned as seen below. We see that                                (indicating good clusters). We see that for k=15, there is a
the two points that were in error before are now part of the                                clear jump at distance = 0.1 miles (lower left corner). Using
expected cluster.                                                                           these parameters will result in relatively few clusters, but they
                                                                                            will be highly correlated clusters. Below is the final example
                                                                                            proving this. We can clearly see crime hotspots identified in
                                                                                            Tysons Corner, Springfield, and the Route 1 corridor.

                 Figure 9 – Corrected DBSCAN Results

   Now that we have corrected the DBSCAN cluster
identification algorithm, we examine how to optimize the Eps
and MinPts value parameters. For DBSCAN this is done                                                                                   Figure 11 – Crime Hotspot Identification Example
based on the assumption that, for a given dataset, clusters can
be identified best when k-nearest neighbors are at “similar”

  It should be noted that the crime hotspot identification            The visualization techniques employed with this CIS could
implemented for this CIS allows filtering by crime type and        also be enhanced in the future. Only one minor graph was
date range, just as the other spatial-temporal queries do.         implemented here to display relative amounts of crime types.
                                                                   Other implementations could enable analysis of “temporal”
  I. Visualization
                                                                   crime hotspots (not only where the crime hotspots are, but
  The example CIS is implemented using an easy-to-use              when they are).
interface with clear navigation between query types, simple
input/control parameter approach, and straightforward map,
datagrid, and graph displays. The interaction with the map is                              REFERENCES
intuitive and the results are well organized. Markers
representing the crimes from any query can be clicked to
display the primary crime metadata. There is also an               1.      Knox, E.G. and M.S. Bartlett, The Detection of
information tab available to display the full report for that              Space-Time Interactions. Applied Statistics, 1964.
crime. An example, including a graphical pie chart plot of the             13(1): p. 25-30.
crime types and full datagrid display is included below:           2.      Mantel, N., The detection of disease clustering and a
                                                                           generalized regression approach. Cancer Research,
                                                                           1967. 27(2): p. 209-220.
                                                                   3.      Geoffrey M, J., A <I>k</I> NEAREST
                                                                           NEIGHBOUR TEST FOR SPACE-TIME
                                                                           INTERACTION. Statistics in Medicine, 1996. 15(18):
                                                                           p. 1935-1949.
                                                                   4.      Chen, H., et al., Crime data mining: a general
                                                                           framework and some examples. Computer, 2004.
                                                                           37(4): p. 50-56.
                                                                   5.      Shyam Varan, N., Crime Pattern Detection Using
                                                                           Data Mining, in Proceedings of the 2006
                                                                           IEEE/WIC/ACM international conference on Web
                                                                           Intelligence and Intelligent Agent Technology. 2006,
                                                                           IEEE Computer Society.
                                                                   6.      Fairfax County Sex Offenders. [cited 2008;
                                                                           Available from:
                                                                   7.      My Neighborhood - Police Incident Mapper. [cited
                                                                           2008; Available from:
                                                                   8.      Georgia Sex Offender Maps and Alerts. [cited 2008;
                                                                           Available from: http://www.georgia-sex-
                                                                   9.      Falls Church Crime Statistics. [cited 2008;
                                                                           Available from:
         Figure 12 – Full CIS Web Interface Example
   The CIS implemented here is fully functional with respect to
the objectives of the effort initially. This CIS implementation
requires additional features in order to meet the full scope of
law enforcement user needs. The primary prerequisite for
these additional features, however, is additional data to
support them. Law enforcement officials have indicated they
would like other CIS’ like the one implemented here to be able
to perform full-scope data mining, drawing conclusions
between various crime events. For example, if future data
were available regarding the actual criminals, their home
locations, their current status (e.g. in jail, at large), victim
information, and other data, this CIS implementation could be
made significantly more powerful.

Shared By: