Creation of a crime data mining and visualization system

Document Sample
Creation of a crime data mining and visualization system Powered By Docstoc
					 NoVaCrime: A crime data mining
 and visualization system

Anukrati Gupta, Alex Galanes,
Kendrick Burnett
Date: Dec. 4th 2008
  •   Problem Definition
  •   Key Issues
  •   Motivation
  •   Related Work
  •   System Design
      ▫ Parsing and Loading
      ▫ Data Cleansing
      ▫ Weka
  • Software Design
      ▫   The Google Web Toolkit
      ▫   Querying
      ▫   Data Mining
      ▫   Visualization
           Charts
           Maps
  •   Demo
  •   Conclusion
  •   Future Work
  •   Questions
Problem Definition

• Definition
 ▫ Provided with XML files of three counties – Arlington,
   Fairfax and Falls Church, representing a crime or
   series of crimes and identifying and analyzing
   suspicious patterns and trends.
• Components
 ▫   Includes data processing
 ▫   database systems
 ▫   crime information systems
 ▫   visualization
Key Issues
• Input into the system may be very difficult to
  analyze, police reports do not follow a set or
  structured pattern, and may have important
  pieces of information missing.
• Based on the information available it is hard to
  define the correctness or accuracy of
• Fast access to these records and queries against
  the records are necessary.
• Vast amounts of information needs to be
  conveyed to the user in an easy to understand
• There is a lack of crime analysis services
  available for the Northern Virginia Area.
• Analysis of crime data can provide key
  information to police organizations to improve
  safety on the streets.
• Citizens desire crime information to make
  informed decisions about where to live, play,
  work, and attend school.
Related Work
• accepts an address and
  provides detailed information on crimes within a
  certain distance for a certain period of time.
 ▫ Good user interface but not all
   municipalities report criminal activity to
 ▫ As of December, 2008, only four
   jurisdictions in Virginia submitted their
   crime data to
Related Work – cont.
• Some municipalities are also providing crime
  data on their websites example Seminole County
  Sheriff’s office in Florida.
  ▫ User interface not as easy to use as
  ▫ Data is shown on a map for a specific district for a
    given month.
  ▫ Unable to find a way to search for a given address
    or to look at data for more than one month at a
• Visualization techniques for existing crime-
  related websites are utilized as guidelines when
  displaying criminal data for the given data sets.
NoVaCrime System Design
Parsing and Loading
• Utilized provided XML files instead of reparsing
  raw data to conserve development time.
• Reverse-engineered an XSD to validate XML
• Created XSLT transformations to convert XML
  files into pipe-delimited files for loading into the
  ▫ XSLT performed basic validation as well
• Loaded data first into Microsoft Excel and then
  into Microsoft SQLServer
Parsing and Loading – cont.

• 4000 Character description field could not be
  loaded directly from pipe-delimited file into
• Dates loaded from pipe-delimited file were treated
  by SQLServer as strings
• Using Microsoft Excel as an intermediary step
  solved both issues.
Data Cleansing
• Latitude and Longitudes were reversed for Google geo-codings
• Addresses containing intersections addresses were not properly
  handled in Fairfax County dataset
• Example
  ▫ Expected
      Bennington Woods Road and Baron Cameron Avenue
  ▫ Address in xml
      Road Baron Cameron Avenue/Bennington Woods
Data Cleansing - cont.

• Crime Categories were null or uninformative for
  a significant portion of the data.
• Null categories were updated by hand by
  examining the specific crimes
• Large categories were split into smaller groups
  by running SQL queries against the
  description/crime fields.
 ▫ Primarily Larceny crimes

• NoVaCrime’s Data Mining Engine
• Open source and free for development
• Rich API provides many common data mining
  algorithms without custom development
• Weka file format is easy to generate by hand or
  in code
• Weka Explorer GUI provided with distribution –
  allows for mining to occur prior to integration in
Weka – cont.
• Used Generated ARFF files to attempt data mining through Explorer
• Allowed for quick Visualization before coding within NoVaCrime.

SimpleKMeans -N 11 -S 10            DBSCAN – 11 clusters E 0.9 -M 4 -I
Weka – cont.
• Drawbacks
• Not provide sufficient support date/time or string
  ▫ We converted most of these attributes into nominal
• API could be improved
  ▫ Development team needed to examine underlying
    source code rather than just relying on Weka-provided
• Mining algorithms did not provide significant
  ▫ Classifiers performed significantly worse than 50%.
  ▫ Clusters worked slightly better
Software Design
The Google Web Toolkit

• Write web application in Java
  ▫ Compiles client code to JavaScript
  ▫ Define/Use Widgets for display
     Large library of existing controls
         Grids
         Graphs
         Maps
• Communications
  ▫ RPC
     Define interface
     Asynchronous calls from client to server
         Primitive, Complex types

• Define Data Sources of interest
• Define time frame of interest,
  filter out noisy time based data
• Define geographic area of
• Define crime information of
Data Mining

• Three clustering algorithms
  ▫ DB Scan
  ▫ K Means
  ▫ Expectation-Maximization
• Select tab and specify
  parameters for algorithm
Visualization - Charts
• Shows several views of the data
  ▫ Day of Week chart
     Number of incidents per day of week
  ▫ Crime Category
     Number of incidents per category
  ▫ Crimes
     Number of incidents per crime type
  ▫ Clusters
     Number of incidents per cluster
Visualization - Maps
• Heat map
 ▫ Provides a Density
   based summary of
   queried data.

• Points
 ▫ Colored map of
   points based on a
Demo - NoVaCrime Example
• Data quality
  ▫ Lots of noise/outliers in the data
  ▫ Difficult to clean/verify accuracy
  ▫ Empty values hurt ability of classifiers to classify the records properly
  ▫ Provided a quick way to run mining algorithms
  ▫ Did not support many data types well
  ▫ Unable to find algorithm that appeared to find useful results
• Google Web Toolkit/JFreeChart/Google Maps
  ▫ Provided relatively straight forward API for visualization
  ▫ We were able to produce several interesting charts and maps while
    providing an easy interface for data mining.
Future Work
• Additional Features
  ▫ Cleaning/Parsing
      Use raw data as input
      Additional noise reduction
  ▫ Querying
      Improved method to select an geographic area of interest
  ▫ Map
      Feedback on crime information for specific incidents from Map
      Better color selection/legend
  ▫ Charts
      Additional chart types
      Control over chart contents
  ▫ WEKA
      Additional Algorithms
      Feedback loop from algorithm run to subsequent runs

Shared By: