Crime Data Mining and Visualization - Filebox

Document Sample
Crime Data Mining and Visualization - Filebox Powered By Docstoc

Nemallapudi Chaitanya, Sunkara Anish, ELGammal
   Overview
   Objective
   Motivation
   Approach
   Design
   Mining
   Visualization
   Implementation of Crime Data Mining and
    Visualization Application
   Use of Mining & Visualization techniques
     PostGIS

     Google    Maps API
     WEKA

   Client-Server approach using JAVA, JAVASCRIPT,
   This project is to implement mining methodologies on
    crime data.
   Provide visualization for better understanding the
   This is based on publicly available dispatch reports of
     City of Falls church
     Fairfax county
     Arlington county
   Data mining has proven to be a useful methodology in
    providing analytical data normally unseen by
    traditional methods.
   Because of its ability to draw conclusions based on
    many perspectives, it can be used to -
     Identify crime trends and patterns/series.
     Assist law-enforcement agencies in planning of resources

     Aid investigation process by giving a different perspective
   Collect publicly available crime data
   Parse useful data and load it into database
   Use spatial database to get the co-ordinates of the
    crime locations, criminal location etc.,
   Use mining algorithms (DBScan, K-Nearest Neighbor
    and EM) to analyze the trends in the data
   Use Google Maps to show the crime data based on
   Use prefuse visualization to show graphs based on the
    data collected
Input (Data sets)
 Quality is an important characteristic for any data
 Challenges:

  Difficult to extract few attributes .(eg…. juvenile BMs
  1517yo wearing dark clothing….)
  Missing values( criminal age is not specified in all the
  In some cases, latitudes and longitude values are
 Develop a crime ontology
Data Preprocessing
   Need to implement dimensionality reduction
   Reduce amount of time and memory required by data
    mining algorithms
   Allow data to be more easily visualized
   May help to eliminate irrelevant features or reduce noise
   Implemented aggregation
   863 crime types were reduced to 45 crime types
   Classification of crimes (e.g. Burglary Commercial and
    Burglary Residential are classified as Burglary)
   Crime Information Extraction (XML Parsing)
System Architecture

 Home page       (Request)
 Home Page
                                       Data   Database
(Charts, Maps)

 Map Server                   Mining
   Data Cleansing
     Automated   grouping of 863 crime types in raw data
      into 45 final crime types
     Cleaning of some missing information and handling of
      null and defining data types is done via a parser that
      reads the data from the file and loads it into the
   Data Model
     Indexes are added to some of the most used fields in
      the queries, for performance improvement.
                       Column Name
Id                           Zip
Dataset                      Criminal_age
Crime_type                   Criminal_gender
Description                  Victim_age
Crime_time                   Victim_gender
address                      Crime_latlng
Data Mining
   Different API’s for data mining.
     Ex:WEKA, Java Data Mining Package (JDMP),
      RapidMiner (YALE)
   WEKA
     WEKA    is a Machine Learning and Data Mining
      software tool written in Java
     Open Source, well documented, support for
Data Mining Functions Implemented

   WEKA works on a “Attribute-Relation File Format
   Filters:
     Supervised   : Interface for filters that make use of a
      class attribute.
       Ex:   Discretize, NominalToBinary, Resample
     Unsupervised:      Interface for filters that do not need a
      class attribute.
       Ex:   Standardize, StringToNominal, StringToWordVector
Functions Available
   Supported Classification
     Ex:   NaiveBayes, RandomizableClassifier
   Supported Association Functions
     Ex:   Apriori, Associator.
   Supported Clustering Algorithms
     Ex:   Simple K-means, DBScan, EM
   Outlier Detection based on location is facilitated by
   Data mining part is implemented separately on the
    3 datasets
     Due to variations in attributes in data sets.
     Results do not reflect anomalies in the datasets.
       Ex: 725 records in Falls Church, compared to 11507 records
        in Fairfax (1705 records dealing with Auto-Theft)
   Fetching data
         is fetched from the data base or CSV files using
     Data
      WEKA functions.
Implementation Continued…
   Filtering
       Unwanted attributes are filtered (removed) from the
        working data set. Ex: Criminal Description etc.
   Clustering
     DBScan, K-means, EM algorithms are implemented using
      WEKA API.
     Simple K-means,
         Takes a range of values of K (Say 1 to 45) as we know 45 is the
          number of different crime types in the database.
         Calculates the SSE between the clusters corresponding to the K
          value and picks the K where SSE is low.
Implementation Continued…
   Visualization
     The mined data is then sent to the Visualization
      explorer in WEKA, where different attributes can be
      graphed and represented.
   Examples of Some Visualizations are :

   Will include some graphs…..

       Arlington, DOW VS Clusters: Auto Theft(C3) is low on Weekends
Example 2

       Arlington Data set, data inclined towards Wednesday.
Example 3

  Fairfax, Month VS Clusters: Show that in June the data is very sparse
Advantages of this Implementation

   Does not depend on one algorithm such as K-means.
   Modules can be added seamlessly to the existing
    code to implement other algorithms or using WEKA
   Open design: Algorithm implementation can be
    switched with simple parameter changes.
   Google Maps
     Visualization   of data is implemented using Google
      Maps API
   WEKA used for histograms and cluster visualization
   PostGIS is spatial database extender for the
    PostGreSQL DBMS.
   Adds spatial functions such as distance, area, and
    specialty geometry data types to the database.
    Relies on GiST (Generalized Search Tree) for
    indexing geometric data.
PostGIS (continued)
   Examples of geometry data types:
       POINT(2572292.2 5631150.7)
       LINESTRING (2566006.4 5633207.9, 2566028.6 5633215.1,
        2566062.3 5633227.1)
       POLYGON (2568262.1 5635344.1, 2568298.5 5635387.6,
        2568261.04 5635276.15, 2568262.1 5635344.1)
   Examples of PostGIS functions/operators:
       Distance(), Intersetcs(), Within(), Contains(), Length(), Area(),
        ConvexHull(), Extent(), ...
       A~B         (A contains b?)
       A@B         (B contains A?)
       A && B (Do A and B overlap?)
       ...
Detecting Outliers By Location
Map Visualization
   The client UI is implemented in JavaScript.
   The Google Maps API was used to view the map
    and draw all necessary illustrations.
   The UI uses asynchronous (AJAX) requests to
    communicate with the server.
   The server replies in JSON (a data-interchange
    format native to JavaScript).
   Request/Reply batching is used to improve
Map Visualization (continued)
   The UI can generate map-based visualizations
    showing the following:
     Crime  rate in different regions filtered by: dataset,
      crime status (attempted/committed), year, month, day
      of week.
     As well as: crime type, year, month, day of week with
      the highest frequency in different regions.
   PostGIS (
   Google Map API
   JDMP (
   WEKA (
     API:   (
   Rapid Miner(YALE) (

Shared By: