CRIME DATA MINING AND
Nemallapudi Chaitanya, Sunkara Anish, ELGammal
Implementation of Crime Data Mining and
Use of Mining & Visualization techniques
Google Maps API
This project is to implement mining methodologies on
Provide visualization for better understanding the
This is based on publicly available dispatch reports of
City of Falls church
Data mining has proven to be a useful methodology in
providing analytical data normally unseen by
Because of its ability to draw conclusions based on
many perspectives, it can be used to -
Identify crime trends and patterns/series.
Assist law-enforcement agencies in planning of resources
Aid investigation process by giving a different perspective
Collect publicly available crime data
Parse useful data and load it into database
Use spatial database to get the co-ordinates of the
crime locations, criminal location etc.,
Use mining algorithms (DBScan, K-Nearest Neighbor
and EM) to analyze the trends in the data
Use Google Maps to show the crime data based on
Use prefuse visualization to show graphs based on the
Input (Data sets)
Quality is an important characteristic for any data
Difficult to extract few attributes .(eg…. juvenile BMs
1517yo wearing dark clothing….)
Missing values( criminal age is not specified in all the
In some cases, latitudes and longitude values are
Develop a crime ontology
Need to implement dimensionality reduction
Reduce amount of time and memory required by data
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
863 crime types were reduced to 45 crime types
Classification of crimes (e.g. Burglary Commercial and
Burglary Residential are classified as Burglary)
Crime Information Extraction (XML Parsing)
Home page (Request)
Map Server Mining
Automated grouping of 863 crime types in raw data
into 45 final crime types
Cleaning of some missing information and handling of
null and defining data types is done via a parser that
reads the data from the file and loads it into the
Indexes are added to some of the most used fields in
the queries, for performance improvement.
Different API’s for data mining.
Ex:WEKA, Java Data Mining Package (JDMP),
WEKA is a Machine Learning and Data Mining
software tool written in Java
Open Source, well documented, support for
Data Mining Functions Implemented
WEKA works on a “Attribute-Relation File Format
Supervised : Interface for filters that make use of a
Ex: Discretize, NominalToBinary, Resample
Unsupervised: Interface for filters that do not need a
Ex: Standardize, StringToNominal, StringToWordVector
Ex: NaiveBayes, RandomizableClassifier
Supported Association Functions
Ex: Apriori, Associator.
Supported Clustering Algorithms
Ex: Simple K-means, DBScan, EM
Outlier Detection based on location is facilitated by
Data mining part is implemented separately on the
Due to variations in attributes in data sets.
Results do not reflect anomalies in the datasets.
Ex: 725 records in Falls Church, compared to 11507 records
in Fairfax (1705 records dealing with Auto-Theft)
is fetched from the data base or CSV files using
Unwanted attributes are filtered (removed) from the
working data set. Ex: Criminal Description etc.
DBScan, K-means, EM algorithms are implemented using
Takes a range of values of K (Say 1 to 45) as we know 45 is the
number of different crime types in the database.
Calculates the SSE between the clusters corresponding to the K
value and picks the K where SSE is low.
The mined data is then sent to the Visualization
explorer in WEKA, where different attributes can be
graphed and represented.
Examples of Some Visualizations are :
Will include some graphs…..
Arlington, DOW VS Clusters: Auto Theft(C3) is low on Weekends
Arlington Data set, data inclined towards Wednesday.
Fairfax, Month VS Clusters: Show that in June the data is very sparse
Advantages of this Implementation
Does not depend on one algorithm such as K-means.
Modules can be added seamlessly to the existing
code to implement other algorithms or using WEKA
Open design: Algorithm implementation can be
switched with simple parameter changes.
Visualization of data is implemented using Google
WEKA used for histograms and cluster visualization
PostGIS is spatial database extender for the
Adds spatial functions such as distance, area, and
specialty geometry data types to the database.
Relies on GiST (Generalized Search Tree) for
indexing geometric data.
Examples of geometry data types:
LINESTRING (2566006.4 5633207.9, 2566028.6 5633215.1,
POLYGON (2568262.1 5635344.1, 2568298.5 5635387.6,
2568261.04 5635276.15, 2568262.1 5635344.1)
Examples of PostGIS functions/operators:
Distance(), Intersetcs(), Within(), Contains(), Length(), Area(),
ConvexHull(), Extent(), ...
A~B (A contains b?)
A@B (B contains A?)
A && B (Do A and B overlap?)
Detecting Outliers By Location
The Google Maps API was used to view the map
and draw all necessary illustrations.
The UI uses asynchronous (AJAX) requests to
communicate with the server.
The server replies in JSON (a data-interchange
Request/Reply batching is used to improve
Map Visualization (continued)
The UI can generate map-based visualizations
showing the following:
Crime rate in different regions filtered by: dataset,
crime status (attempted/committed), year, month, day
As well as: crime type, year, month, day of week with
the highest frequency in different regions.
Google Map API
Rapid Miner(YALE) (www.rapidminer.com)