NoVaCrime: A crime data mining
and visualization system
Anukrati Gupta, Alex Galanes,
Date: Dec. 4th 2008
• Problem Definition
• Key Issues
• Related Work
• System Design
▫ Parsing and Loading
▫ Data Cleansing
• Software Design
▫ The Google Web Toolkit
▫ Data Mining
• Future Work
▫ Provided with XML files of three counties – Arlington,
Fairfax and Falls Church, representing a crime or
series of crimes and identifying and analyzing
suspicious patterns and trends.
▫ Includes data processing
▫ database systems
▫ crime information systems
• Input into the system may be very difficult to
analyze, police reports do not follow a set or
structured pattern, and may have important
pieces of information missing.
• Based on the information available it is hard to
define the correctness or accuracy of
• Fast access to these records and queries against
the records are necessary.
• Vast amounts of information needs to be
conveyed to the user in an easy to understand
• There is a lack of crime analysis services
available for the Northern Virginia Area.
• Analysis of crime data can provide key
information to police organizations to improve
safety on the streets.
• Citizens desire crime information to make
informed decisions about where to live, play,
work, and attend school.
• CrimeReports.com accepts an address and
provides detailed information on crimes within a
certain distance for a certain period of time.
▫ Good user interface but not all
municipalities report criminal activity to
▫ As of December, 2008, only four
jurisdictions in Virginia submitted their
crime data to crimereports.com.
Related Work – cont.
• Some municipalities are also providing crime
data on their websites example Seminole County
Sheriff’s office in Florida.
▫ User interface not as easy to use as
▫ Data is shown on a map for a specific district for a
▫ Unable to find a way to search for a given address
or to look at data for more than one month at a
• Visualization techniques for existing crime-
related websites are utilized as guidelines when
displaying criminal data for the given data sets.
NoVaCrime System Design
Parsing and Loading
• Utilized provided XML files instead of reparsing
raw data to conserve development time.
• Reverse-engineered an XSD to validate XML
• Created XSLT transformations to convert XML
files into pipe-delimited files for loading into the
▫ XSLT performed basic validation as well
• Loaded data first into Microsoft Excel and then
into Microsoft SQLServer
Parsing and Loading – cont.
• 4000 Character description field could not be
loaded directly from pipe-delimited file into
• Dates loaded from pipe-delimited file were treated
by SQLServer as strings
• Using Microsoft Excel as an intermediary step
solved both issues.
• Latitude and Longitudes were reversed for Google geo-codings
• Addresses containing intersections addresses were not properly
handled in Fairfax County dataset
Bennington Woods Road and Baron Cameron Avenue
▫ Address in xml
Road Baron Cameron Avenue/Bennington Woods
Data Cleansing - cont.
• Crime Categories were null or uninformative for
a significant portion of the data.
• Null categories were updated by hand by
examining the specific crimes
• Large categories were split into smaller groups
by running SQL queries against the
▫ Primarily Larceny crimes
• NoVaCrime’s Data Mining Engine
• Open source and free for development
• Rich API provides many common data mining
algorithms without custom development
• Weka file format is easy to generate by hand or
• Weka Explorer GUI provided with distribution –
allows for mining to occur prior to integration in
Weka – cont.
• Used Generated ARFF files to attempt data mining through Explorer
• Allowed for quick Visualization before coding within NoVaCrime.
SimpleKMeans -N 11 -S 10 DBSCAN – 11 clusters E 0.9 -M 4 -I
Weka – cont.
• Not provide sufficient support date/time or string
▫ We converted most of these attributes into nominal
• API could be improved
▫ Development team needed to examine underlying
source code rather than just relying on Weka-provided
• Mining algorithms did not provide significant
▫ Classifiers performed significantly worse than 50%.
▫ Clusters worked slightly better
The Google Web Toolkit
• Write web application in Java
▫ Define/Use Widgets for display
Large library of existing controls
Asynchronous calls from client to server
Primitive, Complex types
• Define Data Sources of interest
• Define time frame of interest,
filter out noisy time based data
• Define geographic area of
• Define crime information of
• Three clustering algorithms
▫ DB Scan
▫ K Means
• Select tab and specify
parameters for algorithm
Visualization - Charts
• Shows several views of the data
▫ Day of Week chart
Number of incidents per day of week
▫ Crime Category
Number of incidents per category
Number of incidents per crime type
Number of incidents per cluster
Visualization - Maps
• Heat map
▫ Provides a Density
based summary of
▫ Colored map of
points based on a
Demo - NoVaCrime Example
• Data quality
▫ Lots of noise/outliers in the data
▫ Difficult to clean/verify accuracy
▫ Empty values hurt ability of classifiers to classify the records properly
▫ Provided a quick way to run mining algorithms
▫ Did not support many data types well
▫ Unable to find algorithm that appeared to find useful results
• Google Web Toolkit/JFreeChart/Google Maps
▫ Provided relatively straight forward API for visualization
▫ We were able to produce several interesting charts and maps while
providing an easy interface for data mining.
• Additional Features
Use raw data as input
Additional noise reduction
Improved method to select an geographic area of interest
Feedback on crime information for specific incidents from Map
Better color selection/legend
Additional chart types
Control over chart contents
Feedback loop from algorithm run to subsequent runs