Criminal Incident Data Association Using OLAP Technology
Donald E. Brown & Song LinDepartment of Systems & Information
EngineeringUniversity of Virginia
Summary
In this paper, we combine OLAP (Online Analytical Processing) and data mining to associate criminal incidents.This method is tested with a robbery dataset from Richmond, Virginia
Objectives of Spatial Knowledge MiningLeverage DBMS (records management), OLAP, & GISFind spatial-temporal patterns and relationships in dataSupport crime analysis & information sharing
Related Applications - UVa
ReCAP Regional Crime Analysis Program Provides support for regional analysis using RDBMS Requires implementation on each client computer
CARV Crime Analysis and Reporting in Virginia Runs on Citrix Metaframe, so the number of concurrent
users is limited
GRASP Geospatial Repository for Analysis and Safety Planning Web interface for a central repository of criminal incident
data and geospatial files
Outline
IntroductionExisting studies on OLAP & data miningCombined approachApplicationConclusions
Introduction (crime association)
80-20 rule: 20% of the criminals commit 80% of the crimesHow can we link criminal incidents committed by the same criminal?Start by looking at the same crime types
Theories of criminal behavior (criminology)
Rational choice (Clarke and Cornish) Criminals evaluate “benefit” and
“risk”, make rational decisions to maximize “profit”.
Routine activity (Felson) A ready criminal Suitable target Lack of effective guardian
Theories of criminal behavior (template)
“Template” (Brantingham & Brantingham) Environment sends out cues about its
characteristics Criminals use cues to evaluate Template is built to associate certain cues
with suitable targets Template is self-reinforcing and enduring A criminal does not have many templates
An operational approach to the theories (template)
Criminal incidents committed by the same person Similar patterns in time Similar patterns in space Similar patterns in MO
It is possible to associate incidents from the same person by discovering these patterns
Existing Association Methods & Systems
AREST (Badiru et al.) Suspect matching
ViCAP (FBI) Incident matching
COPLINK (U. Arizona) Link search terms with cases (concept
space)
Existing Association Methods & Systems
TSM (Brown) Total similarity measures Could be used for both incidents and
suspects matching
SQL Used by analysts in practice
Comments on existing methods
Computer technologies are central to criminal incident associationFor example MIS Databases Information Retrieval GIS
Comments on existing methods
Two additional techniques that enable incident association Data Warehousing / OLAP Data Mining
We develop a method thatseamlessly integrates OLAP and data
mining.
Related Work on OLAP and data mining
OLAP Ancestor: OLTP (transactional data) OLAP: (summary data for analysis) Dimension:
OLAP data is multidimensional Dimension: numeric or categorical
attributes Hierarchical structures exist in dimensions
Aggregates: Sum, count, average, max, min, …
OLAP and Data Mining
Both of them are powerful tools to support decision making process, but OLAP focus on efficiency, few
quantitative analysis methods are used Data mining is typically for 2-D dataset
(spreadsheets), not for multidimensional OLAP data structures
Idea: combine them
Existing studies on combining OLAP and Data mining
Cubegrade Problem (Imielinski) Generalized version of association
rule Association rule: change of “count”
aggregate imposing another constraint, or perform a “drill-down” operation
Other aggregates could also be considered
Existing studies on combining OLAP and Data mining
Constrained Gradient Analysis Retrieve pairs of OLAP cells
Quite different in aggregates Similar in dimension (parents, children,
siblings) More than one aggregate could be
considered simultaneously (e.g., sum and mean).
Existing studies on combining OLAP and Data mining
Data driven exploration (Sarawagi) Find “exceptions” Mean and STD are calculated for a
cell If the aggregate of the cell is outside
the (-2.5, +2.5) exception OLAP version of “3” rule
Associating records by finding distinctive values or outliers
Basic idea If a group of records have common characteristics, and
these “common” characteristics are unusual or “outliers”, we are more confident in asserting that these records come from the same causal mechanism.
Look for distinctive characteristics – the best would be DNA
OLAP-outlier-based method to associate records
Rationale for distinctive values or outliers Weapon used in robberies “gun” – very common, hard to associate “Japanese sword” – distinctive, come from
the same person
We build an outlier score function to measure this “distinctiveness”, Higher score more distinctive more
confident to associate It is for categorical attributes (MO is
important in linking criminal incidents)
Definitions
Cell, Parent, Neighbor Cell: a vector of values for some
attributes. Parent: replace one attribute of the
cell with wildcard element “*”. Neighbor: A group of cells having the
same Parent.
Derive from OLAP field
Illustration -- Cell
Dimension 1
Dimension 2
a1 a4a3a2
b1
b2
b4
b3
Two-Dimension Cell
(a 4,b 2)
One-Dimension Cell
(*,b 4)
Illustration --parent
a1 a2 a4 a5a3
b4
b3
b2
b1
Cell (a5,b3) has two parents: (a5, *) and (*,b3)
Illustration -- Neighbor
Neighbor is a collection of cells sharing the same parent
Outlier Score Function
We start building this function from one dimension, and then we generalize to higher dimensions.For one dimension, we have the following two observations. Values with small probability
(frequency) are more “unusual” Outlier score is high when the
uncertainty level is low.
Observation I
Blond Brown Black Red Gray
HairColor
0
10
20
30
40
50
Cou
nt
P=0.1Outlier
For attribute “color”, value “blond” covers 10% of the records. Hence, it should get a higher outlier score.
Observation II
Blond Brown Black Red Gray
HairColor
0
20
40
60
80
Count
Blond Brown
HairColor
0
20
40
60
80
Coun
t
Although both of them have frequency=0.2, the left one is more “unusual”, because the uncertainty level is low.
Observation III
“more evidence” More evidence is better than less
higher outlier score
OSF for One Dimension
-log(p) comes from information theory, where p is the probability of a valueEntropy measures the information in a message (in this case, in a data record)
Entropy
pOSF
)log(
OSF for Higher Dimensions
For any cell, calculate the sum of the OSF of its parent cell and the OSF conditional on the neighbor of this cell. (one-dimension OSF)Do this calculation for all parent cells.Take the maximum as the outlier score for this cell.
)(*,*,...,*0
))(
))(log()),(((max
)(c
cofneighborkEntropy
cfrequencykcparentf
cf th
Association (using this OLAP-outlier method)
For a pair of incidents (A,B) If there is a cell that contains both A
and B And the outlier score of this cell is
large enough (threshold test) Associate them
Application (dataset)
Applied to a robbery dataset (Richmond, VA, 1998) Why robbery?
For evaluation purpose # of multiple offenses > murder # of known suspects > B & E
Attributes
Three attributes Modus Operandi -- categorical Census Features -- numeric Distance Features – numeric
Feature Selection
Redundant features feature selection Cluster features (similar features in the
same group) Pick a representative feature for each
group Method: k-medoid clustering
Applicable to distance matrix Return “medoids”
Feature Selection Result
Component 1
Co
mp
on
en
t 2
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
-0.6
-0.4
-0.2
0.0
0.2
0.4
These two components explain 44.25 % of the point variability.
Medoids -- 1 : HUNT 2 : ENRL3 3 : TRANS.PC
Final Selected Features
Medoids HUNT (housing unit density) ENRL3 (public school enrollment)
POP3 (population:12-17) more meaningful (attacker and victims)
TRAN_PC (transportation expense per capita) MHINC (median income)
Discretize
Discretize these numeric features into bins Similar to histogram Sturges’ number of bins rule
Evaluation
For incidents with known suspects (170) Generate all incident pairs If a pair of incidents have the same
criminal suspect, then “true association”
Compare results given by the algorithm with the “true result”
Evaluation Criteria
Two measures Detected true associations
Larger is better Average number of relevant records
Similar to search engines like “google” Given one record, system return a list Take the average of the length of all lists Shorter is better.
Evaluation Criteria (cont.)
From information retrieval Recall: ability to provide relevant
items Precision: ability to provide only
relevant items
1st measure is “recall”; 2nd is equivalent to “precision”2nd also measures the user effort (in further investigation)
Result (OLAP-outlier based)
Threshold Detected true associations
Avg. number of relevant records
0 33 169.00 1 32 121.04 2 30 62.54 3 23 28.38 4 18 13.96 5 16 7.51 6 8 4.25 7 2 2.29 0 0.00
Result of binary association method (calculating similarity score)
Threshold Detected true associations Avg. number of relevant records 0 33 169.00
0.5 33 112.98 0.6 25 80.05 0.7 15 45.52 0.8 7 19.38 0.9 0 3.97 0 0.00
Comparison Outlier vs. Binary
0
5
10
15
20
25
30
35
0 20 40 60 80 100 120 140 160 180
Avg. relevant records
Similarity
Outlier
Comparison (cont.)Generally, the curve of our method lies above the other one Given the same accuracy level, this method
returns less records Keep the same “length” of the list, this
method is more accurate
The other method is better at the tail However, that means the average number of
relevant records is > 100 Given the size is 170, no analyst would
investigate 100 incidents.
Generally, the new method is effective.
Comparison(Outlier vs. Simple Combination)
0
5
10
15
20
25
30
35
0 50 100 150 200
Similarity
Outlier
Combine
WebCAT Implementation
A secure web environment that can read several data formats, translate them into a uniform standard (XML)Uses free, open-source technology ASP, XML, MapServer, SVG, etc.
Provides tools to meet spatial and statistical analysis needs, to include associationProvides utilities for querying and reporting
Conclusions
Developed a new data association method for linking criminal incidents that combines Concepts in OLAP (multidimensional) Ideas in data mining (outlier detection)
Testing with a robbery dataset shows promiseDeployment through WebCAT provides open source (XML-based) capability for data access and analysis over the web
Questions?