Macro Trends in
Counter-Terrorism Technologies
And Thoughts on Responsible Innovation
DETECTER Project, Brussels
September 7th, 2011
Jeff Jonas, IBM Distinguished Engineer
Chief Scientist, IBM Entity Analytics
JeffJonas@us.ibm.com
1
Today‟s Material
Background
Macro Trends
Detecting Bad Guys in Big Data
Challenging Privacy and Civil Liberties Issues
Privacy by Design (PbD) Considerations
Questions and Answers
2
Background
Early 80‟s: Founded Systems Research & Development
(SRD), a custom software consultancy
1989 – 2003: Built numerous systems for Las Vegas
casinos including a technology known as Non-Obvious
Relationship Awareness (NORA)
2001/2003: Funded by In-Q-Tel
2005: IBM acquires SRD
2005: Acquired by IBM, now Chief Scientist, IBM Entity
Analytics
Cumulatively: I have had a hand in a number of systems
with multi-billions of rows describing 100‟s of millions of
entities
3
Roles
Member, Markle Foundation Task Force on National
Security in the Information Age
Board Member, US Geospatial Intelligence Foundation
(USGIF), the GEOINT organizing body
Senior Associate, Center for Strategic and International
Studies (CSIS)
Member, EPIC advisory board
Advisor, Privacy International
4
Current Primary Area of Interest
Making sense of information in large data sets,
across complex ecosystems with emphasis on privacy
and civil liberties protections
– 1996: Created an identity-centric customer repository based on
4,200 disparate systems … >100 million resolved identities
– 2001: Assistance in various post-9/11 data analysis programs for
public and private sector
– 2005: Missing persons project following Hurricane Katrina
resulting in re-unification of >100 loved ones
5
A Late Bloomer to Privacy
1980 – 2001 No clue whatsoever
2001 – 2006 Slowly waking up
2007 – 2011 Today, at best, a
student of privacy
6
A Journey Fraught with Reflection and Rethinking
The greater
my privacy and
civil liberties
awareness The greater
the number of
imperfections
appear in my
rearview mirror
7
Katrina – Missing Persons Reunification Project
Information about status of persons quickly end up
scattered across countless databases
– Over 50 such web sites/organizations were identified as having
victim related data
– Many people were registered duplicate times in the same
database
– Many people were registered duplicate times across databases
– Many people were registered as missing in one database and
found in another database
Connecting found persons previously reported as
missing becomes nearly impossible
– Too many databases
– Constantly changing data
8
Katrina Reunification Project Statistics
Total data sources 15
Usable records 1,570,000
Unique persons 36,815
Total loved ones reunited >100
9
Katrina – Missing Persons Reunification Project
Privacy by Design (PbD)
– Contractually authorized to delete all the data
after the reunification office completed its work
– Hence, a few months later, all collected data and
reporting products were deleted
DESTRUCTION OF EVIDENCE!
Data Decommissioning – Destruction of Accountability
10
Macro Trends
11
Good News: The World is Not More Dangerous
Avg Age
67 75M
~17+%
Number Dead
37
300M
1900: Today: ~4.5%
Western Global
Europe Average
1300‟s: Today:
“Black Death” If America
sunk into ocean
and everyone dies
12
Prediction
Your doctor is 102
and this is not weird.
13
Bad News: “More Death Cheaper in Future” Graph
10 Kiloton
Complexity of Execution
Nuke
1918
Spanish
Influenza
Death
14
1918 Spanish Influenza Genome
15
“More Death Cheaper in Future” Graph
10 Kiloton
= Bad
Complexity of Execution
Nuke
Easier
1918
Spanish
Influenza
More Death
Death
16
Jerome Kerviel – US$7B
www.chinapost.com.tw/news_images/20080127/p1d.jpg
17
Jerome Kerviel – US$7B
Back it out Back it in Back it out Back it in
Analytic
Analytic
Checkpoint
Checkpoint
1 Day
18
2050 Predictions
A single person can
kill 100M people for
<$1,000.
19
State of the Union:
Enterprise Amnesia
20
Amnesia, definition
A defect in memory, especially resulting
from brain damage.
21
US National Security Amnesia Events
9/11
Two known terrorists were admitted into the US (only discovered
after the fact).
Christmas Day Bomber
Abdulmutallab possessed a multi-entry VISA while at the same
time was on the terrorist watch list (only discovered after the
fact).
22
Trend: Organizations Are Getting Dumber
Every two days now we create as
Available much information as we did from
Observation the dawn of civilization up until
Computing Power Growth
Space 2003.”
~ EricContext CEO Google
Schmidt,
Enterprise
Amnesia
Sensemaking
Algorithms
Time
23
Trend: Organizations Are Getting Dumber
Available
Observation
Computing Power Growth
Space
WHY?
Context
Sensemaking
Algorithms
Time
24
Algorithms at Dead End.
You Can‟t
Squeeze Knowledge
Out of a Pixel.
25
No Context
scrila34@msn.com
26
Context, definition
Better understanding
something by taking into
account the things around it.
27
Information without
context
is hardly actionable.
28
Lack of Context – Consequences
Alert queues growing faster than the
humans address – filled mostly with false
positives
The top item in the queue is not the most
relevant item
Items require so much investigative
effort – they are often abandoned
prematurely
Risk assessment becomes the risk
29
29
Information in Context … and Accumulating
scrila34@msn.com
Job
Applicant Most
Trusted
Source
Known
Terrorist
No Fly
List
30
The Puzzle Metaphor
Imagine an ever-growing pile of puzzle pieces of varying
sizes, shapes and colors
What it represents is unknown – there is no picture
Is it one puzzle, 15 puzzles, or 1,500 different puzzles?
Some pieces are duplicates, missing, incomplete, low
quality, or have been misinterpreted
Some pieces may even be professionally fabricated lies
Until you take the pieces to the table and attempt
assembly, you don‟t know what you are dealing with
31
32
Puzzling: 4 Puzzles, 620 Useful Pieces
270 pieces 30 pieces
90% 10%
(duplicates)
200 pieces 6 pieces
66% 2%
(pure noise)
150 pieces
50% +36 Useless Pieces!
33
34
First Discovery
35
More Data Finds Data
36
Duplicates in Front Of Your Eyes
37
First Duplicate Found Here
38
39
40
Incremental Context – Incremental Discovery
6:40pm START
22min “Hey, this one is a duplicate!”
35min “I think some pieces are missing.”
37min “Looks like a bunch of hillbillies on
a porch.”
44min “Hillbillies, playing guitars, sitting
on a porch, near a barber sign …
and a banjo!”
41
150 pieces
50%
42
Incremental Context – Incremental Discovery
47min “We should take the sky and grass
off the table.”
2hr “Let‟s switch sides, and see if we
can make sense of this from
different perspectives.”
2hr10m “Wait, there are three … no, four
puzzles.”
2hr17m “We need a bigger table.”
2hr18m “I think you threw in a few random
pieces.”
43
44
45
46
Trend: Big Data [in context] = New Physics
More data: better the predictions
– Lower false positives
– Lower false negatives
More data: bad data … good
– Suddenly glad your data was not perfect
More data: less compute
47
From Pixels to Pictures to Insight
Relevance Detection
Contextualization
Observations Persistent Consumer
Context (An analyst, a system,
the sensor itself, etc.)
48
One Form of Context is “Expert Counting”
Is it 5 people each with 1 account … or is it 1
person with 5 accounts?
Is it 20 cases of H1N1 in 20 cities … or one
case reported 20 times?
If one cannot count … one cannot estimate
vector or velocity (direction and speed).
Without vector and velocity … prediction is
nearly impossible.
49
Skilled adversaries engage in
“channel separation.”
Cell Phone #1 Cell Phone #2 Bank Acct #1 Passport #1
Unknown Unknown Billy K. William A.
50
Hence, detection requires
“channel consolidation.”
William A
aka Billy K.
• Cell Phone #1
• Cell Phone #2
• Bank Acct #1
• Passport #1
51
Expert Counting: Degrees of Difficulty
Deceit
Bob Jones Ken Wells
123455 550119
Incompatible
Features
Bob Jones bjones@hotmail
Fuzzy 123455
Bob Jones Robert T Jonnes
Exactly 123455 000123455
Same
Bob Jones Bob Jones
123455 123455
52
Deceit Detection Using Context Accumulation
Deceit Feature
Accumulation
Bob Jones Ken Wells
Robert Jones 123455 550119
123455
POB 13452
DOB 03/12/73
Ken Wells
550119
POB 999911
DOB 03/12/73
Bob Jones
gw3e56@hotmail.com
POB 13452
gw3e56@hotmail.com
gw3e56@hotmail.com
DOB 03/12/73
Robert Jones
123455 Resolved!
Ken Wells
53
550119
3 Models for
Information Sharing
54
1. Bulk Transfer
Large collections are passed along to appropriate third parties
May be required if the recipient must commingle the data in
secret
The recipients must have a capacity much larger than their own
native requirements
The more copies the more difficult it is to maintain the
information currency across the ecosystem
The more copies the more difficult to prevent of unintended
disclosure
Useful when the number of recipients and transactional
volumes are very small
55
2. Services for Inquiry
Owners enable third party inquiry (human or machine lookups)
When lots of systems are integrated, federated search can be
automated to search all third party data sources based on a
single user/machine search
Each system in the federation must be sized for all volume
Third party systems often lack the necessary indexes
Nearly impossible to ensure each federated systems is on-line
Useful for periodic, on-demand, inquiry using each third party
data source like a reference system – particularly appropriate
for narrow investigative work and/or forensic analysis
Not that useful for detect/preempt missions
56
3. Central Catalog/Index
Parties interested in information sharing supply metadata to a
central catalog (index)
Inquiries can discover the location of all available documents
using a single lookup
Card catalogs provide pointers to source systems and
documents enabling efficient/scalable lookup (aka federated
fetch)
Easier to keep the data current … than bulk transfer
Scales massively
Easier to secure
57
Discovery at the Library
?
Subject Title Author
58
Enterprise Discovery
Who What Where When How
59
The Policy Focus Becomes … “Discoverability”
If you don‟t publish your meta-data (who,
what, where, when) to the enterprise
catalog …
Information is not discoverable …
Therefore, the value of your operational
system to the broad strategic interests of
the enterprise is effectively ZERO!
60
Are You Playing Well With Others?
SHARING SCORECARD(*)
DISCOVERABILITY
Organization Records Discoverable %
This org 5B 2.5B 50%
That org 120B 6B 5%
The other org 3B 1B 33%
Their org 1B 750K 75%
Their other org 1B 500K 50%
(*) Any resemblance to real organizations and real number would be coincidental
61
Challenging
Privacy and Civil Liberties
Issues
62
Issue #1: Essential Secrets vs. Transparency
To detect professionally fabricated lies,
using only data, one must either:
1. Collect observations the adversary doesn‟t know you have
2. Or, be able to perform compute over your observations in a
manner the adversary cannot fathom
The Challenge: How can organizations
catch bad guys if there is transparency
over their observational space and what
is computable?
63
Issue #2: More Data Good
The good news: Both those in the counterterrorism business
and privacy community equally detest false positives
– The government recognizes that false positives waste government
resources
– The privacy community recognizes that false positives place the innocent
under undeserved government scrutiny
The challenge: Two remedies for false positives
1. Change the rules to reduce the number of alerts (which increases the
false negatives)
2. Add more information such that the additional context permits greater
discrimination
The more data, the lower the false positives and the lower
the false negatives
64
Issue #3: Necessity of Central Indexes
Federated search is extremely limited
– Does not scale when the mission is to get “left of boom”
(detection)
Central card catalogs (indexes) are the only
viable way forward
– Only the metadata centralized with pointers, not all the
data
The Challenge: General reaction to central
databases, even if just an index
65
Issue #4: Lone Gunmen Surveillance
Rare events planned by one or a small group are more difficult
to detect
The size of the observation space needed to detect lone
gunmen planning acts of terrorism … approaches ubiquitous
surveillance
Risk-based surveillance
– A car bomb in a public place
– A sector of national infrastructure at risk
– WMD over a major city
The Challenge: At some point when one person can create
extraordinary damage, cheaply, without a trace … then what?
66
Issue #5: Less Secrets Lead to Chilling Effects?
It is becoming harder and harder to
have secrets
Will this chill behavior?
– Will population behavior gravitate towards the
center of the bell curve?
– Or, will mankind become more tolerant of
diversity?
67
Privacy by Design (PbD)
Considerations
68
Universal Declaration of Human Rights
Article 9
No one shall be subjected to arbitrary arrest, detention or exile.
Article 12
No one shall be subjected to arbitrary interference with his privacy,
family, home or correspondence, nor to attacks upon his honor and
reputation. Everyone has the right to the protection of the law
against such interference or attacks.
Article 15
(1) Everyone has the right to a nationality.
(2) No one shall be arbitrarily deprived of his nationality nor denied
the right to change his nationality.
Article 17
(1) Everyone has the right to own property alone as well as in
association with others.
(2) No one shall be arbitrarily deprived of his property.
69
PbD: Information Attribution
Avoid the receipt of any data that does not come with an
ability to track its pedigree/attribution.
When passing your data into secondary systems, pass the data
pedigree/attribution along to the recipient (even if that means
only a pointer to your copy).
If the „chain of where data came from‟ is not maintained in the
information sharing ecosystem – there is no hope of keeping it
current and very difficult to reconcile cross-system
consistency.
More here:
Full Attribution, Don‟t Leave Home Without It
Out-bound Record-level Accountability in Information Sharing Systems
70
PbD: Data Destruction
When the data is no longer needed or there is a mandate …
purge it.
For example, at the close of a special information analysis
project; consider decommissioning the data sets in proportion
to the consequences of unintended disclosure or misuse.
If there is a legal requirement to retain data, or long term
accountability is necessary, consider pushing the data to forms
of retrieval useful only in the context of
forensic/investigatory purposes.
More here:
Decommissioning Data: Destruction of Accountability
71
PbD: Limit Data Transfers
If you don‟t have to move the entire record: don‟t.
Using information sharing systems as an example, it is best not
to send all the data to each (and every) information sharing
partner. Better to create a central index with prescribed
fields. The index then points to the original data holder – and
getting access to the original record requires permission at
that time, from the original data holder. This ensures a degree
of transparency.
More here:
Discoverability: The First Information Sharing Principle
72
PbD: Data Tethering
When data is moved from systems of record out into
secondary systems, as the source data changes (adds, changes
and deletes) these secondary systems should be notified.
If the secondary systems have themselves forwarded the data
to tertiary systems, these same changes should be passed
through the entire food chain.
More here:
Data Tethering: Managing the Echo
73
PbD: Obfuscate Data
For every copy there is a increasing risk of unintended
disclosure.
When there is an opportunity to perform data masking,
anonymization, encryption … do it.
Techniques now exist whereby data can be first obfuscated
(e.g., encrypted, anonymized, masked, etc.) before information
transfer ... while still maintaining a capability of performing
deep analytics (e.g., data matching) post obfuscation.
More here:
To Anonymize or Not Anonymize, That is the Question
74
Maximizing Discovery - Minimizing Disclosure
Persistent Observations Sensors
Context
Cd5dced41028cb …
00c9782a552a2 …
7f2b6e48ea7d0 …
!
…
Employee
Record #A-701 Database
0d06b31faa7c…
B5e341a4b0c…
00c9782a552…
FEATURES: …
Cd5dced41028cb7ea51
00c9782a552a2d09b1b Record #B-9103 Fraud
7f2b6e48ea7d042bbe8 Database
75
Maximizing Discovery - Minimizing Disclosure
Observations Sensors
Mark Randy Smith Policy Controls
DOB: 06/07/74
123 Main Street Discovery
713 731 5577 Employee Record #A-701
Record #A-701 Database
Matches
Record #B-9103
M. Randal Smith
DOB: 06/07/74
713 731 5577
Policy Controls
Record #B-9103 Fraud
Database
76
PbD: Build Accountability into Systems
Opt for the use of tamper-resistant audit logs. The greater
the lack of transparency, the greater the need for immutable
logs: mandated or not.
More here:
Immutable Audit Logs (IAL‟s)
Found: An Immutable Audit Log
77
Comments on: Data Mining
Data mining is not bad. There are setting where data mining is
very valuable and saves lives
Predictive Data Mining – Limited efficacy without volumes of
training data
Predicate Triage Data – Used to organize data sets containing
only “subjects of interest”
More here:
Effective Counter-Terrorism and the Limited Role of Predictive Data Mining
Data Mining, Predicate Triage and NSA Domestic Surveillance
78
Data Mining Defined (humorous)
“Torturing the data until it confesses …
and if you torture it enough, you can
get it to confess to anything.”
ACM SIGKDD Conference, Philadelphia 2006
79
Comments on: Link Analysis
Link analysis is very powerful, when used in a narrow fashion.
Inspection of “subjects of interest” outward.
Predicate-based link analysis: Big social maps are not useful
unless one has an entrance point.
Link analysis: prune early
More here:
Hunting Bad Guys, Phone Records and a Few Good Dead Men
Predicate-based Link Analysis: A Post 9/11 Analysis (1+1= 13)
Sometimes a Big Picture is Worth a 1,000 False Positives
80
Comments on: Watch Listing and False Positives
Difference between wrongly named and wrongly matched
Low fidelity watch lists are the single biggest cause of false
positives - solving this ambiguity involves additional data
Minimize collection, maximize consumer participation and
election
Provide a redress process
More here:
Precision in TSA‟s Terrorist Watch List
Comments on the TSA No-Fly and Selectee Watch List Process
81
Closing Thoughts
82
”The data must find the
data … and the relevance
must find the user.”
83
In Closing
There is going to be more sensors, more data
This data will be commingled for greater accuracy to serve
consumers and protecting countries
What data is collected/observed and when … will be the debate
Chief privacy principle: Avoid consumer surprise
If it has been collected, the holder has the obligation to make
sense of it
Organizations must harness data to be smart, efficient, and
survive … but how smart do they need to be and do we trust
them?
Hence the tension
84
Related Papers
Heritage Foundation: Paul Rosenzweig/Jeff Jonas
Correcting False Positives: Redress and the Watch List Conundrum
Cato Foundation: Jeff Jonas/Jim Harper
Effective Counterterrorism and the Limited Role of Predictive Data Mining
Steptoe & Johnson: Stewart Baker
Anonymization, Data-Matching and Privacy: A Case Study
IEEE Security and Privacy: Jeff Jonas
Threat and Fraud Intelligence: Las Vegas Style
Giannino Bassetti Foundation: Jeff Ubios
Transparency, Privacy and Responsibility: An Interview with Jeff Jonas
Markle Foundation
Nation At Risk: Policy Makers Need Better Information to Protect the Country
85
Related Blog Posts
Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel
Puzzling: How Observations Are Accumulated Into Context
When Risk Assessment is the Risk
Big Data. New Physics.
The Christmas Day Intelligence Failure – Part II: Jeff Jonas‟ Christmas Wish List
Decommissioning Data: Destruction of Accountability
Source Attribution, Don‟t Leave Home Without It
Data Tethering: Managing the Echo
Out-bound Record-level Accountability in Information Sharing Systems
To Anonymize or Not Anonymize, That is the Question
Immutable Audit Logs (IAL‟s)
The Information Sharing Paradox
Discoverability: The First Information Sharing Principle
When Federated Search Bites
Using Transparency As A Mask
86
Macro Trends in
Counter-Terrorism Technologies
And Thoughts on Responsible Innovation
DETECTER Project, Brussels
September 7th, 2011
Jeff Jonas, IBM Distinguished Engineer
Chief Scientist, IBM Entity Analytics
JeffJonas@us.ibm.com
87