QR’09
RESEARCH NEWS EXCLUSIVE
False Identity Detection
?
Most Wanted
Order-of-Magnitude Based Approach
Tossapon Boongoen & Qiang Shen,
Aberystwyth University, UK
An integration of qualitative reasoning and link analysis, to detect possible use of false (or deceptive) identity.
Date 24/06/09
Outline
Background of False Identity False Identity Detection Approaches Order-of-Magnitude Based Model Experimental Results
Conclusion
Date 24/06/09
Background
Age of Terror
False Identity has become
the common denominator of serious crimes and terrorism
In UK, financial losses due to such cause are reported to be around 1.3 billion pounds each year.
In particular to 9/11 attack, US authorities failed to discover the use of false identities by terrorists.
19 terrorists entered the US on 9-11 with false identity
Date 24/06/09
Background
Identity is a set of characteristic
descriptors unique to a specific person.
Identity
Attributed
Name Date-of-birth
Biographical
Biometric
DNA and fingerprint
Easy to
Educational, financial or criminal falsify! history
Date 24/06/09
Attributed Identity
Name Deception is the most common practice with attributed identity.
False Identity (attributed)
Name (100%)
DOB (66.7%)
ID (56.3%)
Resident (33.3%)
Completely different name
Add-on abbreviation
Similar pronunciation
First-second name swap
Date 24/06/09
False Identity Detection
Text-based approach makes
use of string-matching techniques to compare the similarity of strings (X, Y), e.g. Edit distance and Jaro.
Problems of high deception
Bin laden The prince Bin laden The emir Fadil muhamad Harun fazul
Edit distance is based on the number
of edit operations to transform X to Y.
Jaro relies on the number and order of the
common characters between X and Y. Effective for short strings, especially personal names.
This method is effective for problems caused by data-entry or translation errors. But, it fails to deal with ‘highly deceptive cases’.
Date 24/06/09
False Identity Detection
Link-based approach
Despite using several false identities, a criminal (e.g. terrorist) typically exhibits a unique relation pattern to other information objects. Similarity of objects can be estimated from the link patterns they are part in. Example methods:
House
Phone No.
Identity A
Cash Card
Email
Identity B
SimRank (publication domain) PageSim (Internet domain)
Identity C
Date 24/06/09
Link Analysis
D
1
Terminology
Vertex Name Edge Co-occurrence relation Edge weight Co-occurrence frequency
8
4
2
Link-based similarity
Several methods use different properties of shared neighbours. E.g., for the neighbours of (A, B): Cardinality = 2 (i.e. C and D)
1
A
3
C
1 6
B
Uniqueness: average(Uniquenesses of individual neighbours) =
Uniqueness of C + Uniqueness of D 2
Date 24/06/09
Uniqueness Measure
Uniqueness is estimated for each shared neighbour k
of vertices i and j:
UQ
k ij
f ik f jk
f
m
mk
fik : frequency of the link between vertices i and k, fjk : frequency of the link between vertices j and k, fmk : frequency of the link between any vertex m and the vertex k.
Uniqueness measure captures the relative density of
unique links to the nodes in question.
Date 24/06/09
Uniqueness Measure
8 1 4 2
D
8 1 4
1
A
3
1
2
A
3
C
1 6
1
6
B
B
Uniqueness of D =
1+3 1+3
Uniqueness of C =
2+6 2+6+1+1
Date 24/06/09
OM-based Model
Motivations
Existing numerical techniques encounter the problem of inaccurate description (often caused by unduly large values). E.g. Normalised interpretation of cardinality = 100 is 0.1, when the maximum cardinality = 1,000.
Link properties, such as cardinality, are usually a matter of degree.
Link property measures are gauged and described qualitatively: using order-of-magnitude formalism. Additionally, most link-based similarity methods take into account one property of neighbourhood context. Multiple properties (e.g. cardinality and uniqueness) are combined to improve the quality of similarity measure.
Date 24/06/09
OM-based Model
Constructing an OM scale: Cardinality
Numerical scale
0 2
6
…
Landmark set = {2, 6}
Human analyst
Small
Medium
Large
OM scale
[0, 2] (2, 6]
(6, )
Date 24/06/09
OM-based Model
OM Space: Cardinality
[small, large]
Abstraction Precision
[small, medium]
[medium, large]
[small, small]
[medium, medium]
[large, large]
[small, medium]
[medium, large]
Date 24/06/09
OM-based Model
Semi-supervised determination of landmarks
Human-directed landmarks are not optimal for different datasets. A better alternative is to learn from data. In this work, Density function is used to determine landmarks:
N (t ) D(t ) * N
D(t): density of property measure t, N(t): number of entity pairs, whose property measure ≥ t, N*: number of all entity pairs.
Date 24/06/09
OM-based Model
Learning landmark values
101 102 103
4
7 10
104
23
105
Order-of-magnitude Values of D(t)
Date 24/06/09
OM-based Model
Homogenisation of OM Models
Multiple link properties are described in different OM spaces.
Prior to combining these measures, the homogenisation of property-specific OM scales is required.
For instance: Landmark sets of cardinality and uniqueness are {2, 6} and {0.1, 0.3, 0.6, 0.8}, to be mapped onto the homogenised scale of {-3, -2, -1, 0, 1, 2, 3}. Step1.1: Select the central landmark (lc), which is in the middle of each ordered landmark set CT = {2, 6} lc = 2 or lc = 6 UQ = {0.1, 0.3, 0.6, 0.8} lc = 0.3 or lc = 0.6
Date 24/06/09
Homogenisation
Step1-2: Modify each original landmark li to its new value sli, such that sli = li – lc. CT = {2, 6} {0, 4}, lc = 2
UQ = {0.1, 0.3, 0.6, 0.8} {-0.2, 0, 0.3, 0.5}, lc = 0.3
Step2: Add landmark values, such that they symmetrically appear on both positive and negative sides of 0. CT = {0, 4} {-4, 0, 4} UQ = {-0.2, 0, 0.3, 0.5} {-0.5, -0.3, -0.2, 0, 0.2, 0.3, 0.5} Step3: Add additional landmarks, such that all landmark sets have the same granularity. CT = {-4, 0, 4} {-4, -2, -1, 0, 1, 2, 4} UQ = {-0.5, -0.3, -0.2, 0, 0.2, 0.3, 0.5}
Date 24/06/09
Homogenisation
Finally, map the modified scales to the homogenised set.
Date 24/06/09
OM-based Model
Homogenised and Original Scales
Property
Cardinality
Label
small medium large very low low moderate high very high
Original
[0, 2] (2, 6] (6, +) [0, 0.1] (0.1, 0.3] (0.3, 0.6] (0.6, 0.8] (0.8, 1]
Homogenised
(-, 0] (0, 3] (3, +) (-, -1] (-1, 0] (0, 2] (2, 3] (3, +)
Uniqueness
Date 24/06/09
OM-based Model
Combining property measures
Different relevance (importance) degrees for different properties. Qualitative relevance is used here:
Cardinality (CT) = ++ (or 2) and Uniqueness (UQ) = + (or 1).
OMS (Order-of-Magnitude based Similarity).
OMS [ ](CT ,UQ , RV CT , RV UQ )
[(CT ,UQ, RV CT , RV UQ )]S * [2CT UQ]S *
RVCT, RVUQ: relevance degrees of CT and UQ, respectively, (.): real weighted sum, [(.)]: qualitative expression of (.), S*: OM space for expressing OMS values.
Date 24/06/09
OM-based Model
Combining property measures
Example: CT = [medium, medium] and UQ = [moderate, high] OMS = 2CT + UQ
= (2×(0, 3] + (0, 2]) (2×(0, 3] + (2, 3])
= (0, 8] (2, 9] = (0, 9]
Date 24/06/09
OM-based Model
Order-of-magnitude Similarity (OMS)
• Estimated with respect to homogenised scale • Described using the OM space of S* VL
-1
L
0
M
6
H
9
VH
OMS of (0, 9] = [M, H]
• Different S* can be used for a specific precision level required.
Date 24/06/09
Terrorist Data
Terrorist Data is extracted from online news and web stories
Wanted Al-Qaeda chief Osama bin Laden and his top aide, Ayman al-Zawahri, have been witnessed ... ... Osama bin Laden and Ayman al-Zawahri, moved out of Pakistan and are believed to have crossed the border back into Afghanistan ...
Al-Qaeda 1 1+ 1 1 Afghanistan Ayman alZawahri
1 Osama bin Laden 1
Date 24/06/09
Example Data
Abu abdallah Abu muhammad
20 57 10 14 35
Terrorist
13
September11 attack
Al qaida
Bin laden
Afghanistan
Chung-Hsing Yeh
11
Rowena Chen
DBLP
4
Hepu Dong
7
Jisong Chen
2
10
1 5
Hepu Deng
Kate A. Smith
Date 24/06/09
OMS Performance
Different Combination Methods
OMS: Order-of-magnitude model with semi-supervised landmarks.
For Terrorist, CT = {4, 7, 10, 23}, UQ = {0.05, 0.12, 0.27, 0.43, 1} For DBLP, CT = {2, 5, 9, 15}, UQ = {0.008, 0.04, 0.17, 0.31, 1}
OMSH: OMS with human-directed landmarks.
CT = {2, 6}, UQ = {0.1, 0.3, 0.6, 0.8}
QT: Numerical weighted summation. Note that again, the relevance degrees of CT and UQ are 2:1 here.
Date 24/06/09
OMS Performance
With Terrorist Data (Precision/Recall)
Method 200 OMS OMSH QT K name-pairs with top values 400 600 800 0.183/0.159 0.134/0.120 0.094/0.082 1,000 0.180/0.196 0.138/0.150 0.102/0.111
0.215/0.047 0.200/0.087 0.192/0.125 0.045/0.009 0.143/0.062 0.151/0.099 0.040/0.008 0.103/0.045 0.100/0.065
precision recall
# (disclosed alias pairs) # (retrieved pairs)
# (disclosed alias pairs) # (alias pairs in dataset)
Date 24/06/09
OMS Performance
With DBLP Data (Precision/Recall)
Method 100 OMS OMSH QT K name-pairs with top values 200 300 400 0.015/0.261 0.012/0.217 0.010/0.174 500 0.020/0.435 0.012/0.261 0.010/0.217
0.040/0.174 0.025/0.217 0.017/0.217 0.010/0.043 0.010/0.087 0.010/0.130 0.010/0.043 0.005/0.043 0.007/0.087
Date 24/06/09
OMS Performance
OMS against other link-based methods With Terrorist Data
Method 200 OMS SimRank PageSim K name-pairs with top values 400 600 800 0.183/0.159 0.001/0.001 0.099/0.086 1,000 0.180/0.196 0.002/0.002 0.092/0.100
0.215/0.047 0.200/0.087 0.192/0.125 0.000/0.000 0.000/0.000 0.002/0.001 0.035/0.008 0.090/0.039 0.105/0.069
Date 24/06/09
OMS Performance
With DBLP Data
Method 100 OMS SimRank PageSim K name-pairs with top values 200 300 400 0.015/0.261 0.005/0.087 0.005/0.087 500 0.020/0.435 0.006/0.130 0.008/0.174
0.040/0.174 0.025/0.217 0.017/0.217 0.000/0.000 0.005/0.043 0.007/0.087 0.010/0.043 0.005/0.043 0.003/0.043
Date 24/06/09
Conclusion
Contribution:
OMS, as a combination of OM reasoning and link analysis, with (semi-supervised) data-driven determination of landmarks. Usually performing better than numerical link-based approaches.
Improving similarity measure by combining link properties. Allowing explanation for possible reduction of false positives.
Further Work:
Evaluation with more relevant data. Learning of relevance degrees from data.
Acknowledgement:
This research is supported by UK EPSRC grant EP/D057086.