Dr. Max Mustermann Referat Kommunikation & Marketing Verwaltung
Analyzing Characteristic Host Access
Management of Information Security University of Regensburg, Germany
Patterns for Re-Identification of Web User Sessions
Nordsec 27. – 29. October 2010 Aalto University, Espoo, Finland
Dominik Herrmann, Christoph Gerber, Christian Banse, Hannes Federrath
behavior-based web user re-identification Christoph Gerber 2
agenda
problem description
relation to text-mining
case study and
test setting
re-identification
behavior-based web user re-identification Christoph Gerber 3
problem description
• small user group (e.g. users of a proxy-server) • all HTTP-requests are recorded • changing IP-addresses / different surfing sessions
Proxy-Server
IP 1, User 1: www.wikipedia.de IP 2, User 2: www-sec.uni-r.de IP 2, User 2: www.cse.tkk.fi IP 1, User 1: www.google.de
behavior-based web user re-identification Christoph Gerber 4
perspective of a proxy server
t
Host
Session 1: IP1
Session 2: IP2
Session 3: IP3 Session 4: IP4
x x
x x
o
o
o o
¤ ¤
¤
¤
¤
Δ Δ
Δ Δ
Δ
x
HTTP-request to a certain host issued by a user with the IP-address 4
behavior-based web user re-identification Christoph Gerber 5
perspective of a proxy server
t
Host
Session 1: IP1
Session 2: IP2
Session 3: IP3 Session 4: IP4
x x
x x
o
o
o o
¤ ¤
¤
¤
¤
Δ Δ
Δ Δ
Δ
x
User 1
User 2
behavior-based web user re-identification Christoph Gerber 6
perspective of a proxy server
t
Host
Session 1: IP1
Session 2: IP2
Session 3: IP3 Session 4: IP4
x x
x x
o
o
o o
¤ ¤
¤
¤
¤
Δ Δ
Δ Δ
Δ
x
User 1
User 2 User 1? User 2?
someone else?
behavior-based web user re-identification Christoph Gerber 7
perspective of a proxy server
t
Host
Session 1: IP1
Session 2: IP2
Session 3: IP3 Session 4: IP4
x x
x x
o
o
o o
¤ ¤
¤
¤
¤
Δ Δ
Δ Δ
Δ
x
aggregated session
User 2
User 1
behavior-based web user re-identification Christoph Gerber 8
modeling the classification problem
(x11, x2
1, x32, x4
1)
X4: www.google.de X3: www.cse.tkk.fi X2: www-sec.uni-r.de X1: www.wikipedia.de
Session 4: IP4
Δ Δ
Δ Δ
Δ
• each session (s) consists of a multiset
• each surfing session (s) is an instance of a class
• each class represents an user
behavior-based web user re-identification Christoph Gerber 9
classification of user sessions
Host
Session 1: IP1
Session 2: IP2
Session 3: IP3
Training
User 1 User 2
Test Classifier
x x
x x
o
o
o o
¤ ¤
¤
¤
¤
x
t Session 4: IP4
Δ
Δ Δ
Δ
aggregated session
Δ
behavior-based web user re-identification Christoph Gerber 10
similarity to text-mining-problems
• word frequency and host frequency following a power-law
http://www.cs.princeton.edu/introcs/data/bible.txt
text-retrieval user re-identification
0
10000
20000
30000
40000
50000
60000
70000
1 6 11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
freq
uen
cy
host ranking
0
10000
20000
30000
40000
50000
60000
70000
1 6 11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
freq
uen
cy
word ranking
behavior-based web user re-identification Christoph Gerber 11
text-mining toolbox
• multinomial naive bayes (MNB)
• vector transformations - TF transformation
- IDF transformation
- cosine normalisation (N)
cf. Manning Raghavan,Schütze: Introduction to Information Retrieval. Cambridge Press 2009. [21]
Training Test
behavior-based web user re-identification Christoph Gerber 12
related work
• Pang et al. (2007) - re-identification of users in 802.11 wireless networks
• Yang (2008) - focus on fraud detection
• Kumpost (2009) - focus on re-identification of web users
behavior-based web user re-identification Christoph Gerber 13
test setting and case study
• test users • local proxy server • host obfuscation • client/server architecture
key value participants 28 duration of study in days 57 number of HTTP requests 2,684,736
number of unique hosts 25,124
behavior-based web user re-identification Christoph Gerber 14
data acquisition users scope of protection WWW
local proxy-server
HTTP-traffic to WWW
host-obfuscation
#1
#2
#n
...
study data
aggregated study data
behavior-based web user re-identification Christoph Gerber 15
host obfuscation
hash-function
www.google.de
<salt> 134BC2D1F..0D
serveral times repeated
• hashing of hostnames • + salt to prevent dictionary attacks • + iterations to prevent building of own dictionary
behavior-based web user re-identification Christoph Gerber 18
re-identification attack
• attacker's view - limited knowledge - practical relevance
• simulations - for evaluating the driving factors
• countermeasures
behavior-based web user re-identification Christoph Gerber 19
attacker's view (training)
• Δt = 24h • decision to track a specific user ut on day t • training with Ut classes on day t with St sessions
time
Host
day t
User 1 User 2
MNB
x x
x x
o o
o o x
Training
behavior-based web user re-identification Christoph Gerber 20
attacker's view (attack)
• Δt = 24h • decision to track a specific user ut on day t • training with Ut classes on day t with St sessions
• on day t+1 assinging each session s to a class ut • evaluating the classification result for class cu
time
Host
day t
Training Test MNB
x x
x x
o o
o o ¤
¤
¤
¤
¤
Δ Δ
Δ Δ
Δ
x
day t+1
User 1 User 2
behavior-based web user re-identification Christoph Gerber 21
prediction scheme of attacker's view
correctly classified by proxy-server
wrong classification – error is detectable for proxy-server
wrong classification – error not detectable for proxy-server
- attacker sucessfully recognizes the user
- attacker sucessfully recognizes the absence of the user
- more than one user was predicted to belong to class cu
- attacker detects absence of user; but user was online
- attacker wrongly recognizes the user
behavior-based web user re-identification Christoph Gerber 22
results from the attacker's view
• user re-identification works - 60.5% correctly classified sessions
• and can be improved by vector transformations - 73.1% by applying TF-N transformation
• further improvements are possible - 77.6% by 'learning' the user habbits
• more improvements conceivable - timing-information - filenames - GET-parameters - destination-ports - ...
?
behavior-based web user re-identification Christoph Gerber 23
results from the attacker's view
• user re-identification works - 60.5% correctly classified sessions
• and can be improved by vector normalization - 70.1% by applying TF-N transformation
• further improvements are possible - 77.6% by 'learning' the user habbits
• more improvements conceivable - timing-information - filenames - GET-parameters - destination-ports - ...
?
none N IDF IDF-N TF TF-N TF-IDF TF-IDF-N
60.5% 62.9% 65.0% 62.8% 56.0% 73.1% 66.1% 72.8%
behavior-based web user re-identification Christoph Gerber 24
results from the attacker's view
• user re-identification works - 60.5% correctly classified sessions
• and can be improved by vector transformations - 73.1% by applying TF-N transformation
• further improvements are possible - 77.6% by 'learning' the user habbits
• more improvements conceivable - timing-information - filenames - GET-parameters - destination-ports - ...
?
behavior-based web user re-identification Christoph Gerber 25
simulations
• simulation of simultaneously surfing sessions - putting together the cronologically succeeding sessions - always 28 users / session
• in each experiment one parameter was modified - session duration - number of simultaneous users - offset between last training and first test session - number of consecutive training instances
• each experiment was repeated 25 times
behavior-based web user re-identification Christoph Gerber 26
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
(SIM)
session duration
• longer session times support re-identification
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
session duration in minutes
behavior-based web user re-identification Christoph Gerber 27
numer of simultaneous users
• the fewer simultaneous users the better it works
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30
24 hours3 hours1 hour10 min
number of concurrent users
session duration
behavior-based web user re-identification Christoph Gerber 28
offset between test and training sessions
• each user tends to act similar at the same time of the day
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140 160
3 hours1 hour
offset between test and training in hours
behavior-based web user re-identification Christoph Gerber 29
number of training instances
• more training instances are better, but only few are needed
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
number of training instances
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18 20
3 hours1 hour
1 hour (48 hours train/test offset)10 min
behavior-based web user re-identification Christoph Gerber 30
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
re-id
entif
ied
sess
ions
[%]
number of proxy servers
1 day3 hours
countermeasures
• using multiple, non-colluding proxy servers works - but is not practicable (at this early stage)
• more distribution schemes conceivable
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
behavior-based web user re-identification Christoph Gerber 31
countermeasures
• analyzing a part of the host frequency distribution
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
acce
ss fr
eque
ncy
Host ranking
behavior-based web user re-identification Christoph Gerber 32
countermeasures
• analyzing a part of the host frequency distribution - keep the most popular hosts
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
acce
ss fr
eque
ncy
Host ranking
behavior-based web user re-identification Christoph Gerber 33
countermeasures
• analyzing a part of the host frequency distribution - keep the most popular hosts - can not prevent from user re-identification
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5
re-id
entif
ied
sess
ions
[%]
proportion of most popular hosts kept
1 day3 hours1 hour
10 minutes
prop
ortio
n of
cor
rect
ly
clas
sifie
d se
ssio
ns
behavior-based web user re-identification Christoph Gerber 34
conclusion and discussion
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET http://www.ab.com/index.html HTTP/1.0" 200 2326
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET http://www.ab.com/index.html HTTP/1.0" 200 2326
• re-identification as a feasible attack • evaluated on a privacy preserving case study
• works well for small closed groups • not only for relevant for proxy-servers
• improvements in using context information
• improvements in gathering more realistic sessions