Date post: | 24-May-2015 |
Category: |
Technology |
Upload: | saeed-samet |
View: | 950 times |
Download: | 0 times |
Saeed Samet and Ali Miri
School of Information Technology and Engineering
University of Ottawa
Privacy Preserving ID3 using Gini Index over
Horizontally Partitioned Data
2
Outline
Our motivation Decision Tree and ID3 Information Gain, Entropy, and Gini Index Privacy-Preserving Data Mining Background Our Main Protocol and Sub-Protocols Complexity Future Work
3
Our Motivation
All works done in privacy-preserving decision tree use entropy
Gini index can be used to compute information gain
Using gini Index, the largest class goes into one pure node, while the other classes go into the other node
Entropy normally tries to create balanced tree
4
Decision Tree and ID3
A Decision Tree describes a tree structure wherein leaves represent classifications and branches represent conjunctions of features that lead to those classifications
ID3 is a decision tree induction algorithm, developed by Quinlan. ID3 stands for "Iterative Dichotomizer 3 "
5
Possible Values
Variables(Normal Attributes)
predicted values of target variable(Class Attribute)
Observation
Conclusion
Predictive Model
Decision Tree
6
Decision Tree ExampleDay Outlook Humidity Wind Play1 Sunny High Weak No
2 Sunny High Weak No
3 Cloudy High Strong No
4 Rain Normal Strong No
5 Rain Normal Weak Yes
6 Rain High Weak No
7 Cloudy High Weak Yes
8 Sunny Normal Strong No
9 Sunny Normal Strong No
10 Sunny High Strong No
11 Rain Normal Weak Yes
12 Cloudy Normal Weak Yes
13 Cloudy High Weak Yes.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Normal (or Independent) AttributesNormal (or Independent) Attributes
Class (or Dependent) AttributeClass (or Dependent) Attribute
7
Decision Tree Example (cont.)
Wind
Humidity Outlook
Weak Strong
Outlook Yes=7No=0
High Normal
Humidity
Sunny Cloudy
Yes=2No=0
Yes=0No=2
Yes=2No=3
Yes=0No=4
Yes=0No=5
Rain
Yes=1No=0
High Normal
Yes=0No=2
Sunny CloudyRain
(Outlook= Cloudy , Wind = Strong ), Humidity = High Play = No
Target Data
8
Information Gain The information gain of a given attribute A with respect
to the class attribute C is the reduction in uncertainty about the value of C when we know the value of A.
The uncertainty about the value of C is measured by its entropy.
The uncertainty about the value of C when we know the value of A is given by the conditional entropy of C given A.
where a is a value of A, is the subset of instances of where A takes the value a, and is the number of instances.
Aa
aa SEntropy
S
SSEntropyASGain )(
||
||)(),(
SaS
|| S
9
Entropy Amount of uncertainty about an event associated
with a given probability distribution
Shannon defines entropy in terms of a discrete random event X, with possible states as:
where: is the probability of the i-th outcome of X.
nxx ,...,1
n
i
n
iii
ii xpxp
xpxpxH
1 122 )(log)()
)(
1(log)()(
)Pr()( ii xXxp
10
Gini Index Another sensible measure of impurity is Gini
Index:
where is the relative frequency of in S.
Therefore, information gain using Gini Index is:
We will come back to this formula…
n
iixpSGini
1
2 )(1)(
Aa
aa SGini
S
SSGiniASGain )(
||
||)(),(
)( ixp ix
11
Privacy-Preserving Data Mining
Privacy-preserving data mining Extracting desired knowledge without
revealing the private data values by developing new algorithms or modifying the standard algorithms
In Co-operative and distributed computation, Prevents access to unnecessary and private information while each party wants to achieve some aggregate results
12
Privacy-Preserving Approaches
Data Distribution Centralized data environment Distributed data environment
Horizontal Vertical Arbitrary
Main approaches Secure Multi-party Computation (SMC) Randomization and perturbation
13
Background
Pinkas and Lindell Computing information gain using Entropy Presenting a secure protocol to compute
when x is distributed between two parties Working only for two parties (because of using
Oblivious Polynomial Evaluation protocol)
Xiao et al. Computing information gain using Entropy Working for multi-party case Using Homomorphic encryption
xx ln
14
Our Protocol
Privacy Preserving ID3 over Horizontally Partitioned Data
Using Gini Index to compute information gain
Working for multi-party cases Sub-protocols:
Secure multi-party additionSecure multi-party multiplicationSecure multi-party square-division
15
Main Protocol
Computing information gain using Gini Index
is the probability that the value of class attribute C is c while the value of attribute A is a .
Cc
aca pSGini 21)(
Aa
aa SGini
S
SSGiniASGain )(
||
||)(),(
acp
Cc a
aca S
SSGini
2
2
||
||1)(
16
Main Protocol (cont.)
is fixed. Therefore, we have to compute
)(SGini
)||
||1(
||
||),(
2
2
Cc a
ac
Aa
a
S
S
S
SASF
)||
||(||
11),(
2
Aa Cc a
ac
S
S
SASF
.
.
.
17
Main Protocol (cont.)
For instance, in the previous example, for attribute Outlook we have to compute
||
||||
||
||
||
||||
||
||
||
||||
||
||
2,
2,
2,
2,
2,
2,
2,
2,
2,
Cloudy
NoCloudyYesCloudy
Cc Cloudy
cCloudy
Rain
NoRainYesRain
Cc Rain
cRain
Sunny
NoSunnyYesSunny
Cc sunny
csunny
S
SS
S
S
S
SS
S
S
S
SS
S
S
18
For instance, two parties have to compute
and belong to party 1
and belong to party 2
)()(
)()(
2121
221
221
yyxx
yyxx
1x
2x
1y
2y
Main Protocol (cont.)
19
Secure Multi-party Addition n parties are involved,
Inputs: , Outputs: ,
such that:
Suppose E is an Additive Homomorphic Encryption, with public key e and private key d:
Thus we have:
nPP ,,1 ii xP :
ii lP :ni 1
n
ii
n
ii lx
11
ni 1
nl
ln
ii
n
ii
n
ii elEelEexEexE
2,,,),( 1
111
)),,((),),(( 2112 demmEDdemED m
emmEemEemE ,,, 2121
20
Secure Multi-party Addition (cont.)
1. selects an additive homomorphic encryption and sends the public key e to all other parties.
2. encrypts its input , , and sends it to .
3. For i=2 to n-1 encrypts its input , , multiplies it by and
sends to .
4. encrypts its input , , and computes .
5. randomly selects its, nonzero, output share ,
calculates and sends it to .
1P
1
1
),(i
jj exE
1P 1x ),( 1 exE2P
ix ),( exE iiP
1iP
i
jj exE
1
),(
nP nx ),( exE n
n
ii exE
1
),(
nP nl1
1
),(
nln
ii exE
1nP
21
11
1
),(
il
nln
jj exE
Secure Multi-party Addition (cont.)
6. For i=n-1 to 2 randomly selects its, nonzero, output , calculates
and sends it to .
7. decrypts the received value from and sets it as its output .
2P
iP
1P
il
1iP
1l
dexEDl
l
nln
ii ,),(
12
1
11
22
Other Sub-Protocols
Secure Multi-party MultiplicationSame as Secure Multi-party
Addition Secure Multi-party Square-
DivisionUsing two previous sub-protocols
23
Complexity Common parameters
Size of the database # of parties involved # of attributes # of possible values for attributes (on average)
To compute the cost of the protocol, suppose: # of parties involved in the protocol is denoted by n # of remaining normal attributes at the current step
and node is denoted by a # of possible values for those normal attributes, on
average, is denoted by v # of possible values for the class attribute is denoted
by c # of bits exchanging from one party to another party is
denoted by b Computational overhead of sub-protocols 1 and 2 is
denoted by CPn
24
Complexity (cont.)
CPn includes n encryptions, one decryption, n-1 multiplications and n-1 power computations.
The overall computational cost, by assuming that c is dominated by b and n, is
The overall communication cost is
nCPCPnva 22
bnva 2
25
Future Work
Using proposed sub-protocols in other techniques in PPDM and presenting new building blocks in SMC
Implementation of the protocol to Find the exact cost and efficiency of the
algorithm Compare with other existing techniques
26
References1. Rakesh Agrawal, Alexandre V. Evfimievski, and Ramakrishnan Srikant. Information sharing across
private databases. In ACM Special Interest Group on Management of Data (SIGMOD) Conference, pages 86–97, 2003.
2. Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In ACM Special Interest Group on Management of Data (SIGMOD) Conference, pages 439–450, 2000.
3. Friedman J.H. Olshen R.A. Breiman, L. and C.J. Stone. Classification and Regression Trees. Chapman & Hall, New York, 1984.
4. Leo Breiman. Technical note: Some properties of splitting criteria. Machine Learning, 24(1):41–47, 1996.
5. Christian Cachin and Jan Camenisch, editors. Advances in Cryptology - EUROCRYPT 2004, International Conference on the Theory and Applications of Cryptographic Techniques, Interlaken, Switzerland, May 2-6, 2004, Proceedings, volume 3027 of Lecture Notes in Computer Science. Springer, 2004.
6. Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preservingdistributed data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 4(2):28–34, 2003.
7. DTREG. How trees are built. http://www.dtreg.com/treebuild.htm, 2006. (Last posted: 22/7/2006).8. W. Du and M. Atallah. Privacy-preserving cooperative statistical analysis. In ACSAC ’01: Proceedings
of the 17th Annual Computer Security Applications Conference, pages 102–110, New Orleans, Louisiana, USA, December 10-14 2001.
9. Wenliang Du and Zhijun Zhan. Building decision tree classifier on private data. In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining, pages 1–8, Darlinghurst, Australia, Australia, 2002. Australian Computer Society, Inc.
10. Bart Goethals, Sven Laur, Helger Lipmaa, and Taneli Mielik¨ainen. On private scalar product computation for privacy preserving data mining. In ICISC, pages 104–120, 2004.
11. R. J. Light and B. H. Margolin. An analysis of variance for categorical data. In Journal of The American Statistical Association, volume 66, pages 534–544, 1971.
12. Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In CRYPTO, pages 36–54, 2000.
27
References13. Behzad Malek and Ali Miri. Secure dot-product protocol using trace functions. 2006 IEEE
International Symposium on Information Theory, 2006.14. Moni Naor and Benny Pinkas. Oblivious transfer and polynomial evaluation. In STOC ’99:
Proceedings of the thirty-first annual ACM Symposium on Theory of Computing, pages 245–254, New York, NY, USA, 1999. ACM Press.
15. Moni Naor and Benny Pinkas. Efficient oblivious transfer protocols. In SODA ’01: Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 448–457, Philadelphia, PA, USA, 2001. Society for Industrial and Applied Mathematics.
16. Benny Pinkas. Cryptographic techniques for privacy-preserving data mining. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 4(2):12–19, 2002.
17. Laura Elena Raileanu and Kilian Stoffel. Theoretical comparison between the gini index and information gain criteria. Annal of Mathematics and Artificial Intelligence, 41(1):77–93, 2004.
18. Eakalak Suthampan and Songrit Maneewongvatana. Privacy preserving decision tree in multi party environment. In Asia Information Retrieval Symposium (AIRS), pages 727–732, 2005.
19. Salford Systems. Do splitting rules really matter? http://www. salford-systems.com/423.php, 2006.20. Jaideep Vaidya and Chris Clifton. Privacy-preserving decision trees over vertically partitioned data.
In Data and Application Security (DBSec), pages 139–152, 2005.21. Jaideep Vaidya and Chris Clifton. Secure set intersection cardinality with application to association
rule mining. Journal of Computer Security, 13(4):593–622, 2005.22. Ming-Jun Xiao, Liu-Sheng Huang, Yong-Long Luo, and Hong Shen. Privacy preserving ID3 algorithm
over horizontally partitioned data. In Parallel and Distributed Computing, Applications and Technologies, pages 239–243, 2005.
28
Secure Multi-party Multiplication k parties are involved,
Inputs: , Outputs: ,
such that:
Suppose E is an Additive Homomorphic Encryption, with public key e and private key d. Thus we have:
ii xP :
ii lP :ni 1
n
ii
n
ii lx
11
ni 1
nx
xn
ii
n
ii
n
ii exEexEelEelE
2,,,),( 1
111
kPP ,,1
29
1. selects an additive homomorphic encryption and sends the public key e to all other parties.
2. encrypts its input , , and sends it to .
3. For i=2 to n-1 powers the received value to its input , and
sends it to .
4. For i=n to 2 randomly selects its, nonzero, output share , encrypts it,
, computes its inverse, , multiplies the received
value to that, , and sends it to .
1P
1P 1x ),( 1 exE2P
ixix
xexE2),( 1iP
1iP
iPil
1iP
),( elE i1),( elE i
n
iji
x elEexEnx
11 ),(),( 2
Secure Multi-party Multiplication (cont.)
30
Secure Multi-party Multiplication (cont.)
6. decrypts the received value from and sets it as its output . 2P1P 1l
dexEexEDln
ii
xnx
,),(),(2
111
2
31
Secure Multi-party Square-Division
Suppose two parties are horizontally involved, and Inputs: ,
Outputs:
Using two-party multiplication Inputs: , Outputs: ,
such that: and 2121 llyy
111 ,: yxP 222 ,: yxP
)()(
)()(
2121
221
221
yyxx
yyxx
1P 2P
111 ,: yxP 222 ,: yxP
111 ,: lkP 222 ,: lkP
2121 kkxx
32
Secure Multi-party Square-Division (cont.)
Next step Inputs: and
and
Outputs:
Sub-step, using two-party addition
Inputs: ,
Outputs: ,
Such that: and
computes and send it to
computes and send it to
2121 nnww
1121
2111 22: lkyxzP
21
21
ww
zz
1
11 n
mr
111 ,: wzP
2121 mmzz
111 yxw
2222
2222 22: lkyxzP 222 yxw
222 ,: nmP111 ,: nmP
222 ,: wzP
1P 2P
2
22 n
mr 2P 1P
33
Secure Multi-party Square-Division (cont.)
Each party computes
21
21
2
2
1
121 nn
mm
n
m
n
mrr
)()(
)()(
2121
221
221
yyxx
yyxx
21
21
ww
zz