Post on 25-May-2020
transcript
A Review on anonymization approach to preserve privacy of
Published data through record elimination
Isha K. Gayki
P.R.Patil COE &T
Computer science & Engg.Dept, Amravati
Ishagayki09@gmail.com
Prof.Arvind S.Kapse
P.R.Patil COE &T
Computer science & Engg.Dept, Amravati
Arvind.kapse@yahoo.com
Abstract
Data mining is the process of analyzing data. Data
Privacy is collection of data and dissemination of
data. Privacy issues arise in different area such as
health care, intellectual property, biological data,
financial transaction etc. It is very difficult to protect
the data when there is transfer of data. Sensitive
information must be protected. There are two kinds of
major attacks against privacy namely record linkage
and attribute linkage attacks. Research have
proposed some methods namely k-anonymity, ℓ-
diversity, t-closeness for data privacy. K-anonymity
method preserves the privacy against record linkage
attack alone. It is unable to prevent address attribute
linkage attack. ℓ-diversity method overcomes the
drawback of k-anonymity method. But it fails to
prevent identity disclosure attack and attribute
disclosure attack. t-closeness method preserves the
privacy against attribute linkage attack but not
identity disclosure attack. A proposed method used to
preserve the privacy of individuals sensitive data
from record and attribute linkage attacks. In the
proposed method, privacy preservation is achieved
through generalization by setting range values and
through record elimination. A proposed method
overcomes the drawback of both record linkage
attack and attribute linkage attack.
1. Introduction 1.1. Data Mining
Data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from
different perspectives and summarizing it into useful
information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is
one of a number of analytical tools for analyzing
data. It allows users to analyze data from many
different dimensions or angles, categorize it, and
summarize the relationships identified. Technically,
data mining is the process of finding correlations or
patterns among dozens of fields in large relational
databases. Data mining is an interdisciplinary subfield of computer science, is the computational
process of discovering patterns in large data
sets involving methods at the intersection of artificial
intelligence, machine learning, statistics,
and database systems. The overall goal of the data
mining process is to extract information from a data
set and transform it into an understandable structure
for further use. The actual data mining task is the
automatic or semi-automatic analysis of large
quantities of data to extract previously unknown
interesting patterns such as groups of data records,
unusual records and dependencies. Data mining uses
information from past data to analyze the outcome of
a particular problem or situation that may arise. Data
mining works to analyze data stored in data
warehouses that are used to store that data that is
being analyzed. Data mining interprets its data into real time analysis that can be used to increase sales,
promote new product, or delete product that is not
value-added to the company. Government and private
sectors are publishing micro data to facilitate pure
research. Individuals’ privacy should be safeguarded.
1.2. Published Data
Published scientific data sets provide a great
opportunity for instructors who want to get their
students working with data. Using available data can
make it quicker and easier to get rolling. Allowing
Isha K Gayki et al , Int.J.Computer Technology & Applications,Vol 4 (6),986-989
IJCTA | Nov-Dec 2013 Available online@www.ijcta.com
986
ISSN:2229-6093
students to explore real questions with the best
scientific data available adds excitement to course
and enhances student motivation for learning. Other
reasons to use published data resources from
scientific projects include:
Ease of Use.
Data Quality:- Quality control of the data
Spatial Coverage.
Temporal Coverage.
Focus on Process.
Visualization.
1.3. Data Privacy
Most exciting developments in privacy research
in recent years has been the emergence of a
workable, formal definition of privacy-preserving
data access, along with algorithms that can provide a
mathematical proof of privacy preservation in certain
cases. Information privacy, or data privacy, is the
relationship between collection and dissemination
of data, technology, the public expectation of
privacy, and the legal and political issues surrounding
them Privacy concerns exist wherever personally
identifiable information is collected and stored – in
digital form or otherwise. Improper or non-existent
disclosure control can be the root cause for privacy
issues. Data privacy issues can arise in response to
information from a wide range of sources, such as:
Healthcare records
Criminal justice investigations and proceedings
Financial institutions and transactions
Biological traits, such as genetic material
Residence and geographic records
Ethnicity
Privacy breach
Location-based service and geolocation
The challenge in data privacy is to share data while
protecting personally identifiable information. The
fields of data security and information security design
and utilize software, hardware and human resources
to address this issue.
Data mining involves six common classes of tasks:
1)Anomaly detection (Outlier/change/deviation
detection) – The identification of unusual data
records.2)Association rule learning (Dependency
modeling) – Searches for relationships between variables..3)Clustering – is the task of discovering
groups and structures in the data that are in some way
or another "similar", without using known structures
in the data.4)Classification – is the task of
generalizing known structure to apply to new data.
For example, an e-mail program might attempt to
classify an e-mail as "legitimate" or as
"spam".5)Regression – Attempts to find a function
which models the data with the least error. 6)Summarization – providing a more compact
representation of the data set, including visualization
and report generation.7)Sequential pattern mining –
Sequential pattern mining finds sets of data items that
occur together frequently in some sequences.
Sequential pattern mining, which extracts frequent
subsequences from a sequence database e.g. web user
analysis, stock trend prediction, DNA sequence
analysis, finding language or linguistic patterns from
natural language texts, and using the history of
symptoms to predict certain kind of disease.
2. Literature review
Existing techniques find solution for privacy
problem to some extent. k-anonymity[7] can prevent
the identity disclosure attack but not attribute
disclosure attack. Another method, ℓ-diversity[9]
method preserves the privacy against attribute
disclosure attack. But, not identity disclosure attack.
t-closeness method[9] is good at attribute disclosure
attack. It is computationally complex. but, it fail to
protect the privacy against attribute disclosure attack
P sensitive k-anonymity model[7], the modified
micro data table T* satisfies (p+, α)-sensitive k-
anonymity property if it satisfies k-anonymity, and
each Q-I group has at least p distinct categories of the sensitive attribute and its total weight is at least α.
This method significantly reduces the possibility of
Similarity Attack and incurs less distortion ratio
compared to p-sensitive k-anonymity method.
Tamir Tassa [2] proposed an alternative model of
k-type anonymity. It is reduce the information loss
than k-anonymity and obtained anonymized table by
less generalization. It preserves the privacy against
identity disclosure alone. Qian Wang[4] proposed the
model k-anonymity in protection of attribute
disclosure. It can prevent attribute disclosure by
controlling average leakage probability and
probability difference of sensitive attribute value
Mahesh, Meyyappan[11] proposed a new method to anonymize the dataset by setting range values in
Quasi identifiers. If the Quasi identifier consists of
same attribute values in any class.
In t-closeness method[9], an equivalence class is
said to have t-closeness if the distance between the
Isha K Gayki et al , Int.J.Computer Technology & Applications,Vol 4 (6),986-989
IJCTA | Nov-Dec 2013 Available online@www.ijcta.com
987
ISSN:2229-6093
distribution of a sensitive attribute in this class and
the distribution of the attribute in the whole table is
no more than a threshold t. A table is said to have t-
closeness if all equivalence classes have tcloseness. It
preserves the privacy against homogeneity and
background knowledge attacks.
In (α,k)-Anonymity [6] model, a view of the table
is said to be an (α, k)-anonymization, if the
modification of the table satisfies both k-anonymity
and α-deassociation properties with respect to the
quasi-identifier. It does not address the identity
disclosure attack. The proposed method provides new
anonymization technique comprising record
elimination and generalization the proposed method
reduces the information loss compared to existing
methods.
Versatile publishing method [8] preserves the
privacy by splitting the anonymized table T* by
framing the privacy rules. Privacy can be breached by
applying the conditional probability in published
table.
Mahesh, Meyyappan[1] proposed a method to
anonymize the dataset by setting range values in
Quasi
identifiers. If the Quasi identifier consists of same
attribute values in any class, but fails to preserve the privacy.
3. Methods for privacy preservation
3.1. k-anonymity when attributes are suppressed or generalized
until each row is identical with at least k-1 other rows
then that method is called as a k-anonymity it
prevents definite database linkages and also
guarantees that the data released is accurate. but it
has some limitation it does not hide individual
identity, Unable to protect against attacks based on background knowledge
k-anonymity cannot be applied to high-
dimensional data without complete loss of utility also
some special methods are required for
anonymization and published data more than once.
3.2. l- diversity Method overcomes the drawbacks of k-anonymity
but fails to preserve the privacy against skewness
and similarity attacks.
3.3. t-closeness Method is called as t-closeness when the distance
between the distribution of a sensitive attribute in
same class and the distribution of the attribute in the
whole table is no more than a threshold t. A table is
said to have t-closeness if all equivalence classes
have t-closeness. It preserves the privacy against
homogeneity and background knowledge attacks.
4. Proposed work Database of any organization,company’s,medical
is a confidential database such database must be
preserved so that no confidential information can be
loss. the proposed method provides new
anonymization technique comprising record
elimination and generalization. An application will be
develop which will have one graphical user
presentation and one database. This database will
have patient database. Data is pass to the algorithm
and generalized database will be passed to research
Proposed method reduces the information loss
compared to existing system.
5. Conclusion Information about age or any confidential data
published in web pages are growing enormously
every year. When we are utilizing the data for
research purpose, privacy of the individuals whose
data is going to be published should not be
challenged. The proposed method attempts at static
micro data only which contain numeric quasi
identifiers. Future we are going to study different
data publishing scenarios such as multiple view
publishing, Anonymizing sequential release with new
attributes and incrementally update data records as
well as non-numeric quasi identifiers.
REFERENCES
[1] Mahesh R, Meyyappan T, “Anonymization technique
through record elimination to Preserve Privacy of
Published data”, International workshop on pattern
recognization, Informatics and mobile engineering, proceedings,,978-1-4673-5845-3,2013
[2]Tamir Tassa,Arnon Mazza and Aristides Gionis,“k-
Concealment: An Alternative Model of k-Type
Anonymity”, TRANSACTIONS ON DATA PRIVACY 5, 2012, pp189–222
[3] Xin Jin,Mingyang Zhang,Nan Zhang and Gautam Das,
“Versatile Publishing For Privacy Preservation”,
2010,KDD10,ACM [4] Qiang Wang,Zhiwei Xu and Shengzhi Qu,“An
Enhanced K-Anonymity Model against Homogeneity
Attack”, Journal of software,2011, Vol. 6, No.10, October
2011;1945-1952 [5] Benjamin C.M.Fung,KE Wang,Ada Wai-Chee Fu and Philip S. Yu, Introduction to Privacy-
Preserving Data Publishing Concepts and techniques,
ISBN:978-1-4200-9148-9,2010
[6] Raymond Wong, Jiuyong Li,Ada Fu and Ke wang, “(α,k)-anonymous data publishing”, Journal Intelligent
Information System, 2009,pp209- 234.
[7] Xiaoxun Sun, Hua Wang, Jiuyong Li and Traian
Marius Truta, “Enhanced P-Sensitive K-Anonymity Models for privacy Preserving Data Publishing”,
Transactions On Data Privacy, 2008,pp53-66
Isha K Gayki et al , Int.J.Computer Technology & Applications,Vol 4 (6),986-989
IJCTA | Nov-Dec 2013 Available online@www.ijcta.com
988
ISSN:2229-6093
[8] B.C.M. Fung, Ke Wang and P.S.Yu, “Anonymizing
classification data for privacy preservation”, IEEE Transactions on Knowledge and Data Engineering(TKDE),
2007,pp711-725
[9] Ninghui Li, Tiancheng Li, Suresh
Vengakatasubramaniam,“t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity”, International Conference on
Data Engineering, 2007, pp106-115
[10] X. Xiao and Y. Tao,“Personalized privacy
preservation”, In Proceedings of ACM Conference on Management of Data (SIGMOD’06”),2006,pp229-240
[11] Mahesh R, Meyyappan T, “A New Method for
Preserving Privacy in Data Publishing”,International
workshop on cryptography and Information Security, CS&IT proceedings,2012,pp 261-266
[12]Neha V. Mogre, Girish Agarwal, Pragati Patil:“A
Review On Data Anonymization Technique For Data
Publishing” Proc. International Journal of Engineering Research &
Technology (IJERT) Vol. 1 Issue 10, December- 2012
ISSN: 2278-0181
[13]Y. Xu, K. Wang, A.W.-C. Fu, and P.S. Yu, “Anonymizing Transaction Databases for Publication,”
Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery
and Data Mining (KDD), pp. 767-775, 2008.
[14]N. Li, T. Li, and S. Venkatasubramanian, “t-Closeness: Privacy Beyond k-Anonymity and „-Diversity,” Proc.
IEEE 23rd Int’l Conf. Data Eng. (ICDE), pp. 106-115,
2007.
[15]T. M. Truta and V. Bindu,“Privacy Protection: “p-
sensitive k-anonymity property”, International Workshop
of Privacy Data Management (PDM2006), In Conjunction
with 22th International Conference of Data Engineering (ICDE),2006, pp94
[16] X. Xiao and Y. Tao,“Personalized privacy
preservation”, In Proceedingsof ACM Conference on
Management of Data (SIGMOD’06”),2006,pp229-240 [17]L. Sweeney,“An Achieving k-Anonymity Privacy
Protection Using Generalization and Suppression”,
International Journal of Uncertainty, Fuzziness and
Knowledge-Based System,2002,pp571-588
Isha K Gayki et al , Int.J.Computer Technology & Applications,Vol 4 (6),986-989
IJCTA | Nov-Dec 2013 Available online@www.ijcta.com
989
ISSN:2229-6093