Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | beverley-hunter |
View: | 220 times |
Download: | 3 times |
1
Advanced databases –
Inferring implicit/new knowledge from data(bases):
Some thoughts about mining and privacy
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/
Last update: 13 December 2007
2
Agenda
(Some) questions
(Some) answers
Outlook
3
Is this a man or a woman?
clicked on
4
Is this the same person?
5
Who is this?
6
Who is this?(Sample from a search-query log)
7
Agenda
(Some) questions
(Some) answers
Outlook
8
Gender prediction I
9
Data, data preparation and learning
Blogspot.com – optional demographic annotation
150,000 blogs:
75,000 malemale entries / 75,000 femalefemale entries
size 200 - 4,000 characters / entry
post-processing
quality of automatic “gender separation”
naïve bayes text classifier 140,000 training set – 10,000 test set
Accuracy: 71% (>> 50% baseline)
[Liu, H. & Mihalcea, R. (2007). Of men, women, and computers: Data-driven gender modeling for improved user interfaces,
In Proc. of the International Conference on Weblogs and Social Media]
10
Results: Characteristic words and n-grams
Feature weights:
(F = feature,
C = class)
≥ threshold
11
Deploying the results –personalization can have different faces (1)
12
Deploying the results –personalization can have different faces (2)
13
Gender prediction II
clicked on
14
Input data and prediction problem
Only 8% of Internet users write blogs
Informal observations of correlations between browsing behaviour and demographic attributes (gender, age)
Problem:
How to predict the user‘s gender from the Web pages s/he klicked on
Basic idea:
users user-to-page matrix pages document-to-term matrix terms
[Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007]
15
Steps (1)
1. Define the gender tendency of a Web page Proportion of requests for the page by male/female (c) users,
relative to all requests
(R : user-to-page matrix)
2. Learn the gender tendency of Web pages Pages: with variance on gender ≥ threshold
Linear form of support-vector machine regression
Features: content words with highest information gain
target attribute: gender tendency
3. Predict the user‘s gender Naive Bayes
Features: visited pages
target attribute: gender
16
Steps (2)
4. Optimize the process by leveraging latent semantic structure
Smooth the user-to-page matrix ( 0 some small constant )
Apply LSI to the user-to-page matrix
Smooth each predicted value by replacing it by a weighted average of its closest neighbours‘ values
(and various other optimizations)
17
Results (steps 1-3)
18
Results (steps 1-4)
A kind of analogue of the BOWused for predicting genderfrom produced content(words over all visited pages)
19
Data, preprocessing and learning
Click-through log collected by a large scale Web site
189,480 users
Exclude those who did not provide gender or age information
Exclude those who clicked fewer than 10 pages
223,786 pages
Exclude those that were not crawled by the crawler
Exclude those that that were visited fewer than 10 times
Further preprocessing on page content: remove stopwords
Different forms of regression and classification
Training set / test set: each ½ of the users
20
Re-identification
21
Anonymized?
„In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient-specific data with nearly one hundred attributes per encounter along the lines of the those shown in the leftmost circle of Figure 1 [...] Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry.
For twenty dollars I purchased the voter registration list for Cambridge Massachusetts [...] The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter.
This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals.
22
Results
For example, [...] Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.“
„87% of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.“
Based on that, Sweeney defined k-anonymity: A relational table is k-anonymous if every sequence of values for an
attribute set that can re-identify (that can be used for linking) occurs in at least k records.
L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
23
Data and data analysis
Medical records on 135,000 state employees and their families
The census data
Matching on shared attributes
(Note: This matching is generally not considered to be a data mining technique. However, k-anonymity has become an important goal of privacy-preserving data mining techniques.)
24
Merging identities
25
Keeping identities apart – the basic setting
Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/
Public dataset: users mention films in forum posts
Private dataset (may be released e.g. for research purposes): users‘ ratings
Film IDs can easily be extracted from the posts
Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)
[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]
26
Keeping identities apart – the computational problem
Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset
Rank these users u by their likelihood of being t
Evalute:
If t is in the top k of this list, then t is k-identified
Count percentage of users who are k-identified
E.g. measure likelihood by TF.IDF (m: item)
27
Results
28
What do you think helps?
29
Data and data analysis
3,828 movie references made by 133 forum posters about 1,685 different movies
12,565,530 ratings from 140,132 users, on 8,957 items
Extraction of movie references from the posts
Different algorithms for finding „similar“ users
30
Re-identification
31
Who is this?(Sample from an anonymized search-query log)
32
Result(a 1-identified person)
[M. Barbaro and T. Zeller. A face is exposed for
AOL Searcher No. 4417749. New York Times,
9 August 2006]
33
Noise injection for queries
34
Agenda
(Some) questions
(Some) answers
Outlook
35
Trends in Web mining
More data types (e.g., multimedia)
More varied semantics (e.g., social networks: documents, people, ...)
More action types (usage becomes reading and writing) and richer data (texts instead of queries)
Ubiquity
Of data and devices (e.g., mobile)
Of people
Privacy-preserving data mining
More users of KD
36
Other uses of data mining / KD: Policing
Rodney Munroe, chief of the Richmond, Va. police service, accepted the 2007 Business Intelligence Award for Excellence from Gartner, a leading information technology research and advisory company [...].
The award usually goes to innovative business applications, but judges were impressed with how Munroe turned analytics into a formidable crime-fighting tool.
When Munroe took over as chief two years ago, his department was drowning in crime and data. Police had a mass of data from 911 calls and crime reports; what they didn’t have was a way to connect the dots and see a pattern of behaviour.
Using some sophisticated software and hardware they started overlaying crime reports with other data, such as weather, traffic, sports events and paydays for large employers. The data was analyzed three times a day and something interesting emerged: Robberies spiked on paydays near cheque cashing storefronts in specific neighbourhoods. Other clusters also became apparent, and pretty soon police were deploying resources in advance and predicting where crime was most likely to occur.
Ian Harvey. Fighting crime with databases. Aug. 6, 2007 http://www.cbc.ca/news/background/tech/data-mining.html
37
The Automated Targeting System
The Automated Targeting System assigns a “risk assessment” to every person and container seeking to enter or exit the U.S.
The assessment is based on combining a number of data sources.
In Fiscal Year 2005, Customs and Border Protection “processed 431 million pedestrians and passengers, 121 million privately owned vehicles, and processed and cleared 25.3 million sea, rail, and truck containers,” according to Customs.
38
Department of Homeland Security – announcement of changes to the ATS, 7 August 2007
„ATS assists U.S. Customs and Border Protection (CBP) frontline officers in frustrating the ability of terrorists to gain entry into the United States, enforcing all import and export laws, and facilitating legitimate trade and travel across our borders. [...] the department received several hundred comments [...], many of which concerned ATS-P, the passenger screening module used by CBP officers. The department responded to these comments by revising the SORN.
Notable revisions to the SORN include: ATS-P will retain the information for a far shorter period of time. Under the revised SORN,
the retention period is 15 years (7 years active and 8 years dormant), a significant decrease from the proposed 40-year period.
Under ATS-P, the purposes for which Passenger Name Record data (PNR) may be used have been narrowed.
The updated SORN implements the department’s mixed system policy, which administratively extends the protections of the Privacy Act of 1974 to non-U.S. persons by providing access and redress to their PNR data.
As well, ATS-P treats all passengers equally. ATS does not profile by race, ethnicity or arbitrary assumptions. The department does not collect information on race, ethnicity, religion, or orientation, or make decisions based on such information, and to the extent such information may be provided by a carrier, the department filters that information out.
Further, ATS-P does not use a score to determine an individual’s risk level. Rather, ATS-P compares PNR and Advanced Passenger Information System data with law enforcement records and threat-based scenarios for use by law enforcement officials to intercept high-risk travelers, identify persons of concern, and identify patterns of suspicious activity, which may be used to identify other high risk travelers previously unknown to law enforcement. The scenarios are drawn from previous and current law enforcement and intelligence information.
Importantly, ATS does not replace human decision making. It is a decision-making support tool for use by trained law enforcement officials. [...]“
http://www.dhs.gov/xnews/releases/pr_1186178812301.shtm
39
„My data belong to me“?! – or: my mined profile, part 2; or: external effects I
40
„My data belong to me“?! – or: external effects II
Friendship is generally symmetric
If A wants to hide her friendships,
But B shows that „A is my friend“,
B has disclosed private information of A.
(More elaborate problems follow from this ...)
For a discussion, see
Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling. In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007.
41
Scope for action for you!
as a researcher
as a netizen
as a citizen
42
Next lecture: Application example – What is the impact of genetically modified organisms?
43
References and background reading
Liu, H. & Mihalcea, R. (2007). Of men, women, and computers: Data-driven gender modeling for improved user interfaces. In Proc. of the International Conference on Weblogs and Social Media. http://www.icwsm.org/papers/paper3.html
Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. http://www2007.org/papers/paper686.pdf
L. Sweeney (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 557-570. http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html
Frankowski, D., Cosley, D., Sen, S., Terveen, L.G., Riedl, J. (2006). You are what you say: privacy risks of public mentions. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006 (pp. 565–572). ACM. http://www-users.cs.umn.edu/~dfrankow/files/privacy-sigir2006.pdf
Barbaro, M., Zeller, T.: A face is exposed for AOL searcher no. 4417749. New York Times (9 August 2006) http://www.nytimes.com/2006/08/09/technology/09aol.html
Jeff Jonas and Jim Harper (2006). Effective Counterterrorism and the Limited Role of Predictive Data Mining. Policy Analysis No. 584, Dec. 11, 2006. http://www.cato.org/pubs/pas/pa584.pdf