1 Advanced databases – Inferring implicit/new knowledge from data(bases): Some thoughts about...

1

Advanced databases –

Inferring implicit/new knowledge from data(bases):

Some thoughts about mining and privacy

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/

Last update: 13 December 2007

2

Agenda

(Some) questions

(Some) answers

Outlook

3

Is this a man or a woman?

clicked on

4

Is this the same person?

5

Who is this?

6

Who is this?(Sample from a search-query log)

7

Agenda

(Some) questions

(Some) answers

Outlook

8

Gender prediction I

9

Data, data preparation and learning

Blogspot.com – optional demographic annotation

150,000 blogs:

75,000 malemale entries / 75,000 femalefemale entries

size 200 - 4,000 characters / entry

post-processing

quality of automatic “gender separation”

naïve bayes text classifier 140,000 training set – 10,000 test set

Accuracy: 71% (>> 50% baseline)

[Liu, H. & Mihalcea, R. (2007). Of men, women, and computers: Data-driven gender modeling for improved user interfaces,

In Proc. of the International Conference on Weblogs and Social Media]

10

Results: Characteristic words and n-grams

Feature weights:

(F = feature,

C = class)

≥ threshold

11

Deploying the results –personalization can have different faces (1)

12

Deploying the results –personalization can have different faces (2)

13

Gender prediction II

clicked on

14

Input data and prediction problem

Only 8% of Internet users write blogs

Informal observations of correlations between browsing behaviour and demographic attributes (gender, age)

Problem:

How to predict the user‘s gender from the Web pages s/he klicked on

Basic idea:

users user-to-page matrix pages document-to-term matrix terms

[Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007]

15

Steps (1)

1. Define the gender tendency of a Web page Proportion of requests for the page by male/female (c) users,

relative to all requests

(R : user-to-page matrix)

2. Learn the gender tendency of Web pages Pages: with variance on gender ≥ threshold

Linear form of support-vector machine regression

Features: content words with highest information gain

target attribute: gender tendency

3. Predict the user‘s gender Naive Bayes

Features: visited pages

target attribute: gender

16

Steps (2)

4. Optimize the process by leveraging latent semantic structure

Smooth the user-to-page matrix ( 0 some small constant )

Apply LSI to the user-to-page matrix

Smooth each predicted value by replacing it by a weighted average of its closest neighbours‘ values

(and various other optimizations)

17

Results (steps 1-3)

18

Results (steps 1-4)

A kind of analogue of the BOWused for predicting genderfrom produced content(words over all visited pages)

19

Data, preprocessing and learning

Click-through log collected by a large scale Web site

189,480 users

Exclude those who did not provide gender or age information

Exclude those who clicked fewer than 10 pages

223,786 pages

Exclude those that were not crawled by the crawler

Exclude those that that were visited fewer than 10 times

Further preprocessing on page content: remove stopwords

Different forms of regression and classification

Training set / test set: each ½ of the users

20

Re-identification

21

Anonymized?

„In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC collected patient-specific data with nearly one hundred attributes per encounter along the lines of the those shown in the leftmost circle of Figure 1 [...] Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry.

For twenty dollars I purchased the voter registration list for Cambridge Massachusetts [...] The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter.

This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals.

22

Results

For example, [...] Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.“

„87% of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.“

Based on that, Sweeney defined k-anonymity: A relational table is k-anonymous if every sequence of values for an

attribute set that can re-identify (that can be used for linking) occurs in at least k records.

L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.

23

Data and data analysis

Medical records on 135,000 state employees and their families

The census data

Matching on shared attributes

(Note: This matching is generally not considered to be a data mining technique. However, k-anonymity has become an important goal of privacy-preserving data mining techniques.)

24

Merging identities

25

Keeping identities apart – the basic setting

Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/

Public dataset: users mention films in forum posts

Private dataset (may be released e.g. for research purposes): users‘ ratings

Film IDs can easily be extracted from the posts

Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)

[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]

26

Keeping identities apart – the computational problem

Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset

Rank these users u by their likelihood of being t

Evalute:

If t is in the top k of this list, then t is k-identified

Count percentage of users who are k-identified

E.g. measure likelihood by TF.IDF (m: item)

27

Results

28

What do you think helps?

29

Data and data analysis

3,828 movie references made by 133 forum posters about 1,685 different movies

12,565,530 ratings from 140,132 users, on 8,957 items

Extraction of movie references from the posts

Different algorithms for finding „similar“ users

30

Re-identification

31

Who is this?(Sample from an anonymized search-query log)

32

Result(a 1-identified person)

[M. Barbaro and T. Zeller. A face is exposed for

AOL Searcher No. 4417749. New York Times,

9 August 2006]

33

Noise injection for queries

34

Agenda

(Some) questions

(Some) answers

Outlook

35

Trends in Web mining

More data types (e.g., multimedia)

More varied semantics (e.g., social networks: documents, people, ...)

More action types (usage becomes reading and writing) and richer data (texts instead of queries)

Ubiquity

Of data and devices (e.g., mobile)

Of people

Privacy-preserving data mining

More users of KD

36

Other uses of data mining / KD: Policing

Rodney Munroe, chief of the Richmond, Va. police service, accepted the 2007 Business Intelligence Award for Excellence from Gartner, a leading information technology research and advisory company [...].

The award usually goes to innovative business applications, but judges were impressed with how Munroe turned analytics into a formidable crime-fighting tool.

When Munroe took over as chief two years ago, his department was drowning in crime and data. Police had a mass of data from 911 calls and crime reports; what they didn’t have was a way to connect the dots and see a pattern of behaviour.

Using some sophisticated software and hardware they started overlaying crime reports with other data, such as weather, traffic, sports events and paydays for large employers. The data was analyzed three times a day and something interesting emerged: Robberies spiked on paydays near cheque cashing storefronts in specific neighbourhoods. Other clusters also became apparent, and pretty soon police were deploying resources in advance and predicting where crime was most likely to occur.

Ian Harvey. Fighting crime with databases. Aug. 6, 2007 http://www.cbc.ca/news/background/tech/data-mining.html

37

The Automated Targeting System

The Automated Targeting System assigns a “risk assessment” to every person and container seeking to enter or exit the U.S.

The assessment is based on combining a number of data sources.

In Fiscal Year 2005, Customs and Border Protection “processed 431 million pedestrians and passengers, 121 million privately owned vehicles, and processed and cleared 25.3 million sea, rail, and truck containers,” according to Customs.

38

Department of Homeland Security – announcement of changes to the ATS, 7 August 2007

„ATS assists U.S. Customs and Border Protection (CBP) frontline officers in frustrating the ability of terrorists to gain entry into the United States, enforcing all import and export laws, and facilitating legitimate trade and travel across our borders. [...] the department received several hundred comments [...], many of which concerned ATS-P, the passenger screening module used by CBP officers. The department responded to these comments by revising the SORN.

Notable revisions to the SORN include: ATS-P will retain the information for a far shorter period of time. Under the revised SORN,

the retention period is 15 years (7 years active and 8 years dormant), a significant decrease from the proposed 40-year period.

Under ATS-P, the purposes for which Passenger Name Record data (PNR) may be used have been narrowed.

The updated SORN implements the department’s mixed system policy, which administratively extends the protections of the Privacy Act of 1974 to non-U.S. persons by providing access and redress to their PNR data.

As well, ATS-P treats all passengers equally. ATS does not profile by race, ethnicity or arbitrary assumptions. The department does not collect information on race, ethnicity, religion, or orientation, or make decisions based on such information, and to the extent such information may be provided by a carrier, the department filters that information out.

Further, ATS-P does not use a score to determine an individual’s risk level. Rather, ATS-P compares PNR and Advanced Passenger Information System data with law enforcement records and threat-based scenarios for use by law enforcement officials to intercept high-risk travelers, identify persons of concern, and identify patterns of suspicious activity, which may be used to identify other high risk travelers previously unknown to law enforcement. The scenarios are drawn from previous and current law enforcement and intelligence information.

Importantly, ATS does not replace human decision making. It is a decision-making support tool for use by trained law enforcement officials. [...]“

http://www.dhs.gov/xnews/releases/pr_1186178812301.shtm

39

„My data belong to me“?! – or: my mined profile, part 2; or: external effects I

40

„My data belong to me“?! – or: external effects II

Friendship is generally symmetric

If A wants to hide her friendships,

But B shows that „A is my friend“,

B has disclosed private information of A.

(More elaborate problems follow from this ...)

For a discussion, see

Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling. In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007.

41

Scope for action for you!

as a researcher

as a netizen

as a citizen

42

Next lecture: Application example – What is the impact of genetically modified organisms?

43

References and background reading

Liu, H. & Mihalcea, R. (2007). Of men, women, and computers: Data-driven gender modeling for improved user interfaces. In Proc. of the International Conference on Weblogs and Social Media. http://www.icwsm.org/papers/paper3.html

Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. http://www2007.org/papers/paper686.pdf

L. Sweeney (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 557-570. http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html

Frankowski, D., Cosley, D., Sen, S., Terveen, L.G., Riedl, J. (2006). You are what you say: privacy risks of public mentions. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006 (pp. 565–572). ACM. http://www-users.cs.umn.edu/~dfrankow/files/privacy-sigir2006.pdf

Barbaro, M., Zeller, T.: A face is exposed for AOL searcher no. 4417749. New York Times (9 August 2006) http://www.nytimes.com/2006/08/09/technology/09aol.html

Jeff Jonas and Jim Harper (2006). Effective Counterterrorism and the Limited Role of Predictive Data Mining. Policy Analysis No. 584, Dec. 11, 2006. http://www.cato.org/pubs/pas/pa584.pdf

Date post:	28-Dec-2015
Category:	Documents
Upload:	beverley-hunter
View:	220 times
Download:	3 times

1 Advanced databases – Inferring implicit/new knowledge from data(bases): Some thoughts about...

Documents