+ All Categories
Transcript
Page 1: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Rules of Thumb for Information Acquisition

from Large and Redundant Data

Wolfgang Gatterbauer

http://UniqueRecall.comDatabase groupUniversity of Washington

Version April 21, 2011

33rd European Conference on Information Retrieval (ECIR'11)

Page 2: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect

e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources

Information acquisition ? 20% data(instances)

? 80% information(concepts)

e.g. words in a corpus all words different wordse.g. used first names individual names different names

e.g. web harvesting web data web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Page 3: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3

Information Acquisition from Redundant Data

Au A B Bu"Unique"

informationAvailable

DataRetrieved and extracted data

Acquired information

InformationIntegration

Information Retrieval, Information Extraction

InformationDissemination

Information Acquisition

Redundancydistribution k Recall r Expected sample

distribution k

Expected unique recall ru

Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets

Page 4: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 5: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A Simple Balls-and-Urn Sampling Model

Redundancy Distribution kk=(6,3,3,2,1)

Information i

Redundancy ki

Data

1 2 3 4 5

1

5

6

4

3

2

frequency of i-th most often appearing information (color)

(# colors: au=5)

(# balls: a=15)

Redundancy ki

1 2 3 4 5

1

5

6

4

3

2

Sampled Data(# balls: b=3)

Sampled Information i(# Colors: bu=2)

Recall rr=3/15=0.2

Unique recall ru=2/5=0.4Sample Redundancy

distribution k=(2,1)

5

Page 6: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A model for sampling in the Limit of large data sets

1 2 3 4 51

56

432

6

1 3 5 7 91

56

432

2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)

1 2 3 4 51

56

432

a=(0.2,0.2,0.4,0,0,0.2)

a=15 a∞a=30

k2-3=3

a3=0.4

vertical perspective horizontal perspectivek3-6=3

a3=0.4 a3=0.4

...

Page 7: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for constant redundancy k

0 1

ru = 1-(1-0.5)3=0.875

ru

1

2

3

k=const=3

7

Unique recall

2

Expected sample distribution

2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution

Indep. sampling with p=r

lima∞

r=0.5

Page 8: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for arbitrary redundancy distributions

a3=0.4

a6= 0.2

a2= 0.2

a1= 0.2

1

2

3

5

6

4

Redundancy k

0.2 0.60 0.8 1

8

ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]

Stratified sampling

lima∞

r=0.5

2

Page 9: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A horizontal Perspective for Sampling

9

au=5 au=20

au=100 au=1000

Horizontal layer of redundancy 1=0.8

Expected sampleredundancy k1=3

r=0.5

bu=800

Page 10: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10

Unique Recall for arbitrary redundancy distributions

10

Page 11: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 12: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12

Three formulations of Power law distributions

12

Stumpf et al. [PNAS’05]

Mitzenmacher [Internet Math.’04]

Adamic [TR’00]

Mitzenmacher [IM’04]

redundancy kiredundancy frequency ak

complementarycumulativefrequency k

Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]

Commonly assumed to be different represen-tations of the same distribution

Page 13: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

13 13

For =1log-log plot

Page 14: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

14 14

Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information

For =1

Page 15: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Invariants under Sampling

15 15

Given our model: Which redundancy distribution remains invariant under sampling?

Page 16: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16

Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling

16

ruk=k/k

Page 17: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Also, the power law tail breaks in

17 17

Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows

Page 18: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 19: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19

Sampling from Real DataIncoming links for one domain

Tag distribution on delicious.com

Rule of Thumb 1:not good!

Rule of Thumb 1:perfect!

Rule of Thumb 2:works, but only for small area

Rule of Thumb 2:works!

Page 20: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Theory on Sampling from Power-Laws

Stumpf et al. [PNAS’05]

"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."

This paper

"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.

Page 21: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Some other related workPopulation sampling

Downey et al. [IJCAI’05]

Ipeirotis et al. [Sigmod’06]

Stumpf et al. [PNAS’05]

• urn model to estimate the probability that extracted information is correct.

• random sampling with replacement

• show that random subnets sampled from scale-free networks are not scale-free

• decision framework to search or to crawl• random sampling without replacement with

known population sizes

• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement

Bar-Yossef, Gurevich [WWW’06]

• biased methods to sample from search engine's index to estimate index size

Page 22: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22

Summary 1/2 Au A B Bu

Information AvailableData

Retrieved Data

Acquired information

Inf. IntegrationIR & IEInf. Dissemination

Recall rUnique recall ru

• A simple model ofinformation acquisition from redundant data

• Full analytic solution

Inf. Acquisition

- no disambiguity- random sampling

w/o replacementRedundancy

distribution k

- large data

Normalized Redundancy

distribution a

ru(r)

ruk(r)

ruk=k/k

Sampledistribution k

Page 23: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23

Summary 2/2

• Rule of thumb 1:

• Rule of thumb 2:

- 80/20 40/20

- power-law coreremains invariant

Unique recall for power-laws3 different power-laws

Sampling from power-lawsInvariant distribution

ruk r

- sensitive to exact power-law root

http://uniqueRecall.com

Page 24: Rules of Thumb  for Information Acquisition from Large and Redundant Data

24

backup

Page 25: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Geometric interpretation of k(, k, r)

25

BACKUP

Page 26: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26

Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*

Capurro, Hjørland [ARIST’03]

e.g. words of a corpus:word appearances / vocabulary

e.g. used first names in a groupindividual names / different names

e.g. web harvesting:web data / web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Data(instances)

Information(concepts)

*data interpreted as redundant representation of information


Top Related