+ All Categories
Home > Documents > Rules of Thumb for Information Acquisition from Large and Redundant Data

Rules of Thumb for Information Acquisition from Large and Redundant Data

Date post: 25-Feb-2016
Category:
Upload: sidney
View: 21 times
Download: 1 times
Share this document with a friend
Description:
Version April 21, 2011. Rules of Thumb for Information Acquisition from Large and Redundant Data. Wolfgang Gatterbauer. 33 rd European Conference on Information Retrieval (ECIR'11). Database group University of Washington. http://UniqueRecall.com. - PowerPoint PPT Presentation
Popular Tags:
26
Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer http://UniqueRecall.com Database group University of Washington Version April 21, 2011 33 rd European Conference on Information Retrieval (ECIR'11)
Transcript
Page 1: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Rules of Thumb for Information Acquisition

from Large and Redundant Data

Wolfgang Gatterbauer

http://UniqueRecall.comDatabase groupUniversity of Washington

Version April 21, 2011

33rd European Conference on Information Retrieval (ECIR'11)

Page 2: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect

e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources

Information acquisition ? 20% data(instances)

? 80% information(concepts)

e.g. words in a corpus all words different wordse.g. used first names individual names different names

e.g. web harvesting web data web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Page 3: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3

Information Acquisition from Redundant Data

Au A B Bu"Unique"

informationAvailable

DataRetrieved and extracted data

Acquired information

InformationIntegration

Information Retrieval, Information Extraction

InformationDissemination

Information Acquisition

Redundancydistribution k Recall r Expected sample

distribution k

Expected unique recall ru

Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets

Page 4: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 5: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A Simple Balls-and-Urn Sampling Model

Redundancy Distribution kk=(6,3,3,2,1)

Information i

Redundancy ki

Data

1 2 3 4 5

1

5

6

4

3

2

frequency of i-th most often appearing information (color)

(# colors: au=5)

(# balls: a=15)

Redundancy ki

1 2 3 4 5

1

5

6

4

3

2

Sampled Data(# balls: b=3)

Sampled Information i(# Colors: bu=2)

Recall rr=3/15=0.2

Unique recall ru=2/5=0.4Sample Redundancy

distribution k=(2,1)

5

Page 6: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A model for sampling in the Limit of large data sets

1 2 3 4 51

56

432

6

1 3 5 7 91

56

432

2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)

1 2 3 4 51

56

432

a=(0.2,0.2,0.4,0,0,0.2)

a=15 a∞a=30

k2-3=3

a3=0.4

vertical perspective horizontal perspectivek3-6=3

a3=0.4 a3=0.4

...

Page 7: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for constant redundancy k

0 1

ru = 1-(1-0.5)3=0.875

ru

1

2

3

k=const=3

7

Unique recall

2

Expected sample distribution

2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution

Indep. sampling with p=r

lima∞

r=0.5

Page 8: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

The Intuition for arbitrary redundancy distributions

a3=0.4

a6= 0.2

a2= 0.2

a1= 0.2

1

2

3

5

6

4

Redundancy k

0.2 0.60 0.8 1

8

ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]

Stratified sampling

lima∞

r=0.5

2

Page 9: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

A horizontal Perspective for Sampling

9

au=5 au=20

au=100 au=1000

Horizontal layer of redundancy 1=0.8

Expected sampleredundancy k1=3

r=0.5

bu=800

Page 10: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10

Unique Recall for arbitrary redundancy distributions

10

Page 11: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 12: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12

Three formulations of Power law distributions

12

Stumpf et al. [PNAS’05]

Mitzenmacher [Internet Math.’04]

Adamic [TR’00]

Mitzenmacher [IM’04]

redundancy kiredundancy frequency ak

complementarycumulativefrequency k

Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]

Commonly assumed to be different represen-tations of the same distribution

Page 13: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

13 13

For =1log-log plot

Page 14: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Unique Recall with Power laws

14 14

Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information

For =1

Page 15: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Invariants under Sampling

15 15

Given our model: Which redundancy distribution remains invariant under sampling?

Page 16: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16

Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling

16

ruk=k/k

Page 17: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Also, the power law tail breaks in

17 17

Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows

Page 18: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18

Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion

Page 19: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19

Sampling from Real DataIncoming links for one domain

Tag distribution on delicious.com

Rule of Thumb 1:not good!

Rule of Thumb 1:perfect!

Rule of Thumb 2:works, but only for small area

Rule of Thumb 2:works!

Page 20: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Theory on Sampling from Power-Laws

Stumpf et al. [PNAS’05]

"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."

This paper

"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.

Page 21: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Some other related workPopulation sampling

Downey et al. [IJCAI’05]

Ipeirotis et al. [Sigmod’06]

Stumpf et al. [PNAS’05]

• urn model to estimate the probability that extracted information is correct.

• random sampling with replacement

• show that random subnets sampled from scale-free networks are not scale-free

• decision framework to search or to crawl• random sampling without replacement with

known population sizes

• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement

Bar-Yossef, Gurevich [WWW’06]

• biased methods to sample from search engine's index to estimate index size

Page 22: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22

Summary 1/2 Au A B Bu

Information AvailableData

Retrieved Data

Acquired information

Inf. IntegrationIR & IEInf. Dissemination

Recall rUnique recall ru

• A simple model ofinformation acquisition from redundant data

• Full analytic solution

Inf. Acquisition

- no disambiguity- random sampling

w/o replacementRedundancy

distribution k

- large data

Normalized Redundancy

distribution a

ru(r)

ruk(r)

ruk=k/k

Sampledistribution k

Page 23: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23

Summary 2/2

• Rule of thumb 1:

• Rule of thumb 2:

- 80/20 40/20

- power-law coreremains invariant

Unique recall for power-laws3 different power-laws

Sampling from power-lawsInvariant distribution

ruk r

- sensitive to exact power-law root

http://uniqueRecall.com

Page 24: Rules of Thumb  for Information Acquisition from Large and Redundant Data

24

backup

Page 25: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Geometric interpretation of k(, k, r)

25

BACKUP

Page 26: Rules of Thumb  for Information Acquisition from Large and Redundant Data

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26

Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*

Capurro, Hjørland [ARIST’03]

e.g. words of a corpus:word appearances / vocabulary

e.g. used first names in a groupindividual names / different names

e.g. web harvesting:web data / web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Data(instances)

Information(concepts)

*data interpreted as redundant representation of information


Recommended