Rules of Thumb for Information Acquisition from Large and Redundant Data

Rules of Thumb for Information Acquisition

from Large and Redundant Data

Wolfgang Gatterbauer

http://UniqueRecall.comDatabase groupUniversity of Washington

Version April 21, 2011

33rd European Conference on Information Retrieval (ECIR'11)

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com

Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect

e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources

Information acquisition ? 20% data(instances)

? 80% information(concepts)

e.g. words in a corpus all words different wordse.g. used first names individual names different names

e.g. web harvesting web data web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3

Information Acquisition from Redundant Data

Au A B Bu"Unique"

informationAvailable

DataRetrieved and extracted data

Acquired information

InformationIntegration

Information Retrieval, Information Extraction

InformationDissemination

Information Acquisition

Redundancydistribution k Recall r Expected sample

distribution k

Expected unique recall ru

Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets


Outline

• A horizontal sampling model• The role of power-laws• Real data & Discussion


A Simple Balls-and-Urn Sampling Model

Redundancy Distribution kk=(6,3,3,2,1)

Information i

Redundancy ki

Data

1 2 3 4 5

1

5

6

4

3

2

frequency of i-th most often appearing information (color)

(# colors: au=5)

(# balls: a=15)

Redundancy ki

1 2 3 4 5

1

5

6

4

3

2

Sampled Data(# balls: b=3)

Sampled Information i(# Colors: bu=2)

Recall rr=3/15=0.2

Unique recall ru=2/5=0.4Sample Redundancy

distribution k=(2,1)

5


A model for sampling in the Limit of large data sets

1 2 3 4 51

56

432

6

1 3 5 7 91

56

432

2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)

1 2 3 4 51

56

432

a=(0.2,0.2,0.4,0,0,0.2)

a=15 a∞a=30

k2-3=3

a3=0.4

vertical perspective horizontal perspectivek3-6=3

a3=0.4 a3=0.4

...


The Intuition for constant redundancy k

0 1

ru = 1-(1-0.5)3=0.875

ru

1

2

3

k=const=3

7

Unique recall

2

Expected sample distribution

2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution

Indep. sampling with p=r

lima∞

r=0.5


The Intuition for arbitrary redundancy distributions

a3=0.4

a6= 0.2

a2= 0.2

a1= 0.2

1

2

3

5

6

4

Redundancy k

0.2 0.60 0.8 1

8

ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]

Stratified sampling

lima∞

r=0.5

2


A horizontal Perspective for Sampling

9

au=5 au=20

au=100 au=1000

Horizontal layer of redundancy 1=0.8

Expected sampleredundancy k1=3

r=0.5

bu=800


Unique Recall for arbitrary redundancy distributions

10


Outline



Three formulations of Power law distributions

12

Stumpf et al. [PNAS’05]

Mitzenmacher [Internet Math.’04]

Adamic [TR’00]

Mitzenmacher [IM’04]

redundancy kiredundancy frequency ak

complementarycumulativefrequency k

Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]

Commonly assumed to be different represen-tations of the same distribution


Unique Recall with Power laws

13 13

For =1log-log plot


Unique Recall with Power laws

14 14

Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information

For =1


Invariants under Sampling

15 15

Given our model: Which redundancy distribution remains invariant under sampling?


Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling

16

ruk=k/k


Also, the power law tail breaks in

17 17

Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows


Outline



Sampling from Real DataIncoming links for one domain

Tag distribution on delicious.com

Rule of Thumb 1:not good!

Rule of Thumb 1:perfect!

Rule of Thumb 2:works, but only for small area

Rule of Thumb 2:works!


Theory on Sampling from Power-Laws


"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."

This paper

"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.


Some other related workPopulation sampling

Downey et al. [IJCAI’05]

Ipeirotis et al. [Sigmod’06]


• urn model to estimate the probability that extracted information is correct.

• random sampling with replacement

• show that random subnets sampled from scale-free networks are not scale-free

• decision framework to search or to crawl• random sampling without replacement with

known population sizes

• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement

Bar-Yossef, Gurevich [WWW’06]

• biased methods to sample from search engine's index to estimate index size


Summary 1/2 Au A B Bu

Information AvailableData

Retrieved Data

Acquired information

Inf. IntegrationIR & IEInf. Dissemination

Recall rUnique recall ru

• A simple model ofinformation acquisition from redundant data

• Full analytic solution

Inf. Acquisition

- no disambiguity- random sampling

w/o replacementRedundancy

distribution k

- large data

Normalized Redundancy

distribution a

ru(r)

ruk(r)

ruk=k/k

Sampledistribution k


Summary 2/2

• Rule of thumb 1:

• Rule of thumb 2:

- 80/20 40/20

- power-law coreremains invariant

Unique recall for power-laws3 different power-laws

Sampling from power-lawsInvariant distribution

ruk r

- sensitive to exact power-law root

http://uniqueRecall.com

24

backup


Geometric interpretation of k(, k, r)

25

BACKUP


Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*

Capurro, Hjørland [ARIST’03]

e.g. words of a corpus:word appearances / vocabulary

e.g. used first names in a groupindividual names / different names

e.g. web harvesting:web data / web information

Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?

Data(instances)

Information(concepts)

*data interpreted as redundant representation of information

Date post:	25-Feb-2016
Category:	Documents
Upload:	sidney
View:	21 times
Download:	1 times

Rules of Thumb for Information Acquisition from Large and Redundant Data

Documents