Rules of Thumb for Information Acquisition
from Large and Redundant Data
Wolfgang Gatterbauer
http://UniqueRecall.comDatabase groupUniversity of Washington
Version April 21, 2011
33rd European Conference on Information Retrieval (ECIR'11)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect
e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources
Information acquisition ? 20% data(instances)
? 80% information(concepts)
e.g. words in a corpus all words different wordse.g. used first names individual names different names
e.g. web harvesting web data web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3
Information Acquisition from Redundant Data
Au A B Bu"Unique"
informationAvailable
DataRetrieved and extracted data
Acquired information
InformationIntegration
Information Retrieval, Information Extraction
InformationDissemination
Information Acquisition
Redundancydistribution k Recall r Expected sample
distribution k
Expected unique recall ru
Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A Simple Balls-and-Urn Sampling Model
Redundancy Distribution kk=(6,3,3,2,1)
Information i
Redundancy ki
Data
1 2 3 4 5
1
5
6
4
3
2
frequency of i-th most often appearing information (color)
(# colors: au=5)
(# balls: a=15)
Redundancy ki
1 2 3 4 5
1
5
6
4
3
2
Sampled Data(# balls: b=3)
Sampled Information i(# Colors: bu=2)
Recall rr=3/15=0.2
Unique recall ru=2/5=0.4Sample Redundancy
distribution k=(2,1)
5
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A model for sampling in the Limit of large data sets
1 2 3 4 51
56
432
6
1 3 5 7 91
56
432
2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)
1 2 3 4 51
56
432
a=(0.2,0.2,0.4,0,0,0.2)
a=15 a∞a=30
k2-3=3
a3=0.4
vertical perspective horizontal perspectivek3-6=3
a3=0.4 a3=0.4
...
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for constant redundancy k
0 1
ru = 1-(1-0.5)3=0.875
ru
1
2
3
k=const=3
7
Unique recall
2
Expected sample distribution
2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution
Indep. sampling with p=r
lima∞
r=0.5
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for arbitrary redundancy distributions
a3=0.4
a6= 0.2
a2= 0.2
a1= 0.2
1
2
3
5
6
4
Redundancy k
0.2 0.60 0.8 1
8
ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]
Stratified sampling
lima∞
r=0.5
2
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A horizontal Perspective for Sampling
9
au=5 au=20
au=100 au=1000
Horizontal layer of redundancy 1=0.8
Expected sampleredundancy k1=3
r=0.5
bu=800
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10
Unique Recall for arbitrary redundancy distributions
10
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12
Three formulations of Power law distributions
12
Stumpf et al. [PNAS’05]
Mitzenmacher [Internet Math.’04]
Adamic [TR’00]
Mitzenmacher [IM’04]
redundancy kiredundancy frequency ak
complementarycumulativefrequency k
Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]
Commonly assumed to be different represen-tations of the same distribution
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
13 13
For =1log-log plot
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
14 14
Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information
For =1
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Invariants under Sampling
15 15
Given our model: Which redundancy distribution remains invariant under sampling?
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16
Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling
16
ruk=k/k
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Also, the power law tail breaks in
17 17
Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19
Sampling from Real DataIncoming links for one domain
Tag distribution on delicious.com
Rule of Thumb 1:not good!
Rule of Thumb 1:perfect!
Rule of Thumb 2:works, but only for small area
Rule of Thumb 2:works!
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Theory on Sampling from Power-Laws
Stumpf et al. [PNAS’05]
"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."
This paper
"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Some other related workPopulation sampling
Downey et al. [IJCAI’05]
Ipeirotis et al. [Sigmod’06]
Stumpf et al. [PNAS’05]
• urn model to estimate the probability that extracted information is correct.
• random sampling with replacement
• show that random subnets sampled from scale-free networks are not scale-free
• decision framework to search or to crawl• random sampling without replacement with
known population sizes
• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement
Bar-Yossef, Gurevich [WWW’06]
• biased methods to sample from search engine's index to estimate index size
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22
Summary 1/2 Au A B Bu
Information AvailableData
Retrieved Data
Acquired information
Inf. IntegrationIR & IEInf. Dissemination
Recall rUnique recall ru
• A simple model ofinformation acquisition from redundant data
• Full analytic solution
Inf. Acquisition
- no disambiguity- random sampling
w/o replacementRedundancy
distribution k
- large data
Normalized Redundancy
distribution a
ru(r)
ruk(r)
ruk=k/k
Sampledistribution k
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23
Summary 2/2
• Rule of thumb 1:
• Rule of thumb 2:
- 80/20 40/20
- power-law coreremains invariant
Unique recall for power-laws3 different power-laws
Sampling from power-lawsInvariant distribution
ruk r
- sensitive to exact power-law root
http://uniqueRecall.com
24
backup
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Geometric interpretation of k(, k, r)
25
BACKUP
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26
Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*
Capurro, Hjørland [ARIST’03]
e.g. words of a corpus:word appearances / vocabulary
e.g. used first names in a groupindividual names / different names
e.g. web harvesting:web data / web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
Data(instances)
Information(concepts)
*data interpreted as redundant representation of information