Date post: | 14-Apr-2017 |
Category: |
Data & Analytics |
Upload: | yasanka-sameera-horawalavithana |
View: | 517 times |
Download: | 0 times |
1
An Efficient incremental indexing mechanism for extracting Top-k
representative queries over continuous data streams
Y.S. Horawalavithana, D.N. Ranasinghe
Adaptive and Reflective Middleware (ARM) ACM/IFIP/USENIX Middleware
Vancouver, BC, CanadaDecember 08, 2015
University of Colombo School of Computing, Sri Lanka
2
Overview
β’ Motivationβ’ Adaptive Diversificationβ’ Incremental Top-kβ’ Evaluationβ’ Conclusionβ’ Future work
3
4
Diversity: Top-k representative setRepresentative Top-kDrawback
(without diversity)What we want(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
5
Minimum independent-dominating set
π1
π2π3
π4
π5
π£1
π£4
π£3
π£5
π£2
πΌ
π£1
π£4
π£3
π£5
π£2 π£1
π£4
π£3π£2
π£5
π£1
π£4
π£3π£2
π£5
jijiji ppppdppodNeighborho ,| )(
π£1
π£4
π£3π£2
π£5
Publication space
Graph model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
6
NAΓVE Greedy argmaxπ (ππ)
2
βπ πβπ (π π)
π (π π)Γπ (ππ ,π π)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
7
Handling streaming publications
π1
π2π3
π4
π5
π£1
π£4
π£3
π£5
π£2πΌ
π6
π£1
π£4
π£3
π£5
π£2π£6
Continuity Requirements1. Durability
an item is selected as diversified in window may still have the chance to be in window if it's not expired & other valid items in window are failed to compete with it.
2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not-older than j.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
8
Adaptive Diversification
π1π2π3π4 .. π ππ π+1 .. .. .. ....
Matching publication stream
π1π2π3π4 .. π ππ π+1 .. .. .. ....
ith window
(i+1)th window
π πβ
π π+1β
Independence
Dominance
Durability
Order
Straightforward solution: Apply naΓ―ve greedy method at each instance
Propose incremental index mechanism! Avoid the curse of re-calculating neighborhood
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
9
Locality Sensitive Hashing (LSH) Simple Idea
if two points are close together, then after a βprojectionβ operation these two points will remain close together
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
10
LSH in Adaptive Diversification:Publications as categorical data
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
11
LSH in Adaptive Diversification:Characteristic Matrix
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
12
LSH in Adaptive Diversification:Minhashing No Publications any more!
Signature to represent
Technique Randomly permute the rows at
characteristic matrix m times Take the number of the 1st row, in
the permuted order, which the column has a 1 for
the correspondent column of publications.
First permutation of rows at characteristic matrix
Advantage: Reduce the dimensions into a small
minhash signature1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
13
LSH in Adaptive Diversification:Signature Matrix
Fast-minhashingSelect m number of random hash
functionsTo model the effect of m number of
random permutationMathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
14
LSH in Adaptive Diversification:LSH Buckets
Take r sized signature vectors From m sized
minhash-signature
Map them into, L Hash-Tables Each with
arbitrary b number of buckets
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
15
LSH in Adaptive Diversification:Batch-wise Top-k computation
Bucket βWinnerβ β a publication which has the highest relevancy score
Winner is dominant to represent it's bucket neighborhood
Top-k "winnersβ that have a majority of votes k winners are independent
π π΄π π΅ππΆππ·π πΈπ πΉππΊππ». .
ith window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
16
LSH in Dynamic Diversification:Incremental Top-k computation
πππ€ππ’ππππππ‘ππππ πππππ‘ππ hπ‘ hπ πππππ‘ππππ π‘πππ£πππ‘ππ Characteristic Matrix
πΊππππππ‘π π hπ‘ h hπππ ππ π πππππ‘π’ππ
Signature Matrix
Map signature into L hash-tables
Update βWinnerβ at bucket signature
maps into
Vote πππβπππππππππ‘π1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
17
LSH in Dynamic Diversification:When new publication F arrivesβ¦
Only buckets will vote Follow continuity requirements
Durability Order
π π΄π π΅ππΆπ π·π πΈπ πΉππΊππ». .
ith window
(i+1)th window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
18
LSH in Adaptive Diversification:Analysis
For two vectors x,y
For publications x & y At a particular hash table
x & y map into the same bucket:
x & y does not map into the same bucket:
At L Hash-tables x & y does not map into the same bucket:
1βΒΏ
True near neighbors will be unlikely to be unlucky
in all the projections
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
19
Publication Stream Zipfian subscriptions
Normalized preferences
Evaluation:Dataset
Amazon on-line market place data available at 17th β 19th November 2014
N - number of elements in distribution,
k - rank of element
s - value of exponent
20
TerminologyILSH, BLSH and NAΓVE
π1π2π3π4π5π6π7π8. .BLSH
or NAIVE
BLSH or
NAIVE
BLSH or
NAIVE
BLSH or
NAIVE
ILSH
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
21
Accuracy:ILSH vs. NAΓVE
Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
22
Performance & Efficiency:ILSH vs. BLSH vs. NAΓVE
log (Top-k matching time) on number of publications with D=500
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
23
Conclusions Locality Sensitive Hashing (LSH) indexing method
Produce diverse set of results at average 70% accuracy over naΓ―ve method Reduce the matching time very significantly over NAΓVE method Further, refine by itβs incremental version
For handling streaming publications Avoid the curse of re-computing neighborhoods
Top k to restrict the delivery of Top publications Given a window size & delivery method Model can produce best diverse set of personalized results
To represent the set of all matching publications at given instance
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
24
Future work Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g. Personalized newspaper for every Facebook user Adaptive resource scheduling in large scale distributed system
Exploit overlap among diversified results of users who have similar interest
Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
25
Q&A
THANK YOU!