Compute "Closeness" in Graphs using Apache Giraph.

13.1.2014 DIMA - TU Berlin

Compute “Closeness” in Graphs using Apache Giraph

… using probabilistic data structures.Today: Validation

IMPRO-3, TU Berlin, Winter 13/14Robert Metzger, Robert Waury


Quick Recap on our Task

● Measure reachable nodes within s steps from a node n in a Graph.→ N(a,s).N(“Robert”,1)=80 N(“Robert”,2)=10413…

● Largest N() is graph diameter.

Robert’s Xing Network


What happened so far ...

● Giraph Implementation:○ a) Bitfield○ b) Flajolet Martin Sketch

■ 32 bit with Thomas Wang’s integer hash■ 64 bit MurmurHash 2.0

○ c) HyperLogLogSketch with MurmurHash 2.0● Drafted Stratosphere “Spargel” implementation● Benchmarked a) and b) for AIM-3


Validating the correctness of the implementation ...

● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations.

● On two (small) datasets:○ General Relativity and Quantum Cosmology collaboration

network (Coauthor relationships). Largest CC 4.158 Nodes.○ Enron email network. Largest CC 33.696 Nodes.


Statistical Methods to determine correlation

● Kendall's τ (tau)○ -1 < τ < 1○ expects an order (ranking)

e.g. Comparable interface ;-)

● Spearman's ρ (rho)

○ same properties as Kendall but checks whether relation is monotonic (not just linear)

● Pearson’s r○ checks for linear correlation○ uses the actual values (not just ranks)


Coauthorship Results (I)

Kendall’s τ Spearman’s ρ Pearson’s r

FM32 0.906881050538273 0.98765689317449 0.991695076216846

FM64 0.905736944670186 0.987400738579957 0.991700042774567

HLL 0.931782793461063 0.993272573234886 0.9956213651786

→ High (linear) correlation with all metrics ✔→ HyperLogLog has highest correlation and has best memory properties


Coauthorship Results (II)

→ HLL the best approximation→ outliers can be identified with higher confidence than central nodes→ nodes with highest closeness tend to have similar values

Top10 Top100 Top1000 Last1 Last100

FM32 6/10 76/100 891/1000 1/1 94/100

FM64 5/10 69/100 881/1000 1/1 94/100

HLL 8/10 80/100 932/1000 1/1 95/100


Enron Results (I)

→ High (linear) correlation with all metrics ✔→ HyperLogLog has highest correlation and has best memory properties

Kendall’s τ Spearman’s ρ Pearson’s r

FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506

FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614

HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085


Enron Results (II)

Top10 Top100 Top1000 Last1 Last100

FM32 5/10 80/100 877/1000 1/1 96/100

FM64 7/10 66/100 839/1000 1/1 97/100

HLL 8/10 86/100 889/1000 1/1 97/100

→ HLL again best approximation→ outliers can be identified with higher confidence than central nodes


Validation Summary

● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint.

● We assume that these results hold for larger data sets.


Next step

● Benchmark implementations with larger datasets (that require Giraph out-of-core execution)

● Datasets:

Description Name Vertices Edges Text File Size in GB

The data of Stanford's WebBase 2001 crawl as a graph

webbase-2001 118,142,155 1,019,903,190 9.46

Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49


References

U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages

Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA

Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692

Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.

Formulas taken from Wikipedia.

http://www.siam.org/meetings/sdm11/

Date post:	20-Aug-2015
Category:	Technology
Upload:	robert-metzger
View:	1,594 times
Download:	4 times