Presented by: Zohreh Raghebi Fall 2015. Foteini Katsarou Nikos Ntarmos Peter Triantafillou...

Performance and Scalability of Indexed

Subgraph QueryProcessing Methods

Presented by: Zohreh Raghebi

Fall 2015

Foteini Katsarou

Nikos Ntarmos

Peter Triantafillou

University of Glasgow, UK

Motivations

Graph data management systems have become very popular

One of the main problems for these systems is subgraph query processing Given a query graph, return all graphs that contain the query.

To perform a subgraph isomorphism, test against each graph in the dataset Not scale, as subgraph isomorphism is NP-Complete Many indexing methods : to reduce the number of candidate graphs for

subgraph isomorphism test

Problem definition

A set of key factors-parameters, that influence the performance of related methods:

the number of nodes per graph the graph density the number of distinct labels the number of graphs in the dataset and the query graph size

Basic idea

First to derive conclusions about the algorithms’ relative performance Second highlight how both performance and scalability depend on the above

factors six well established indexing methods, namely: Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode

Related work

Most related works are tested against the AIDS antiviral dataset and synthetic datasets, Formed of many small graphs

Grapes alone used several real datasets The authors did not evaluate scalability

The iGraph comparison framework compared the performance of older algorithms (up to 2010). Since then, several, more efficient algorithms have been proposed

Conclusion A linear increase in the number of nodes results in a quadratic increase

in the number of edges; The number of edges increases linearly to the graph density The increase of the above two factors leads to a detrimental increase in

the indexing time The number of graphs increases the overall complexity only linearly The frequent mining techniques are more sensitive because more

features have to be located across more graphs

Conclusion

The increase in the number of distinct labels leads to: An easier dataset to index 1. It results in fewer occurrences of any given feature 2. A decrease in the false positive ratio of the various algorithms Our findings give rise to the following adage: “Keep It Simple and

Smart”. The simpler the feature structure and extraction process, The

faster the indexing and query processing algorithm

Conclusion

Frequent mining algorithms (gIndex, Tree+∆) may be competitive for small/sparse datasets

Techniques using exhaustive enumeration (Grapes, GGSX, CT-Index) are the clear winners

Especially with those indexing simple features (paths; i.e., Grapes, GGSX)

Instead of having more complex features (trees, cycles; i.e., CT-Index)

One Trillion Edges: Graph Processing at Facebook-Scale

Industrial paper

Avery Ching

Sergey Edunov

Maja Kabiljo


Fall 2015

Motivation

Analyzing the real world graphs The scale of hundreds of billions edges with available software is

very difficult Graph processing engines tend to have additional challenges in

scaling to larger graphs

Related work

Apache Giraph is an iterative graph processing system designed to scale to process trillions of edges Used at Facebook

Giraph was inspired by Pregel, the graph processing developed at Google

While Giraph did not scale to our needs at Facebook with over 1.39B users and hundreds of billions of social connections

Improve the platform to support the workloads

Limitations: Giraph’s graph input model was only vertex centric Parallelizing Giraph infrastructure relied on Map Reduce’s task level parallelism

Not have multithreading support

Giraph’s flexible types were initially implemented using native Java objects Consumed excessive memory and garbage collection time.

The aggregator framework was inefficiently implemented in ZooKeeper Need to support very large aggregators

Improvements:

we modified Giraph to allow loading vertex data and edges from separate sources Parallelization support:

Use worker local multithreading to take advantage of additional CPU cores

Memory optimization : By default serialize the edges of every vertex into a byte array

Rather than instantiating them as native Java objects

Aggregator architecture: Each aggregator is now randomly assigned to one of the workers Aggregation responsibilities are balanced across all workers Not bottlenecked by the Master

Bonding Vertex Sets Over Distributed Graph: A

Betweenness Aware Approach

Xiaofei Zhang†, Hong Cheng‡, Lei Chen††Department of Computer Science & Engineering, HKUST


Fall 2015

Vertex Set Bonding query

Extracts the most prominent vertices Returns a minimum set of vertices with: The maximum importance total betweenness and shortest path

reachability in connecting two sets of input vertices logistic planning, social community bonding

Motivation In social network study: To understand information propagation and hidden

correlations To find the “bonding” of communities , an ideal bonding agent would: 1. Reside on as many cross group pairwise shortest paths as possible 2. Connect as large portion of two groups as possible

Such agents could best serve the message passing between two groups

The VSB query ranks a vertex’s prominence with two factors: betweenness and shortest path connectivity

Related works

Minimum cut finds the minimum set of edges to remove to turn a graph into two disjoint subgraph

Not offer how other vertices contributes to the connection between sets

Top-k betweenness computing are employed to find important vertices in a network

However, due to the local dominance property of the betweenness metric such queries cannot serve the vertex sets bonding properly

Basic Idea

Two novel building blocks for the efficient VSB query evaluation: Guided graph exploration: by vertex filtering scheme

to reduce redundant vertex access

The minimum set of vertices of the highest accumulative betweenness as the bonding vertices

Betweenness ranking on-exploration

Instead of computing the exact betweenness value

Rank the betweenness of vertices during graph exploration to save the computation cost.

A Scalable Distributed Graph PartitionerDaniel Margo Harvard University

Cambridge, [email protected]

Margo Seltzer Harvard UniversityCambridge, [email protected]


Fall 2015

Distributed graph partitioning algorithm

Capable of handling graphs that far exceed main memory

High quality edge partitions

Graph partitioning is an important problem that affects many graph-structured system

Partitioning quality greatly impacts the performance of distributed graph analysis frameworks

Related works

METIS: the gold standard for graph partitioning A multi-level graph partitioning algorithm These approaches do not scale to today’s large graphs Streaming partitioners A graph loader : reads serial graph data from a disk onto a cluster It must make a decision about the location of each node as it is loaded The goal is to find an optimal balanced partitioning with as little computation

as possible They are sensitive to the stream order, which can affect performance Streaming algorithms are difficult to parallelize

Contribution

Partitions by a method that does not vary with how the input graph is distributed

Sheep can arbitrarily divide the input graph for parallelism and fit

tasks in memory

Contribution

Sheep reduces the input graph to a small elimination tree

Sheep’s tree transformation is a distributed map-reduce operation

Using simple degree ranking, Sheep creates competitive edge partitions faster than other partitioners

Extracting Logical Hierarchical Structure of HTML Documents Based

on HeadingsType: Research Paper

Authors: Tomohiro Manabe, Keishi Tajima

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

How is logical structure different from the mark-up structure?Difference between Human understanding and Browser

interpretation “Mark-up structure does not necessarily always

correspond to the logical hierarchy” Basic idea for webpages with improper tag usage

Related Work

HTML5 solves this problem, but we cannot port all web pages to HTML 5

Other techniques for document segmentation Based on margins between blocks Based on text density Based on identification of important blocks Etc.

Most rely on tags (so does this paper, but not entirely)

Contribution

Define Blocks and Headings for their own structure extraction Logical Hierarchy extraction using

Preprocessing Heading based Page Segmentation

Dataset: Web snapshot ClueWeb09 Category B document collection

Conclusion

Calculate accuracy based on Precision and Recall of extracted relationship

Types of Relationships – parent, ancestor, siblings, child, descendants

Argonaut: Macrotask Crowdsourcing for

Complex Data ProcessingType: Industry Paper

Authors: Daniel Haas, Jason Ansel, Lydia Gu, Adam Marcus

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

What is Macrotask Crowdsourcing? What is the problem with it?

Related Work

The related work focuses on Macrotasking and Crowdsourcing frameworks

Contribution

Argonaut Predictive models to identify trustable workers who can perform

reviews And a model to identify which tasks need review Evaluate trade off between single and multiple phases of reviews

based on budget

QOCO: A Query Oriented Data Cleaning System with

OraclesPresented by: Shahab Helmi

VLDB 2015 Paper Review Series

Fall 2015

Paper InfoAuthors:

Moria Bergman, Tel-Aviv University

Tova Milo, Tel-Aviv University

Slava Novgorodov, Tel-Aviv University

WangChiew Tan, University of California, Santa Cruz

Publication: VLDB 2015

Type: Demonstration Paper

Introduction

It is important for the database to be as complete (no missing values) and correct (no wrong values) as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However these tools:

are not able to remove all erroneous data:

95% accuracy in YOGO database.

It is impossible to correct all errors manually in big datasets.

are not usually able to determine what information is missing from a database.

What Is QOCO?

A novel query oriented data cleaning system with oracle crowds.

It uses materialized views (i.e., query oriented views which are defined through user queries) are used as a trigger for identifying incorrect or missing information.

If an error (i.e., a wrong tuple or missing tuple) in the materialized view is detected, the system will interact minimally with a crowd of oracles by asking only pertinent questions.

Answers to a certain question will help to identify the next pertinent questions to ask and ultimately, a sequence of edits is derived and applied to the underlying database.

Cleaning the entire database is not the goal of QOCO. It cleans parts of the database as needed. Hence, it could be used as a complementary tool alongside with other cleaning tools.

Related Works & Contributions

Data cleaning techniques:

QOCO uses crowd to correct query results.

QOCO propagates updates back to the underplaying database.

QOCO discovers and inserts true tuples that are missing from the input database.

Crowdsourcing is a model where humans perform small tasks to help solve challenging problems such as Entity/conflict resolution .

Duplicate detection.

Schema matching.

Example: Deleting Wrong Tuples

Consider a user query which searches for European teams that won the World Cup at least twice.

Result will contain ESP (which is wrong) it ITA will be missing.

Correct tuples

Wrong tuples

Missing tuples

Example: Deleting Wrong Tuples (2)Tuples which contains the wrong answer

t1 = Game(11:07:10;ESP;NED; final; 1:0)t2 = Game(17:07:94;ESP;NED; final; 3:1)t3 = Team(ESP;EU)



1. Finding the most frequent tuple (t3) and asking the oracle if it is true or not?

t3: is correct -> the remaining candidates will be {t1, t2}, {t2, t4}, {t4, t1}

2. The rest of tuples have the same frequency so QOCO will choose one of them randomly, let say t1.

1. t1: is correct -> {t2}, {t2, t4}, {t4} -> ESP won the world cup only once; hence both t2 and t4 are wrong and should be deleted!

Adding a Missing Answer & More Details

Could be find in the original research paper:

M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, 2015

Aggregate Estimations Over Location Based

ServicesPresented by: Shahab Helmi

VLDB 2015 Paper Review Series

Fall 2015

Paper InfoAuthors:

Weimo Liuy, The George Washington University

Md Farhadur Rahmanz, University of Texas at Arlington

Saravanan Thirumuruganathanz, University of Texas at Arlington

Nan Zhangy, The George Washington University

Gautam Das, University of Texas at Arlington

Publication: VLDB 2015

Type: Research Paper

1. IntroductionLocation-based services (LBS)

Location-returned services (LR-LBS): this services return the location of the k returned tuples. Google Maps.

Location-not-returned services (LNR-LBS): this services does not return the location of the k tuples and returns some other attributes such as ID, ranking and etc. WeChat

Sina Weibo

K-nearest-neighbors (kNN) queries: returns the k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper).

1. Introduction (2)LBS with a kNN interface: third-party applications and/or end users do not have complete and direct access to this entire database. The database is essentially “hidden”, and access is typically limited to a restricted public web query interface or API.

These interfaces impose some constraints:

Query limitation: 10,000 per user per day in Google Maps

Maximum coverage limit, for example 5 miles away from the query point

Aggregate Estimations: For many interesting third-party applications, it is important to collect aggregate statistics over the tuples contained in such hidden databases such as sum, count, or distributions of the tuples satisfying certain selection conditions.

A hotel recommendation application would like to know the average review scores for Marriott vs Hilton hotels in Google Maps;

A cafe chain startup would like to know the number of Starbucks restaurants in a certain geographical region;

A demographics researcher may wish know the gender ratio of users of social networks in China etc.

2. MotivationAggregate information can be obtained by:

Entering into data sharing agreements with the location-based service providers, but this approach can often be extremely expensive, and sometimes impossible if the data owners are unwilling to share their data.

Getting the whole data using limited interfaces would take so long.

Goals:

Approximate estimates of such aggregates by only querying the database via its restrictive public interface.

Minimizing the query cost (i.e., ask as few queries as possible) in an effort to adhere to the rate limits or budgetary constraints imposed by the interface.

Making the aggregate estimations as accurate as possible.

3. Related Work Analytics and Inference over LBS:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction

Aggregate Estimations over Hidden Web Repositories: Unbiased estimators for COUNT and SUM aggregates for static databases.

efficient techniques to obtain random samples from hidden web databases that can then be utilized to perform aggregate estimation.

Estimating the size of search engines.

4. Contributions For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT

and SUM aggregates, represents a significant improvement over prior work along multiple dimensions: a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;

top-k returned tuples are leveraged rather than only top-1; several innovative techniques developed for reducing error and increasing efficiency.

For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free bias-free, but the bias can be controlled to any desired

precision.

5. Background: Voronoi Diagrams

Top-1 Voronoi Top-2 Voronoi

In a Voronoi diagram, for each point, there is a corresponding region consisting of all points closer to that point than to any other.

6. LR-LBS-AGG | LNR-LBS-AGG Algorithms1. Precisely Compute Voronoi Cells

Faster Initialization

Leverage history on Voronoi cell computation

2. Error Reduction Bias error removal/reduction

Variance reduction

...

Experimental ResultsDatasets:

Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of the algorithm.

Online LBS Demonstrations: to evaluate efficiency of the algorithm. Google Maps

WeChat

Sina Weibo

Date post:	13-Jan-2016
Category:	Documents
Upload:	jeremy-cummings
View:	217 times
Download:	0 times