Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | harold-powers |
View: | 215 times |
Download: | 0 times |
GraphGen: Exploring Interesting Graphs in Relational Data
Presented by: Zohreh Raghebi
Fall 2015
Authurs
Konstantinos Xirogiannopoulos Udayan Khurana Amol Deshpande
University of Maryland, College Park; fkostasx | udayan | [email protected]
Introduction
Analyzing the interconnection structure among the underlying entities can provide significant insights and value in many application domains such as: social networks, communication networks
There is an increasing interest in executing a wide variety of graph analysis tasks and graph algorithms (e.g. community detection, influence propagation, network evolution, anomaly detection,
centrality analysis, etc.)
Introduction
This has led to the development of many specialized graph databases (e.g., Neo4j, Titan, OrientDB, etc.)
and graph execution engines (e.g., Apache Giraph, GraphLab,
Ligra, Galois, GraphX, XStream, to name a few)
Recently several researchers have also investigated the possibility of:
executing graph analysis tasks using a relational database system (e.g., Vertexica , GRAIL , Aster Graph Analytics
Although such specialized graph data management systems have made significant advances in analyzing graph-structured data,
a large fraction of the data of interest resides in relational database systems
Main Idea
We are building a system, called GRAPHGEN, with the goal:
to make it easy for users to extract a variety of different types of graphs from a relational database
and execute graph analysis tasks over them in memory
GRAPHGEN supports an expressive Domain Specific Language (DSL), based on Datalog,
To specify graph(s) to be extracted from the relational data.
GRAPHGEN has fundamentally different goals than the recent work on graph analytics using relational databases
e.g., Vertexica GRAIL , Aster Graph Analytics , SQL Serverbased
Related works
In Vertexica and GRAIL, a graph is normalized and stored in the relational database,
Those works do not consider the problem of extracting graphs from relational data,
can only execute analysis tasks that can be written using the vertex-centric programming framework
GRAPHGEN, on the other hand, pushes some computation to the relational engine,
most of the complex graph algorithms are executed in memory on a graph representation of the data
This allows GRAPHGEN to execute more complex analysis tasks like:
community detection, dense subgraph detection, matching, etc., as long as the extracted graph fits in memory
Related works
Ringo has somewhat similar goals to GraphGen and provides operators:
to convert from in-memory relational table representation to graph reprsentation
however it does not provide an expressive declarative DSL for graph extraction
and does not consider the optimizations to reduce the memory requirements
Ringo does support a large library of built-in graph algorithms
and plan to support Ringo as a frontend analytics engine for our system
TreeScope: Finding Structural Anomalies In SemiStructured Data
Presented by: Zohreh Raghebi
Fall 2015
Authurs
Shanshan Ying Advanced Digital Sciences [email protected]
Flip KornGoogle [email protected]
Barna SahaUniversity of Massachusetts [email protected]
Divesh SrivastavaAT&T Labs–[email protected]
Introduction Semi-structured data are prevalent on the web and in NoSQL document databases
with formats such as XML (eXtensible Markup Language) and JSON (JavaScript Object Notation)
popularity due to their generality, flexibility and easy customization
However, these benefits come at the cost of being prone to:
a range of data quality errors, from errors in content to errors in structure
Errors in content have been well studied in the literature
very little attention has been paid to errors in structure
Motivation
This is based on the assumption:
once data are valid according to the specified schema (DTD or XSD for XML data)
there can be no errors in their structure
We have found this assumption to often be incorrect
we observe that DTD/XSD specifications for heterogeneous XML data sets tend to be quite liberal
allowing semantically incorrect (though syntactically valid) data to creep into the data sets
The existence of such errors can lead to incorrect results on queries
and even worse result in poor data-driven decisions
illustrative examples of such errors in the well-known and widely-used DBLP Computer Science bibliography data set
Main Idea
In this work, we present TREESCOPE, to analyze semi-structured data sets
with the goal of automatically identifying potential structural errors in the data
A key insight is that it is not necessary to learn precise schema to identify structural errors
it is sufficient to learn robust structural models of subsets of the semi-structured data with high support
and identify structural anomalies as violations of the learned models
Main Idea
TREESCOPE learns robust structural models:
through a controlled exploration of the lattice structure of context path expressions
have high support, computing frequency distributions of candidate target tags
find those structural models that:
exhibit a significant skew in their frequency distributions
A Time Machine for Information: Looking Back to Look Forward
Xin Luna Dong Wang-Chiew Tan
Google Inc. UC Santa Cruz
Presented by: Omar Alqahtani
Fall 2015
Motivation
To develop a complete understanding of the history of an entity To depict trends over time. Difficult: why?
The lack of explicit temporal data. The lack of tools for interpreting such data. Many of the challenges that occur in data integration and
knowledge curation. Making every step of the data integration process time-aware.
Example
To illustrate the knowledge management techniques limitations: Query on google search: “Google CEO in 2015” Google search returns a speech by the “ex-CEO”
Another Example: “Google’s CEO before Larry Page” Google search returns articles about Larry Page
Time Machine for Information
Any one can easily and incrementally ingest temporal data to : To form more comprehensive understanding of entities over
time, To search and query facts for a particular time period, To understand trending patterns over time, and To perform analytics
A Demonstration of the BigDAWG Polystore
SystemPresented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Publication: VLDB 2015
Type: Demonstration Paper
Introduction / Motivation
“One size does not fit all”:
MIMIC II is a publicly accessible dataset covering about 26,000 ICU admissions at Boston’s Beth Israel Deaconess Hospital: Waveform data (up to 125 Hz measurements from bedside devices):
SciDB: in the format of time-series (array).
S-Store: device stream information.
Patient metadata (name, age, …): PostgreSQL.
Doctors’ and nurses’ notes (text). Apache Accumulo: stores the associated text data in a key-value store.
lab results, and prescriptions filled (both semi-structured data).
Historical data + real-time feeds from current patients.
Introduction / Motivation
It is hard for programmers to: Use different databases in their applications.
Learn new query languages.
BigDAWG
BigDAWG is a reference implementation of a new architecture for “Big Data” applications. Intel Science and Technology Center (ISTC).
Its UI provides:
Data browsing.
Exploratory Analysis.
Complex Analytics (liner regression, fast Fourier transformation …).
Real-Time Monitoring.
BigDAWG Architecture
The goal is to enable users to enjoy the performance advantages of multiple vertically-integrated systems (such as column stores, NewSQL engines, and array stores) without sacrificing the expressiveness of their queries nor burdening the user with learning multiple front-end languages.
Each island is a front-facing abstraction for the user, and it includes a query language, data model, and a set of connectors or shims for interacting with the underlying storage engines that it is federating.
Real-Time Analytical Processing with SQL
Server Presented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Publication: VLDB 2015
Type: Industry Paper
Introduction / Motivation
Transactional processing (OLTP) and analytical processing are traditionally separated and running on different systems.
Separation reducing the load on transactional systems which makes it easier to ensure consistent throughput and response times for business critical applications.
Users are increasingly demanding access to ever fresher data also for analytical purposes. The freshest data resides on transactional systems so the most up-to-date results are
obtained by running analytical queries directly against the transactional database.
Introduction / Motivation (2)
Over the last two releases SQL Server has added column store indexes (CSI) and batch mode (vectorized) processing to speed up analytical queries and the Hekaton in-memory engine to speed up OLTP transactions.
Each feature works for a specific workload: Column-store indexes are optimized for large scans but operations such as point lookups or
small range scans also require a complete scan, which is clearly prohibitively expensive.
lookups are very fast in in-memory tables but complete table scans are expensive because of the large numbers of cache and TLB misses and the high instruction and cycle count associated with row-at-a-time processing.
SQL Server 2016 will include several enhancements that are targeted primarily for such hybrid workloads.
SQL Server 2016 Enhancements on Hybrid Workloads
1. Columnstore indexes on in-memory tables. greatly speed up queries that require complete table scans.
2. Updatable secondary columnstore indexes. Secondary CSIs on disk-based tables were introduced in SQL Server 2012. However, adding a CSI makes the table read-only. This limitation will be remedied in SQL Server 2016.
3. B-tree indexes on primary columnstore indexes. CSI works great for data warehousing applications but not for point lookup and small region scans (needs the whole database scan). To speed up such operations users will be able to create normal B-tree indexes on primary
column stores.
4. Column store scan improvements. The new scan operator makes use of SIMD (single instruction, multiple data) instructions and
the handling of filters and aggregates has been extended and improved.
Faster scans speed up many analytical queries considerably.
SQL Server 2016 Enhancements on Hybrid Workloads (2)
Experimental Results
Cost of inserts, updates, deletes and their effect on query time
Experimental Results (2)
Scan speedup using CSI
Experimental Results (3)
Query performance with and without scan enhancements (using 12 cores)
Experimental Results (4)
Comparing performance of basic operations with and without SIMD instructions
Efficient Evaluation of Object-Centric Exploration Queries for Visualization
Type: Research Paper
Authors: You Wu, Jun Yang(Duke Uni.), Boulos Harb, Cong Yu (Google)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Effective way to explore data? Example:
Player LeBron James has scored 35 or more points in every one of the last k games
Identifying clusters and outliers
Example
Contribution
Related work focuses on other visualization techniques Efficient evaluation through sampling
Ssparse
SSketch
Proposed algorithm to consider the allocated budget
A Demonstration of AQWA: Adaptive Query-Workload-Aware Partitioning of Big
Spatial DataType: Demonstration Paper
Authors: Ahmed M. Aly, Ahmed S. Abdelhamid, Ahmed R. Mahmood, Walid G. Aref, Mohamed S. Hassan(Purdue), Hazem Elmeleegy(Turn Inc.),
Mourad Ouzzani(Qatar Computing Research Institute)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Too many location aware devices! Too much location data!! Other systems use Static Data partitioning
The idea
Four Steps Initialization Query Execution Data Acquisition Repartitioning
Contribution
AQWA – adapts to query workload and data distribution and incrementally updates the partitioning!
Deployed on Hadoop
Spark