+ All Categories
Home > Documents > YZStack: Provisioning Customizable Solution for Big Data Sai Wu, Chun Chen, Gang Chen, Lidan Shou,...

YZStack: Provisioning Customizable Solution for Big Data Sai Wu, Chun Chen, Gang Chen, Lidan Shou,...

Date post: 29-Dec-2015
Category:
Upload: marylou-fields
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
16
YZStack: Provisioning Custo mizable Solution for Big Da ta Sai Wu, Chun Chen, Gang C hen, Lidan Shou, Ke Chen Zhejiang University Hui Cao, yzBigData Co. Lte. He Bai, City Cloud Technology
Transcript

YZStack: Provisioning Customizable Solution for Big Data

Sai Wu, Chun Chen, Gang Chen,

Lidan Shou, Ke ChenZhejiang University

Hui Cao, yzBigData Co. Lte.

He Bai, City Cloud Technology

3H Problem in Deploying the Big Data System

• How can I build and deploy a big data system without back-ground knowledge?

• How can I migrate existing applications to the big data system?

• How can I use my big data system to do the analysis job?

Too Many Choices• Visualization :

– Openstack– Cloudstack– Vmware

• Cloud storage: – key-value store (hbase, cassandra, redis,…)– relational service (AWS, spanner,…)

• Processing engine: – MapReduce/Hadoop– Dryad– Pregel, GraphLab– Spark– epiC

• Application service: – Mahout– Hive– Spatial Hadoop

Can I Deploy a Big Data System Like Installing a Windows

Software?• Configure the installation as a customization

process• The installation software will copy the binary

codes to all servers and do the configuration automatically

• A browser-based management system to start/stop the services and monitor the status

YZStack: the Architecture

• Layers are loosely connected

• Each layer includes many selectable modules

• Modules of different layers are linked via the common interfaces

• Optimizations are implemented as special plugins

Cloud Virtual Server Cloud Storage

Cloud Network

IaaS

Relational Data Service

Object Based Service

Distributed File System

SaaS

Data MiningOLAP

ProcessingStream

ProcessingVisualization

Plugins

Security Module

OLTP Processing

DaaS

Data Integration Module

Applications

Smart TrafficHangzhou

E-cardAnalyzer for Power Grid

Green Hangzhou

Key-Value Store

PaaS

YZepiC

Graph EngineRelational

Analytical EngineRelational

Transactional Engine System Monitor

OptimzationModules

Data Importer

/ETL tools

Features of YZStack

• Adaptive Image– Based on openstack, partition the big image into small

chunks– Different images share the same chunk

• Optimization Plugins– Column-oriented plugin– Index plugin– Query optimization plugin– Iterative job plugin

• Visualization Tool– Zoom in/out for different dimensions

Optimization Plugin

Common Interface of Layer

Module A Module K...

...

Default Implementation hooks

Common Interface of Layer

Module A Module K...

...

Default Implementation

hooks

Optimization Plugin 1

Customized Function

Customized FunctionLayer 2

Layer 1

Optimization Plugin 2

Customized Function

Use Case: the Smart Financial System

• Built for the Zhejiang Provincial Department of Finance (ZPDF)

Virtual Server

Virtual Server

Virtual Server...

Distributed File System

Schema Metadata

Data Statistics

Table File

Tablet File

Tablet File...

Relational Data Service

YZepiC

SQL Query Parser

Query Optimizer Query Engine

Relational Analytical Engine

OLAP Module Data Mining Module

Data Importer

Tax Energy Environment

Traffic Human Electronic

Index Plugin

Visualization Tool

Security Plugin

Monitor Plugin

Economic Prediction

• Collaborate with researchers from college of economics, Zhejiang University

• Step 1:– Use the OLAP module to provide a basic view for each

registered company

Economic Prediction (cont.)

• Step 2:– Healthy Model: Based on the historical data, the

healthy model discovers risks and predicts prospects of an industry

– Energy Consumption Model: We link the financial data with the electronic, water, and environment data to rank each industry based on its energy consumption per unit of output value.

– Economic Impact: Model By connecting the financial data to the human resource data, we study how many workers are employed for an industry and their average salary

– Combine all three models to rank all industries accordingly

Economic Prediction (cont.)

• Step 3: Index of Economic (ongoing work)– To predict the status of the whole Zhejiang

Province using statistics generated by previous two steps

– Involving multiple complex economic models– Our economic researchers are using the

visualization tools to build and study their models

Detection of Improper Payment

• What is the improper payment?– A person is classified as the low-income type and

buys a house specially for low-and-medium wage earners. However, he is actually employed by IT company

– One company may submit different registration files to different government departments (e.g., it registers as a high-tech company in the Department of Science, but as a labor-intensive one in the Department of Labor) to enjoy various allowances from the government.

Why ZPDF?

• A harbor of financial data in Zhejiang Province– Electronic department – Traffic department – Tax department– …

• It is well motivated– Expected to save more than 1 billion CNYs

Improper Payment

• Step 1 (Consistent Problem):– To detect improper payment from two databases, D0

and D1,– we first generate two star-join queries, Q0 and Q1,

which selectively merge the fact tables with the dimension tables.

– The trick is that the entities returned by Q0 should not exist in the results of Q1.

– E.g., Q0 returns the high-income persons, while Q1 returns the users who own a house specially for low-and-medium wage earners.

Consistent Problem

• we apply the LSH (Locality Sensitive Hashing) to generate k hash values for each tuple from T0 and T1.

• So the tuples sharing the same hash value are considered as a candidate group.

• We define a similarity function sim(ti; tj) to evaluate the probability of two tuples representing the same entity. If sim(ti; tj) is greater than a predefined threshold, it will be forwarded to the verification module where a human-aided algorithm is applied to filter out the false positives.

Fact Table

Dimension Table

Dimension Table

Fact Table

Dimension Table

Dimension Table

Candidate Group

Candidate Group

Candidate Group

Verification

Conclusion

• YZStack is tailored for the users who have little or no experience in deploying and maintaining the cloud system.

• It simplifies the development of a new big data application as the process of module selection and customization.

• To show the flexibility and usability of YZStack, we demonstrate how we build a smart financial system for the Zhejiang Provincial Department of Finance using YZStack.


Recommended