Data platform at Samsung (Big Learning)

SRA-SV | Cloud Research LabSRA-SV | Cloud Research Lab

Guangdeng LiaoZhan Zhang

Samsung Cloud Research Lab

Data Platform at Samsung

SRA-SV | Cloud Research Lab Slide 2

Our Mission: provide scalable, reliable, and secure storage and computation for Samsung R&D

Samsung Data Platform

Resources: • Hundreds of machines • Petabytes of storage• keep increasing..


What we have in our platform

Distributed MR processing Data warehousing with Hive/PigIn-house web-based ETL portalMany more..

Offline

K-V store HBaseIn-house Blob store

Online StormMany more..

Online

Apache Mahout ElasticSearch

In house unified web portalIn house Single Sign On

VisualizationMany more..

Dev. & management tools

By using platform, we already significantly improve ETL process, data management and processing for other teams!!


So, are we done?

No. Many more complex challenges.


Challenge #1: How to build scalable and efficient machine learning over Big Data?


MR-based Mahout is good but...

Not good at expressing data dependency and iterative algorithms like PageRank

Map: distribute rank to link targets

Reduce: collect ranks from multiple sources

Iterate

n

i i

i

tC

tPR

NxPR

1 )(

)()1(

1)(

One job/iteration Startup penaltyI/O Penalty

Unfortunately, a lot of MLDM are iterative jobs


Graph naturally represents data dependency


Graph-based Processing: Think like a Vertex

Scheduling

p p

p

p

p

p

p

In-memory data graph over a cluster

Communication– Message-based– Shared memory-

based

Vertex abstraction– Think like a vertex’s– In-memory processing

Execution engine– Bulk synchronous

parallel – Asynchronous parallel

Popular frameworks: – Giraph– GraphLab


Graph-based Machine Learning

We used Apache Giraph 1.0 and developed machine learning library over it:

Alternative Least Square (ALS)

Weight ALSSGD ( Matrix Factorization)

Bias SGD

Belief Propagation

Recommendation Graphical Model

KMeansKMeans++

Fuzzy-Clustering

Clustering

We see one magnitude order of speedups compared to MR-based approach in our cluster


Challenge #2: How to make Big Model + Big Data like Deep Learning scalable and efficient?


One example: Deep Learning1

Many more examples (millions to billions parameters ) in Speech Recognition, Image Processing and NLP

1Imagenet classification with deep convolutional neural networks, in NIPS 2012


Model-Parallel Framework

User defined model

Auto-generation of model topology

Auto-partition of topology over

clusterc1

c2

Auto-deployment of topology (in-

memory)

c3

Neuron-like programming

Message-based communication

Message-driven computation

Parallelize a big machine learning model over a cluster


Architecture over Yarn

Node Manager

Node manager

ControllerPartition and

deploy topology

Node manager

Application Master

Container

Container

Container

Data Communication:• node-

level• group-

level

Control comm. based on Thrift

Data comm. based on Netty


Execution Engine• Execution Engine (Deep Neural Net)

– Training layer by layer controlled by Execution Engine..

– Progress reporting– Process control: end user can control the

training process, and even restart the process from a certain point

– System snapshot for fault tolerance

Input

RBM

RBMSoftmax

Fully connected

• Generic Execution Engine– Abstract the common design pattern from our development

experiences of deep neural net algorithm.– Generalized to support various other algorithms


Model-parallel is still not scalable enough over Big Data


Deep Learning Platform: Hybrid of Data-parallelism and Model-parallelism

……..Data Chunk

Model-parallel Model-parallel

Data Chunk

……..

Parameter Server 1

Parameter Server n

……..

Parameters coordination

Data-parallelism

Lots of model instances

Parameter servers help models learn

each other


Distributed Parameter Servers

Client Client Client

HBase/HDFS

In-memory cache/storage



Server 1 Server 2 Server 3

Netty communication layer

Currently we support asynchronous parameter pulls and push Synchronized version is also supported

Pull/Push/Sync


Deep Learning Algorithms

Aim at three major application fields: speech recognition, image processing and NLP

What we have developed Our Roadmap

Feed Forward Neural NetworkRestricted Boltzmann Machine

Deep Belief NetworkSparse Auto-encoder

Convolutional Neural NetworkRecurrent Neural Network


Summary

• We are providing our Hadoop-based data platform– hundreds machines, petabytes of storages– Hadoop ecosystem (MapReduce, HBase, Yarn, HDFS, Zookeeper, Oozie, Lipstick, Mahout etc.)– In-house ETL pipeline– In-house unified web portal with SSO

• We are working hard on big learning to make our platform intelligent– Large-scale graph-based machine learning – Large-scale deep learning – And many more under progress

Q&A

Date post:	26-Jan-2015
Category:	Engineering
Upload:	zhuanzhuanding
View:	111 times
Download:	8 times

Data platform at Samsung (Big Learning)

Engineering