Date post: | 26-Jan-2015 |
Category: |
Engineering |
Upload: | zhuanzhuanding |
View: | 111 times |
Download: | 8 times |
SRA-SV | Cloud Research LabSRA-SV | Cloud Research Lab
Guangdeng LiaoZhan Zhang
Samsung Cloud Research Lab
Data Platform at Samsung
SRA-SV | Cloud Research Lab Slide 2
Our Mission: provide scalable, reliable, and secure storage and computation for Samsung R&D
Samsung Data Platform
Resources: • Hundreds of machines • Petabytes of storage• keep increasing..
SRA-SV | Cloud Research Lab Slide 3
What we have in our platform
Distributed MR processing Data warehousing with Hive/PigIn-house web-based ETL portalMany more..
Offline
K-V store HBaseIn-house Blob store
Online StormMany more..
Online
Apache Mahout ElasticSearch
In house unified web portalIn house Single Sign On
VisualizationMany more..
Dev. & management tools
By using platform, we already significantly improve ETL process, data management and processing for other teams!!
SRA-SV | Cloud Research Lab Slide 4
So, are we done?
No. Many more complex challenges.
SRA-SV | Cloud Research Lab Slide 5
Challenge #1: How to build scalable and efficient machine learning over Big Data?
SRA-SV | Cloud Research Lab Slide 6
MR-based Mahout is good but...
Not good at expressing data dependency and iterative algorithms like PageRank
Map: distribute rank to link targets
Reduce: collect ranks from multiple sources
Iterate
n
i i
i
tC
tPR
NxPR
1 )(
)()1(
1)(
One job/iteration Startup penaltyI/O Penalty
Unfortunately, a lot of MLDM are iterative jobs
SRA-SV | Cloud Research Lab Slide 7
Graph naturally represents data dependency
SRA-SV | Cloud Research Lab Slide 8
Graph-based Processing: Think like a Vertex
Scheduling
p p
p
p
p
p
p
In-memory data graph over a cluster
Communication– Message-based– Shared memory-
based
Vertex abstraction– Think like a vertex’s– In-memory processing
Execution engine– Bulk synchronous
parallel – Asynchronous parallel
Popular frameworks: – Giraph– GraphLab
SRA-SV | Cloud Research Lab Slide 9
Graph-based Machine Learning
We used Apache Giraph 1.0 and developed machine learning library over it:
Alternative Least Square (ALS)
Weight ALSSGD ( Matrix Factorization)
Bias SGD
Belief Propagation
Recommendation Graphical Model
KMeansKMeans++
Fuzzy-Clustering
Clustering
We see one magnitude order of speedups compared to MR-based approach in our cluster
SRA-SV | Cloud Research Lab Slide 10
Challenge #2: How to make Big Model + Big Data like Deep Learning scalable and efficient?
SRA-SV | Cloud Research Lab Slide 11
One example: Deep Learning1
Many more examples (millions to billions parameters ) in Speech Recognition, Image Processing and NLP
1Imagenet classification with deep convolutional neural networks, in NIPS 2012
SRA-SV | Cloud Research Lab Slide 12
Model-Parallel Framework
User defined model
Auto-generation of model topology
Auto-partition of topology over
clusterc1
c2
Auto-deployment of topology (in-
memory)
c3
Neuron-like programming
Message-based communication
Message-driven computation
Parallelize a big machine learning model over a cluster
SRA-SV | Cloud Research Lab Slide 13
Architecture over Yarn
Node Manager
Node manager
ControllerPartition and
deploy topology
Node manager
Application Master
Container
Container
Container
Data Communication:• node-
level• group-
level
Control comm. based on Thrift
Data comm. based on Netty
SRA-SV | Cloud Research Lab Slide 14
Execution Engine• Execution Engine (Deep Neural Net)
– Training layer by layer controlled by Execution Engine..
– Progress reporting– Process control: end user can control the
training process, and even restart the process from a certain point
– System snapshot for fault tolerance
Input
RBM
RBMSoftmax
Fully connected
• Generic Execution Engine– Abstract the common design pattern from our development
experiences of deep neural net algorithm.– Generalized to support various other algorithms
SRA-SV | Cloud Research Lab Slide 15
Model-parallel is still not scalable enough over Big Data
SRA-SV | Cloud Research Lab Slide 16
Deep Learning Platform: Hybrid of Data-parallelism and Model-parallelism
……..Data Chunk
Model-parallel Model-parallel
Data Chunk
……..
Parameter Server 1
Parameter Server n
……..
Parameters coordination
Data-parallelism
Lots of model instances
Parameter servers help models learn
each other
SRA-SV | Cloud Research Lab Slide 17
Distributed Parameter Servers
Client Client Client
HBase/HDFS
In-memory cache/storage
In-memory cache/storage
In-memory cache/storage
Server 1 Server 2 Server 3
Netty communication layer
Currently we support asynchronous parameter pulls and push Synchronized version is also supported
Pull/Push/Sync
SRA-SV | Cloud Research Lab Slide 18
Deep Learning Algorithms
Aim at three major application fields: speech recognition, image processing and NLP
What we have developed Our Roadmap
Feed Forward Neural NetworkRestricted Boltzmann Machine
Deep Belief NetworkSparse Auto-encoder
Convolutional Neural NetworkRecurrent Neural Network
SRA-SV | Cloud Research Lab Slide 19
Summary
• We are providing our Hadoop-based data platform– hundreds machines, petabytes of storages– Hadoop ecosystem (MapReduce, HBase, Yarn, HDFS, Zookeeper, Oozie, Lipstick, Mahout etc.)– In-house ETL pipeline– In-house unified web portal with SSO
• We are working hard on big learning to make our platform intelligent– Large-scale graph-based machine learning – Large-scale deep learning – And many more under progress
Q&A