Date post: | 11-Nov-2014 |
Category: |
Documents |
Upload: | edward-j-yoon |
View: | 1,058 times |
Download: | 0 times |
Who Am I
• Edward J. Yoon–@eddieyoon
• Founder of Apache Hama• PMC member of Apache BigTop• Oracle Employee
What’s Hama?
• Open Source – Under Apache 2.0 License
• Written In Java• Apache Top Level Project
Characteristics
• a General BSP computing engine– M/R like Input/Output Formatter
• SequenceFile, Text, Accumulo, Hbase, …, etc.
– Job Manager– Checkpoint Recovery
• Streaming and Pipes – Python, C++, …, etc.
• Graph and Machine Learning Packages– K-means, Gradient Descent, Collaborative Filtering
Bulk Synchronous Parallel?
• Originally introduced by Valiant• a Sequence of supersteps
Compare to M/R and MPI
• Supports message-passing paradigm style of application development
• Provides a flexible, simple, and easy-to-use small APIs
• Enables to perform better than MPI for communication-intensive applications
• Guarantees impossibility of deadlocks or collisions in the communication mechanisms
So, fit for what?
• Processing Big Data w/ complicated relationships– e.g., graph or network.
• Iterative or Recursive scientific applications
• Continuous Event Processing
Which is the Big Data?
Could be applied to
• Analyze user actions and patterns• Social Target Marketing• Observe evolution of Social networks• Detect anomaly rapidly in Real-time• Business Intelligence
Internals
• Pluggable RPC Architecture for message transfer– e.g., Hadoop RPC, Avro RPC, …, etc.
• Message Collector, Bundler, and Compressor to reduce network overheads and contentions– e.g., Snappy, Bzip2, …, etc.
BSP API
public abstract void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws IOException, SyncException;
BSP Examples
• Pi Calculation• Sparse Matrix-Vector Multiplication• K-means Clustering• Gradient Descent
Graph API
public void compute(Iterator<M> messages) throws IOException;
Graph Examples
• In-link Count• Single Source Shortest Path• Pagerank• Bipartitie Matching• Semi-Clustering
Find Maximum Value
SSSP Performance
• a SSSP for random graph of 1 billion edges is computed in 400 seconds on 1 Oracle BDA