Overview of Hadoop forData MiningFederal Big Data Groupconfidential
Mark SilvermanTreeminer, Inc.
155 Gibbs Street Suite 514Rockville, Maryland 20850
(240) [email protected]
TREEMINER, INC.CONFIDENTIAL
Agenda
• Introduction to Hadoop• Developing and testing a Map/Reduce
application• Auto-Clustering in Hadoop and
Interworking with Apache Storm
TREEMINER, INC.CONFIDENTIAL
Introduction to Hadoop
• Hadoop consists of:• Clustered, distributed, highly available file
system (HDFS)• Execution framework (Map/Reduce)
TREEMINER, INC.CONFIDENTIAL
Hadoop File System
• “Rack” aware• Local storage• Distributed copies (generally 3)
Rack
TREEMINER, INC.CONFIDENTIAL
Sample Hadoop File System
TREEMINER, INC.CONFIDENTIAL
Hadoop “Eco-System”
• HiveAllows SQL-like querying of data in HDFS
• PigBasic scripting language for Hadoop
• DatabasesHbase, Accumulo, Cassandra, Neo4j
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
TREEMINER, INC.CONFIDENTIAL
WordCount Example
TREEMINER, INC.CONFIDENTIAL
Getting Started
• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.
http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html
TREEMINER, INC.CONFIDENTIAL
Developing In Map / Reduce
• Standalone Mode – Hadoop runs as single process, best for debugging
• Pseudo-Distributed – Separate processes on same server
• Fully Distributed – Full blown cluster
TREEMINER, INC.CONFIDENTIAL
Eclipse Framework
• Write code in eclipse• PC or Linux• Options:
• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and
profiling• Profiling: Yourkit
TREEMINER, INC.CONFIDENTIAL
WordCount
• Create a project in eclipse• Load wordcount code (widely available
and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
TREEMINER, INC.CONFIDENTIAL
Resources
http://www.cloudera.com
http://www.hortonworks.com
hadoop.apache.org
http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf
Hadoop: A Definitive Guide by Tom White
TREEMINER, INC.CONFIDENTIAL
Example: Document AutoClustering using Hadoop and Storm
https://www.youtube.com/watch?v=5X65WV0n4rU