Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer,...

Post on 22-Dec-2015

218 views 2 download

Tags:

transcript

Overview of Hadoop forData MiningFederal Big Data Groupconfidential

Mark SilvermanTreeminer, Inc.

155 Gibbs Street Suite 514Rockville, Maryland 20850

(240) 389-0750msilverman@treeminer.com

TREEMINER, INC.CONFIDENTIAL

Agenda

• Introduction to Hadoop• Developing and testing a Map/Reduce

application• Auto-Clustering in Hadoop and

Interworking with Apache Storm

TREEMINER, INC.CONFIDENTIAL

Introduction to Hadoop

• Hadoop consists of:• Clustered, distributed, highly available file

system (HDFS)• Execution framework (Map/Reduce)

TREEMINER, INC.CONFIDENTIAL

Hadoop File System

• “Rack” aware• Local storage• Distributed copies (generally 3)

Rack

TREEMINER, INC.CONFIDENTIAL

Sample Hadoop File System

TREEMINER, INC.CONFIDENTIAL

Hadoop “Eco-System”

• HiveAllows SQL-like querying of data in HDFS

• PigBasic scripting language for Hadoop

• DatabasesHbase, Accumulo, Cassandra, Neo4j

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

TREEMINER, INC.CONFIDENTIAL

WordCount Example

TREEMINER, INC.CONFIDENTIAL

Getting Started

• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.

http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html

TREEMINER, INC.CONFIDENTIAL

Developing In Map / Reduce

• Standalone Mode – Hadoop runs as single process, best for debugging

• Pseudo-Distributed – Separate processes on same server

• Fully Distributed – Full blown cluster

TREEMINER, INC.CONFIDENTIAL

Eclipse Framework

• Write code in eclipse• PC or Linux• Options:

• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and

profiling• Profiling: Yourkit

TREEMINER, INC.CONFIDENTIAL

WordCount

• Create a project in eclipse• Load wordcount code (widely available

and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

TREEMINER, INC.CONFIDENTIAL

Resources

http://www.cloudera.com

http://www.hortonworks.com

hadoop.apache.org

http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf

Hadoop: A Definitive Guide by Tom White

TREEMINER, INC.CONFIDENTIAL

Example: Document AutoClustering using Hadoop and Storm

https://www.youtube.com/watch?v=5X65WV0n4rU