+ All Categories
Home > Documents > Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer,...

Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer,...

Date post: 22-Dec-2015
Category:
Upload: ada-poole
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
17
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850 (240) 389-0750 [email protected]
Transcript
Page 1: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

Overview of Hadoop forData MiningFederal Big Data Groupconfidential

Mark SilvermanTreeminer, Inc.

155 Gibbs Street Suite 514Rockville, Maryland 20850

(240) [email protected]

Page 2: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Agenda

• Introduction to Hadoop• Developing and testing a Map/Reduce

application• Auto-Clustering in Hadoop and

Interworking with Apache Storm

Page 3: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Introduction to Hadoop

• Hadoop consists of:• Clustered, distributed, highly available file

system (HDFS)• Execution framework (Map/Reduce)

Page 4: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Hadoop File System

• “Rack” aware• Local storage• Distributed copies (generally 3)

Rack

Page 5: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Sample Hadoop File System

Page 6: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Hadoop “Eco-System”

• HiveAllows SQL-like querying of data in HDFS

• PigBasic scripting language for Hadoop

• DatabasesHbase, Accumulo, Cassandra, Neo4j

Page 7: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

Page 8: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Map / ReduceParallel Execution Framework

Page 9: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

WordCount Example

Page 10: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Getting Started

• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.

http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html

Page 11: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Developing In Map / Reduce

• Standalone Mode – Hadoop runs as single process, best for debugging

• Pseudo-Distributed – Separate processes on same server

• Fully Distributed – Full blown cluster

Page 12: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Eclipse Framework

• Write code in eclipse• PC or Linux• Options:

• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and

profiling• Profiling: Yourkit

Page 13: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

WordCount

• Create a project in eclipse• Load wordcount code (widely available

and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output

Page 14: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

Page 15: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Monitoring Hadoop Jobs

Page 16: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Resources

http://www.cloudera.com

http://www.hortonworks.com

hadoop.apache.org

http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf

Hadoop: A Definitive Guide by Tom White

Page 17: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

TREEMINER, INC.CONFIDENTIAL

Example: Document AutoClustering using Hadoop and Storm

https://www.youtube.com/watch?v=5X65WV0n4rU


Recommended