+ All Categories
Home > Documents > HadoopTraining05 03 2017 V2 - HPC...

HadoopTraining05 03 2017 V2 - HPC...

Date post: 13-Mar-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
89
Introduction to Data Analysis with Hadoop PRESENTED BY: Tatineni, Mahidhar, PhD, SDSC Feng “Kevin” Chen, PhD, TACC and Weijia Xu, PhD, TACC
Transcript

IntroductiontoData AnalysiswithHadoop

PRESENTEDBY:Tatineni, Mahidhar, PhD, SDSC

Feng “Kevin” Chen, PhD, TACC

and Weijia Xu, PhD, TACC

Today’sAgenda• OverviewofBigData– scope,challenges• IntroductiontoHadoopArchitecture

• HDFS• YARN

• MapReduceParadigm• HadooponWrangler

• Wrangleroverview• Hadoopreservations,usage

• Hadoophandsonsession• HDFScommands• Simplewordcount example• Hadoopstreaming• Mahout

XSEDETrainingSurvey

Pleasecompleteashorton-linesurveyaboutthismoduleathttp://bit.ly/xsedejackson.Wevalueyourfeedback,andwilluseyourfeedbacktohelpimproveourtrainingofferings.Slidesfromthisworkshopareavailableat

http://hpcuniversity.org/trainingMaterials/238/

What/WhyBigData?Big data: High-volume, high velocity and high varietyinformation assets that demand cost effective, innovativeforms of information processing for enhanced insight anddecision making.• 5billionmobilephonesinusein2010

• Facebook:34,722likeseverymin,100TBuploaddaily

• YouTube:usersupload48hrs ofnewvideoeveryminute.

• Wal-Marthandlesmorethan1millioncustomertransactionseveryhour.

• Akamaianalyze75millioneventsperdayforbettertarget

• advertisements.

• Twitter:roughly400milliontweetseverydayandholds465Maccounts.

• 571newwebsitesarecreatedeveryminute.

Continued..• Data Volume

• 50x increase from 2010 to 2020• Data velocity

• 10k payment card transactionsare madevery second around theglobe.

• Wal-Mart handles 1M +• transactions an hour.

• Data variety• Structured data such as ATM

andPOS bank transactions.• Semi structured is having some• tagging/method of

differentiation.• Unstructured : everything else

fallsin this category. E gtweets,FB likes,posts etc.

Continued..• Rapidgrowthwithalmost90% ofthe data generated in last2years.• Classification of data–

• 51% is structured data• 27% is semi structured data• 22% is unstructured data

• Total6Mjobs outofwhich 2M are in theUS. Limitedbynumberofpeoplewith deep analysis skills anditwillbedifficult to addressbig datarequirementsin USby 2018.

What’sthe big challenge withthebig data analysis?

• Dataanalysisprocessrequiresalotofcomputationalresources,– Storage,triplethesizeoftherawdatatostoretheintermediate

files,outputetc.– Memory,e.g.algorithmwanttostorethepair-wisedistance

matrixamongdatapoints

• Theanalysisprocesswouldtake muchlonger.– Typicalharddrivereadspeedisabout150MB/sec,

• Reading1TB~2hours

– Analysiscouldrequireprocessingtimequadratic tothesizeofthedata

• Analysis that took1secondfor1GBdata,wouldrequire11daystofinishfor1TBdata

WhatisHadoopApache Hadoop is an open source softwareframeworkfor storage and large scaleprocessing ofdatasets on clusters ofcommodityhardware. It’s atop level Apacheproject being built and used byaglobal community ofcontributors/users. Itislicensed under the Apache License 2.0

Features :• Cost effective

• Flexible/Heterogeneous hardware

• Scalable

• Resilient to failure

Hadoop• Implementation of MapReduce programming modelin

JAVA withinterface toother programming language suchasC/C++, python.

• Hadoopincludes– HDFS, adistributedfile system basedon google file system

(GFS), asitssharedfile system.– YARN,Aresourcemanagertoassignresourcestothe

computational tasks.– MapReduce, alibrary toenable efficient distributeddata

processing easily.– Mahout,scalablemachinelearninganddatamininglibrary– Hadoopstreaming, enable processing withother language.– …

HDFSFeatures :• Based on GoogleFS• Fault Tolerant and easy management• Scalable and extremely simple to expand• Hadoop support shell-like commands to interact

with HDFS directly• Typicalreplica is 3, but can be set at file level as

well• 128M is the default block size• Daemons services ofHDFS

• NameNode• Secondary NameNode• Data Node

Limitations :• Cannotbe mounted directly byan existing OS.• Low latency and not for systems requiring

concurrent writes• Parallel write/arbitrary read• Is meant for less no of huge files not otherwise

NameNode

• Master node in the cluster/ single point of failure• Data node sends heartbeats every 3 seconds• Every 10th heartbeat is ablockreport• Name node builds metadata from block reports• All the requests (read/write) are processed by namenode only

YARNArchitecture

•ApacheHadoopYARN(YetAnotherResourceNegotiator)isaclustermanagementtechnology.•Scheduler is responsible for allocating resources, though noresponsibility on completion.•ApplicationMgr accepts job submission, spins up appmaster andtracks its progress.•AppMaster has the responsibility of negotiating resources fromthe scheduler, tracking their status and monitoring the progress.

Hadoop andYARN Architecture

YARN has two main components : Scheduler and ApplicationManager.Scheduler does the scheduling and allocates resources to running applicationsand It does so based on the abstract notion of aresource container whichincorporates elements such as memory, cpu, disk, network etc.ApplicationMgr is responsible for accepting job submissions, creating appmastercontainer and restart appmaster in case of failure.Nodemgr is the per-machine frameworkagent which is responsible forcontainers, monitoring their resource usage and reporting the same to RM.

MapReduceMapReduce is asoftware frameworkfor easily writing applications whichprocess vast amounts of data in-parallel on large clusters in a reliable fault-tolerant manner.

Features :• Accessibility• Flexibility• Reliability

MapReducedeepdive

Reduceprocess can start before mappers end. It can start shuffle since itsdatatransfer only but not sortand reduce.mapreduce.job.reduce.slowstart.completedmaps is used for that when tostart reducer.

“Big”Ideasin MapReduce

• Move computations todata.– Donotuse/assumehighbandwidthinter-connection betweennodes.– Ifpossible avoidor reducetheneedofdatatransferoverthenetwork

asthisisoften thebottlenecktoscaling.

• Processing data insequentialorder, avoidrandom access– SameideaasappliedintheRDBMS– However,assumedatacanbeprocessedanyorder,– Alldataarepackedintoblocks(defaultat64MB or128MB).

“Big”Ideasin MapReduce

• Scaleout insteadofscale up:– The same code canrunwith1 node or 100 nodes.

• Hide system detailsfrom user– Providingabstractioninwriting parallelcode.

• Mapperandreducer• Partitioner andcombiner

– Isolatesdeveloperfrom(andindependentfrom)systemhardwaredetails

– Onceallrequiredcomponentsarespecified,dataisautomatically“sliced”andprocessedinparallel.

How doesMapReduce work inHadoop?

• Thecomputation isbroke downintotwomajor steps:– Mapinstances:

• processthedatastoredwithinonedatablocksequentially,• TheresultisintheformatoflistofKey-valuepairs

– Reduceinstances:• Collect(key,value)pairsemitted byMapinstances• Pairswiththesamekeyaresenttothesamereducer,• EachReducerprocessthekey-valuepairsreceivedandwriteoutputtoafile.

• Useronly requiredtodevelopfunctions for MapandReduce– Theworkloaddistributionareautomatically handledbythe Hadoop

cluster

• Eitherthe“Key”or “Value”inakey-value pair couldbe anytype ofdata.

8

WordCount Example

• Readtextfilesandcount how often wordsoccur.– The input istext file– The output isatext file

• eachline: word, tab, count

• Map:Produce pairsof(word, count)• Reduce:For eachword, sum upthe counts.

• AppMaster launch one map task for eachmap split. Typically there is a map split foreach input file.• Mappers transform the input data to intermediate data. Output pairs do not need to beof the same types as input pairs. A given input pair may map to zero ormany outputpairs.• All intermediate values associated with a given output key are subsequentlygrouped by the framework, and passed to the Reducer(s) to determine thefinal output.• The Mapper outputs are sorted and then partitioned per Reducer. The total number ofpartitions is the same as the number of reduce tasks for the job.• Users can optionally specify a combiner to perform local aggregation of theintermediate outputs. This helps to cut down amount of data flow.

WordCount Overview

3import...12publicclassWordCount{131417182627282930373839404153545556575859}

publicstatic classMapextendsMapper<Object,Text,Text,IntWritable>{

publicvoidmap...}

publicstatic classReduceextendsReducer<Text,IntWritable,Text,IntWritable>{

public voidreduce ...}

publicstatic voidmain(String[]args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");...FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

System.exit(job.waitForCompletion(true)?0:1);}

wordCount Mapper

14publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{15161718

privatefinalstatic IntWritable one =new IntWritable(1);private Text word=new Text();

publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException {

Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());context.write(word,one);

}}

1920212223242526}

WordCount Reducer

28publicstatic classReduce extendsextendsReducer<Text, IntWritable,Text, IntWritable>{

2930publicvoidreduce(Textkey, Iterable<IntWritable>values,Context context) throwsIOException {31 intsum=0;32 while(values.hasNext()){33 sum+=values.next().get();34 }35 result.set(sum);36 context.write(key, result);37 }38}

WordCount main

publicstatic voidmain(String[] args)throwsException {Jobjob=Job.getInstance(new Configuration(), "wordcount");

job.setJarByClass(WordCount.class);job.setMapperClass(Map.class);job.setCombinerClass(Reduce.class);job.setReducerClass(Reduce.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true)?0 : 1);

}

Mapper

• Mapsinputkey-value pairstoaset ofintermediate key-valuepair– ClassforIndividualtaskstorun.– Onemappertask per InputSplit.– Mapfunct ion areautomatical ly c a l l e d perkeyValuepair.

publicstatic classMapextendsMapper<Object, Text, Text,IntWritable>{private finalstatic IntWritable one =new IntWritable(1);privateTextword=newText();

publicvoidmap(Objectkey,Textvalue, Context context)throwsIOException {Stringline= value.toString();StringTokenizertokenizer=newStringTokenizer(line);while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());context.write(word,one);

}}

}

Partitioner andCombiner

• Theintermediateoutput generatedby Mapper willbe sortedandpartitioned basedonnumber ofreducers.

• Partitioner, (optionalimplementation)for determine how togroupintermediate key-value pairsfor eachreducer

• Combiner, (optionalimplementation)for combine key-valuepairswiththe same key.– Think of runningreducerlocally.

Reducer• Reducesthe set ofvalues ofthe same key toasmaller set.– Eachreducer willprocesssubset generatedby partitioner– Eachreducer willgenerate anoutput file

publicsta)c classReduce extendsextendsReducer<Text, IntWritable, Text,IntWritable>{public voidreduce(Text key, Iterable<IntWritable>values,

Contextcontext)throws IOException {intsum =0;while(values.hasNext()){

sum+=values.next().get();}result.set(sum);context.write(key, result);

}}

ShuffleandSort

• Therearetwoother phasesduring the reducer– Shuffle: Gathering the partitions from eachmapper

– Sort: merge andsort partitions gatheredfrom multiplemappers.

• Thesetwophasesusually runsimultaneously

• Common causeof bottlenecks as this mayinvolve large datamovement.

AccessingHadoopCluster usingWrangler

Computing Cluster at aGlance

Big data processing platform• Big data processing platforms like hadoopandsparkare not ansimple applications.

• Co-locate computations anddata.

• Needdedicatedstorage manager andresourcesmanager.

• Deployment requiresmapping andconfigurationwithunderlying hardware.

Hadoop Support at TACC: Wrangler

40 Gb/sEthernet100 Gbps

PublicNetwork

GlobusInterconnect with1 TB/s throughput

IB Interconnect 120 Lanes(56 Gb/s) non-blocking

High Speed Storage System500+ TB1 TB/s

250M+ IOPS

Access &Analysis System

96 Nodes128 GB+Memory

Haswell CPUs

MassStorage Subsystem10 PB

(Replicated)

Access &Analysis System

24 Nodes128 GB+Memory

Haswell CPUs

MassStorage Subsystem10 PB

(Replicated)

TACC Indiana

• A direct attached PCI interface allows access to the NAND flash.

• Not limited by networking connection

• Flash storage not tied to individual nodes

• TheHadoopclustercanbedynamicallycreatedover2to48nodesforeachprojecttouseinallocated tome

• Eachnodehaveaccessto4TBflashstorageacrossfourchannels

• Accessibleviathe Hadoopcluster viaidev, batchjobsubmissionandVNCsessions.

FilesystemsonWrangler

• /home– Home directory for eachuser– Smallareaforconfiguration f i l e s andprograms/sourcecode

• /work– Eachusergets1TBofspaceonworksharedforallsystemsat

TACC– Yourworkdirectoryisstoredin$WORK environment variable

for Wrangler

• /data– Staging input andoutput files– OnlyavailableonWrangler– Supportallocation o f s h aredprojectdirectory

GetStartedwith HadooponWrangler

• Step1: create a Hadoopreservation throughWrangler data portal– What doyouneed?

• Any web browser

• Step2: Accessyour Hadoopcluster andsubmit jobs– What doyouneed?

• Secure ShellClient• Any VNCclient

CreateHadoopReservation• Wrangler data portal: portal.wrangler.tacc.utexas.edu

On project page choose: Manage -> Create Hadoop Reservation

the number of nodes (1 ~10) to be used for the Hadoop cluster.

Duration(1-30 Days)

Schedule Start time

About HadoopReservation onWrangler

• Anyone inthe project cansubmit a HadoopReservation request

• Anyone inthe project canaccessthe HadoopReservation

• TheSUonWrangler is“node hour”

• AHadoopreservation requiresaminimum of2 nodesand1 day,e. 48SU.– Onenodeinthereservation willbeusedas Namenode, resource manager

andapplication master etc.– Therestnodeswillbeusedasdatanodes.– Eachnodewillhave4TBflashstoragemountedaspartof hdfs.

CheckReservation Status

• The web portal will show reservation status after request has been submitted

• More information about the reservation are available through slurm command: scontrol

Check HadoopReservationsUsingCommandLine

• Once log on to Wrangler login node,user can check the reservation statuswith`scontrol` command:

>scontrol show reservation

• The reservation will include all users from the projects

• The first node in the reservation will be used as namenode

• The hadoop cluster will start with a setofdefaultsettings• User may override most settings such as duplication factor,block sizeatrun

time per application.• Hadoop cluster with specific settings upon request

AccessHadoopReservation• Oncethereservation statusis “active”, auser canaccessit(viaslurm jobs)inmultipleways:

– VNCjob:startsa vnc server sessiononone ofthe node in Hadoopcluster,

• Checkclusterinformation and hadoopjobstatus• Application withGraphical/Webuserinterface

– idevjob: Assignone node in Hadoopcluster touser• Managedatainandout hadoop cluster• SubmitHadoopjobsviacommandline• Codetesting

– Batchjob:submit jobstoYARNresourcemanagerin Hadoopcluster.

• Submitlargeanalysisjob• Submitbatchofprocessingjobstorunsequentially

AccessHadoopCluster withVNC

• Pleasevisit: vis.tacc.utexas.edu

Choose “TACC User Portal User”

Enter credential

1. Choose Wrangler Tab

2. Set VNC password, (Only need once)

3. Fill in reservation name: hadoop+TRAINING-HPC+1419And choose “hadoop” queue

AccessHadoopReservation via idev Session

• Usercansubmitidevsessionusing hadoop reservation➢idev -r hadoop+JSU-Training

• Itdefaultstouseyour default project,The-Aallocation_nameoption tospecify allocation touse

• Thedefaultduration for idev is30 minutesThe-m minutesoption canspecify the time ofthe idev session

• Pleaselimityour usage to Hadooprelatedtasks, youcanalsosubmit idev without using reservation for non-hadooptasks.

AccessHadoopReservation via idev Session

• Accessbysecure shellclient– sshwrangler.tacc.utexas.edu– idev -r hadoop+JSU-Training -m 240

• Accessby vis portal– Gotovis.tacc.utexas.edu using web browser– Loginwithyour credential– Goto Wrangler tab to start VNC sessions usingreservation and using Hadoop queue

Working withHadoopCluster

HadoopDistributedFile System (HDFS)

• Thehdfswillbe set upwiththree topleveldirectories

• /tmp publicwriteable, usedby many hadoopbasedapplication astemporary space.

• /user allusershome directory /user/$USERNAME• /var public readable, usedby many hadoopbasedapplications tostore log filesetc.

Working withHDFS• HDFShasfile system shell

hadoopfs[commands]

• Thefilesystem shellincludesaset ofcommandtowork withhdfs.– Commandare similar tocommonlinuxcommandse.g.

• >hadoopfs-ls• >hadoopfs-mkdir abc #tomake adirectory inhdfs.

#tolist content ofthe default user directory

Getting Data inandout of HDFS

• hadoopfs-putlocal_file [path_in_HDFS]– Put afile inyour localsystem intothe HDFS– Eachfile wouldbe storedinone or more “blocks”

• The default block size is128MB.• The block size usedcanbe overriden by users

• hadoop fs-get path_in_hdfs[path_in_local]– Get afile from the hadoopcluster tothe localfilesystem.

OtherFileShellcommands• -stat returnsstatinformation ofapath••-cat/tail-setrep

output to stdoutset replication factor

• For acomplete listsjust dohadoop fs

• Referenceandmoredetails:https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

Note:Fordfs specificcommandsnewerversionsofHadoophavetransitionedtousing

“hdfs”commands.Forexample:“hadoop dfs -ls” is the same as “hdfs dfs -ls”

HDFSCommandExamples• hdfs fsck <path>

• Filesystemcheckingutility.Canidentifiesproblemslikemissingblocks,under-replicationetc.

HDFSCommandExamples• hdfs dfsadmin -report

• Reportsbasicfilesysteminformationandstatistics,needsadminprivileges.

YARN

• YARN:YetAnother Resource Manager– Managing computing resourceswithinHadoopcluster– Alljobsshouldbe submitted toyarntorun.

• E.g.usingeither yarnJar or hadoopJar

– Whenuse other hadoop-supportedapplication, please alsospecify YARN asresource manager.SuchasSPARK.

• YARNcommands– Show cluster status– Helpmanaging jobsrunning inside the Hadoopcluster.

YARNCommands

• yarnapplication-list tolistapplications submitted toYARN

default willshow active/queuedjobs

-kill tokillapplication specifiedby jobID.-appStates/appTypes filter options

YARNCommands

• yarnnode– -list list ofstatusofdata nodes– Let usknow ifthere islessthanexpectedlive datanodes.

• yarnlogs– dumplogsofafinishedapplication– -applica/ onID specific log from whichapplication– -containerID specify log from whichcontainer

RunningHadoopApplication• AllHadoopjobs canbe runasaconsolecommand.

• Thebasicformat islike following:

hadoopjarjava_jar_namejava_class_name[parameters]

– User canuse –Dtospecify more Hadoopoptions.-Dmapred.map.tasks #number ofmapinstancestobe generated.-Dmapred.reduce.tasks #number ofreduce instancestobe used.

Howmany Mapper andReducer tasks?• Unfortunately, it depends ontheparticularcase.• Theinputparameteris“suggestion”.• Mapper

– Moreorlessdependsoninputworkload– InputSplit, InputFormat

• Numberofblocksusedtostoretheinput• Numberoflinesinatextfile.

• Reducer– Moreorlessdependsoncomputation workloadandavailable

resources.e.g.0.95/1.75 available containers– alsotoconsider

• Howmanypartitionscouldbegenerated?• Howmanyoutputsplitsaredesired.

RunningWordcount example code

• TheHadoopdistribution comeswithanexample jar whichincludesaset ofexemplar mapreduce code:

• Torunwordcount example from that jar file, youcanruncommandlike following:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar\wordcount \ #javaclassname torun-Dmapred.map.tasks=500 \ #number ofmapper instance-Dmapred.reduce.tasks=256 \ #number ofreducer instance/tmp/data/enwiki-20120104-pages-arKcles.xml \ #input file on hdfswiki_wc #foldertostoretheoutput

RunningWordcount Example code

• The on-screenoutput willshowjobrunningstatus

RunningWordCount Example

• Thenumberofactualmapper createdmay belimitedby the number ofactualdata blocksinthehdfs.– Inthe example, the input file isabout ~35GB, withdefaultblock size as128MB, the file isstoredin266 blocksin hdfs.

– Allmappersmay not be runat the same time, ifthere isnot enoughresources.

• Eachreducerwillgenerate anoutput fileindependent from eachother.– So256 reducer wouldresult 256 filesinthe output folder.

RunningWordCount example

• Eachoutput file isatext file aswell,– Eachline containsawordanditscount,– Youwillnoticefor eachoutput file, the wordissortedin alphabetic order

– Youcancopy the file out of hdfsusing commandlike

• hadoop fs-get /tmp/wiki_wc wiki_wc

– Or youcanview the content ofeachoutput usingcommandlike

• hadoop fs-cat /tmp/wiki_wc/part-r-00238 |more

Otherexamples

• There are more examplesinthe example.jar.Youcanuse following commandtoget the list– hadoopjar/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

• Here are afew examples:– grep: A map/reduce program that countsthe matchesofaregexinthe input.

– pi: A map/reduce program that estimated Piusing aquasi-Monte Carlomethod.

– terasort: sorting large set ofrandom generated100bytesdata

– sudoku: A sudokusolver.

HandsOnSession

1.Usetrainingaccounttologintologin.xsede.org(remembertouseDUO)2.AccessWranglerusinggsissh

LoggingintoSingleSignOnHostMac/Linux:

ssh [email protected]

Windows(PuTTY):

login.xsede.org

LogintoWranglerfromSingleSignonhost

gsissh wrangler-tacc

Try It Out• Examplecode/data location/work/03024/chenk/Hadoop_Training_TACC.tar.gz

Copy orextract toyour home directory e.g.>tar-xvf/work/03024/chenk/Hadoop_Training_TACC.tar.gz>cd~/Hadoop_Training_TACC

• Today’sreservationis:hadoop+TRAINING-HPC+2188

• Use“idev”toaccessnodes:> idev -rhadoop+JSU-Training -m240

• Takealook at exercise.txt(https://drive.google.com/file/d/0B4PqgCa0ORIgNUNIVWpyMVVIZnc/view?usp=sharing)

– Tryexercise1andexercise2first

• hadoop fscommandsyntax– hadoop fs-{ls|rm |cat|mkdir |put |get}

HadoopStreaming

• EnablingMRjobsinother scripting language: Python,Perl, R, C, etc...

• Userneedtoprovide scripts/programsfor MapandReduce processing– The input/output format needtobe compatible withkey-value pair

• Intermediate dataarepassedthrough stdin, stdout– A trade-offbetweenconvenience andperformance

HadoopStreaming API

hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar-input/path/to/input/in/hdfs-output/path/to/output/in/hdfs-mappermap-reducerreduce

#inputfilelocation#outputfilelocation#mapperimplementation# reducer implementation

-filemap-filereduce

#location o f t h emapcodeonlocalfilesystem#location o f t h ereducecodeonlocalfilesystem.

• Themapandreducecouldbe implementedinanyprogramming language, evenwithbashscript.

WordCount using BashwithHadoopstreaming

Themapcode:

WordCount using BashwithHadoopstreaming

• The reducecode

WordCount using BashwithHadoopstreaming

• Putting together:hadoopjar/usr/lib/hadoop-mapreduce/hadoop-streaming.jar \-Dmapred.map.tasks=512 \-Dmapred.reduce.tasks=256 \-Dstream.num.map.output.key.fields=1 \ #specify the position ofthe key-input/tmp/data/20news-all/alt.atheism \-output wiki_wc_bash\-mapper./mapwc.sh-reducer ./reducewc.sh \-file ./mapwc.sh-file ./reducewc.sh

HadoopStreaming

• Thedefaultdatasplitbehavior of Hadoopisbyblocks.

• However, Hadoopstreaming split data per“line”, so– Youcanuse higher number ofmapper– The streaming only workswithtext file

Try It Out

• Ifyouare comfortable withJava– Try exercise 3

• Exercise 4 isabout hadoopstreaming– Canyouwrite awordcount example withyourfavorite programming language?

– Exemplar code for Python, R andbashareavailable at the corresponding directory.

• Questions tothink?– Where/how didwe specify parallelisms?

HadoopWordCountExample(Screenshot)

Mahout

• Machinelearninglibrariesfor Hadoop

• Not very comprehensive one, but stillprovide agoodcoverage

• In practice somewhat raw and complex to utilize.Much of code is specific to the particular datasetbeing processed (Ex. 20 newsgroups)

• Asubsetof analyticmethodsalsocanbe runfromcommandline.

RunningMahout from CommandLine

• Typemahout willshow alist ofprogram canberun

• Some ofpotential interests– buildforest: : Buildthe random forest classifier– kmeans::K-meansclustering– recommenditembased: : Compute recommendation using item- based

collaborativefiltering– runlogistic: : Runalogistic regressionmodelagainst CSVdata– svd:: LanczosSingular Value Decomposition

– …

Walkthrough ofusing Mahout for K-meansclustering

• Dataset,Reuters20 newsgroupdata– wget

https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz– mkdir reuters-sgm– tar-zxvf reuters21578.tar.gz-Creuters-sgm

• Step1, preparing filesandmove to hdfs– $ mahoutorg.apache.lucene.benchmark.utils.ExtractReutersreuters-sgmreuters-out– $hadoopfs-putreuters-out reuters-sgm-extract

• Step2, converting extractedtext file intosequence file format inmahout– $mahoutseqdirectory-i reuters-sgm-extract -oreuters-seqdir -cUTF-8-chunk5

Walkthrough ofusing Mahout forK-meansclustering

• HadoopSequence File,– Sequence ofRecords, where eachrecordisa<Key, Value>pair e.g.

• <Key1,Value1 >• <Key2,Value2 >

• For this example,

– Key <- document ID

of– Value <- content the document

Walkthrough ofusing Mahout forK-meansclustering

• Step3: creating vector representation fromsequence file

> mahoutseq2sparse -ireuters-seqdir/ -oreuters-seqdir-vectors

• ThisstepiscreatingtheTermFrequency/inversedocumentfrequencymatrixforthedocumentset.

Walkthrough ofusing Mahout for K-meansclustering

• Step4:runk-meansclusteringwithMahoutmahoutkmeans

-ireuters-seqdir-vectors/tfidf-vectors/-creuters-kmeans-clusters-oreuters-kmeans-x10-k20

• Step5:checkingresultmahoutclusterdump–Ireuters-kmeans/clusters-*-dreuters-seqdir-vectors/dictionary.file-0-dtsequencefile-b100-n

20-o./cluster-output.txt

Walkthrough ofusing Mahout for K-meansclustering

• Youcanget list of options ofeachprogram by– mahout [program_name]

• Options for k-meansinMahout--input(-i) input Path to job input directory.--clusters(-c)clustersTheinputcentroids,asVectors.(optional)--k(-k)kThekink-Means.Ifspecified, thenarandom selection ofk Vectorswillbe chosenasthe Centroidandwritten tothe clustersinput path.--output(-o)outputThedirectory pathnameforoutput.--distanceMeasure(-dm)distanceMeasure Theclassname ofthe DistanceMeasure.Default isSquaredEuclidean--convergenceDelta (-cd)convergenceDelta The convergence delta value.Default is0.5--maxIter(-x)maxIterThe maximum numberof iteraKons.--maxRed(-r)maxRedThe number ofreduce tasks.Defaultsto2--overwrite(-ow)Ifpresent, overwrite the output directory before running job--help(-h)Printouthelp--clustering(-cl)Ifpresent,runclusteringaZertheiterations havetakenplace

OtherPackageandTools

• Hbase– Opensourceimplementation of BigTable– Columnar,append-onlykey-valuestore– BuiltonHDFSbutnotMR.ProvidesfastrandomlookupforHDFS

data.

• Spark– AdifferentprogrammingmodelwithHadoop– Distributed collectionsofobjects thatcanbecachedinmemoryacross

clusternodes– CurrentlysupportedviaYARNresourcemanager

• XSEDEmachinesprovideabroadrangeofcapabilities.Pleaseletusknow other tools/packagesyoumay need.

Summary• Hadoopframeworkextensivelyusedforscalable

distributedprocessingoflargedatasets.• Severaltoolsavailablethatleveragetheframework.• HadoopStreamingcanbeusedtowrite

mapper/reducefunctionsinanylanguage.• Mahoutw/Hadoopprovidesscalablemachine

learningtools.• ToolslikeSparkhaveexpandedcapabilities,taking

advantageofin-memorystorage/caching.CanalsogobeyondtheMapReduceparadigm.

• Pleasefilloutthesurvey:• http://bit.ly/xsedejackson


Recommended