Cloudera Administrator Training for Apache Hadoop: Hands ... · 5 Copyright © 2010-2015 Cloudera,...

Copyright © 2010-2015 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent from Cloudera.

Cloudera Administrator Training for Apache Hadoop: Hands-On Exercises

GeneralNotes............................................................................................................................3SetupActivity:ConfiguringNetworking...........................................................................7

Hands-OnExercise:InstallingClouderaManagerServer.......................................13Hands-onExercise:CreatingaHadoopCluster..........................................................20

Hands-OnExercise:WorkingWithHDFS......................................................................33

Hands-OnExercise:RunningYARNApplications.......................................................38Hands-OnExercise:ExploreHadoopConfigurationsandDaemonLogs...........49

Hands-OnExercise:UsingFlumetoPutDataintoHDFS..........................................55

Hands-OnExercise:ImportingDatawithSqoop........................................................63Hands-OnExercise:QueryingHDFSWithHiveandClouderaImpala.................68

Hands-OnExercise:UsingHuetoControlHadoopUserAccess............................78Hands-OnExercise:ConfiguringHDFSHighAvailability.........................................91

Hands-OnExercise:UsingtheFairScheduler..........................................................101

Hands-OnExercise:BreakingTheCluster.................................................................107Hands-OnExercise:VerifyingTheCluster’sSelf-HealingFeatures..................109

Hands-OnExercise:TakingHDFSSnapshots............................................................112Hands-OnExercise:ConfiguringEmailAlerts..........................................................114

TroubleshootingChallenge:HeapO’Trouble..........................................................116

201510

2


2

AppendixA:SettingupVMwareFusiononaMacfortheCloudTrainingEnvironment........................................................................................................................120AppendixB:SettingupVirtualBoxfortheCloudTrainingEnvironment.......124

3


3

General Notes

Training Environment Overview

Inthistrainingcourse,youwilluseacloudenvironmentconsistingoffiveAmazonEC2instances.Youwillalsousealocalvirtualmachine(VM)toaccessthecloudenvironment.

AlloftheEC2instancesusetheCentOS6.4Linuxdistribution.

Usethetraininguseraccounttodoyourwork.Youshouldnotneedtoenterpasswordsforthetraininguser.

Shouldyouneedsuperuseraccess,youcanusesudoasthetraininguserwithoutenteringapassword.Thetraininguserhasunlimited,passwordlesssudoprivileges.

For the training environment:

• YouwillstarttheVM,andthenuseittoconnecttoyourEC2instances

• TheEC2 instances havebeen configured so that you can connect to themwithoutenteringapassword

• YourinstructorwillprovidethefollowingdetailsforfiveEC2instancesperstudent:

o EC2 public IP addresses – Youwill run a script that adds the EC2public IPaddressesofyour fiveEC2 instances to the/etc/hostsfileonyourVM

o EC2private IP addresses – Youwill use these addresseswhen yourun a script that configures the /etc/hosts file on your EC2instances.EC2private IPaddressesstartwith thenumber10,172,or192.168

4


4

o TheinstructorwillnotprovideEC2internalhostnames.Pleaseusethe host nameselephant,tiger,horse,monkey andlion forthefiveinternalhostnames.Itisimportantthatyouusethesefivehost names, because several scripts, which expect these five hostnames,havebeenprovidedforyourusewhileyouperformexercises.Thescriptswillnotworkifyouusedifferenthostnames.

• PleasewriteoutatablesimilartothefollowingwithyourEC2instanceinformation:

HostName EC2PublicIPAddress EC2PrivateIPAddress

elephant

tiger

horse

monkey

lion

Notational Convention

Insomecommand-linestepsintheexercises,youwillseelineslikethis:

$ hdfs dfs -put shakespeare \

/user/training/shakespeare

Thebackslashattheendofthefirstlinesignifiesthatthecommandisnotcompleted,andcontinuesonthenextline.Youcanenterthecodeexactlyasshown(ontwolines),oryoucanenteritonasingleline.Ifyoudothelatter,youshouldnottypeinthebackslash.

5


5

Copying and Pasting from the Hands-On Exercises Manual

Ifyouwish,youcanusuallycopycommandsandstringsfromthisHands-OnExercisesmanualandpastethemintoyourterminalsessions.However,pleasenoteoneimportantexception:

Ifyoucopyandpastetableentriesthatexceedasingleline,aspacemaybeinsertedattheendofeachline.

Dash(-)charactersareespeciallyproblematicanddonotalwayscopycorrectly.Besuretodeleteanypasteddashcharactersandkeytheminmanuallyafterpastingintextthatcontainsthem.

Pleaseusecautionwhencopyingandpastingwhenperformingtheexercises.

Resetting Your Cluster

Youcanusethereset_cluster.shscripttochangethestateofyourclustersothatyoucanstartwithafresh,correctenvironmentforperforminganyexercise.Usethescriptinsituationssuchasthefollowing:

• Whileattemptingoneoftheexercises,youmisconfigureyourmachinessobadlythatattemptingtodothenextexerciseisnolongerpossible.

• Youhavesuccessfullycompletedanumberofexercises,butthenyoureceiveanemergencycallfromworkandyouhavetomisssometimeinclass.Whenyoureturn,yourealizethatyouhavemissedtwoorthreeexercises.Butyouwanttodothesameexerciseeveryoneelseisdoingsothatyoucanbenefitfromthepost-exercisediscussions.

Thescriptisdestructive:anyworkthatyouhavedoneisoverwrittenwhenthescriptruns.Ifyouwanttosavefilesbeforerunningthescript,youcancopythefilesyouwanttosavetoasubdirectoryunder/home/training.

Beforeyouattempttorunthescript,verifythatnetworkingamongthefivehostsinyourclusterisworking.Ifnetworkinghasnotbeenconfiguredcorrectly,youcan

6


6

reruntheCM_config_hosts.shscripttoresetthenetworkingconfigurationpriortorunningthereset_cluster.shscript.

Runthescriptonelephantonly.Youdonotneedtochangetoadirectorytorunthescript;itisinyourshell’sPATH.

Thescriptstartsbypromptingyoutoenterthenumberofanexercise.Specifytheexerciseyouwanttoperformafterthescripthasrun.Thenconfirmthatyouwanttoresetthecluster(thusoverwritinganyworkyouhavedone).

Thescriptwillfurtherpromptyoutospecifyifyouwanttorunonlythestepsthatsimulatethepreviousexercise,orifyouwanttocompletelyuninstallandreinstalltheclusterandthencatchyouuptothespecifiedexercise.Notethatchoosingtoonlycompletethepreviousexercisedoesnotofferasstronganassuranceofproperlyconfiguringyourclusterasafullresetwoulddo.Itishoweveramoreexpedientoption.

Afteryouhaverespondedtotheinitialprompts,thescriptbeginsbycleaningupyourcluster—terminatingHadoopprocesses,removingHadoopsoftware,deletingHadoop-relatedfiles,andrevertingotherchangesyoumighthavemadetothehostsinyourcluster.Pleasenotethatasthissystemcleanupphaseisrunning,youwillseeerrorssuchas“unrecognizedservice”and“Nopackagesmarkedforremoval.”Theseerrorsareexpected.Theyoccurbecausethescriptattemptstoremoveanythingpertinentthatmightbeonyourcluster.Thenumberoferrormessagesthatappearduringthisphaseofscriptexecutiondependsonthestatetheclusterisinwhenyoustartthescript.

Next,thescriptsimulatesstepsforeachexerciseuptotheoneyouwanttoperform.

Scriptcompletiontimevariesfrom5minutestoalmostanhourdependingonhowmanyexercisestepsneedtobesimulatedbythescript.

7


7

Setup Activity: Configuring Networking Inthispreparatoryexerciseyouwillconfigurenetworkingforyourfiveinstances.

TaskOverview

Inthistask,youwillrunscriptstoconfigurenetworking.

First,youwillstartthelocalGet2EC2VMandrunascripttoconfigurethe/etc/hostsfileonthatVMwiththeaddressesofthefiveEC2instancesandthehostnameselephant,tiger,horse,monkeyandlion.

Next,youwillrunascripttoconfigurethe/etc/hostsfileontheEC2instances,settingthefivehostnamestoelephant,tiger,horse,monkeyandlion.

Thenyouwillverifythatnetworkinghasbeensetupcorrectly.

Finally,youwillrunascriptthatstartsaSOCKS5proxyserveronthelocalVM.

StepstoComplete

1. IfyouareusingVMwareFusiononMacOStoconnecttoyourfiveEC2instances,read‘AppendixA,SettingupVMwareFusiononaMacfortheCloudTrainingEnvironment’andperformanyindicatedactions.Afteryouhavedoneso,continuetothenextstep.

8


8

2. StarttheGet2EC2VM.

YoushouldfindtheVMonyourDesktop,ontheC:driveofyourmachine,orinthelocationtowhichyoudownloadedit.Double-clicktheiconwhosenameendsin.vmxtolaunchtheVM.

Afterafewminutes,theGNOMEinterfacewillappear.

Note: A VirtualBox VM is available for students running Mac OS who are

unable to or prefer not to install VMware Fusion. However, we strongly

recommend using the VMware version VM. Use VirtualBox for this course

only if it is your preferred virtualization environment and if you are

knowledgeable enough to be self-sufficient to troubleshoot problems you

might run into.

If you are using a VirtualBox VM, follow the steps in ‘Appendix B, Setting up

VirtualBox for the Cloud Training Environment.’ When you have completed

the steps in the Appendix, continue to the next step.

3. Runascriptthatmodifiesthe/etc/hostsfileonyourVMbyaddingyourfiveEC2instances.

EnterthefollowingcommandintheterminalwindowintheVM:

$ CM_config_local_hosts_file.sh

TheCM_config_local_hosts_file.shscriptwillpromptyoutoenterthefiveEC2publicIPaddressesforyourfiveEC2instances.RefertothetableyoucreatedwhenyourinstructorgaveyoutheEC2instanceinformation.

4. LogintotheelephantEC2instanceasthetraininguser.

$ connect_to_elephant.sh

Whenpromptedtoconfirmthatyouwanttocontinueconnecting,enteryesandthenpressEnter.

9


9

5. RuntheCM_config_hosts.shscriptonelephant.Thisscriptsetsupthe/etc/hostsfileonelephant,copiesthatfiletotiger,horse,monkeyandlion,andthensetsthehostnamesforthefivehoststoelephant,tiger,horse,monkeyandlion.

Enterthefollowingcommandonelephant:

$ CM_config_hosts.sh

WhenthescriptpromptsyoutoenterIPaddressesforyourEC2instances,entertheEC2privateIPaddresses(whichtypicallystartwith10,172,or 192.168).

Whenthescriptpromptsyoutoconfirmthatyouwanttocontinueconnecting,enteryeseachtimeandthenpressEnter.

6. TerminateandrestarttheSSHsessionwithelephant.

$ exit

$ connect_to_elephant.sh

Note:Youexitandreconnecttoelephanttoresetthevalueofthe$HOSTNAMEvariableintheshellaftertheCM_config_hosts.shscriptchangesthehostname.

7. StartSSHsessionswithyourotherfourEC2instances,logginginasthetraininguser.

Openfourmoreterminalwindows(ortabs)onyourVM,sothatatotaloffiveterminalwindowsareopen.

Inoneofthenewterminalwindows,connecttothetigerEC2instance.

$ connect_to_tiger.sh

Whenthescriptpromptsyoutoconfirmthatyouwanttocontinueconnecting,enteryesandthenpressEnter.(Youwillalsoneedtodothiswhenyouconnecttohorseandmonkey.)

10


10

Inanotherterminalwindow,connecttothehorseEC2instance.

$ connect_to_horse.sh

Inanotherterminalwindow,connecttothemonkeyEC2instance.

$ connect_to_monkey.sh

Intheremainingterminalwindow,connecttothelionEC2instance.

$ connect_to_lion.sh

8. Verifythatyoucancommunicatewithallthehostsinyourclusterfromelephantbyusingthehostnames.

Onelephant:

$ ping elephant

$ ping tiger

$ ping horse

$ ping monkey

$ ping lion

9. VerifythatpasswordlessSSHworksbyrunningtheip addrcommandontiger,horse,monkeyandlionfromasessiononelephant.

Onelephant:

$ ssh training@tiger ip addr

$ ssh training@horse ip addr

$ ssh training@monkey ip addr

$ ssh training@lion ip addr

Theip addrcommandshowsyouIPaddressesandproperties.

11


11

Note: Your environment is set up to allow you to use passwordless SSH to

submit commands from a session on elephant to the root and training

users on tiger, horse, and monkey, and to allow you to use the scp

command to copy files from elephant to the other four hosts. Passwordless

SSH and scp are not required to deploy a Hadoop cluster. We make these

tools available in the classroom environment as a convenience.

10. Runtheuname -ncommandtoverifythatallfivehostnameshavebeenchangedasexpected.

Onallfivehosts:

$ uname -n

11. Verifythatthevalueofthe$HOSTNAMEenvironmentvariablehasbeensettothenewhostname.

Onallfivehosts:

$ echo $HOSTNAME

Thevalueelephant,tiger,horse,monkeyorlionshouldappear.

12. StartaSOCKS5proxyserverontheVM.ThebrowseronyourVMwillusetheproxytocommunicatewithyourEC2instances.

OpenonemoreterminalwindowonthelocalVM–notatabinanexistingwindow–andenterthefollowingcommand:

$ start_SOCKS5_proxy.sh

Whenthescriptpromptsyoutoconfirmthatyouwanttocontinueconnecting,enteryesandthenpressEnter.

13. MinimizetheterminalwindowinwhichyoustartedtheSOCKS5proxyserver.

12


12

Important:DonotclosethisterminalwindoworexittheSSHsessionrunninginitwhileyouareworkingonexercises.Ifyouaccidentallyterminatetheproxyserver,restartitbyopeninganewterminalwindowandrerunningthestart_SOCKS5_proxy.shscript.

At the End of Each Day

At the end of the day, exit all active SSH sessions you have with EC2 instances

(including the session running the SOCKS5 proxy server). To exit an SSH

session, simply execute the exit command.

CAUTION: Do not shut down your EC2 instances.

Then suspend the Get2EC2 VM.

When you return in the morning, resume the Get2EC2 VM, reestablish SSH

sessions with the EC2 instances, and restart the SOCKS5 proxy server.

To reestablish SSH sessions with the EC2 instances, open a terminal window (or

tab), and then execute the appropriate connect_to script.

To restart the SOCKS5 proxy server, open a separate terminal window, and then

execute the start_SOCKS5_proxy.sh script.

This is the end of the setup activity for the training environment.

13


13

Hands-On Exercise: Installing Cloudera Manager Server WithClouderaManager,youcaninstallyourHadoopclusterusingoneoftwoinstallationoptions,referredtoasInstallationPathAandInstallationPathBintheClouderaManagerdocumentation.InstallationPathAisdrivenentirelybyGUI-basedwizardsandisappropriateforademoorproofofconceptinstallation.InstallationPathBiscommand-linedrivenandisappropriateforproductiondeployments.

Forthisinstallation,youwilluseInstallationPathBtoinstallClouderaManagerServeronthelionmachine.BeforeinstallingClouderaManageryouwillconfigureanexternaldatabase(MySQL)tobeusedbyClouderaManagerandCDH,whichyouwillinstallinthenextexercise.

Completeallstepsinhisexerciseonlion.

Verify Existing Environment

1. VerifytheOracleJDKisinstalledandthatJAVA_HOMEisdefinedandreferencedinthesystemPATH.

On lion:

$ java –version

$ echo $JAVA_HOME

$ env | grep PATH

2. VerifyPythonisinstalled.ItisarequirementforHue,whichyouwillinstalllaterinthecourse.

On lion:

$ sudo rpm -qa | grep python-2.6

14


14

3. VerifyOracleMySQLServerisinstalledandrunningonlion.

On lion:

$ chkconfig | grep mysqld

$ sudo service mysqld status

Notethatinatrueproductiondeploymentyouwouldalsomodifythe/etc/my.cnfMySQLconfigurationsandmovetheInnoDBlogfilestoabackuplocation.

4. VerifytheMySQLJDBCConnectorisinstalled.Sqoop(apartofCDHthatyouwillinstallinthiscourse)doesnotshipwithaJDBCconnector,butdoesrequireone.

On lion:

$ ls -l /usr/share/java

Configure the External Database

Inkeepingwiththeapproachtoinstallationthatisappropriateforaproductionclusterenvironment,youwilluseanexternaldatabaseratherthantheembeddedPostgreSQLdatabaseoption.

1. CreatetherequireddatabasesinMySQL.

Onlion,open~/training_materials/admin/scripts/mysql-setup.sqlinatexteditorandobservethescriptdetails.ThisscriptcreatesdatabasesforClouderaManager,theHiveMetastore,theActivityMonitor,andtheReportsManager.

NotethatifyouweregoingtoaddtheSentryServerCDHserviceorinstallClouderaNavigator,additionaldatabaseswouldalsoneedtobecreated.

On lion:

15


15

$ mysql -u root < \

~/training_materials/admin/scripts/mysql-setup.sql

$ mysql -u root

Now,atthemysql>promptonlion,issuethefollowingcommands:

show databases;

exit;

The‘showdatabases’commandshouldshowthatthefourdatabaseswerecreated.

Note:inatrueproductiondeploymentyouwouldalsoregularlybackupyourdatabase.

2. MakeyourMySQLinstallationsecureandsettherootuserpassword.

On lion:

16


16

$ sudo /usr/bin/mysql_secure_installation

[...]

Enter current password for root (enter for none):

OK, successfully used password, moving on...

[...]

Set root password? [Y/n] Y

New password: training

Re-enter new password: training

Remove anonymous users? [Y/n] Y

[...]

Disallow root login remotely? [Y/n] Y

[...]

Remove test database and access to it [Y/n] Y

[...]

Reload privilege tables now? [Y/n] Y

All done!

3. VerifytheClouderaManagerlocalsoftwarerepository.

YourinstancescontainalocalyumrepositoryofClouderaManager5softwaretosavedownloadtimeinthiscourse.

CentOS(andRedHat)storesoftwarerepositoryreferencesin/etc/yum.repos.d.Issuethecommandbelowtoviewthesettings.

On lion:

$ cat /etc/yum.repos.d/cloudera-cm5.repo

Viewthesoftwarepackagesinasubfolderofthereferenceddirectory.

On lion:

17


17

ls ~/software/cloudera-cm5/RPMS/x86_64

NotethatthesetwolocationsexistoneachofthefivemachinesinyourenvironmentandhavealsobeenmadeavailableonHTTPports8050and8000respectivelyviaaLinuxservice.ThissetupisspecifictothecourseenvironmentandisnotrequiredforClouderaManagerorCDHinstallations.

Ifyouwantedtoinstallfromtheonlinerepository,youwouldcreateareferencetotheClouderarepositoryinyour/etc/yum.repos.ddirectory.

Install Cloudera Manager Server

1. InstallClouderaManagerServer.

On lion:

$ sudo yum install -y cloudera-manager-daemons \

cloudera-manager-server

Note:The-yoptionprovidesananswerofyesinresponsetoanexpectedconfirmationprompt.

2. SettheClouderaManagerServerservicetonotstartonboot.

On lion:

$ sudo chkconfig cloudera-scm-server off

3. RunthescripttopreparetheClouderaManagerdatabase.

On lion:

$ sudo /usr/share/cmf/schema/scm_prepare_database.sh \

mysql cmserver cmserveruser password

Afterrunningthecommandaboveyoushouldseethemessage,“Alldone,yourSCMdatabaseisconfiguredcorrectly!”

18


18

4. Verifythelocalsoftwarerepositories.

RunthefollowingtwocommandstoverifytheURLsareworking.

On lion:

$ curl lion:8000

$ curl lion:8050

Eachcommandshouldshowhyperlinks(<ahref=”…”code)tosoftwarerepositories.Ifeithercommanddidnotsuccessfullycontactthewebserver,discusswiththeinstructorbeforecontinuing.

5. StarttheClouderaManagerServer.

Onlion:

$ sudo service cloudera-scm-server start

6. ReviewtheprocessesstartedbytheClouderaManagerServer.

Onlion:

$ top

ClouderaManagerServerrunsasajavaprocessthatyoucanviewbyusingthetopLinuxutility.NoticetheCPUusageisinthe90thpercentileoraboveforafewsecondswhiletheserverstarts.OncetheCPUusagedrops,theClouderaManagerbrowserinterfacewillbecomeavailable.

Onlion:

$ ps wax | grep [c]loudera-scm-server

TheresultsofthepscommandaboveshowthatClouderaManagerServerisusingtheJDBCMySQLconnectortoconnecttoMySQL.Italsoshowsloggingconfigurationandotherdetails.

7. ReviewtheClouderaManagerServerlogfile.

19


19

Thepathtothelogfileis/var/log/cloudera-scm-server/cloudera-scm-server.log.NotethatyoumustusesudotoaccessClouderaManagerlogsbecauseofrestrictedpermissionsontheClouderaManagerlogfiledirectories.

On lion:

$ sudo less /var/log/cloudera-scm-server/cloudera-scm-\

server.log

Note:YouwilllogintoClouderaManagerinthenextexercise.

This is the end of the exercise.

20


20

Hands-on Exercise: Creating a Hadoop Cluster Inthistask,youwilllogintotheClouderaManagerAdminConsoleandusetheClouderaManagerserviceswizardtocreateaHadoopcluster.ThewizardwillpromptyoutoidentifythemachinestobemanagedbyClouderaManager.ItwilltheninstalltheClouderaManagerAgentoneachmachine.Atthatpoint,yourenvironmentwillbereadyfortheCDHsoftwaretobeinstalled.

YouwillthenbepromptedtochoosewhichCDHservicesyouwanttoaddintheclusterandtowhichmachinesyouwouldliketoaddeachservice.

Attheendofthisexercise,youwillhaveHadoopdaemonsdeployedacrossyourclusterasdepictedhere(daemonsaddedinthisexerciseshowninblue).

21


21

InsubsequentExercises,youwilladdmoreHadoopservicestoyourcluster.

Becauseyouhaveonlyfivehoststoworkwith,youwillhavetorunmultipleHadoopdaemonsonallthehostsexceptlion,whichwillbeusedforClouderaManagerservicesonly.Forexample,theNameNode,aDataNodeandaNodeManagerwillallrunonelephant.HavingaverylimitednumberofhostsisnotunusualwhendeployingHadoopforaproof-of-conceptproject.Pleasefollowthebestpracticesinthe“PlanningYourHadoopCluster”chapterwhensizinganddeployingyourproductionclusters.

Aftercompletingtheinstallationsteps,youwillreviewaClouderaManagerAgentlogfileandreviewprocessesrunningonamachineinthecluster.Finally,thereisasectionattheendofthisexercisethatprovidesabrieftouroftheClouderaManagerAdminUI.

IMPORTANT:Thisexercisebuildsonthepreviousone.Ifyouwereunabletocompletethepreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandonelephantandfollowthepromptstoprepareforthisexercisebeforecontinuing:

$ ~/training_materials/admin/scripts/reset_cluster.sh

Login to Cloudera Manager Admin UI

1. VerifythatyouarerunningaSOCKS5proxyserverstartedwiththestart_SOCKS5_proxy.shscript.TheproxyservershouldberunninginaseparateterminalwindowontheGet2EC2VM.

2. StarttheClouderaManagerAdminConsolefromabrowser.TheURLishttp://lion:7180.

IfanUnabletoConnectmessageappears,theClouderaManagerserverhasnotyetfullystarted.Waitseveralmoments,andthenattempttoconnectagain.

3. Loginwiththeusernameadmin.Thepasswordisadmin.

22


22

The“WelcometoClouderaManager”pageappearswithatablethatprovidesdetailsabouteditionsofClouderaManagersoftware.

Install Cloudera Manager Agents

1. Agreenboxhighlightingaproducteditioncolumnshouldappear.

SelecttheClouderaEnterpriseDataHubEditionTrial.

ClickContinue.

2. The“ThankyouforchoosingClouderaManagerandCDH”pageappears.

ClickContinue.

3. The“SpecifyhostsforyourCDHclusterinstallation”pageappears.

Typeinthenamesofallfivemachines:elephant tiger horse monkey lion

ClickSearch.Allfivemachinesshouldbefound.Ensuretheyareallselected.

23


23

ClickContinue.

4. The“ClusterInstallation–SelectRepository”pageappears.

Specifythefollowingoption:

• ChooseMethod-UseParcels

Clickthe“MoreOptions”button.

• Inthe“RemoteParcelRepositoryURLs”area,removeallthecurrentlinesbyusingtheminus(-)signbutton.

• Oncetheexistingentriesareallremoved,clickontheplus(+)signtoaddanewURL.

• Intheblankfieldthatappears,typein http://lion:8000

• ClickOKtoreturntotheSelectRepositorypage.

The‘SelectRepository’pageshouldnowshowthatCDH-5.x.x-x.cdh5.x.x.px.xx(whereeachxisaversionnumber)isavailable.SelectthisversionofCDH.

Inthe“SelectthespecificreleaseoftheClouderaManagerAgentyouwanttoinstallonyourhosts”areachoose“CustomRepository”.

• Intheblankfieldthatappears,typein http://lion:8050

ClickContinue.

5. The“ClusterInstallation–JDKInstallationOptions”pageappears.

AsupportedversionoftheOracleJDKisalreadyinstalled.

VerifytheboxisuncheckedsothattheinstallationofJDKwillbeskipped.

24


24

ClickContinue.

6. The“ClusterInstallation–EnableSingleUserMode”pageappears.

Keepthedefaultsetting(SingleUserModenotenabled)andclickContinue.

7. The“ClusterInstallation–ProvideSSHlogincredentials”pageappears.

Keep“LoginToAllHostsAs”settoroot.

For“AuthenticationMethod:choose“Allhostsacceptsameprivatekey”

ClicktheBrowsebutton.Inthe“Places”columnchoose“training”.Then,inthe“Name”area,right-click(oronaMacCtrl-click),andselect“ShowHiddenFiles.”Finally,clickintothe.sshdirectory,selecttheid_rsafileandclickOpen.

LeavethepassphrasefieldsblankfieldsandclickContinue.

WhenpromptedtocontinuewithnopassphraseclickOK.

8. The”ClusterInstallation–Installationinprogress”pageappears.

ClouderaManagerinstallstheClouderaManagerAgentoneachmachine.

Oncetheinstallationsuccessfullycompletesonallmachines,clickContinue.

Install Hadoop Cluster

1. The”ClusterInstallation–InstallingSelectedParcels”pageappears.

25


25

TheCDHparcelisdownloaded,distributed,andactivatedonallhostsinthecluster.

The“distributing”actionmaytakeafewminutessincethisinvolvescopyingthelargeCDHparcelfilefromthelionmachinetoalloftheothermachinesinthecluster.

Whenitcompletes,clickContinue.

2. The“ClusterInstallation–Inspecthostsforcorrectness”pageappears.

Afteramoment,outputfromtheHostInspectorappears.

ClickFinish.

3. The“ClusterSetup-ChoosetheCDH5servicesthatyouwanttoinstallonyourcluster”pageappears.

Click“CustomServices”.

AtableappearswithalistofCDHservicetypes.

4. SelecttheHDFSandYARN(MR2Included)servicetypes.

Youwilladdadditionalservicesinlaterexercises.

ClickContinue.

26


26

5. TheClusterSetup–CustomizeRoleAssignmentspageappears.

Assignthefollowingroles.

Role Node(s)

HDFS

NameNode elephant

SecondaryNameNode tiger

Balancer horse

HttpFS Donotspecifyanyhosts

NFSGateway Donotspecifyanyhosts

DataNode elephant,tiger,horse,monkey,butnotlion;theorderinwhichthehostsarespecifiedisnotsignificant

ClouderaManagementService

ServiceMonitor lion

ActivityMonitor lion

HostMonitor lion

ReportsManager lion

EventServer lion

AlertPublisher lion

YARN(MR2Included)

ResourceManager horse

JobHistoryServer monkey

NodeManagerSameasDataNode(elephant,tiger,horse,monkey,butnotlion)

Toassignarole,clickthefieldswithoneormorehostnamesinthem.Forexample,thefieldunderNameNodeinitiallyhasthevaluelion.TochangetheroleassignmentfortheNameNodetoelephant,clickthefieldunderNameNode(thathaslioninit).Alistofhostsappears.Selectelephant,andthenclickOK.ThefieldunderNameNodenowhasthevalueelephantinit.

27


27

Whenyouhavefinishedassigningroles,compareyourroleassignmentstotheroleassignmentsinthescreenshotonthenextpage.

Verifythatyourroleassignmentsarecorrect.

Whenyouarecertainthatyouhavethecorrectroleassignments,clickContinueandproceedtothestepafterthescreenshot.

6. The“ClusterSetup–DatabaseSetup”pageappears.

Fillinthedetailsasshowhere.

DatabaseHostName

DatabaseType

DatabaseName

Username Password

lion MySQL amon amonuser password lion MySQL rman rmanuser password

Click“TestConnection”toverifythatClouderaManagercanconnecttotheMySQLdatabasesyoucreatedinanearlierExerciseinthiscourse.

Afteryouhaveverifiedthatallconnectionsaresuccessful,clickContinue.

28


28

7. TheClusterSetup–ReviewChangespageappears.

Specifythefollowingvalues(changelibtolog):

• HostMonitorStorageDirectory–/var/log/cloudera-host-monitor

• ServiceMonitorStorageDirectory–/var/log/cloudera-service-monitor

ClickContinue.

8. The“ClusterSetup–Progress”pageappears.

Progressmessagesappearwhileclusterservicesarebeingcreatedandstarting.

ThefollowingisasummaryofwhatClouderaManager’sClusterSetupwizardliststhatitdoesforyou:

• FormatstheHDFSNameNode• StartstheHDFSservice• Createsa/tmpdirectoryinHDFS• CreatesaMR2JobHistorydirectoryinHDFS• CreatesaYARNNodeManagerremoteapplicationlogdirectoryinHDFS• StartstheYARNservice• StartstheClouderaManagerServiceservice• Deploysallclientconfigurations

Whenalltheclusterserviceshavestarted,clickContinue.

29


29

9. The“ClusterSetup–Congratulations!”pageappears.

Thepageindicatesthatserviceshavebeenaddedandarenowconfiguredandrunning.ClickFinish.

TheClouderaManagerHomepageappears.Clusterinstallationisnowcomplete.

10. Anoteregardingtheconfigurationwarnings.

Theconfigurationwarnings-highlightedinredinthescreenshotbelow—areexpected,andindicatethatalthoughtheservicesareingoodhealth,theyareoperatinginlowmemoryconditions.Inaproductiondeploymentyouwouldwanttoensurethesewarningwereaddressed.However,inourtrainingenvironmentthesewarningscanbesafelyignored.

Examine Running Processes on a Cluster Node

1. Reviewoperatingsystemservicesonahostinthecluster.

Onelephant:

$ chkconfig | grep cloudera

$ sudo service cloudera-scm-agent status

YouhavealreadyaddedYARN(IncludingMR2)andHDFSservices,yettheonlyserviceregisteredwithinitscriptsattheoperatingsystemlevelisthecloudera-scm-agentservice.

30


30

WithClouderaManagermanagedclusters,thecloudera-manager-agent serviceoneachclusternodemanagesstartingandstoppingthedeployedHadoopdaemons.

2. ReviewtherunningJavaprocessesonahostinthecluster.

on elephant:

$ sudo jps

TheHadoopdaemonsrunasJavaprocesses.YoushouldseetheNodeManager,NameNode,andDataNodeprocessesrunning.

ExaminethedetailsofoneoftherunningCDHdaemons.

$ sudo ps -ef | grep NAMENODE

Youwillseedetailsabouttheprocess.

IfyougrepforDATANODEorNODEMANAGERyouwillseedetailsonthoseprocesses.Grepfor‘cloudera’toseeallCDHandClouderaManagerprocessescurrentlyrunningonthemachine.

Testing Your Hadoop Installation

WewillnowtesttheHadoopinstallationbyuploadingsomedata.Thehdfs dfscommandyouuseinthisstepwillbeexploredingreaterdetailinthenextexercise.

1. Uploadsomedatatothecluster.

ThecommandsbelowhaveyouunzipafileonthelocaldriveandthenuploadittoHDFS’/tmpdirectory.

Onelephant:

31


31

$ cd ~/training_materials/admin/data

$ gunzip shakespeare.txt.gz

$ hdfs dfs -put shakespeare.txt /tmp

2. VerifythatthefileisnowinHDFS.

InClouderaManagerchooseClusters>HDFS.ThenclickonFileBrowser.

Browseintotmp andconfirmthatshakespeare.txtappears.

A Brief Tour of Cloudera Manager

NowthatyouhavetheClouderaManagermanagedclusterrunning,let’sbrieflyexploreafewareasintheAdminconsole.

OntheHomepage,youwillseetheclusternamedCluster1thatyoucreated.

1. ClickonthedropdownmenutotherightofCluster1toseemanyoftheactionsyoucanperformonthecluster,suchasAddaService,Stop,andRestart.

32


32

2. ClickHoststoviewthecurrentstatusofeachmanagedhostinthecluster.Clickingonthe>iconintheRolescolumnforahostwillrevealwhichrolesaredeployedonthathost.

InthisexerciseandtheexercisesthatfollowyouwilldiscovermanyotherareasofClouderaManagerthatwillproveusefulforadministeringyourHadoopcluster(s).


33


33

Hands-On Exercise: Working With HDFS InthisHands-OnExerciseyouwillcopyalargeWebserverlogfileintoHDFSandexploretheresultsintheHadoopNameNodeWebUIandtheLinuxfilesystem.YouwillthencreateasnapshotofadirectoryinHDFS,deletesomedataandthenrestorebacktoitsoriginallocationinHDFS.

IMPORTANT:Thisexercisebuildsonthepreviousone.Ifyouwereunabletocompletethepreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandandfollowthepromptstoprepareforthisexercisebeforecontinuing:


PerformallCommandLinestepsinthisexerciseonelephant.

Confirm HDFS Processes and Settings

1. ConfirmthatallHDFSprocessesarerunning.

FromtheClouderaManagerHomepageclickonHDFSandthenclickontheInstancestab.

NoticethatthethreedaemonsthatsupportHDFS–theNameNode,SecondaryNameNode,andDataNodedaemons–arerunning.InfactthereareDataNodesrunningonfourhosts.

2. DeterminethecurrentHDFSreplicationfactor.

FromtheClouderaManagerHomepageclickintotheSearchboxandtype“replication”.

Choose“HDFS:ReplicationFactor”whichshouldbeoneofthesearchresults.

YouaretakentotheHDFSConfigurationpagewhereyouwillfindthedfs.replicationsettingthathasthedefaultvalueof3.

34


34

3. Similarly,usethesearchboxintheHDFSconfigurationpage,andsearchfor“blocksize”todiscovertheHDFSBlockSizesettingwhichdefaultsto128MiB.

Add Directories and Files to HDFS

1. Asthehdfsuser,createahomedirectoryforthetraininguseronHDFSandgivethetraininguserownershipofit.

Onelephant:

$ sudo -u hdfs hdfs dfs -mkdir -p /user/training/

$ sudo -u hdfs hdfs dfs -chown training /user/training

2. CreateanHDFSdirectorycalled/user/training/weblog,inwhichyouwillstoretheaccesslog.

Onelephant:

$ hdfs dfs -mkdir weblog

3. Extracttheaccess_log.gzfileanduploadittoHDFSinasinglestep.

Onelephant:


$ gunzip -c access_log.gz \

| hdfs dfs -put - weblog/access_log

TheputcommanduploadedthefiletoHDFS.Thedashaftertheputcommandreadstheinputfromstdinandwritesittothedestinationdirectory.

Sincethefilesizeis504MB,HDFSwillhavesplititintomultipleblocks.Let’sexplorehowthefilewasstored.

4. Runthehdfs dfs -lscommandtoreviewthefile’spermissionsinHDFS.

Onelephant:

35


35

$ hdfs dfs -ls /user/training/weblog

Analyze File Storage in HDFS and On Disk

1. Locateinformationabouttheaccess_logfileintheNameNodeWebUI.

a. FromtheClouderaManagerClustersmenu,chooseHDFSforyourcluster.

b. IntheHDFSStatuspage,clickon“NameNodeWebUI(Active)”.ThiswillopentheNameNodeWebUIathttp://elephant:50070.

c. ChooseUtilities>“Browsethefilesystem,”thennavigateintothe/user/training/weblogdirectory.

d. Noticethatthepermissionsshownfortheaccess_logfileareidenticaltothepermissionsyouobservedwhenyouranthehdfs dfs -lscommand.

e. Nowselecttheaccess_logfileintheNameNodeWebUIfilebrowsertobringuptheFileinformationwindow.

NoticethattheBlockInformationdropdownlisthasfourentries,Block0,Block1,Block2,andBlock3.Thismakessensebecauseasyoudiscovered,theHDFSBlockSizeonyourclusteris128MB.Theextractedaccess_logfileis559MB.

Blocks0,1,and2allshowasizeof134217728(or128MB)inaccordancewiththesizespecifiedintheHDFSblocksizesettingyouobservedearlierinthisexercise.Block3issmallerthantheothersasyoucaninthe“Size”detailsifyouchooseit.

Noticealsothateachblockisavailableonthreedifferenthostsinthecluster.Thisiswhatyouwouldexpectsincethreeisthecurrent(anddefault)replicationfactorinHDFS.Alsonoticethateachblockmaybereplicatedtodifferentdatanodesthantheotherblocksthatmakeupthefile.

36


36

a. ChooseBlock0andwritedownthevaluethatappearsintheSizefield.IfBlock0isnotreplicatedonelephantthenchoseanotherblockthatisreplicatedonelephant.

________________________________________________________________________________

YouwillneedthisvaluewhenyouexaminetheLinuxfilethatcontainsthisblock.

f. SelectthevalueoftheBlockIDfieldandcopyit(Editmenu>Copy).Youwillneedthisvalueforthenextstepinthisexercise.

2. LocatetheHDFSblockontheelephantLinuxfilesystem.

a. InClouderaManager’sHDFSConfigurationpage,conductasearchfor“DataDirectory”.YouwillseethattheDataNodeDataDirectoryis/dfs/dn.

b. Let’sfindtheblocksstoredinthatdirectory.Onelephant:

$ sudo find /dfs/dn -name '*BLKID*' -ls

whereBLKIDistheactualBlockIDyoucopiedfromtheNameNodeWebUI.

c. VerifythattwofileswiththeBlockIDyoucopiedappearinthefindcommandoutput–onefilewithanextension,.meta,andanotherfilewithoutthisextension.

d. VerifyintheresultsofthefindcommandoutputthatthesizeofthefilecontainingtheHDFSblockisexactlythesizethatwasreportedintheNameNodeWebUI.

3. StartinganyLinuxeditorwithsudo,openthefilecontainingtheHDFSblock.Verifythatthefirstfewlinesofthefilematchthefirstchunkoftheaccess_logfilecontent.

37


37

Note:YoumuststartyoureditorwithsudobecauseyouareloggedintoLinuxasthetraininguser,andthisuserdoesnothaveprivilegestoaccesstheLinuxfilethatcontainstheHDFSblock.

$ sudo head /dfs/dn/path/to/block

Note:Replace/path/to/blockinthecommandabovewiththeactualpathtotheblockasshownintheresultsofthefindcommandyouraninthepreviousstep.

Youcanreviewtheaccess_logfilecontentonHDFSasfollows:

$ hdfs dfs -cat weblog/access_log | head -

Theresultsreturnedbythelasttwocommandsshouldmatchexactly.


38


38

Hands-On Exercise: Running YARN Applications InthisexerciseyouwillruntwoYARNapplicationsonyourcluster.ThefirstapplicationisaMapReducejob.ThesecondisanApacheSparkapplication.YouwilladdtheSparkserviceonyourclusterbeforerunningtheSparkapplication.Aftercompletingthisexercise,yourclusterwillhavethefollowingcomponentsinstalled(itemsinstalledinthisexercisehighlightedinblue):



39


39

Performallstepsinthisexerciseonelephant.

Submitting a MapReduce Application to Your Cluster

WewillnowtesttheHadoopinstallationbyrunningasampleHadoopapplicationthatshipswiththeHadoopsourcecode.ThisisWordCount,aclassicMapReduceprogram.We’llruntheWordCountprogramagainsttheShakespearedataaddedtoHDFSinapreviousexercise.

1. SincethecodefortheapplicationwewanttoexecuteisinaJavaArchive(JAR)file,we’llusethehadoop jarcommandtosubmitittothecluster.LikemanyMapReduceprograms,WordCountacceptstwoadditionalarguments:theHDFSdirectorypathcontaininginputandtheHDFSdirectorypathintowhichoutputshouldbeplaced.Therefore,wecanruntheWordCountprogramwiththefollowingcommand.

Onelephant:

$ hadoop jar /opt/cloudera/parcels/CDH-5.3.2-1.cdh\

5.3.2.p0.10/lib/hadoop-mapreduce/hadoop-mapreduce-\

examples.jar wordcount /tmp/shakespeare.txt counts

Verifying MapReduce Job Output

2. Oncetheprogramhascompletedyoucaninspecttheoutputbylistingthecontentsoftheoutput(counts)directory.

Onelephant:

$ hdfs dfs -ls counts

3. Thisdirectoryshouldshowallthedataoutputforthejob.Joboutputwillincludea_SUCCESSflagandonefilecreatedbyeachReducerthatran.Youcanviewtheoutputbyusingthehdfs dfs -catcommand.

Onelephant:

40


40

$ hdfs dfs -cat counts/part-r-00000

Review the MapReduce Application Details and Logs

InthistaskyouwillstartbylookingatdetailsintheYARNApplicationsareaofClouderaManager.TheApplicationDetailslinkinClouderaManagerwillthentakeyoutotheHistoryServerWebUIathttp://monkey:19888.

Asyougothroughthestepsbelow,seeifyoucanreconstructwhatoccurredwherewhenyourantheMapReducejob,bycreatingachartliketheonebelow.

Node(s)

ApplicationMaster

MapTask(s)

ReduceTask(s)

1. LocateyourMapReduceapplicationinClouderaManager.

InClouderaManager,chooseClusters>YARN(MR2Included),thenclickonApplications.

IntheResultstabthatdisplays,youshouldseetheMapReduceapplicationthatjustran.

Note:TherewillbeanentryforeachcompletedMapReducejobthatyouhaverunonyourclusterwithinthetimeframeofyoursearch.Thedefaultsearchisforapplicationsrunwithinthelast30minutes.

2. AccesstheHistoryServerWebUItodiscoverwheretheApplicationMasterran.

Fromthedropdownmenuforthe“wordcount”application,choose“ApplicationDetails.”

41


41

ThisactionwillopenapageintheHistoryServerWebUIwithdetailsaboutthejob.

3. LocatewheretheApplicationMasterranandviewthelog.

NoticetheApplicationMastersectionshowswhichclusternoderantheMapReduceApplicationMaster.Clickthe“logs”linktoviewtheApplicationMasterlog.

Noticealsothenumberofmapandreducetasksthatraninordertocompletethewordcountjob.Thenumberofreducersrunbythejobshouldcorrespondtothenumberofpart-r-#####filesyousawwhenyouranthehdfs dfs -lscommandearlier.Therearenopart-m-#####filesbecausethejobranatleastonereducer.

4. Locatewherethemappertaskranandviewthelog.

FromtheHistoryServerWebUI’s“Job”menuchoose“Maptasks”.

FromtheMapTaskstable,clickonthelinkintheNamecolumnforthetask.

TheAttemptstabledisplays.Noticethe“Node”columnshowsyouwherethemaptaskattemptran.

Clickthe“logs”linkandreviewthecontentsofthemappertasklog.Whendone,clickthebrowserbackbuttontoreturntothepreviouspage.

5. Locatewherethereducetasksranandviewthelogs.

FromtheHistoryServerWebUI’sJobmenuchoose“Reducetasks.”

42


42

FromtheReduceTaskstable,clickonthelinkintheNamecolumnforoneofthetasks.

TheAttemptstabledisplays.Noticethe“Node”columnshowsyouwherethisreducertaskran.

Clickthe“logs”linkandreviewthecontentsofthelog.ObservetheamountofoutputintheReducertasklog.Whendone,clickthebrowserbackbuttontoreturntothepreviouspage.

6. DeterminetheloglevelforReducertasksfortheword countjob.

ExpandtheJobmenuandchoose“Configuration.”

Twentyentriesfromthejobconfigurationthatwereineffectwhentheword countjobranappear.

IntheSearchfield,enterlog.level.

Locatethemapreduce.reduce.log.levelproperty.ItsvalueshouldbeINFO.

Note:INFOisdefaultvalueforthe“JobHistoryServerLoggingThreshold”whichcanbefoundintheClouderaManagerYARNConfigurationpageforyourcluster.

Run the MapReduce Application with a Custom Log Level Setting

1. RemovethecountsdirectoryfromHDFSandreruntheWordCountprogram,thistimepassingitaloglevelargument.

Onelephant:

43


43

$ hdfs dfs -rm -r counts



examples.jar wordcount \

-D mapreduce.reduce.log.level=DEBUG \

/tmp/shakespeare.txt counts

Note:YoumustdeletethecountsdirectorybeforerunningtheWordCountprogramasecondtimebecauseMapReducewillnotrunifyouspecifyanoutputpathwhichalreadyexists.

MapReduceprogramscodedtotakeadvantageoftheHadoopToolRunnerallowyoutopassseveraltypesofargumentstoHadoop,includingrun-timeconfigurationparameters.Thehadoop jarcommandshownabovesetstheloglevelforreducetaskstoDEBUG.

Note:The-Doption,asusedinthehadoop jarcommandabove,allowsyouoverrideadefaultpropertysettingbyspecifyingthepropertyandthevalueyouwanttoassign.

Whenyourjobisrunning,lookforalineinstandardoutputsimilartothefollowing:

14/12/09 05:47:16 INFO mapreduce.Job: Running job:

job_1391249780844_0004

2. Afterthejobcompletes,locateandviewoneofthereducerlogs.

FromClouderaManager’sYARNApplicationspageforyourcluster,locatetheentryfortheapplicationthatyoujustran.

ClickontheIDlink,andusetheinformationavailableundertheJobmenu’s“Configuration”and“Reducetasks”linkstoverifythefollowing:• Thevalueofthemapreduce.reduce.log.levelconfiguration

attributeisDEBUG.

44


44

• TheReducertask’slogsforthisjobcontainDEBUGlogrecordsandthe

logsarelargerthanthenumberofrecordswrittentotheReducertask’slogsduringthepreviousWordCountjobexecution.

3. VerifytheresultsofthewordcountjobwerewrittentoHDFSusinganyofthefollowingthreemethods.

Option1:InClouderaManager,browsetotheHDFSpageforyourcluster,thenchooseFileBrowser.Drilldowninto/user/training/counts.

Option2:AccesstheHDFSNameNodeWebUIathttp://elephant:50070.ChooseUtilities>“Browsethefilesystem”,andnavigatetothe/user/training/counts directory.

Option3:onanymachineintheclusterthathastheDataNodeinstalled(elephant,tiger,monkey,orhorse)runthefollowingcommandinaterminal:

$ hdfs dfs -tail counts/part-r-00000

Add the Apache Spark Service

Inthistask,youwilladdtheSparkservicetoyourclusterusingClouderaManager.YouwillthenrunaSparkapplication.

1. InClouderaManager,navigatetotheHomepage.

2. Selectthe“AddaService”optionfromthedrop-downmenuforCluster1.

45


45

TheAddServiceWizardappears.

3. SelectSparkandclickContinue.

The“Selectthesetofdependenciesforyournewservice”pageappears.

4. Selecttherowcontainingthehdfsandyarnservices,thenclickContinue.

TheCustomizeRoleAssignmentspageappears.

5. Specifyhostassignmentsasfollows:

• HistoryServer–monkeyonly• Gateway–monkeyonly

ClickContinue.

6. ProgressmessagesappearontheProgresspage.

Whentheaddingoftheservicehascompleted,clickContinue.

7. TheCongratulationspageappears.

ClickFinish.

8. TheHomepageappears.

AstatusindicatorshowsyouthattheSparkserviceisingoodhealth.

Note:youmaynoticeaconfiguringwarningthatappearsnextto“Hosts”ontheClouderaManagerhomepage.Ifyoulookintoit,ClouderaManagerindicatesthatmemoryisovercommittedonhostmonkey.Thisconfigurationissue,alongwiththefiveotherconfigurationwarningsthatappearedaftertheinitialclusterinstallation,wouldneedtobeaddressedinatrueproductioncluster,howevertheycanbesafelyignoredintheclassroomenvironment.

Run Spark as a YARN Application

Youshouldcompletethesestepsonmonkey sincemonkeyiswhereyoujustaddedtheSparkgatewayrole.

46


46

1. StarttheSparkshellandconnecttotheyarn-clientsparkcontextonmonkey.

RecallthattheSparkGatewayservicewasinstalledonmonkeysotheSparkshellshouldberunfrommonkey.

Onmonkey:

$ spark-shell --master yarn-client

TheScalaSparkshellwilllaunch.Youshouldeventuallyseethemessage,“Sparkcontextavailableassc.”Ifnecessary,clickthe<Enter>keyonyourkeyboardtoseethescala>prompt.

2. Typeinthecommandsbelowtorunawordcountapplicationusingtheshakespeare.txtfilethatisalreadyinHDFS.

ThisaccomplishessomethingverysimilartothejobyouranintheMapReduceexercise,butthistimethecomputationalframeworkbeingusedisSpark.

scala> val file = sc.textFile(

"hdfs://elephant:8020/tmp/shakespeare.txt")

scala> val counts = file.flatMap(line => line.split(

" ")).map(word => (word, 1)).reduceByKey(

_ + _).sortByKey()

scala> counts.saveAsTextFile(

"hdfs://elephant:8020/tmp/sparkcount")

scala> sc.stop()

scala> sys.exit()

3. ViewtheapplicationresultswrittentoHDFSbySpark.

Onelephant:

47


47

$ hdfs dfs -cat /tmp/sparkcount/part-00000 | less

Review Application Details in the Spark History Server

ViewtheSparkapplicationdetailsinClouderaManagerandtheSparkHistoryServerWebUI.

1. InClouderaManager,gototheYARNApplicationspageforyourcluster.

Youwillseea“Sparkshell”application.

2. ClickthelinkintheIDfield.

ApageintheYARNResourceManagerWebUIopenswithdetailsabouttheapplication.

3. ClickontheHistorylink.

ASparkJobspageintheSparkHistoryServerWebUIopens.

NoticethatthisSparkApplicationconsistedoftwojobsthatarenowcompleted.

4. IntheCompletedJobsarea,clickonthe“sortByKey…”linkforthefirstjobthatran(JobId0).

Noticethatthisfirstjobconsistedoftwostages.

5. Clickonthe“mapat…“linkforthefirststage(Stage0).

The“DetailsforStage0”pageappears.

Hereyoucanseethatthereweretwotasksinthisstageandyoucanseeonwhichexecutorandonwhichhosteachtaskran.Youcanalsoseetasksdetailssuchasduration,inputdatasize,theamountofdatawrittentodiskduringshuffleoperations.

6. ClickontheExecutorstabtoseeasummaryofalltheexecutorsusedbytheSparkapplication.

48


48

Review the Spark Application Logs

1. AccesstheSparkapplicationlogsfromthecommandline.

FirstlocatetheapplicationID.

Onelephant:

$ yarn application -list -appStates FINISHED

CopytheapplicationIDfortheSparkapplicationreturnedbythecommandabove.

Nowrunthiscommand(whereappIdistheactualapplicationID).

Onelephant:

$ yarn logs -applicationId appId | less

Scrollthroughthelogsreturnedbytheshell.NoticethatthelogsforallthecontainersthatrantheSparkexecutorsareincludedintheresults.

TheseSparkapplicationlogsarestoredinHDFSunder/user/spark/\ applicationHistory.


49


49

Hands-On Exercise: Explore Hadoop Configurations and Daemon Logs IMPORTANT:Thisexercisebuildsonthepreviousone.Ifyouwereunabletocompletethepreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandandfollowthepromptstoprepareforthisexercisebeforecontinuing:


Exploring Hadoop Configuration Settings

Inthistask,youwillexploresomeoftheHadoopconfigurationfilesauto-generatedbyClouderaManager.

1. GotothedirectorythatcontainstheHadoopconfigurationforClouderaManager-manageddaemonsrunningonelephantandthenviewthecontents.

Onelephant:

$ cd /var/run/cloudera-scm-agent/process

$ sudo tree

Noticehowthereareseparatedirectoriesforeachroleinstancerunningonelephant-DataNode,NameNode,andNodeManager.Noticealsothatsomefiles,suchashdfs-site.xmlandcore-site.xml,existinmorethanonedaemon’sconfigurationdirectory.

2. Comparethefirst20linesoftheNameNode’scopyofhdfs-site.xmltotheNodeManager’scopyofthesamefile.

Onelephant:

50


50

$ sudo head -20 nn-hdfs-NAMENODE/core-site.xml

$ sudo head -20 nn-yarn-NODEMANAGER/core-site.xml

Inthecommandsabove,replaceeachnnwiththeactualnumbersgeneratedbyClouderaManager.

Theentriesinthesecore-site.xmlfilesreflectthesettingsconfiguredinClouderaManagerfortheseparticularroleinstances.SomeofthesettingsreflectchoicesyoumadewhenyourantheClouderaManagerinstallationwizard.OtherentriesareoptimalinitialordefaultvalueschosenbyClouderaManager.

3. MakeaconfigurationchangeinClouderaManager.

InClouderaManager,browsetotheHDFSpageforyourclusterandthenchooseConfiguration.

Conductasearchfortheword“trash”.TheNameNodeDefaultGroup“FilesystemTrashInterval”settingappears.

Double-clickintotheValueareawhereitcurrentlyreads“Iday(s)”.

Changethisvalueto2day(s).ClickSaveChanges.

Noticethe “StaleConfiguration-Restartneeded”iconthatappearsonthescreen.Clickonit.

The“StaleConfigurations-ReviewChanges”screenappears,showingthechangestocore-site.xmlthatwillbemade.

Click“RestartCluster”.The“Cluster1StaleConfigurations-RestartCluster”screenappears.

Click“RestartNow”.The“StaleConfigurations-Progress”screenappears.Waitforthechildcommandstocompletesuccessfully.ClickFinish.

4. Returntotheelephantterminalwindowandlistthecontentsofthe/var/run/cloudera-scm-agent/processdirectory.

51


51

$ sudo ls -l /var/run/cloudera-scm-agent/process

NoticenowhowtherearenowtwodirectorieseachforNameNode,DataNode,andNodeManager.Theoldsettingshavebeenretained,howeverthenewsettingshavealsobeendeployedandwillnowbeused.Thedirectorywiththehighernumberinthenameisthenewerone.

5. FindthedifferencebetweentheoldNameNodecore-site.xmlfileandthenewone.

Onelephant:

$ sudo diff -C 2 nn-hdfs-NAMENODE/core-site.xml nn-hd\

fs-NAMENODE/core-site.xml

Thennvaluesaboveshouldbereplacedwiththeactualnumberswithwhichtheconfigurationdirectoriesarenamed.

Youshouldseethatthefs.trash.intervalpropertyvaluechangehasbeendeployedtothenewNameNodeconfigurationfile.

6. Reverttheconfigurationchangeandrestartthecluster.

InClouderaManager,gototheHDFSConfigurationpageforyourcluster.

Clickthe“HistoryandRollback”button.

The“ConfigurationandRoleGroupHistory”pagedisplays.

52


52

UnderCurrentRevisionclick“Details”.

TheRevisionDetailsscreendisplays.NoticetheFilesystemTrashIntervalpropertyvaluethatyoujustmodifiedislisted.Choose“RevertConfigurationChanges”.

Amessageappearsindicatingthattherevisionwasreverted.ClickontotheHDFSStatuspageandnoticethe“StaleConfiguration-Restartneeded”icon.

Clickontheiconandfollowthestepstorestartthecluster.

Examining Hadoop Daemon Log Files

InthepreviousExercise,youreviewedtheapplicationlogsfromMapReduceandSparkrunningasYARNapplications.HereyouwillreviewHadoopdaemonlogfiles,includingtheHDFSNameNodeandYARNResourceManagerlogfiles.

WithClouderaManager,Hadoopdaemonsgeneratea.log.outfile,astandarderrorlog(stderr.log),andastandardoutputlog(stdout.log).

Inthistask,youwillexamineHadoopdaemonlogfilesusingtheNameNodeWebUI,ClouderaManager,theResourceManagerWebUIandtheNodeManagerWebUI.

1. ViewtheNameNodelogfileusingtheNameNodeWebUI.

53


53

AccesstheNameNodeWebUIfromtheQuickLinksontheHDFSStatuspageinClouderaManager.

FromtheNameNodeWebUI,selectUtilities>Logs.Thelistoffoldersandfilesinthe/var/log/hadoop-hdfsdirectoryonelephantappears.

OpentheNameNodelogfileandreviewthefile.

2. AccessthedaemonlogsdirectlyfromClouderaManager.

InClouderaManager,chooseHostsfromthetopnavigationbar.

SelectelephantandthenchoosetheProcessestab.

LocatetherowfortheNameNodeandclick“Fulllogfile.”

TheLogDetailspageopensatthetailendoftheNameNodelogfile.Scrolluptoviewearlierlogmessages.

Note:Ifthelogfileisverylarge,andyouwanttoseemessagesnearthetop,scrollingintheClouderaManagerUIwillbeslow.Othertoolsprovidequickaccesstotheentirelogfile.

Click“DownloadFullLog”(intheupperrightcorneroftheLogDetailspage)todownloadtheentirelog.

3. ReviewtheNameNodedaemonsstandarderrorandstandardoutputlogsusingClouderaManager.

ReturntotheProcessespageforelephant.

ClicktheStdoutlinkfortheNameNodeinstance.Thestandardoutputlogappears.Reviewthefile,thenreturntotheProcessespage.

ClicktheStderrlinkfortheNameNodeinstance.Thestandarderrorlogappears.Reviewthefile.

Note:ifyouwanttolocatetheselogfilesondisk,theycanbefoundonelephantinthe/var/log/hadoop-hdfsand/var/run/cloudera-scm-agent/process/nn-hdfs-NAMENODE/logsdirectories.

4. UsingClouderaManager,reviewrecententriesintheSecondaryNameNodelogs.

54


54

Tofindthelog,gototheHDFSInstancespageforyourcluster,thenclickontheSecondaryNameNoderoletype,andinStatuspageforthetigerhost,click“LogFile”.

5. AccesstheResourceManagerlogfileusingtheResourceManagerWebUI.

StarttheResourceManagerWebUI(fromClouderaManager’sYARNStatuspageorbyspecifyingtheURLhttp://horse:8088inyourbrowser).

Choose“Nodes”fromtheClustermenuontheleftsideofthepage.

Clickthehorse:8042linktobetakentotheNodeManagerWebUI.

ExpandtheToolsmenuontheleftsideofthepage.

Click“Locallogs.”

Finally,clicktheentryfortheResourceManagerlogfileandreviewthefile.

55


55

Hands-On Exercise: Using Flume to Put Data into HDFS InthisHands-OnExerciseyouwilluseFlumetoimportdynamicallygenerateddataintoHDFS.AverycommonusecaseforFlumeistocollectaccesslogsfromalargenumberofWebserversonyournetwork;wewillsimulateasimpleversionofthisinthefollowingexercise.

ThediagrambelowshowsthedataflowthatwilloccuronceyoucompletetheExercise.

56


56



Adding the Flume Service

1. AddtheFlumeserviceonelephant and horse.

FromtheClouderaManagerHomepage,clickthedownarrownexttoyourclusterandchoose“AddaService”.

Select“Flume”andclick“Continue”.NotethatHDFSisadependency,howevertheHDFSserviceisalreadyadded.Click“Continue”again.

Usethe“Selecthosts”buttontoaddtheFlumeAgentonbothelephantand horsethenclickOKandContinue.AttheCongratulationsscreenclick“Finish”.

Note:youmaynoticethattherearenowtwonew‘Hosts’configurationissues,thistimeonelephantandhorsethat-liketheonethatappearedonmonkeyearlier-arerelatedtomemoryovercommitvalidationthresholds.Inatrueproductioncluster,youwouldwanttoaddressallconfigurationissues,howeveryoucansafelyignoretheseintheclassroomenvironment.

2. UpdatetheconfigurationfortheFlumeagentonelephant.

57


57

FromtheClouderaManagerHomepage,clicktheFlumelinkandthenclickontheInstancestab.

Selectthe“Agent”thatresidesonelephant,thenclickConfiguration.

Deletethedefaultcontentsofthetwopropertieslistedinthetablebelowentirelyandreplacewiththelinesshownbelow.

Note:theConfigurationFilelinesarealsoavailablein~/training_materials/admin/scripts/flume-tail1.txt

Tip:YoucanexpandtheConfigurationFiletextareatomakeiteasiertoeditinbydraggingitoutthebottomrightcornerofthetextbox.

Property Value

AgentName tail1 ConfigurationFile

tail1.sources = src1 tail1.channels = ch1 tail1.sinks = sink1 tail1.sources.src1.type = exec tail1.sources.src1.command = tail -F /tmp/access_log tail1.sources.src1.channels = ch1 tail1.channels.ch1.type = memory tail1.channels.ch1.capacity = 500 tail1.sinks.sink1.type = avro tail1.sinks.sink1.hostname = horse tail1.sinks.sink1.port = 6000 tail1.sinks.sink1.batch-size = 1 tail1.sinks.sink1.channel = ch1

ClickSaveChanges.

3. UpdatetheconfigurationfortheFlumeagentonhorse.

FromtheClouderaManagerFlumepage,clickontheInstancestab.

SelecttheAgentthatresidesonhorse,thenclickConfiguration.

Deletethedefaultcontentsofthetwopropertieslistedinthetablebelowentirelyandreplacewiththelinesshownbelow.

58


58

Note:theConfigurationFilelinesarealsoavailablein~/training_materials/admin/scripts/flume-collector1.txt

Property Value

AgentName collector1

ConfigurationFile

collector1.sources = src1 collector1.channels = ch1 collector1.sinks = sink1 collector1.sources.src1.type = avro collector1.sources.src1.bind = horse collector1.sources.src1.port = 6000 collector1.sources.src1.channels = ch1 collector1.channels.ch1.type = memory collector1.channels.ch1.capacity = 500 collector1.sinks.sink1.type = hdfs collector1.sinks.sink1.hdfs.path = hdfs://elephant/user/flume/collector1 collector1.sinks.sink1.hdfs.filePrefix = access_log collector1.sinks.sink1.channel = ch1

Ensurethat“collector1.sinks.sink1.hdfs.path=hdfs://elephant/user/flume/collector1”shownaboveisallonasingleline.

Click“Savechanges.”

59


59

4. Createthe/user/flume/collector1directoryinHDFStostorethefiles.

Onelephant:

$ sudo -u hdfs hdfs dfs -mkdir -p \

/user/flume/collector1

$ sudo -u hdfs hdfs dfs -chown -R flume /user/flume

Starting the Data Generator

1. Openanewterminalwindowonelephant(oransshconnectiontoelephant).Inthisterminalwindow,runtheaccesslog-gen.bashshellscript,whichsimulatesaWebservercreatinglogfiles.Thisshellscriptalsorotatesthelogfilesregularly.

Onelephant:

$ accesslog-gen.sh /tmp/access_log

Note:Theaccesslog-gen.bashscriptisspecifictothetrainingenvironmentandisnotpartofCDH.

2. Openasecondnewterminalwindowonelephant(oransshconnectiontoelephant).Verifythatthelogfilehasbeencreated.Noticethatthelogfileisrotatedperiodically.

Onelephant:

$ ls -l /tmp/access*

-rw-rw-r-- 1 training training 498 Nov 15 15:12 /tmp/access_log

-rw-rw-r-- 1 training training 997 Nov 15 15:12 /tmp/access_log.0

-rw-rw-r-- 1 training training 1005 Nov 15 15:11 /tmp/access_log.1

60


60

Starting the Flume Collector Agent

HereyoustarttheFlumeagentthatwillinsertthedataintoHDFS.ThisagentreceivesdatafromthesourceFlumeagent.

1. Startthecollector1FlumeAgentonhorse.

InClouderaManager,gotoFlume’sInstancestab.SelecttheAgenthostedonhorse.

Fromthe“ActionsforSelected”menuchooseStart.Inthe“Start”windowthatappears,click“Start”.

Inthe“CommandDetails:Start”screenwaitforconfirmationthattheagentstartedsuccessfully.ClickClose.

Starting the Flume Source Agent

Hereyoustarttheagentthatreadsthesourcelogfilesandpassesthedataalongtothecollectoragentyouhavealreadystarted.

1. Startthetail1FlumeAgentonelephant.

FromtheClouderaManagerFlumeInstancestab,selecttheAgenthostedonelephant.

Fromthe“ActionsforSelected”menuchooseStart.Inthe“Start”windowthatappears,click“Start”.

Inthe“CommandDetails:Start”screenwaitforconfirmationthattheagentstartedsuccessfully.ClickClose.

Viewing Data in HDFS

1. ConfirmthedataisbeingwrittenintoHDFS.

InClouderaManagerbrowsetotheHDFSpageforyourclusterandclickonFileBrowser.

Drilldowninto/user/flume/collector1.Youshouldseemanyaccess_logfiles.

61


61

2. ViewMetricDetails.

ReturntotheFlumepageandclickontheMetricDetailstab.HereyoucanseedetailsrelatedtotheChannels,Sinks,andSourcesofyourrunningFlumeagents.Ifyouareinterested,anexplanationofthemetricsavailableisathttp://bit.ly/flumemetrics.

Increase the File Size in HDFS (Optional)

Thesetwostepsareoptional,butmaybeofinteresttosomestudents.

1. EdittheCollector1agentconfigurationsettingsonhorsebyaddingthesethreeadditionalnamevaluepairs:

collector1.sinks.sink1.hdfs.rollSize = 2048

collector1.sinks.sink1.hdfs.rollCount = 100

collector1.sinks.sink1.hdfs.rollInterval = 60

ClickSaveChanges.

2. FromtheFlumeStatuspage,clickonthe StaleConfiguration-refreshneedediconandfollowthepromptstorefreshthecluster.

3. Executehdfs dfs -ls /user/flume/collector1inaterminalwindowandnotethefilesizeofthemorerecentcontentpostedbyFlumetoHDFS.

Viewing the Logs

1. Checkthelogfilestoseemessages.

InClouderaManagerchooseDiagnostics>Logs.

ClickSelectSourcesandconfigureasfollows:• UncheckallsourcesexceptFlume• SettheMinimumLogLeveltoINFO• Leavethetimeframeofyoursearchsetto30minutes

ClickSearch.

62


62

BrowsethroughtheloggedactionsfrombothFlumeagents.

Cleaning Up

1. StoptheloggeneratorbyhittingCtrl-Cinthefirstterminalwindow.

2. StopbothFlumeagentsfromtheFlumeInstancespageinClouderaManager.

3. Removethegeneratedaccesslogfilesfromthe/tmpdirectorytoclearupspaceonyourvirtualmachine.

Onelephant:

$ rm -rf /tmp/access_log*

This is the end of the Exercise.

63


63

Hands-On Exercise: Importing Data with Sqoop ForthisexerciseyouwillimportdatafromarelationaldatabaseusingSqoop.Thedatayouloadherewillbeusedinasubsequentexercise.

ConsidertheMySQLdatabasemovielens,derivedfromtheMovieLensprojectfromUniversityofMinnesota.(Seenoteattheendofthisexercise.)Thedatabaseconsistsofseveralrelatedtables,butwewillimportonlytwoofthese:movie,whichcontainsabout3,900movies;andmovierating,whichhasabout1,000,000ratingsofthosemovies.



Performallstepsinthisexerciseonelephant.

Reviewing the Database Tables

First,reviewthedatabasetablestobeloadedintoHadoop.

1. LogontoMySQL.

Onelephant:

$ mysql --user=training --password=training movielens

2. Reviewthestructureandcontentsofthemovietable.

Onelephant:

64


64

mysql> DESCRIBE movie;

. . .

mysql> SELECT * FROM movie LIMIT 5;

3. Notethecolumnnamesforthetable.

____________________________________________________________________________________________

4. Reviewthestructureandcontentsofthemovieratingtable.

Onelephant:

mysql> DESCRIBE movierating;

. . .

mysql> SELECT * FROM movierating LIMIT 5;

5. Notethesecolumnnames.

____________________________________________________________________________________________

6. ExitMySQL.

Onelephant:

mysql> quit;

Adding the Sqoop 1 Client

1. AddtheSqoop1Clientgatewayonelephant.

FromtheHomepageinClouderaManager,clickthedownarrowiconnexttoCluster1andchoose“AddaService”.

Select Sqoop 1 Client andclickContinue.

Atthe“CustomRoleAssignments”page,clickontheSelecthostsboxandchoosetoaddtheGatewayonelephant.ClickContinue.

65


65

The“Progress”pageappears.Oncetheclientconfigurationhasbeendeployedsuccessfully,clickContinue.Atthe“Congratulations”screenclickFinish.

2. Usingsudo,createasymlinktotheMySQLJDBCdriver.

Onelephant:

$ sudo ln -s /usr/share/java/mysql-connector-java.jar \

/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10\

/lib/sqoop/lib/

Nowrunthecommandbelowtoconfirmthesymlinkwasproperlycreated.

Onelephant:

$ readlink -f /opt/cloudera/parcels/CDH-5.3.2-\

1.cdh5.3.2.p0.10/lib/sqoop/lib/mysql-connector-java.jar

Ifthesymlinkwasproperlydefined,thecommandshouldreturnthe/usr/share/java/mysql-connector-java.jarpath.

Importing with Sqoop

YouinvokeSqooponthecommandlinetoperformseveralcommands.Withityoucanconnecttoyourdatabaseservertolistthedatabases(schemas)towhichyouhaveaccess,andlistthetablesavailableforloading.Fordatabaseaccess,youprovideaconnectstringtoidentifytheserver,andyourusernameandpassword.

1. ShowthecommandsavailableinSqoop.

Onelephant:

$ sqoop help

YoucansafelyignorethewarningthatAccumulodoesnotexistsincethiscoursedoesnotuseAccumulo.

2. Listthedatabases(schemas)inyourdatabaseserver.

66


66

Onelephant:

$ sqoop list-databases \

--connect jdbc:mysql://localhost \

--username training --password training

(Note:Insteadofentering--password trainingonyourcommandline,youmayprefertoenter-P,andletSqooppromptyouforthepassword,whichisthennotvisiblewhenyoutypeit.)

3. Listthetablesinthemovielensdatabase.

Onelephant:

$ sqoop list-tables \

--connect jdbc:mysql://localhost/movielens \


4. ImportthemovietableintoHadoop.

Onelephant:

$ sqoop import \

--connect jdbc:mysql://localhost/movielens \

--table movie --fields-terminated-by '\t' \


The--fields-terminated-by '\t'optionseparatesthefieldsintheHDFSfilewiththetabcharacter,whichissometimesusefulifuserswillbeworkingwithHiveandPig.

Warningsthatpackagessuchashbase,hive-hcatalog,andaccumuloarenotinstalledareexpected.Itisnotaproblemthatthesepackagesarenotinstalledonyoursystem.

NoticehowtheINFOmessagesthatappearshowthataMapReducejobconsistingoffourmaptaskswascompleted.

67


67

5. Verifythatthecommandhasworked.

Onelephant:

$ hdfs dfs -ls movie

$ hdfs dfs -tail movie/part-m-00000

6. ImportthemovieratingtableintoHadoopusingthecommandinstep4asanexample.

Verifythatthemovieratingtablewasimportedusingthecommandinstep5asanexampleorbyusingtheClouderaManagerHDFSpage’sFileBrowser.

7. OptionallyobservetheresultsinClouderaManager’sYARNApplicationspage.

NavigatetotheYARNApplicationspage.

NoticethelasttwoYARNapplicationsthatran(movie.jarandmovierating.jar).

Explorethejobdetailsforeitherorbothofthesejobs.

This is the end of the Exercise

Note:

This exercise uses the MovieLens data set, or subsets thereof. This data is

freely available for academic purposes, and is used and distributed by

Cloudera with the express permission of the UMN GroupLens Research

Group. If you would like to use this data for your own research purposes,

you are free to do so, as long as you cite the GroupLens Research Group in

any resulting publications. If you would like to use this data for commercial

purposes, you must obtain explicit permission. You may find the full dataset,

as well as detailed license terms, at http://www.grouplens.org/node/73

68


68

Hands-On Exercise: Querying HDFS With Hive and Cloudera Impala Inthisexercise,youwilladdHiveandClouderaImpalaservicestoyourcluster,enablingyoutoquerydatastoredinHDFS.

YouwillstartbyaddingtheZooKeeperservicetoyourcluster.ZooKeeperisaprerequisiteforHiveServer2,whichyouwilldeploywhenyouaddtheHiveservice.

ThenyouwilladdtheHiveservice,includingaHiveMetastoreServerandHiveServer2,toyourHadoopcluster,andconfiguretheservice.

Next,youwilladdtheImpalaservicetoyourclusterandconfiguretheservice.

ThenyouwillpopulateHDFSwithdatafromthemovieratingtableandrunqueriesagainstitusingbothHiveandImpala.

69


69

Attheendofthisexercise,youshouldhavedaemonsdeployedonyourfivehostsasfollows(newservicesaddedinthisexercisearehighlightedinblue):



Adding the ZooKeeper Service

Inthistask,youwilladdaZooKeeperservicetoyourcluster.ArunningZooKeeperserviceisaprerequisiteformanyotherservicessoyouwilladdthisservicenow.Whenyouaddadditionalserviceslaterintheclass,youmaynoticetheExercise

70


70

instructionshaveyouselectZooKeeperaspartofthesetofdependenciesforthenewservicetouse.

1. FromtheClouderaManagerHomepage,selectthe‘AddaService’menuoptionfromthedrop-downmenutotherightofCluster1.


2. SelectZooKeeperandclickContinue.


3. Specifythefollowinghostassignments:

• Server–elephant,horse,andtigerbutnotlionormonkey

ClickOKandthenclickContinue.

4. TheReviewChangespageappears.

ReviewthedefaultvaluesspecifiedonthispageandclickContinue.




ClickFinish.

7. TheClouderaManagerHomepageappears.

Thezookeeperservicenowappearsinthelistofservices.

AhealthissueiconmayappearnexttothenewZooKeeperservicetemporarily,howeverthisshouldgoawaymomentarilyandthestatusshouldchangetoGoodHealth.

71


71

Note:youmaynoticethatthereisnowonemore‘Hosts’configurationissues,thistimeontigerthat-liketheonesthatappearedearlier-arerelatedtomemoryovercommitvalidationthresholds.Inatrueproductioncluster,youwouldwanttoaddressallconfigurationissues,howeveryoucansafelyignoretheseintheclassroomenvironment.

Adding the Hive Service to Your Cluster

Inthistask,youwilladdtheHiveservicetoyourclusterusingClouderaManager.

YouwillconfiguretheHiveMetastoretousetheMySQLmetastoredatabaseandtohaveaHiveServer2instance.

HiveandImpalacanbothmakeuseofasingle,commonHivemetastore.RecallthatyoucreatedafewdatabasesbyrunningaSQLscriptpriortoinstallingyourCDHcluster.Oneofthedatabasesyoucreatedisnamedmetastore,whichwillbeusedbyHiveandImpalaasacommonmetastoreforstoringtabledefinitions.

Attheendofthetask,youwillrunasimpleHivecommandtoverifythattheHiveservicehasbeenadded.

1. FromtheClouderaManagerHomepage,selectthe“AddaService”optionforCluster1.


2. SelectHiveandclickContinue.


3. Selecttherowcontainingthehdfs,yarn,andzookeeperservices,thenclickContinue.



72


72

• Gateway–elephantonly• HiveMetastoreServer–elephantonly• WebHCatServer–Donotselectanyhosts• HiveServer2–elephantonly

VerifythatyouhaveselectedonlyelephantandnotanyadditionalhostsfortheGateway,HiveMetastoreServer,andHiveServer2roles.

ClickContinue.

5. The“DatabaseSetup”pageappears.

Specifyvaluesforthedatabaseasfollows:

• DatabaseHostName–lion • DatabaseType–MySQL• DatabaseName–metastore• UserName–hiveuser• Password–password

Click“TestConnection”andverifythatconnectiontotheMySQLdatabaseissuccessful.

ClickContinue.






ClickFinish.


Astatusindicatorshowsyouthatthehiveserviceisingoodhealth.

73


73

10. VerifythatyoucanrunaHivecommandfromtheBeelineshell.

Onelephant:

$ beeline -u jdbc:hive2://elephant:10000/default \

-n training

IntheBeelineshell,typethefollowingcommand:

> SHOW TABLES;

Notablesshouldappear,becauseyouhaven’tdefinedanyHivetablesyet,butyoushouldnotseeerrormessages.

11. ExittheBeelineshell.

> !quit

Adding the Impala Service

Inthistask,youwilladdtheImpalaservicetoyourcluster.

ClouderaManagerwillautomaticallyconfigureImpalatousetheHiveMetastoreservicethatyoucreatedearlierinthisexercise.

1. FromtheClouderaManagerHomepage,selecttheAddaServiceoptionforCluster1.


2. SelectImpalaandclickContinue.


3. Selecttherowcontainingthehdfsandhiveservices,thenclickContinue.


74


74


• ImpalaCatalogServer–horse• ImpalaStateStore–horse• ImpalaDaemon–elephant,horse,monkeyandtiger

ClickContinue.

Note:WhenyouaddedtheHiveservice,youspecifiedelephantasaGatewayhost,whichcausedtheHiveclienttobeaddedonelephant.WithImpala,theImpalaclient—impala-shell—isautomaticallyaddedonallhostsrunningImpala.


ReviewthedefaultvaluesspecifiedfortheImpalaDaemonScratchDirectoriesonthispageandclickContinue.




ClickFinish.

8. RestarttheHDFSservice.

AfteraddingImpala,ontheClouderaManagerhomepageyouwillnoticethattheHDFSservicehasstaleconfigurationsasindicatedbythe iconthatappears.

Clickonthe“StaleConfiguration:Restartneeded”icon.The“ReviewChanges”pageappears.

Click“RestartCluster”thenclick“RestartNow”.WhentheactioncompletesclickFinish.

9. ConfirmtheImpalaservicesstarted.

BrowsetotheImpalapageforyourclusterandclickonInstances.

75


75

YoushouldseethattheImpalaCatalogServer,ImpalaDaemons,andImpalaStateStoreserviceshaveallstartedsuccessfully.

Running Hive and Impala Queries

Inthistask,youwilldefinethemovieratingtableinHiveandrunasimplequeryagainstthetable.ThenyouwillrunthesamequeryinImpalaandcompareperformance.

Note:YoualreadyimportedthemovieratingtableintoHDFSintheImportingDataWithSqoopexercise.

1. ReviewthemovieratingtabledataimportedintoHDFSduringtheSqoopexercise.

Onelephant:

$ hdfs dfs -cat movierating/part-m-00000 | head

2. StarttheBeelineshellandconnecttoHiveServer2.

Onelephant:

$ beeline -u jdbc:hive2://elephant:10000 -n training

3. DefinethemovieratingtableinHive.

> CREATE EXTERNAL TABLE movierating

> (userid INT, movieid STRING, rating TINYINT)

> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

> LOCATION '/user/training/movierating';

4. VerifythatyoucreatedthemovieratingtableintheHivemetastore.

Onelephant:

76


76

> SHOW TABLES;

Youshouldseeanentryforthemovieratingtable.

5. RunasimpleHivetestquerythatcountsthenumberofrowsinthemovieratings table.

Onelephant:

> SELECT COUNT(*) FROM movierating;

BrowsetoClouderaManager’sYARNApplicationspagetoviewtheMapReducejobthatwasrunwhenyouexecutedtheHivequery.IntheResultstab,makenoteoftheamountoftimethequerytakestoexecutewhenrunninginHive.

6. TerminatetheBeelineshell.

Onelephant:

> !quit

7. StarttheImpalashell.

Onelephant:

$ impala-shell

8. ConnecttotheImpalaCatalogServerrunningonhorse.

> CONNECT horse;

77


77

9. SinceyoudefinedanewtableafterstartingtheImpalaserveronhorse,youmustnowrefreshthatserver’scopyoftheHivemetadata.

> INVALIDATE METADATA;

10. InImpala,runthesamequeryagainstthemovieratingtablethatyouraninHive.

> SELECT COUNT(*) FROM movierating;

ComparetheamountoftimeittooktorunthequeryinImpalatotheamountoftimeittookinHive.

11. ExittheImpalashell.

! quit;

12. ExploreImpalaqueriesinClouderaManager

InClouderaManager,fromtheClustersmenu’sActivitiessection,chooseImpalaQueries.

Noticethatbothofthequeriesyouranfromtheimpala-shellappearintheResultspanel.

Forthe‘SELECT’querythatyouran,choose“QueryDetails”fromthedropdownmenuontheright.Browsethroughthequerydetailsnotingtheinformationaboutthequerythatisavailabletoyou.


78


78

Hands-On Exercise: Using Hue to Control Hadoop User Access Inthisexercise,youwillconfigureaHueenvironmentthatprovidesbusinessanalystswiththefollowingcapabilities:

• SubmittingPig,Hive,andImpalaqueries• ManagingdefinitionsintheHiveMetastore• BrowsingtheHDFSfilesystem• BrowsingYARNapplications

UserswillbeabletoaccesstheirenvironmentsbyusingaWebbrowser,eliminatingtheneedforadministratorstoinstallHadoopclientenvironmentsontheanalysts’systems.

79


79

Attheendofthisexercise,youshouldhavedaemonsdeployedonyourfivehostsasfollows(newdaemonsshowninblue):

TheHueserverwillbedeployedonmonkey.TheHttpFSandOozieserversonmonkeywillsupportseveralHueapplications.

80


80



Adding an HttpFS Role Instance to the hdfs Service

Inthistask,youwillusetheClouderaManagerwizardtoaddanHttpFSroleinstancetothehdfsservice.TheHttpFSroleinstancewillresideonmonkey.Afteraddingtheroleinstance,youwillrunacurlcommandfromthecommandlinetoverifythatHttpFSworkscorrectly.

1. InClouderaManager,navigatetotheHDFSInstancespage.

2. ClickAddRoleInstances.

The“AddRoleInstancestoHDFS”pageappears.

3. ForHttpFS,specifymonkey.

ClickContinue.

4. ThehdfsRoleInstancespagereappears.TheHttpFS(monkey)roleinstancenowappearsinthelistofroleinstances.

NoticethatthestatusforthisroleinstanceisStopped.

AstatusindicatorshowsyouthattheHDFSservicehasgoodhealth.

5. StarttheHttpFSroleinstance.

IntheHDFSInstancespage,checktheboxnexttoHttpFsandfromtheActionsforSelectedmenuchooseStart.

IntheStartdialogwindowclickStart.WhentheCommandDetailswindowshowsthatthecommandcompletedsuccessfullyclickClose.

81


81

6. ToverifyHttpFSoperation,runtheHttpFSLISTSTATUSoperationtoexaminethecontentinthe/user/trainingdirectoryinHDFS.

Onelephant:

$ ssh training@monkey netstat -tan | grep :14000

$ curl -s "http://monkey:14000/webhdfs/v1/\

user/training?op=LISTSTATUS&user.name=training" \

| python -m json.tool

Note:TheHttpFSRESTAPIreturnsJSONobjects.PipingtheJSONobjectstopython -m json.toolmakestheobjectseasiertoreadinstandardoutput.

Adding the Oozie Service

Inthistask,youwilladdtheOozieservicetoyourcluster.WithClouderaManager,OozieisaprerequisitesforaddingtheHueservice.Wewon’tusetheseservicesintheseexercises,butfeelfreetoexplorethemonyourownifyoulike.

YouwillconfiguretheOozieinstancetoresideonmonkeyandtheSolrinstancetorunontiger.


2. SelecttheAddaServiceoptionforCluster1.


3. SelectOozieandclickContinue.


4. Selecttherowcontainingthehdfs,yarn,andzookeeperservices,thenclickContinue.


82


82


• OozieServer–monkey

ClickContinue.






ClickFinish.


Astatusindicatorshowsyouthattheoozieserviceisingoodhealth.

Adding the Hue Service

Inthistask,youwilladdaHueservicetoyourcluster,configuringtheHueinstancetorunonmonkey.

1. FromtheClouderaManagerHomepage,selecttheAddaServiceoptionforCluster1.


2. SelectHueandclickContinue.


3. Selecttherowcontainingthehdfs,hive,impala,oozie,yarn,andzookeeperservices,thenclickContinue.


83


83


• HueServer–monkey

ClickContinue.




ClickFinish.


Astatusindicatorshowsyouthatthehueserviceisingoodhealth.

8. SubmitaHadoopWordCountjobsothattherewillbeaMapReducejobentrythatyoucanbrowseinHueafteryoustarttheHueUI.

Onelephant:



examples.jar wordcount /tmp/shakespeare.txt test_output

9. InstallSparkclientconfigurationonelephant.

PreviouslyyourantheSparkshellonmonkey.HereyouwilladdtheSparkGatewayroleonelephantsothatyoucanruntheSparkshellfromelephant.

NavigatetotheSparkInstancespageforyourclusterandclickAddRoleInstances.

AddtheGatewayroletoelephant.

Clickonthe“Staleconfiguration-clientconfigurationredeploymentneeded”iconandinthe“Cluster1StaleConfigurations”pageclick“DeployClient

Configuration”.

84


84

IntheDeployClientConfigurationpageclick“DeployClientConfiguration”.

IntheProgressscreen,waitforthecommandstocomplete,thenclickFinish.

10. StarttheSparkShellsothattherewillbeaSparkjobentryinHue.

Onelephant:

$ spark-shell --master yarn-client

Leavethespark-shellopenintheterminalfortherestofthisexercise.

Exploring the Hue User Interface

Inthistask,youwilllogintotheHueUIasanadministrativeuserandbrieflyexplorethefollowingapplications:HueHomepage,HiveUI,ClouderaImpalaQueryUI,PigEditor,FileBrowser,MetastoreManager,JobBrowser,HueShell,UserAdmin,andHelp.

YouwillalsoexplorehowtheHueUIreportsmisconfigurationanddeterminewhyyoucannotusetheJobDesignerandOozieEditor/Dashboardapplications.

LoggingIntoHue

1. MaximizethebrowserwindowtogiveHueenoughspacetodisplayasmanyoptionsaspossibleonitstopmenu.

2. ViewtheHueUI.

Accessthe“HueWebUI”fromtheClouderaManagerHueStatuspageforyourclusterorjustbrowsetotheURLathttp://monkey:8888.

3. LogintoHue.

Note:thatasthemessageboxinthebrowserindicates,asthefirstpersontologintothisHueservice,youareactuallydefiningtheHuesuperusercredentialsinthisstep.

85


85

Typeinadminastheuser,withthepasswordtraining,thenclick“CreateAccount”.

4. TheQuickStartWizarddisplays.

ClicktheHomeicon.

Atthe“Didyouknow?”dialog,click“Gotit,prof!”

The“Mydocuments”pageappears.

AccessHiveUsingHue

1. IfyoucompletedtheQueryingHDFSWithHiveandClouderaImpalaexercise,starttheHiveQueryEditorbyselectingQueryEditors>Hive.

EnterthefollowingquerytoverifythatHiveisworkinginHue:

SHOW TABLES;

ClickExecute.Theresultofthequeryshouldbethemovieratingtable.

Enteranotherquerytocountthenumberofrecordsinthemovieratingtable:

SELECT COUNT(*) FROM movierating;

Giveitsometimetocomplete.TheUIwillfirstshowtheLogtabcontents,thenitshouldeventuallyshowtheResultstab.Thequeryshouldrunsuccessfully.

86


86

AccessImpalaUsingHue

1. IfyoucompletedtheQueryingHDFSWithHiveandClouderaImpalaexercise,starttheImpalaQueryEditorbyselectingQueryEditors>Impala.

EnterthefollowingquerytoverifythatImpalaisworkinginHue:

SHOW TABLES;

ClickExecute.Theresultofthequeryshouldbethemovieratingtable.

Enteranotherquerytocountthenumberofrecordsinthemovieratingtable:

SELECT COUNT(*) FROM movierating;

Thequeryshouldrunsuccessfully.

AccesstheMetastoreUsingHue

1. StarttheMetastoreManagerbyselectingDataBrowsers>MetastoreTables.

TheMetastoreManagerappearswithanentryforthemovieratingtable.

Selecttheentryforthemovieratingtable.

TheschemafortheHivemovieratingtable,whichyoucreatedintheHiveexercise,appearsintheMetastoreManager.

NoticethelistofactionsavailableintheACTIONSmenu.

AccessPigUsingHue

1. StartthePigUIbyselectingQueryEditors>Pig.

87


87

ThePigQueryEditorappears.

YoucaneditandsavePigscriptsusingHue’sPigQueryEditorinyourcurrentHuedeployment.

AccessHDFSUsingHue

1. ClickontheFileBrowser(documenticon)towardsthetoprightoftheHueUI.

ThisopenstheFileBrowserapplication.

BrowsetheHDFSfilesystem.Ifyouwish,executesomehdfs dfscommandsfromthecommandlinetoverifythatyouobtainthesameresultsfromthecommandlineandtheHueFileBrowser.

NotetheUploadmenuaswell.YoucoulduploadfilestoHDFSthroughHue.

TherearealsoActionsavailablegivingtheHueuseroptionstoRename,Move,DownloadorchangepermissionsonHDFSfiles(assumingtheuserhasthepermissionstodoso).

2. IntheFileBrowser,navigatetothe/user/admindirectory.

OnthefirstHuelogin,Huecreatedasuperuser–inthiscase,theadminuser–andanHDFSpathforthatuser–inthiscase,the/user/adminpath.

3. IntheFileBrowser,navigatetothe/user/training/test_output directory–theoutputdirectoryoftheWordCountjobthatyouranthebeforestartingtheHueUI.

4. Clicktheentryforthepart-r-00000file–theoutputfilefromtheWordCountjob.

Aread-onlyeditorshowingthecontentsofthepart-r-00000fileappears.

88


88

BrowseYARNApplicationsUsingHue

1. SelecttheJobBrowser(listicon)option.

AnentryfortheHivejobthatyouranearlierappearsintheHueJobBrowser.

SpecifytrainingintheUsernamefield.

AnentryfortheMapReducewordcountjobyouraninthepreviousstepappearswiththestatus“SUCCEEDED.”

AnotherentryfortheSparkshellyoushouldstillhaverunningappearswiththestatus“RUNNING.”

Browsethecompleted“wordcount”jobdetailsbyclickingonthelinkintheIDcolumnandthenlookingthroughthedetailsintheAttempts,Tasks,Metadata,andCounterstabs.

Ifyouareinterested,lookinClouderaManager’sYARNApplicationspageforyourcluster,locatetheentryforthesamewordcountjob,andfollowthe“ApplicationDetails”linkwhichtakesyoutotheHistoryServerWebUI.ComparethedetailsyoufindtherewiththeinformationavailableintheHueJobBrowser.

2. Backintheterminalwindowonelephant,typeexittoendtheSparkinteractiveshell.

BrowseUsers,Documentation,SettingsandLogsofHue

89


89

1. StarttheUserManagementToolbyselectingthe”admin”menu(cogandwheelsicon)andthenchoosing“ManageUsers.”

TheUserAdminscreenappearswhereyoucandefineHueusersandgroupsandsetpermissions.

Noticetheautomaticallycreatedentryfortheadminuser.

Youwillcreateanotheruserandagroupinthenexttask.

2. ClicktheDocumentation(questionmark)icon:

HueUIuserdocumentationappears.

3. ClicktheAboutHueicon(totheleftoftheHomeicon).

TheQuickStartWizard’sCheckConfigurationtabshows“AllOK.Configurationcheckpassed.”

ChoosetheAboutHuetop-levelmenuandclickintotheConfigurationtabtoexamineHue’sconfiguration.

ClicktheServerLogstabtoexamineHue’slogs.

Setting up the Hue Environment for Business Analysts

ConsiderascenariowhereyouhavebeengivenarequirementtosetupaHueenvironmentforbusinessanalysts.TheenvironmentwillallowanalyststosubmitHiveandImpalaqueries,editandsavePigqueries,browseHDFS,managetabledefinitionsintheHiveMetastore,andbrowseHadoopjobs.AnalystswhohavethisenvironmentwillnotneedHadoopinstalledontheirsystems.Instead,theywillaccessalltheHadoopfunctionalitythattheyneedthroughaWebbrowser.

YouwillusetheUserAdminapplicationtosetuptheanalysts’Hueenvironment.

1. VerifythatyouarestillloggedintoHueastheadminuser.

2. ActivatetheHueUserManagementtoolbyselectingadmin>ManageUsers.

3. SelectGroups.

90


90

4. ClickAddgroup,namethenewgroupanalysts.

Configurethepermissionsbyselectingtheoneslistedbelow:

• about.access • beeswax.access • filebrowser.access • hbase.access • help.access • impala.access • jobbrowser.access • metastore.write • metastore.access • pig.access

ClickAddgroup.

5. SelectUsers.

6. AddaHueusernamedfredwiththepasswordtraining.

Inthe“Step2:NamesandGroups”tab,makefredamemberoftheanalystsgroup.However,makesurethatfredisnotamemberofthedefaultgroup.

Click“AddUser”.

7. SignoutoftheHueUI(usingthearrowiconinthetoprightofthescreen).

8. LogbackintotheHueUIasuserfredwithpasswordtraining.

Verifythatinthesessionforfred,onlytheHueapplicationsconfiguredfortheanalystsgroupappear.Forexample,theAdministrationmenudoesnotallowfredtomanageusers.FredalsohasnoaccesstotheWorkflows,Search,andSecuritymenusthatareavailabletotheadminuser.


91


91

Hands-On Exercise: Configuring HDFS High Availability Inthisexercise,youwillreconfigureHDFS,eliminatingtheNameNodeasasinglepointoffailureforyourHadoopcluster.

Youwillstartbymodifyingthehdfsservice’sconfigurationtoenableHDFShighavailability.

Youwillthenshutdownservicesthatyouwillnolongeruseinthisexerciseorothersubsequentexercises.

Next,youwillenableautomaticfailoverfortheNameNode.AutomaticfailoverusestheZooKeeperservicethatyouaddedtoyourclusterinanearlierexercise.ZooKeeperisaprerequisiteforHDFSHAautomaticfailover.

OnceyouhaveenabledHDFShighavailabilitywithautomaticfailover,youwillintentionallybringoneoftheserversdownasatest.HDFSservicesshouldstillbeavailable,butitwillbeservedbyaNameNoderunningonadifferenthost.

92


92

Attheendofthisexercise,youshouldhavedaemonsdeployedandrunningonyourfivehostsasfollows(newdaemonsshowninblue):

93


93



Bringing Down Unneeded Services

SinceyouwillnolongeruseHue,Oozie,Impala,orHivefortheremainingexercises,youcanstoptheirservicestoimproveyourcluster’sperformance.


2. Stopthehueserviceasfollows:

Intherowforthehueservice,clickActions>Stop.

ClickStopintheconfirmationwindow.

ClickCloseaftermessagesintheCommandDetails:Stoppageindicatethatthehueservicehasstopped.

TheHomepagereappears.ThestatusofthehueserviceshouldhavechangedtoStopped.

3. Usingstepssimilartothestepsyoufollowedtostopthehueservice,stoptheoozieservice.

4. Stoptheimpalaservice.

5. Stopthehiveservice.

6. Stoptheflumeserviceifitisrunning.

94


94

Verifythattheonlyservicesthatarestillupandrunningonyourclusterarethehdfs,yarn,spark,zookeeper,andmgmtservices.Alloftheseservicesshouldhavegoodhealth.

Enabling HDFS High Availability

InthistaskyouwillconfigureyourHadoopconfigurationtouseHDFShighavailability.

1. ConfiguredirectoriesforJournalNodestostoreeditsdirectories.

Onelephant:

$ sudo mkdir /dfs/jn

$ sudo chown hdfs:hadoop /dfs/jn

$ ssh training@horse sudo mkdir /dfs/jn

$ ssh training@horse sudo chown hdfs:hadoop /dfs/jn

$ ssh training@tiger sudo mkdir /dfs/jn

$ ssh training@tiger sudo chown hdfs:hadoop /dfs/jn

2. InClouderaManager,browseHDFSConfigurationpageforyourcluster.

SelecttheService-Widecategory.

ForZooKeeperService,clickintheValueareaandchooseZooKeeper.

SaveChanges.

NowselecttheInstancestab.

ThehdfsRoleInstancespageappears.

Observethatthehdfsservicecomprisesthefollowingroleinstances:

95


95

• Balanceronhorse• DataNodesonelephant,monkey,horse,andtiger• TheactiveNameNodeonelephant• TheSecondaryNameNodeontiger• AnHttpFSserveronmonkey

Thelistofroleinstanceswillchangeafteryouenablehighavailability.

3. Click“EnableHighAvailability”.

4. The“GettingStarted”pageappears.

ChangetheNameserviceNametomycluster.

ClickContinue.

5. The“AssignRoles”pageappears.

Specifythefollowing:

• NameNodeHosts• elephant(Current) • tiger

• JournalNodeHosts

• elephant, horse, tiger

ClickContinue.

96


96

6. The“ReviewChanges”pageappears.

Specifythevalue/dfs/jninallthreeJournalNodeEditsDirectoryfields.

ClickContinue.

The“Progress”pageappears.ThemessagesshowninthescreenshotbelowappearasClouderaManagerenablesHDFShighavailability.

Note:FormattingthenamedirectoriesofthecurrentNameNodewillfail.AsdescribedintheClouderaManagerinterface,thisisexpected.

97


97

Whentheprocessofenablinghighavailabilityhasfinished,clickContinue.

98


98

Aninformationalmessageappearsinformingyouofpost-setupstepsregardingtheHiveMetastore.Youwillnotperformthepost-setupstepsbecauseyouwillnotbeusingHiveforanyremainingexercises.

ClickFinish.

7. ThehdfsRoleInstancespageappears.

Observethatthehdfsservicenowcomprisesthefollowingroleinstances:• Balanceronhorse• DataNodesonelephant,tiger,horse,andmonkey• FailoverControllersonelephantandtiger• AnHttpFSserveronmonkey• JournalNodesonelephant,tiger,andhorse• TheactiveNameNodeonelephant• ThestandbyNameNodeontiger• NoSecondaryNameNode

Verifying Automatic NameNode Failover

Inthistask,youwillrestarttheactiveNameNode,bringingitdownandthenup.ThestandbyNameNodewilltransitiontotheactivestatewhentheactiveNameNodegoesdown,andtheformerlyactiveNameNodewilltransitiontothestandbystate.

ThenyouwillrestartthenewactiveNameNodeinordertorestoretheoriginalstatesofthetwoNameNodes.

1. NavigatetotheHDFSservice’sInstancestab.

2. Inthe“FederationandHighAvailability”sectionofthepage,observethatthestateofoneoftheNameNodesisactiveandthestateoftheotherNameNodeisstandby.

3. Scrolldowntothe“RoleInstances”section.

4. IntheRoleInstancessectionofthepage,selectthecheckboxtotheleftoftheentryfortheactiveNameNode.

99


99

5. ClickActionsforSelected>Restart.

ClickRestarttoconfirmthatyouwanttorestarttheinstance.

6. Waitfortherestartoperationtocomplete.WhenithassuccessfullycompletedclickClose.

VerifythatthestatesoftheNameNodeshavechanged—theNameNodethatwasoriginallyactive(elephant)isnowthestandby,andtheNameNodethatwasoriginallythestandby(tiger)isnowactive.IftheClouderaManagerUIdoesnotimmediatelyreflectthischange,giveitafewsecondsanditwill.

100


100

7. GotoDiagnostics>EventsandnoticethemanyrecentevententriesrelatedtotherestartingoftheNameNode.

8. BackintheHDFSInstancestab,restarttheNameNodethatiscurrentlytheactiveNameNode.

Aftertherestarthascompleted,verifythatthestatesoftheNameNodeshaveagainchanged.


101


101

Hands-On Exercise: Using the Fair Scheduler Inthisexercise,youwillsubmitsomejobstotheclusterandobservethebehavioroftheFairScheduler.



1. AdjustYARNmemorysettingandrestartthecluster.

InordertodemonstratetheFairScheduler,youwillneedtoincreasetheamountofmemorythatYARNcontainersarepermittedtouseonyourcluster.

InClouderaManager,gototheYARN(MR2Included)Configurationpage.

Inthesearchbox,searchforyarn.nodemanager.resource.memory-mb

Changethevalueofthisparameterto3GBinbothoftheRoleGroupswhereitappears.

102


102

SavetheChanges.

NavigatetotheYARNStatuspageandclickonthe“StaleConfiguration:RestartNeeded”icon.

Followthestepstorestartthecluster.Whentherestarthascompleted,clickFinish.

2. Analyzethescriptyouwillruninthisexercise.

TomakeiteasiertostartandstopMapReducejobsduringthisexercise,ascripthasbeenprovided.Viewthescripttogainanunderstandingofwhatitdoes.

On elephant

103


103

$ cd ~/training_materials/admin/scripts

$ cat pools.sh

Thescriptwillstartorstopajobinthepoolyouspecify.Ittakestwoparameters.Thefirstisthepoolnameandthesecondistheactiontotake(startorstop).Eachjobitwillrunwillberelativelylongrunningandconsistof10mappersand10reducers.

Note:Remember,weusethetermspoolandqueueinterchangeably.

3. StartthreeHadoopjobs,eachinadifferentpool.

Onelephant:

$ ./pools.sh pool1 start



Itisrecommendedthatyouattempttogothroughthisexercisebyonlystartingjobswhenpromptedintheinstructions.However,dependingonhowquicklyyoucompletethesteps,ajobmayhavecompletedearlierthantheinstructionsanticipated.Therefore,pleasenotethatatanytimeduringthisexerciseyoucanstartadditionaljobsinanypoolusinganyofthethreecommandsyouraninthisstep.

4. Verifythejobsstarted.

InClouderaManager,browsetoClusters>YARNApplications.

Youshouldnoticethatthethreejobshavestarted.Ifthejobsdonotyetdisplayinthepage,waitamomentandthenrefreshthepage.

Note:ClouderaManagerdoesrefreshpagesautomatically,howeverinthisexerciseyoumayfinditusefultorefreshthepagesmanuallytomorequicklyobservethelateststatusofpoolsandjobsrunninginthepools.

Leavethisbrowsertabopen.

104


104

5. Observethestatusofthepools.

Openanotherbrowsertab,andinClouderaManager,browsetoClusters>DynamicResourcePools.

Analyzethedetailsinthe“ResourcePoolsUsage”table.

Ifpool1,pool2,andpool3donotyetdisplay,refreshthebrowsertab.

Thetabledisplaysthepoolsinthecluster.Thepoolsyousubmittedjobstoshouldhavependingcontainersandallocatedcontainers.

Thetablealsoshowstheamountofmemoryandvcoresthathavebeenallocatedtoeachpool.

6. AnalyzethePerPoolcharts.

OnthesamepageinClouderaManager,noticethePerPoolSharescharts.ThereisonechartforFairShareMemoryandanotherforFairShareVCores.

Ifnothingisdisplayinginthesechartsyet,waitamomentandtheywill.Optionallyrefreshthebrowserpage.

Leavethisbrowsertabopen.

7. Startanotherjobinanotherpool.

Inthesameshellsessiononelephant:


8. BackinClouderaManager,observetheresourceallocationaffectofstartinganewjobinanewpool.

OccasionallyrefreshtheDynamicResourcePoolspage.

Somepoolsmaybeinitiallyovertheirfairsharebecausethefirstjobstorunwilltakeallavailableclusterresources.

However,overtime,noticethatthejobsrunningoverfairsharebegintoshedresources,whicharereallocatedtootherpoolstoapproachfairshareallocationforallpools.

105


105

Tip:Mouseoveranyoneofthe“PerPool..”chartsandthenclickthedouble-arrowicontoexpandthechartsize.

9. Conductfurtherexperiments.

Stopthejobrunninginpool1.

$ ./pools.sh pool1 stop

Waitaminuteortwo,thenobservetheresultsinthechartsontheDynamicResourcePoolspage.

Startasecondjobinpool3.


Againobservetheresults.

10. ConfigureaDynamicResourcePoolforpool2.

IntheDynamicResourcePoolspage,clickintotheConfigurationtabandclickontheResourcePoolstab.

Clickon“AddResourcePool”.

IntheGeneraltab,settheResourcePoolNametopool2.KeepDRFastheschedulingpolicy.

IntheYARNtab,configurethefollowingsettings:• Weight:2• VirtualCoresMin:1• VirtualCoresMax:2• MemoryMin:2400• MemoryMax:5000

ClickOKtosavethechanges.

106


106

11. ObservetheeffectofthenewDynamicResourcePoolontheresourceallocationswithinthecluster.

IntheYARNapplicationspage,checkhowmanyjobsarestillrunningandinwhichpoolstheyarerunning.

Usethepools.shscripttostartorstopjobssothatthereisonejobrunningineachofthefourpools.

ReturntotheDynamicResourcePoolspage’sStatustabtoobservetheaffectofthepoolsettingsyoudefined.

AsyoucontinuetoobservethePer-PoolShareschart,youshouldsoonseethatpool2isgivenagreatershareofresources.

12. Cleanup.

WhenyouaredoneobservingthebehavioroftheFairScheduler,stopallrunningjobseitherbyusingthepools.shscriptorkilltheapplicationsfromtheYARNApplicationspageinClouderaManager.


107


107

Hands-On Exercise: Breaking The Cluster Inthisexercise,youwillseewhathappensduringfailuresofportionsoftheHadoopcluster.



1. VerifytheexistenceofalargefileinHDFS.

InapreviousexerciseyouplacedtheweblogfilesinHDFS.Verifythefilesarethere.

Onelephant:

$ hdfs dfs -ls weblog

Onlyifyoudonotseetheaccess_logfileinHDFS,placeittherenow.

Onelephant:


$ hdfs dfs -mkdir weblog


| hdfs dfs -put - weblog/access_log

2. Locateablockthathasbeenreplicatedonelephantasfollows:

IntheNameNodeWebUI,navigatetothe/user/training/weblog/access_logfile.TheFileInformationwindowappears.

108


108

LocatetheAvailabilitysectionintheFileInformationwindowforBlock0.YoushouldseethreehostsonwhichBlock0isavailable.Ifoneofthereplicasisonelephant,noteitsBlockID.YouwillneedtorefertotheBlockIDinthenextexercise.

IfnoneofBlock0’sreplicasareonelephant,viewthereplicationinformationforotherblocksinthefileuntilyoulocateablockthathasbeenreplicatedonelephant.Onceyouhavelocatedablockthathasbeenreplicatedonelephant,noteitsblockID.

WewillrevisitthisblockwhentheNameNoderecognizesthatoneoftheDataNodesisa‘deadnode’(after10minutes).

3. Now,intentionallycauseafailureandobservewhathappens.

UsingClouderaManager,fromtheHDFSInstancespage,stoptheDataNoderunningonelephant.

4. VisittheNameNodeWebUIagainandclickon‘Datanodes’.Refreshthebrowserseveraltimesandnoticethatthe‘Lastcontact’valuefortheelephantDataNodekeepsincreasing.

5. RuntheHDFSfilesystemconsistencychecktoseethattheNameNodecurrentlythinkstherearenoproblems.

Onelephant:

$ sudo -u hdfs hdfs fsck /

6. Waitforatleasttenminutestopassbeforestartingthenextexercise.

(optional)inanyterminal:

$ sleep 600 && echo “10 minutes have passed.”


109


109

Hands-On Exercise: Verifying The Cluster’s Self-Healing Features Inthisexercise,youwillseewhathashappenedtothedataonthedeadDataNode.



1. IntheNameNodeWebUI,clickonDatanodesandconfirmthatyounowhaveone‘deadnode.’

2. Viewthelocationoftheblockfromtheaccess_logfileyouinvestigatedinthepreviousexercise.NoticethatHadoophasautomaticallyre-replicatedthedatatoanotherhosttoretainthree-foldreplication.

3. InClouderaManager,chooseClusters>Reports.

Clickon“Customreport”atthebottomoftheDiskUsage(HDFS)section.

Buildareportwiththefollowingtwofilters(usetheplussigntoaddthesecondfilter):• Replication < 4 • Group equal to hadoop

Click“GenerateReport”.Youshouldseeallfilesmatchthecriteriasincethereplicationfactorforyourclusterissettothree.

ChangetheReplicationfilterfrom< 4to< 3andrungeneratethereportagain.

NowthereshouldbenofilesthatmatchthecriteriasincedatastoredonelephantwasreplicatedovertooneoftheotherthreeDataNodes.

4. Viewchartsthatshowwhenreplicationoccurred.

FromtheClouderaManagerHDFSpage,chooseCharts>ChartsLibrary.

110


110

Inthesearchbox,typeinreplication.Thiswillshowyouthe“PendingReplicationBlocks”and“ScheduledReplicationBlocks”charts.

NotethespikeinactivitythatoccurredaftertheDataNodewentdown.

5. ViewtheauditandlogtrailsinClouderaManager.

InClouderaManager,clickonAudits.

NotethetimestampforwhentheHDFSservicewasstopped.

ChooseDiagnostics>Logs,selectsourcesHDFSonly,settheMinimumLogLevelto“INFO”,andenterthesearchterm“replicate”.

ClickSearch.

Scrolltothebottom.Noticethelogmessagesrelatedblocksbeingreplicated.

6. Runthehdfs fsckcommandagaintoobservethatthefilesystemisstillhealthy.

Onelephant:


7. Runthehdfs dfsadmin -reportcommandtoseethatonedeadDataNodeisnowreported.

Onelephant:

111


111

$ sudo -u hdfs hdfs dfsadmin -report

8. UseClouderaManagertorestarttheDataNodeonelephant,bringingyourclusterbacktofullstrength.

9. Runthehdfs fsckcommandagaintoobservethetemporaryoverreplicationofblocks.

Onelephant:


Notethattheoverreplicationsituationwillresolveitself(ifithasnotalready)nowthatthepreviouslyunavailableDataNodeisonceagainrunning.

Ifthecommandabovedidnotshowanyoverreplicatedblocks,gotoDiagnostics>LogsinClouderaManagerandsearchtheHDFSsourcefor“ExcessRepl”.Youshouldfindevidenceofthetemporaryover-replicationinthelogentries.


112


112

Hands-On Exercise: Taking HDFS Snapshots Inthisexercise,youwillenableHDFSsnapshotsonadirectoryandthepracticerestoringdatafromasnapshot.



1. EnablesnapshotsonadirectoryinHDFS.

InClouderaManager,gototheHDFSpageforyourclusterandclickFileBrowser.

Browseto/user/training,thenclickEnableSnapshots.

Inthe“EnableSnapshots”windows,keeptheSnapshottablePathsetto/user/trainingandclickEnableSnapshots.

Thecommandcompletes.Noticeinthemessagedisplayedonthe“Program:”linethatsnapshotscanalsobeenabledfromthecommandlineusingthehdfs dfsadmintool.

ClickClose.Noticethatthereisnowa“TakeSnapshot”button.

2. Takeasnapshot.

StillintheClouderaManagerFileBrowserat/user/training,Click“TakeSnapshot”.Giveitthenamesnap1andclickOK.

AfterthesnapshotcompletesclickClose.

Thesnapshotsectionshouldnowshowyour“snap1”listing.

3. Deletedatafrom/user/trainingthenrestoredatafromthesnapshot.

113


113

Nowlet’sseewhathappensifwedeletesomedata.

Onelephant:

$ hdfs dfs -rm -r weblog

$ hdfs dfs -ls /user/training

Thesecondcommandshouldshowthattheweblogdirectoryisnowgone.

Howeveryouweblogdataisstillavailable,whichyoucanseebyrunningthecommandshere:

$ hdfs dfs -ls /user/training/.snapshot/snap1

$ hdfs dfs -tail .snapshot/snap1/weblog/access_log

Restoreacopyoftheweblogdirectorytotheoriginallocationandthenverifyitisbackinplace.

$ hdfs dfs -cp .snapshot/snap1/weblog weblog

$ hdfs dfs -ls /user/training


114


114

Hands-On Exercise: Configuring Email Alerts Inthisexercise,youwillconfigureClouderaManagertouseanemailservertosendalerts.



1. ConfigureClouderaManagertosendemailalertsusingtheemailserveronlion.

InClouderaManager,chooseClusters>ClouderaManagementService.

ClickonConfigurationandthenchoose“AlertPublisherDefaultGroup”.

Confirmthe“Alerts:EnableEmailAlerts”propertyischecked.

Configurethefollowing:• Alerts:MailServerUsername:training• Alerts:MailServerPassword:training• Alerts:MailMessageRecipients:training@localhost• Alerts:MailMessageFormat:text

Savethechanges.

2. RestarttheClouderaManagementService.

3. SendatestalertfromClouderaManager.

InClouderaManager,gotoAdministration>Alerts.Youshouldseethattherecipient(s)ofalertsisnowsettotraining@localhost.

Clickonthe“SendTestAlert”buttonatthetopofthepage.

4. ConfirmemailsarebeingreceivedfromClouderaManager.

115


115

Thepostfixemailserverisrunningonlion.Hereyouusethemailcommandlineclienttoaccessthetraininguser’sinbox.

Onlion:

$ mail

The“TestAlert”emailshouldshowasunread(U).

Atthe&prompt,typeinthenumberthatappearstotherightoftheUandhitthe<Enter>keysoyoucanreadtheemail.

Afteryouaredonereadingtheemail,typeq<Enter>toexitthemailclient.


116


116

Troubleshooting Challenge: Heap O’ Trouble It’s8:30AMandyouareenjoyingyourfirstcupofcoffee.Keri,whoismakingthetransitionfromwritingRDBMSstoredprocedurestocodingJavaMapReduce,showsupinyourdoorwaybeforeyou’reevenhalfwaythroughthatfirstcup.

“IjusttriedtorunaMapReducejobandIgotanoutofmemoryexception.Iheardthattherewas32GBonthosenewmachinesyoubought.ButwhenIrunthisstupidjob,Ikeepgettingoutofmemoryerrors.Isn’t32GBenoughmemory?IfIdon’tfixthisthing,I’mgoingtobeinaheapoftrouble.ItoldmymanagerIwas99%completewithmyprojectbutnowI’mnotevensureifIcandowhatIwanttodowithHadoop.”

PutdownyourcoffeeandseeifyoucanhelpKerigetherjobrunning.



Recreating the Problem

1. ConfirmfilesinHDFS.

Onelephant:

$ hdfs dfs -ls /tmp/shakespeare.txt

Thecommandaboveshouldshowthatshakespeare.txtisinHDFS.

Onlyifshakespeare.txtwasnotfound,runthesecommandstoplacethefileinHDFS.

Onelephant:

117


117


$ gunzip shakespeare.txt.gz

$ hdfs dfs -put shakespeare.txt /tmp

Onelephant:

$ hdfs dfs -ls weblog/access_log

Thecommandaboveshouldconfirmthattheaccess_logfileexists.

Onlyifaccess_logwasnotfound,runthesecommandstoplacethefileinHDFS.

Onelephant:


$ hdfs dfs –mkdir weblog


| hdfs dfs -put – weblog/access_log

2. RuntheHeapofTroubleprogram.

onelephant:

$ cd ~/training_materials/admin/java

$ hadoop jar EvilJobs.jar HeapOfTrouble \

/tmp/shakespeare.txt heapOfTrouble

118


118

Attacking the Problem

TheprimarygoalofthisandalltheothertroubleshootingexercisesistostarttobecomemorecomfortableanalyzingproblemscenariosbyusingHadoop’slogfilesandWebUIs.Althoughyoumightbeabletodeterminethesourceoftheproblemandfixit,doingsosuccessfullyisnottheprimarygoalhere.

Takeasmanyactionsasyoucanthinkoftotroubleshootthisproblem.Pleasewritedowntheactionsthatyoutakewhileperformingthischallengesothatyoucansharethemwithothermembersoftheclasswhenyoudiscussthisexerciselater.

Fixtheproblemifyouareableto.

Donotturntothenextpageunlessyouarereadyforsomehints.

119


119

Some Questions to Ask While Troubleshooting a Problem

ThislistofquestionsprovidessomestepsthatyoucouldfollowwhiletroubleshootingaHadoopproblem.AllofthestepsdonotnecessarilyapplytoallHadoopissues,butthislistisagoodplacetostart.

• Whatistherethatisdifferentintheenvironmentthatwasnottherebefore

theproblemstartedoccurring?• Isthereapatterntothefailure?Isitrepeatable?• Ifaspecificjobseemstobethecauseoftheproblem,locatethetasklogsfor

thejob,includingtheApplicationMasterlogs,andreviewthem.Doesanythingstandout?

• ArethereanyunexpectedmessagesintheNameNode,ResourceManager,andNodeManagerlogs?

• Howisthehealthofyourcluster?• Isthereadequatediskspace?• Morespecifically,doesthe/var/logdirectoryhaveadequatedisk

space?• Mightthisbeaswappingissue?• Isnetworkutilizationextraordinarilyhigh?• IsCPUutilizationextraordinarilyhigh?

• Canyoucorrelatethiseventwithanyoftheissues?• IfitseemslikeaHadoopMapReducejobisthecauseoftheproblem,isit

possibletogetthesourcecodeforthejob?• DoessearchingtheWebfortheerrorprovideanyusefulhints?

Fixing the Problem

Ifyouhavetimeandareableto,fixtheproblemsothatKericanrunherjob.

Post-Exercise Discussion

Aftersometimehaspassed,yourinstructorwillaskyoutostoptroubleshootingandwillleadtheclassinadiscussionoftroubleshootingtechniques.


120


120

Appendix A: Setting up VMware Fusion on a Mac for the Cloud Training Environment WhenperformingtheHands-OnExercisesforthiscourse,youuseasmallCentOSvirtualmachinecalledGet2EC2.ThisVMisconfiguredtouseNATnetworking.YouconnecttoAmazonEC2instancesfromtheguestOSbystartingSSHsessions.TheGet2EC2VMissupportedforVMwareorVirtualBox.

VMwareFusion,likeotherhypervisors,runsaninternalDHCPserverforNAT-tedgueststhatassignsIPaddressestotheguests.Fromtimetotime,theinternalDHCPserverreleasesandrenewstheguests’leases.Unfortunately,theinternalDHCPserverinVMwareFusiondoesnotalwaysassignthesameIPaddresstoaguestthatithadpriortothereleaseandrenew,andtheGet2EC2VM’sIPaddresschanges.

ChangingtheIPaddressresultsinproblemsforactiveSSHsessions.Sometimestheterminalwindowinwhichtheclientisrunningwillfreezeup,becomingunresponsivetomouseandkeyboardinput,andnolongerdisplayingstandardoutput.Atothertimes,sessionswillbeshutdownwithaBrokenPipeerror.Ifthishappens,youwillhavetore-openanyfailedsessions.

IfyouareusingVMwareFusiononaMactoperformtheHands-OnExercisesforthiscourse,youneedtodecidewhetheryouwouldprefertotakeactionordonothing:

• IfyouhaveadministratorprivilegesonyourMac,youcanconfigureVMwareFusiontouseafixedIPaddress.TheinstructionsforconfiguringVMwareFusiontouseafixedIPaddressappearbelow.

• IfyouhaveVirtualBoxinstalled,youcanusetheVirtualBoxGet2EC2VMinsteadofVMwareFusion.

• Youcandonothing,inwhichcaseyoumightencounterterminalfreezesasdescribedabove.

ToconfigureVMwareFusiontouseafixedIPaddress,performthefollowingsteps:

121


121

1. StartVMwareFusion.

2. CreateanentryintheVirtualMachineslistfortheGet2EC2VM.Tocreatetheentry,dragtheCloudera-Training-Get2EC2-VM-1.0.vmxfiletotheVirtualMachineslist.

YoushouldseetheCloudera-Training-Get2EC2-VM-1.0entryintheVirtualMachineslist.WewillrefertotheCloudera-Training-Get2EC2-VM-1.0VMastheGet2EC2VM.

3. MakesuretheGet2EC2VMispowereddown.

4. ClickonceontheGet2EC2VMentryintheVirtualMachineslisttoselecttheVM.

Note:Ifyouaccidentallydouble-clicktheentry,youstarttheVM.Beforeyouproceedtothenextstep,powerdowntheVM.

5. ClicktheSettingsiconintheVMwareFusionToolbar(orselectVirtualMachines>Settings).

6. ClickNetworkAdapter.

7. ClickAdvancedOptions.

TheMACAddressfieldappears.

8. IftheMACAddressfieldisempty,clickGeneratetogenerateaMACaddressfortheGet2EC2VM.

9. CopytheMACaddressandpasteitintoafilewhereyoucanaccessitlater.YouwillneedtousetheMACaddressinasubsequentstep.

122


122

10. OpenthefollowingfileonyourMacusingsuperuser(sudo)privileges:

• VMwareFusion4andhigher:/Library/Preferences/VMware Fusion/vmnet8/dhcpd.conf

• VMwareFusion3:/Library/Application Support/VMware Fusion/vmnet8/dhcpd.conf

Lookfortherangestatement.ItshouldhavearangeofIPaddresses.Forexample:

range 172.16.73.128 172.16.73.254;

11. ChooseanIPaddressfortheGet2EC2VM.TheIPaddressshouldhavethefirstthreetuplesoftheIPaddressesintherangestatement,butthefourthtupleshouldbeoutsideoftheaddressesintherangestatement.Giventheexampleoftherangestatementinthepreviousstep,youwouldchooseanIPaddressthatstartswith172.16.73andendswithanumberlowerthan128(butnot0,1,or2–thosenumbersarereservedforotherpurposes).

Forexample,172.16.73.10.

12. Addfourlinestothebottomofthedhcpd.conffileasfollows:

host Get2EC2 {

hardware ethernet <MAC_Addresss>;

fixed-address <IP_Address>;

}

Replace<MAC_Address>withtheMACaddressyougeneratedinanearlierstep.

Replace<IP_Address>withtheIPaddressyouchoseinthepreviousstep.

BesuretoincludethesemicolonsaftertheMACandIPaddressesasshownintheexample.

13. Saveandclosethedhcpd.conffile.

123


123

14. RunthefollowingcommandsfromaterminalwindowonyourMac:

ForVMwareFusion4orhigher:

$ sudo /Applications/VMware\ Fusion.app/Contents/\

Library/vmnet-cli --stop

$ sudo /Applications/VMware\ Fusion.app/Contents/\

Library/vmnet-cli --start

ForVMwareFusion3:

$ sudo /Library/Application Support/VMware\ Fusion/\

boot.sh --restart

15. StarttheGet2EC2VM.

16. AftertheVMhascomeup,runthefollowingcommandinaLinuxterminalwindow:

$ ip addr

VerifythattheIPaddressthatappearsistheIPaddressthatyouspecifiedinthedhcpd.conffile.

This is the end of this Appendix.

124


124

Appendix B: Setting up VirtualBox for the Cloud Training Environment FollowthesestepstosetupVirtualBoxforthecloudtrainingenvironmentifyoudonotwanttoinstallVMwareFusiononyourMac.

VMwareFusionisourpreferredhypervisorforstudentsrunningthiscourseonMacOS.PleaseuseVMwareFusionifpossible.UseVirtualBoxforthiscourseonlyifitisyourpreferredvirtualizationenvironmentandifyouareknowledgeableenoughtobeself-sufficienttotroubleshootproblemsyoumightruninto.

Thissetupactivitycomprises:

• CreatingtheGetEC2VM

• PoweringuptheVM

• InstallingVirtualBoxGuestAdditionsontheVM

ThissetuprequiresVirtualBoxversion4orhigher.

1. Getthe.vmdkfilefortheclassfromyourinstructorandcopyitontothesystemonwhichyouwillbedoingtheHands-OnExercises.

2. StartVirtualBox.

3. SelectMachine>New.

TheNameandOperatingSystemdialogboxappears.

4. IntheNameandOperatingSystemdialogbox,specifyGet2EC2astheName,LinuxastheType,andRed Hat(notRedHat64-bit)astheVersion.ClickContinue.

TheMemorySizedialogboxappears.

5. IntheMemorySizedialogbox,acceptthedefaultmemorysizeof512MBandclickContinue.

125


125

TheHardDrivedialogboxappears.

6. IntheHardDrivedialogbox,select“Useanexistingvirtualharddrivefile.”

7. Inthefieldunderthe“Useanexistingvirtualharddrivefile”selection,navigatetothe.vmdkfilefortheclassandclickOpen.

8. ClickCreate.

TheOracleVMVirtualBoxManagerdialogboxappears.TheGet2EC2VMappearsontheleftsideofthedialogbox,withthestatusPoweredOff.

9. ClickStarttostarttheGet2EC2VM.

TheVMstartsup.Afterstartupiscomplete,theGNOMEinterfaceappears.Youareautomaticallyloggedinasthetraininguser.

NowyouarereadytoinstallVirtualBoxGuestAdditionsontheVM.

Note:TheversionofVirtualBoxandGuestAdditionsmustbethesame.YoumustinstallGuestAdditionsnowtoguaranteecompatibilitybetweenyourversionofVirtualBoxandGuestAdditions.

10. SelectDevices>InstallGuestAdditions.

Afterseveralseconds,adialogboxappearspromptingyoutoselecthowyouwanttoinstalltheversionofVBOXADDITIONSonyoursystem.VerifythatOpenAutorunPromptisselectedastheAction,thenclickOK.

11. AnotherdialogboxappearspromptingyoutoconfirmyouwanttoruntheGuestAdditionsinstaller.ClickRun.

12. TheAuthenticatedialogboxappears,promptingyoutoentertherootuser’spassword.SpecifytrainingandclickAuthenticate.

13. MessagesappearintheterminalwindowwhileVirtualBoxisbuildingandinstallingtheGuestAdditions.

Wheninstallationiscomplete,themessage,“PressReturntoclosethiswindow”appearsintheterminalwindow.

126


126

14. PressReturn.

15. SelectSystem>“LogOuttraining”tologoutofyourGNOMEsession.

Afteryouhaveloggedout,youareautomaticallyloggedbackinasthetraininguser.

YouhavecompletedtheVirtualBoxsetup.Pleasereturntothenextstepin“ConfiguringNetworkingonYourCluster:CloudTrainingEnvironment”andcontinuethesetupactivityforthecloudtrainingenvironment.

This is the end of this Appendix.

Date post:	13-Mar-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	2 times

Cloudera Administrator Training for Apache Hadoop: Hands ... · 5 Copyright © 2010-2015 Cloudera,...

Documents