dsc.soic.indiana.edudsc.soic.indiana.edu/publications/E222IntroCloudsplusML.pdfINTELLIGENT SYSTEMS...

INTELLIGENTSYSTEMSENGENEERING

GeoffreyC.FoxGregorvonLaszewski

(c)GregorvonLaszewski,2018,2019

INTELLIGENTSYSTEMSENGENEERING

1PREFACE1.1Disclaimer☁�1.1.1Acknowledgment1.1.2Extensions

1.2Contributors☁�2SYLLABUS2.1e222:IntelligentSystemsEngeneeringII☁�2.1.1Teachingandlearningmethods2.1.2Representativebibliography2.1.3Grading2.1.4Incomplete2.1.5OtherclassesI423,I523,I524,B649,E516,E6162.1.6Communication2.1.6.1Howtotakethisclass

2.1.7CoveredTopics2.1.7.1Week1.OverviewofthisClass2.1.7.2 Week 1 and 2. Review of Python for Intelligent SystemsEngineering2.1.7.3Week2.ReviewofLinuxshellforOSX,Linux,andWindows2.1.7.4Week3.IntroductiontoREST2.1.7.5Week4.IntroductiontoScientificWriting2.1.7.6Week5to9.IntroductiontoCloudComputing2.1.7.7Week10:LectureFreeTime2.1.7.8Week11.IntroductiontoCloudPlatforms2.1.7.9 Week 12 to 16. Review of AI for AI-Cloud ComputingIntegration2.1.7.10CloudEdgeComputing2.1.7.11AlternativeProjects

2.2Assignments☁�2.2.1AccountCreation2.2.2Sections,Chapters,Examples2.2.3Project2.2.3.1ProjectDeliverables2.2.3.2ProjectTopic

2.2.4AlternateProject:VirtualCluster2.2.5AlternativeProject:100nodeRaspberryPicluster2.2.6Submissionofsectionsandchaptersandprojects

3PYTHON3.1IntroductiontoPython☁�3.1.1References

3.2Python3.7.4Installation☁�3.2.1Hardware3.2.2PrerequisitsUbuntu19.043.2.3PrerequisitsmacOS3.2.3.1InstallationfromAppleAppStore3.2.3.2Installationfrompython.org3.2.3.3InstallationfromHoembrew

3.2.4PrerequisitsUbuntu18.043.2.5PrerequisiteWindows103.2.5.1LinuxSubsystemInstall

3.2.6Prerequisitvenv3.2.7InstallPython3.7viaAnaconda3.2.7.1Downloadcondainstaller3.2.7.2Installconda3.2.7.3InstallPython3.7.4viaconda

3.3InteractivePython☁�3.3.1REPL(ReadEvalPrintLoop)3.3.2Interpreter3.3.3Python3FeaturesinPython2

3.4Editors☁�3.4.1Pycharm3.4.2Pythonin45minutes

3.5Language☁�3.5.1StatementsandStrings3.5.2Comments3.5.3Variables3.5.4DataTypes3.5.4.1Booleans3.5.4.2Numbers

3.5.5ModuleManagement3.5.5.1ImportStatement

3.5.5.2Thefrom…importStatement3.5.6DateTimeinPython3.5.7ControlStatements3.5.7.1Comparison3.5.7.2Iteration

3.5.8Datatypes3.5.8.1Lists3.5.8.2Sets3.5.8.3RemovalandTestingforMembershipinSets3.5.8.4Dictionaries3.5.8.5DictionaryKeysandValues3.5.8.6CountingwithDictionaries

3.5.9Functions3.5.10Classes3.5.11Modules3.5.12LambdaExpressions3.5.12.1map3.5.12.2dictionary

3.5.13Iterators3.5.14Generators3.5.14.1Generatorswithfunction3.5.14.2Generatorsusingforloop3.5.14.3GeneratorswithListComprehension3.5.14.4WhytouseGenerators?

3.6LIBRARIES3.6.1PythonModules☁�3.6.1.1UpdatingPip3.6.1.2UsingpiptoInstallPackages3.6.1.3GUI3.6.1.3.1GUIZero3.6.1.3.2Kivy

3.6.1.4FormattingandCheckingPythonCode3.6.1.5Usingautopep83.6.1.6WritingPython3CompatibleCode3.6.1.7UsingPythononFutureSystems3.6.1.8Ecosystem3.6.1.8.1pypi

3.6.1.8.2AlternativeInstallations3.6.1.9Resources3.6.1.9.1JupyterNotebookTutorials

3.6.1.10Exercises3.6.2DataManagement☁�3.6.2.1Formats3.6.2.1.1Pickle3.6.2.1.2TextFiles3.6.2.1.3CSVFiles3.6.2.1.4Excelspreadsheets3.6.2.1.5YAML3.6.2.1.6JSON3.6.2.1.7XML3.6.2.1.8RDF3.6.2.1.9PDF3.6.2.1.10HTML3.6.2.1.11ConfigParser3.6.2.1.12ConfigDict

3.6.2.2Encryption3.6.2.3DatabaseAccess3.6.2.4SQLite3.6.2.4.1Exercises �

3.6.3Plottingwithmatplotlib☁�3.6.4DocOpts☁�3.6.5CloudmeshCommandShell☁�3.6.5.1CMD53.6.5.1.1Resources3.6.5.1.2Installationfromsource3.6.5.1.3Execution3.6.5.1.4CreateyourownExtension3.6.5.1.5Bug:Quotes

3.6.6cmdModule☁�3.6.6.1Hello,Worldwithcmd3.6.6.2AMoreInvolvedExample3.6.6.3HelpMessages3.6.6.4UsefulLinks

3.6.7OpenCV☁�

3.6.7.1Overview3.6.7.2Installation3.6.7.3ASimpleExample3.6.7.3.1Loadinganimage3.6.7.3.2Displayingtheimage3.6.7.3.3ScalingandRotation3.6.7.3.4Gray-scaling3.6.7.3.5ImageThresholding3.6.7.3.6EdgeDetection

3.6.7.4AdditionalFeatures3.6.8SecchiDisk☁�3.6.8.1SetupforOSX3.6.8.2Step1:Recordthevideo3.6.8.3Step2:AnalysetheimagesfromtheVideo3.6.8.3.1ImageThresholding3.6.8.3.2EdgeDetection3.6.8.3.3Blackandwhite

3.7DATA3.7.1DataFormats☁�3.7.1.1YAML3.7.1.2JSON3.7.1.3XML

3.7.2MongoDBinPython☁�3.7.2.1CloudmeshMongoDBUsageQuickstart3.7.2.2MongoDB3.7.2.2.1Installation3.7.2.2.1.1Installationprocedure

3.7.2.2.2CollectionsandDocuments3.7.2.2.2.1Collectionexample3.7.2.2.2.2Documentstructure3.7.2.2.2.3CollectionOperations

3.7.2.2.3MongoDBQuerying3.7.2.2.3.1MongoQueriesexamples

3.7.2.2.4MongoDBBasicFunctions3.7.2.2.4.1Import/Exportfunctionsexamples

3.7.2.2.5SecurityFeatures3.7.2.2.5.1Collectionbasedaccesscontrolexample

3.7.2.2.6MongoDBCloudService3.7.2.3PyMongo3.7.2.3.1Installation3.7.2.3.2Dependencies3.7.2.3.3RunningPyMongowithMongoDeamon3.7.2.3.4ConnectingtoadatabaseusingMongoClient3.7.2.3.5AccessingDatabases3.7.2.3.6CreatingaDatabase3.7.2.3.7InsertingandRetrievingDocuments(Querying)3.7.2.3.8LimitingResults3.7.2.3.9UpdatingCollection3.7.2.3.10CountingDocuments3.7.2.3.11Indexing3.7.2.3.12Sorting3.7.2.3.13Aggregation3.7.2.3.14DeletingDocumentsfromaCollection3.7.2.3.15CopyingaDatabase3.7.2.3.16PyMongoStrengths

3.7.2.4MongoEngine3.7.2.4.1Installation3.7.2.4.2ConnectingtoadatabaseusingMongoEngine3.7.2.4.3QueryingusingMongoEngine

3.7.2.5Flask-PyMongo3.7.2.5.1Installation3.7.2.5.2Configuration3.7.2.5.3Connectiontomultipledatabases/servers3.7.2.5.4Flask-PyMongoMethods3.7.2.5.5AdditionalLibraries3.7.2.5.6ClassesandWrappers

3.7.3Mongoengine☁�3.7.3.1Introduction3.7.3.2Installandconnect3.7.3.3Basics

3.8CALCULATION3.8.1WordCountwithParallelPython☁�3.8.1.1GeneratingaDocumentCollection3.8.1.2SerialImplementation

3.8.1.3SerialImplementationUsingmapandreduce3.8.1.4ParallelImplementation3.8.1.5Benchmarking3.8.1.6Excersises3.8.1.7References

3.8.2NumPy☁�3.8.2.1InstallingNumPy3.8.2.2NumPyBasics3.8.2.3DataTypes:TheBasicBuildingBlocks3.8.2.4Arrays:StringingThingsTogether3.8.2.5Matrices:AnArrayofArrays3.8.2.6SlicingArraysandMatrices3.8.2.7UsefulFunctions3.8.2.8LinearAlgebra3.8.2.9NumPyResources

3.8.3Scipy☁�3.8.3.1Introduction3.8.3.2References

3.8.4Scikit-learn☁�3.8.4.1IntroductiontoScikit-learn3.8.4.2Installation3.8.4.3SupervisedLearning3.8.4.4UnsupervisedLearning3.8.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn3.8.4.6Stepsfordevelopingamachinelearningmodel3.8.4.7ExploratoryDataAnalysis3.8.4.7.1Barplot3.8.4.7.2Correlationbetweenattributes3.8.4.7.3HistogramAnalysisofdatasetattributes3.8.4.7.4BoxplotAnalysis3.8.4.7.5ScatterplotAnalysis

3.8.4.8DataCleansing-RemovingOutliers3.8.4.9PipelineCreation3.8.4.9.1 Defining DataFrameSelector to separate Numerical andCategoricalattributes3.8.4.9.2FeatureCreation/AdditionalFeatureEngineering

3.8.4.10CreatingTrainingandTestingdatasets3.8.4.11Creatingpipelinefornumericalandcategoricalattributes3.8.4.12Selectingthealgorithmtobeapplied3.8.4.12.1LinearRegression3.8.4.12.2LogisticRegression3.8.4.12.3Decisiontrees3.8.4.12.4KMeans3.8.4.12.5SupportVectorMachines3.8.4.12.6NaiveBayes3.8.4.12.7RandomForest3.8.4.12.8Neuralnetworks3.8.4.12.9DeepLearningusingKeras3.8.4.12.10XGBoost

3.8.4.13ScikitCheatSheet3.8.4.14ParameterOptimization3.8.4.14.1Hyperparameteroptimization/tuningalgorithms

3.8.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)3.8.4.15.1Creatingaparametergrid3.8.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.3.8.4.15.3ResultstablefromtheModelevaluationwithmetrics.3.8.4.15.4ROCAUCScore

3.8.4.16K-meansinscikitlearn.3.8.4.16.1Import

3.8.4.17K-meansAlgorithm3.8.4.17.1Import3.8.4.17.2Createsamples3.8.4.17.3Createsamples3.8.4.17.4Visualize3.8.4.17.5Visualize

3.8.5ParallelComputinginPython☁�3.8.5.1Multi-threadinginPython3.8.5.1.1ThreadvsThreading3.8.5.1.2Locks

3.8.5.2Multi-processinginPython3.8.5.2.1Process

3.8.5.2.2Pool3.8.5.2.2.1SynchronousPool.map()3.8.5.2.2.2AsynchronousPool.map_async()

3.8.5.2.3Locks3.8.5.2.4ProcessCommunication3.8.5.2.4.1Value

3.8.6Dask-RandomForestFeatureDetection☁�3.8.6.1Setup3.8.6.2Dataset3.8.6.3DetectingFeatures3.8.6.3.1DataPreparation

3.8.6.4RandomForest3.8.6.5Acknowledgement

4DEVOPSTOOLS4.1Refcards☁�4.2VirtualBox☁�4.2.1Installation4.2.2Guestadditions4.2.3Exercises

4.3Vagrant☁�4.3.1Installation4.3.1.1macOS4.3.1.2Windows �4.3.1.3Linux �

4.3.2Usage4.4LinuxShell☁�4.4.1History4.4.2Shell4.4.3Thecommandman4.4.4Multi-commandexecution4.4.5KeyboardShortcuts4.4.6bashrc,bash_profileorzprofile4.4.7Makefile4.4.8chmod4.4.9Exercises

4.5SecureShell☁�4.5.1ssh-keygen

4.5.2ssh-add4.5.3SSHAddandAgent4.5.3.1UsingSSHonMacOSX4.5.3.2UsingSSHonLinux4.5.3.3UsingSSHonRaspberryPi3/44.5.3.4AccessingaRemoteMachine

4.5.4SSHPortForwarding �4.5.4.1Prerequisites4.5.4.2HowtoRestarttheServer4.5.4.3TypesofPortForwarding4.5.4.4LocalPortForwarding4.5.4.5RemotePortForwarding4.5.4.6DynamicPortForwarding4.5.4.7sshconfig4.5.4.8Tips4.5.4.9References

4.5.5SSHtoFutureSystemsResources☁�4.5.5.1TestingyourFutureSystemssshkey

4.5.6Exercises☁�4.6Github☁�4.6.1Overview4.6.2UploadKey4.6.3Fork4.6.4Rebase4.6.5Remote4.6.6PullRequest4.6.7Branch4.6.8Checkout4.6.9Merge4.6.10GUI4.6.11Windows4.6.12GitfromtheCommandline4.6.13Configuration4.6.14Uploadyourpublickey4.6.15Workingwithadirectorythatwillbeprovidedforyou4.6.16README.ymlandnotebook.md4.6.17ContributingtotheDocument

4.6.17.1Stayuptodatewiththeoriginalrepo4.6.17.2Resources

4.6.18Exercises4.6.19GithubIssues4.6.19.1GitIssueFeatures4.6.19.2GithubMarkdown4.6.19.2.1Tasklists4.6.19.2.2Teamintegration4.6.19.2.3ReferencingIssuesandPullrequests4.6.19.2.4Emojis

4.6.19.3Notifications4.6.19.4cc4.6.19.5Interactingwithissues

4.6.20Glossary4.6.21Examplecommands4.6.21.1Localcommandstoversioncontrilyourfiles4.6.21.2Interactingwiththeremote

4.7GitPullRequest☁�4.7.1Introduction4.7.2Howtocreateapullrequest4.7.3Forktheoriginalrepository4.7.4Cloneyourcopy4.7.5Addinganupstream4.7.6Makingchanges4.7.7Creatingapullrequest

4.8Tig☁�5 Introduction to Cloud Computing and Data Engineering for CloudComputingandMachineLearning☁�5.1A.SummaryofIntroductiontoCloudComputing&DataEngineering5.2B.DefiningCloudsI5.3C.DefiningCloudsII5.4D.DefiningCloudsIII5.5E.Virtualization5.6F.TechnologyHypecycleI5.7G.TechnologyHypecycleII5.8H.CloudInfrastructureI5.9I.CloudInfrastructureII

5.10JCloudSoftware5.11K.CloudApplicationsI5.12LCloudApplicationsII5.13MCloudApplicationsIII5.14N.CloudsandParallelComputing5.15O.Storage5.16P.HPCandClouds5.17Q.ComparisonofDataAnalyticswithSimulation5.18R.Jobs5.19S.TheFutureI5.20T.TheFutureandotherIssuesII5.21U.TheFutureandotherIssuesIII

6REST6.1IntroductiontoREST☁�6.1.0.1CollectionofResources6.1.0.2SingleResource6.1.0.3RESTToolClassification

6.2OpenAPIRESTServiceswithSwagger☁�6.2.1SwaggerTools6.2.2SwaggerCommunityTools6.2.2.1ConvertingJsonExamplestoOpenAPIYAMLModels

6.3OpenAPI2.0Specification☁�6.3.1TheVirtualClusterexampleAPIDefinition6.3.1.1Terminology6.3.1.2Specification

6.3.2References6.4OpenAPI3.0RESTServiceviaIntrospection☁�6.4.1Verification6.4.2Swagger-UI6.4.3Mockservice6.4.4Exercise

6.5OpenAPIRESTServiceviaCodegen☁�6.5.1Step1:DefineYourRESTService6.5.2Step2:ServerSideStubCodeGenerationandImplementation6.5.2.1SetuptheCodegenEnvironment6.5.2.2GenerateServerStubCode6.5.2.3Fillintheactualimplementation

6.5.3Step3:InstallandRuntheRESTService:6.5.3.1Startavirtualenv:6.5.3.2Makesureyouhavethelatestpip:6.5.3.3Installtherequirementsoftheserversidecode:6.5.3.4Installtheserversidecodepackage:6.5.3.5Runtheservice6.5.3.6Verifytheserviceusingawebbrowser:

6.5.4Step4:GenerateClientSideCodeandVerify6.5.4.1Clientsidecodegeneration:6.5.4.2Installtheclientsidecodepackage:6.5.4.3UsingtheclientAPItointeractwiththeRESTservice

6.5.5TowardsaDistributedClientServer6.6FlaskRESTfulServices☁�6.7RestServiceswithEve☁�6.7.1UbuntuinstallofMongoDB6.7.2macOSinstallofMongoDB6.7.3Windows10InstallationofMongoDB6.7.4DatabaseLocation6.7.5Verification6.7.6BuildingasimpleRESTService6.7.7InteractingwiththeRESTservice6.7.8CreatingRESTAPIEndpoints6.7.9RESTAPIOutputFormatsandRequestProcessing6.7.10RESTAPIUsingaClientApplication6.7.11Towardscmd5extensionstomanageeveandmongo �

6.8HATEOAS☁�6.8.1Filtering6.8.2PrettyPrinting6.8.3XML

6.9ExtensionstoEve☁�6.9.1ObjectManagementwithEveandEvegenie6.9.1.1Installation6.9.1.2Startingtheservice6.9.1.3Creatingyourownobjects

6.10DjangoRESTFramework☁�6.11GithubRESTServices☁�6.11.1Issues

6.11.2Exercise7MAPREDUCE7.1IntroductiontoMapreduce☁�7.1.1MapReduceAlgorithm7.1.1.1MapReduceExample:WordCount

7.1.2HadoopMapReduceandHadoopSpark7.1.2.1ApacheSpark7.1.2.2HadoopMapReduce7.1.2.3KeyDifferences

7.1.3References7.2Hadoop☁�7.2.1HadoopandMapReduce7.2.2HadoopEcoSystem7.2.3HadoopComponents7.2.4HadoopandtheYarnResourceManager7.2.5PageRank

7.3InstallationofHadoop☁�7.3.1Releases7.3.2Prerequisites7.3.3UserandUserGroupCreation7.3.4ConfiguringSSH7.3.5InstallationofJava7.3.6InstallationofHadoop7.3.7HadoopEnvironmentVariables

7.4HadoopVirtualClusterInstallationUsingCloudmesh �☁�7.4.1CloudmeshClusterInstallation7.4.1.1CreateCluster7.4.1.2CheckCreatedCluster7.4.1.3DeleteCluster

7.4.2HadoopClusterInstallation7.4.2.1CreateHadoopCluster7.4.2.2DeleteHadoopCluster

7.4.3AdvancedTopicswithHadoop7.4.3.1HadoopVirtualClusterwithSparkand/orPig7.4.3.2WordCountExampleonSpark

7.5SPARK7.5.1SparkLectures☁�

7.5.1.1MotivationforSpark7.5.1.2SparkRDDOperations7.5.1.3SparkDAG7.5.1.4Sparkvs.otherFrameworks

7.5.2InstallationofSpark☁�7.5.2.1Prerequisites7.5.2.2InstallationofJava7.5.2.3InstallSparkwithHadoop7.5.2.4SparkEnvironmentVariables7.5.2.5TestSparkInstallation7.5.2.6InstallSparkWithCustomHadoop7.5.2.7ConfiguringHadoop7.5.2.8TestSparkInstallation

7.5.3SparkStreaming☁�7.5.3.1StreamingConcepts7.5.3.2SimpleStreamingExample7.5.3.3SparkStreamingForTwitterData7.5.3.3.1Step17.5.3.3.2Step27.5.3.3.3Step37.5.3.3.4Step47.5.3.3.5step57.5.3.3.6step6

7.5.4UserDefinedFunctionsinSpark☁�7.5.4.1Resources7.5.4.2InstructionsforSparkinstallation7.5.4.2.1Linux

7.5.4.3Windows7.5.4.4MacOS7.5.4.5InstructionsforcreatingSparkUserDefinedFunctions7.5.4.5.1Example:Temperatureconversion7.5.4.5.1.1Descriptionaboutdataset7.5.4.5.1.2HowtowriteapythonprogramwithUDF7.5.4.5.1.3Howtoexecuteapythonsparkscript7.5.4.5.1.4Filteringandsorting

7.5.4.6Instructionstoinstallandruntheexampleusingdocker7.6ADVANCEDHADOOP

7.6.1AmazonEMR(ElasticMapReduce) �☁�7.6.1.1WhyEMR?7.6.1.2UnderstandingClustersandNodes7.6.1.2.1SubmitWorktoaCluster7.6.1.2.2ProcessingData

7.6.1.3AWSStorage7.6.1.4CreateEMRinAWS7.6.1.4.1Createthebuckets7.6.1.4.2CreateKeyPairs7.6.1.4.2.1CreateKeyValuePairScreenshots

7.6.1.5CreateStepExecution–HadoopJob7.6.1.5.0.1Screenshots

7.6.1.6CreateaHiveCluster7.6.1.6.1CreateaHiveCluster-Screenshots

7.6.1.7CreateaSparkCluster7.6.1.7.1CreateaSparkCluster-Screenshots

7.6.2Twister2☁�7.6.2.1Introduction7.6.2.2Twister2API’s7.6.2.2.1TSetAPI7.6.2.2.2TaskAPI

7.6.2.3OperatorAPI7.6.2.3.1Resources

7.6.3Twister2Installation☁�7.6.3.1Prerequisites7.6.3.1.1MavenInstallation7.6.3.1.2OpenMPIInstallation7.6.3.1.3InstallExtras7.6.3.1.4CompilingTwister27.6.3.1.5Twister2Distribution

7.6.4Twister2Examples☁�7.6.4.1SubmittingaJob7.6.4.2BatchWordCountExample

7.6.5HADOOPRDMA☁�7.6.5.1LaunchingaVirtualHadoopClusteronBare-metalInfiniBandNodeswithSR-IOVonChameleon7.6.5.2LaunchingVirtualMachinesManually

7.6.5.3ExtraInitializationwhenLaunchingVirtualMachines7.6.5.4 Important Note for Tearing Down Virtual Machines andDeletingNetworkPorts

8CONTAINER8.1IntroductiontoContainers☁�8.1.1Motivation-Microservices8.1.2Motivation-ServerlessComputing8.1.3Docker8.1.4DockerandKubernetes

8.2DOCKER8.2.1IntroductiontoDocker☁�8.2.1.1DockerEngine8.2.1.2DockerArchitecture8.2.1.3DockerSurvey

8.2.2RunningDockerLocally☁�8.2.2.1InstillationforOSX8.2.2.2InstallationforUbuntu8.2.2.3InstallationforWindows108.2.2.4TestingtheInstall

8.2.3Dockerfile☁�8.2.3.1Specification8.2.3.2References

8.2.4DockerHub☁�8.2.4.1CreateDockerIDandLogIn8.2.4.2SearchingforDockerImages8.2.4.3PullingImages8.2.4.4CreateRepositories8.2.4.5PushingImages8.2.4.6Resources

8.3DOCKERASPAAS8.3.1DockerSwarm☁�8.3.1.1Terminology8.3.1.2CreatingaDockerSwarmCluster8.3.1.3CreateaSwarmClusterwithVirtualBox8.3.1.4InitializetheSwarmManagerNodeandAddWorkerNodes8.3.1.5Deploytheapplicationontheswarmmanager

8.3.2DockerandDockerSwarmonFutureSystems☁�

8.3.2.1GettingAccess8.3.2.2Creatingaserviceanddeploytotheswarmcluster8.3.2.3Createyourownservice8.3.2.4Publishanimageprivatelywithintheswarmcluster8.3.2.5Exercises

8.3.3HadoopwithDocker☁�8.3.3.1BuildingHadoopusingDocker8.3.3.2HadoopConfigurationFiles8.3.3.3VirtualMemoryLimit8.3.3.4hdfsSafemodeleavecommand8.3.3.5Examples8.3.3.5.1StatisticalExamplewithHadoop8.3.3.5.1.1BaseLocation8.3.3.5.1.2InputFiles8.3.3.5.1.3Compilation8.3.3.5.1.4ArchivingClassFiles8.3.3.5.1.5HDFSforInput/Output8.3.3.5.1.6RunProgramwithaSingleInputFile8.3.3.5.1.7ResultforSingleInputFile8.3.3.5.1.8RunProgramwithMultipleInputFiles8.3.3.5.1.9ResultforMultipleFiles

8.3.3.5.2Conclusion8.3.3.6Refernces

8.3.4DockerPagerank☁�8.3.4.1Usetheautomatedscript8.3.4.2Compileandrunbyhand

8.3.5ApacheSparkwithDocker☁�8.3.5.1PullImagefromDockerRepository8.3.5.2RunningtheImage8.3.5.2.1Runninginteractively8.3.5.2.2Runninginthebackground

8.3.5.3RunSpark8.3.5.3.1RunSparkinYarn-ClientMode8.3.5.3.2RunSparkinYarn-ClusterMode

8.3.5.4ObserveTaskExecutionfromRunningLogsofSparkPi8.3.5.5WriteaWord-CountApplicationwithSparkRDD8.3.5.5.1LaunchSparkInteractiveShell

8.3.5.5.2PrograminScala8.3.5.5.3LaunchPySparkInteractiveShell8.3.5.5.4PrograminPython

8.3.5.6DockerSparkExamples8.3.5.6.1K-MeansExample8.3.5.6.2JoinExample8.3.5.6.3WordCount

8.3.5.7InteractiveExamples8.3.5.7.1StopDockerContainer8.3.5.7.2StartDockerContainerAgain8.3.5.7.3RemoveDockerContainer

8.4KUBERNETES8.4.1IntroductiontoKubernetes☁�8.4.1.1Whatarecontainers?8.4.1.2Terminology8.4.1.3KubernetesArchitecture8.4.1.4Minikube8.4.1.4.1Installminikube8.4.1.4.2StartaclusterusingMinikube8.4.1.4.3Createadeployment8.4.1.4.4Exposetheservi8.4.1.4.5Checkrunningstatus8.4.1.4.6Callserviceapi8.4.1.4.7TakealookfromDashboard8.4.1.4.8Deletetheserviceanddeployment8.4.1.4.9Stopthecluster

8.4.1.5InteractiveTutorialOnline8.4.2UsingKubernetesonFutureSystems☁�8.4.2.1GettingAccess8.4.2.2ExampleUse8.4.2.3Exercises

8.5SINGULARITY8.5.1RunningSingularityContainersonComet☁�8.5.1.1Background8.5.1.2TutorialContents8.5.1.3WhySingularity?8.5.1.4Hands-OnTutorials

8.5.1.5Downloading&InstallingSingularity8.5.1.5.1Download&UnpackSingularity8.5.1.5.2Configure&BuildSingularity8.5.1.5.3Install&TestSingularity

8.5.1.6BuildingSingularityContainers8.5.1.6.1UpgradingSingularity

8.5.1.7CreateanEmptyContainer8.5.1.8ImportIntoaSingularityContainer8.5.1.9ShellIntoaSingularityContainer8.5.1.10WriteIntoaSingularityContainer8.5.1.11BootstrappingaSingularityContainer8.5.1.12RunningSingularityContainersonComet8.5.1.12.1TransfertheContainertoComet8.5.1.12.2RuntheContaineronComet8.5.1.12.3AllocateResourcestoRuntheContainer8.5.1.12.4IntegratetheContainerwithSlurm8.5.1.12.5UseExistingCometContainers

8.5.1.13UsingTensorflowWithSingularity8.5.1.14Runthejob

8.6Exercises☁�9NIST9.1NISTBigDataRefereneceArchitecture☁�9.1.1PathwaytotheNIST-BDRA9.1.2BigDataCharacteristicsandDefinitions9.1.3BigDataandtheCloud9.1.4BigData,EdgeComputingandtheCloud9.1.5ReferenceArchitecture9.1.6FrameworkProviders9.1.7ApplicationProviders9.1.8Fabric9.1.9Interfacedefinitions

10AI10.1ArtificialIntelligenceServicewithREST �☁�10.1.1UnsupervisedLearning10.1.2KMeans10.1.3Lab:PracticeonAI10.1.4k-NN

10.1.5MachineLearningandCloudServices10.1.5.1IntroductionandRegression10.1.5.2K-meansClustering10.1.5.3Visulization10.1.5.4ClusteringExamples10.1.5.5GeneralClusteringwithExamples10.1.5.6InDepthExamplewithfourcenters10.1.5.7ParallelComputingandK-means

10.1.6ExampleProjectwithSVM11REFERENCES

1PREFACE

SatNov2305:18:45EST2019☁�

1.1DISCLAIMER☁�ThisbookhasbeengeneratedwithCyberaideBookmanager.

Bookmanagerisatooltocreateapublicationfromanumberofsourcesontheinternet. It is especially useful to create customized books, lecture notes, orhandouts. Content is best integrated inmarkdown format as it is very fast toproducetheoutput.

Bookmanagerhasbeendevelopedbasedonourexperienceoverthelast3yearswith amore sophisticated approach.Bookmanager takes the lessons from thisapproachanddistributesatoolthatcaneasilybeusedbyothers.

The followingshieldsprovide some informationabout it.Feel free toclickonthem.

pypipypi v0.2.28v0.2.28 LicenseLicense Apache2.0Apache2.0 pythonpython 3.73.7 formatformat wheelwheel statusstatus stablestable buildbuild unknownunknown

1.1.1Acknowledgment

Ifyouusebookmanagertoproduceadocumentyoumustincludethefollowingacknowledgement.

“This document was produced with Cyberaide Bookmanagerdeveloped by Gregor von Laszewski available athttps://pypi.python.org/pypi/cyberaide-bookmanager. It is in theresponsibility of the user tomake sure an author acknowledgementsection is included in your document. Copyright verification ofcontentincludedinabookisresponsibilityofthebookeditor.”

Thebibtexentryis@Misc{www-cyberaide-bookmanager,

author={GregorvonLaszewski},

https://github.com/cloudmesh-community/book/blob/master/chapters/version.md

https://github.com/cyberaide/bookmanager/blob/master/bookmanager/template/disclaimer.md

https://pypi.python.org/pypi/cyberaide-bookmanager


https://github.com/cloudmesh/cyberaide-bookmanager/blob/master/LICENSE




https://travis-ci.com/cloudmesh/cyberaide-bookmanager

1.1.2Extensions

We are happy to discuss with you bugs, issues and ideas for enhancements.Pleaseusetheconvenientgithubissuesat

https://github.com/cyberaide/bookmanager/issues

Pleasedonotfilewithusissuesthatrelatetoaneditorsbook.Theywillprovideyouwiththeirownmechanismonhowtocorrecttheircontent.

1.2CONTRIBUTORS☁�Contributors are sorted by the first letter of their combined Firstname andLastnameandifnotavailablebytheirgithubID.Please,notethattheauthorsareidentifiedthroughgitlogsinadditiontosomecontributorsaddedbyhand.Thegit repository from which this document is derived contains more than thedocuments included in thisdocument.Thusnoteveryone in this listmayhavedirectlycontributedtothisdocument.Howeverifyoufindsomeonemissingthathascontributed(theymaynothaveusedthisparticulargit)pleaseletusknow.Wewilladdyou.Thecontributorsthatweareawareofinclude:

Anand Sriramulu, Ankita Rajendra Alshi, Anthony Duer, Arnav,AverillCate,Jr,BertoltSobolik,BoFeng,BradPope,Brijesh,DaveDeMeulenaere,De’AngeloRutledge,EliyahBenZayin,EricBower,Fugang Wang, Geoffrey C. Fox, Gerald Manipon, Gregor vonLaszewski, Hyungro Lee, Ian Sims, IzoldaIU, Javier Diaz, JeevanReddyRachepalli,JonathanBranam,JulietteZerick,KeithHickman,KeliFine,KennethJones,MallikChalla,ManiKagita,MiaoJiang,Mihir Shanishchara, Min Chen, Murali Cheruvu, Orly Esteban,Pulasthi Supun, Pulasthi Supun Wickramasinghe, Pulkit Maloo,Qianqian Tang, Ravinder Lambadi, Richa Rastogi, Ritesh Tandon,SaberSheybani,SachithWithana,SandeepKumarKhandelwal,SheriSanders, Shivani Katukota, Silvia Karim, Swarnima H. Sowani,

title={{CyberaideBookManager}},

howpublished={pypi},

month=apr,

year=2019,

url={https://pypi.org/project/cyberaide-bookmanager/}

}

https://github.com/cyberaide/bookmanager/issues

https://github.com/cloudmesh-community/book/blob/master/chapters/authors.md

Tharak Vangalapat, Tim Whitson, Tyler Balson, Vafa Andalibi,VibhathaAbeykoon,VineetBarshikar,YuLuo,ahilgenkamp,aralshi,azebrowski, bfeng, brandonfischer99, btpope, garbeandy,harshadpitkar, himanshu3jul, hrbahramian, isims1, janumudvari,joshish-iu, juaco77, karankotz, keithhickman08, kkp, mallik3006,manjunathsivan, niranda perera, qianqian tang, rajni-cs, rirasto,sahancha, shilpasingh21, swsachith, toshreyanjain, trawat87,tvangalapat,varunjoshi01,vineetb-gh,xianghangmi,zhengyili4321

2SYLLABUS

2.1E222:INTELLIGENTSYSTEMSENGENEERINGII☁�Inthisundergraduatecoursestudentswillbefamiliarizedwithdifferentspecificapplicationsandimplementationsofintelligentsystemsandtheiruseindesktopandcloudsolutions.

Piazza:LinkRegistrar:LinkLectureNotes:ePubIndianaUniversityFaculty:GeoffreyC.FoxCredits:3Hardware:Youwillneedacomputertotakethisclass,aphone,tablet,orchromebookisnotsufficient.Prerequisite(s):Knowledgeofaprogramminglanguage,theabilitytopickup other programming languages as needed, willingness to enhance yourknowledge fromonline resources and additional literature.Youwill needaccess to a modern computer that allows using virtual machines and/orcontainers. If such a system is not available to you can also use IUcomputersorcloudvirtualmachines.Thelaterhavetoberequested.CourseDescription:Link

Thisisanintroductoryclass.Incaseyouliketodoresearchandmoreadvancedtopics,considertakinganindependentstudywithDr.FoxorDr.vonLaszewski.

Anintroductionvideoisavailableat:

222ClassIntroductionandManagement

2.1.1Teachingandlearningmethods

LecturesAssignmentsincludingspecificlabactivities

https://github.com/cloudmesh-community/book/blob/master/chapters/class/e222-syllabus.md

https://piazza.com/iu/spring2019/e222spring19/resources

https://registrar.indiana.edu/browser/soc4182/ENGR/ENGR-E222.shtml

https://github.com/cloudmesh-community/book/blob/master/vonLaszewski-e222.epub

https://github.com/cloudmesh-community/book/blob/master/chapters/class/e222-syllabus.md

https://youtu.be/m0T95VfdnkE

Finalproject

2.1.2Representativebibliography

1. CloudComputingforScienceandEngineeringByIanFosterandDennisB.Gannon

https://mitpress.mit.edu/books/cloud-computing-science-and-engineering

2. (This document) Handbook of Clouds and Big Data, Gregor vonLaszewski, Geoffrey C. Fox, and Judy Qiu, Fall 2017,https://tinyurl.com/vonLaszewski-handbook

3. Use Cases in Big Data Software and Analytics Vol. 1, Gregor vonLaszewski, Fall 2017, https://tinyurl.com/cloudmesh/vonLaszewski-i523-v1.pdf

4. Use Cases in Big Data Software and Analytics Vol. 2, Gregor vonLaszewski, Fall 2017, https://tinyurl.com/cloudmesh/vonLaszewski-i523-v2.pdf

5. Use Cases in Big Data Software and Analytics Vol. 3, Gregor vonLaszewski,Fall2017,https://tinyurl.com/vonLaszewski-projects-v3

6. Big Data Software Vol 1., Gregor von Laszewski, Spring 2017,https://github. com/cloudmesh/sp17-i524/blob/master/paper1/proceedings.pdf

7. BigDataSoftwareVol2.,GregorvonLaszewski,Spring2017,

https://github.com/cloudmesh/sp17-i524/blob/master/paper2/proceedings.pdf

8. BigDataProjects,GregorvonLaszewski,Spring2017,

https://github.com/cloudmesh/sp17-i524/blob/master/project/projects.pdf

9. GregorvonLaszewski,GeoffreyC.Fox,CloudComputingandBigDatahttp://cyberaide.org/papers/vonLaszewski-bigdata.pdf

10. Introduction to Python for cloud Computinghttps://laszewski.github.io/book/python/

2.1.3Grading

https://mitpress.mit.edu/books/cloud-computing-science-and-engineering

https://github.com/cloudmesh/sp17-i524/blob/master/paper2/proceedings.pdf

https://github.com/cloudmesh/sp17-i524/blob/master/project/projects.pdf

http://cyberaide.org/papers/vonLaszewski-bigdata.pdf

https://laszewski.github.io/book/python/

GradeItem PercentageAssignments 30%FinalProject 60%Participation 10%

2.1.4Incomplete

Pleaseseetheuniversityregulationsforgettinganincomplete.However,asthisclassuses state-of-the-art technology that changes frequently,youmust expectthat an incompletemay result in significant additionalworkonyourbehalf asyourprojectmayneedsignificantupdatesoninfrastructure,technology,orevenprogrammingmodelsused.Itisbesttocompletethecoursewithinonesemester.

2.1.5OtherclassesI423,I523,I524,B649,E516,E616

IUoffersotherundergraduateclassesinthistopicareasuchasI423.Ifyouareinterestedintakingit,pleaseseewhentheyaretaught.Additionalgraduatelevelclassesrelatedwhichcanalsobetakenonlybyspecialpermissionincluding:

CSCI B-649 Cloud Computing is the same as E516 but for computersciencestudents.I524isthesameasE516butforDataEngineeringStudentsE516IntrodcustiontoCLoudCOmputingandCLoudEngeneering

All of these classes are project based and require a significant and consistenteffortoftimeonyourside.

2.1.6Communication

Toaskforhelpusepiazza:

PiazzaResourcesPiazzaQuestions

PleasedonotuseCANVASforcommunicatingwithus.UsePiazza.MakesureyouhaveaccesstoPiazza,whilepostingyourformalBio.

https://piazza.com/iu/spring2019/e222spring19/resources

https://piazza.com/class/jq2u1qfc4o81ox

2.1.6.1Howtotakethisclass

This class is an undergraduate class that contains two sections that youmustattend.

In thisdocumentwewill introduceyoutheoretically tosomeconcepts thatareimportantforthisclass.Thisisdoneeitherthroughlectures,writtenmaterial,orpointerstoWebresources.Youareresponsibleto

1. listentotheonlinelecturesandunderstandthem.2. identifyadditionalmaterialthatmayhelpyouinunderstandingthelectures.

Thiscouldincludeadditionalresourcesontheinternet3. Contributetothematerialbycorrectingerrorsandupdatesyoumayfind.

Pleasenotethatwetrytokeepthematerialuptodatewithyourhelp.However,in our field software and documentation changes quickly and if you identifyupdatedmaterialweexpectthatyouhelpusfixingit.Youwillgetcreditdoingso.

Toallowyoutobemostflexibleintakingthisclass,wecertainlyallowyoutoworkahead.Thusyoucanuseallbuttheinpersonlecturesaheadoftime.TheSyllabuswillclearlyidentifywhichmaterialisavailable.Notethatthebookmayinclude sections that are notmarked in the syllabus.You do not have to readsuchsections.

Pleasenotethatthisclassdoesnothavesmallassignmentsandanyassignmentis likelytotakeyouasignificantamountof time.Thusit isadvisablethatyoustartyourassignmentsearlyandmakesureyoudonotdotheminthelastweekbefore the assignment is due. This contrasts other undergraduate classes, thatmayfocusontheassignmentofanumberoftoyexerises.Insteadwewillworkthroughout the entire semester towards a project youwill conduct. In order tomake it earlier for you, we will introduce graded checkpoints of all largeassignments.Thegradesforthesecheckpointsarefinalandcannotbeimprovedby work done later. Also here please be advised that somemay take severalweeks to conduct and it is your responsibility to devote enough time to theseactivities.

Toasureprogress,youwillhave tomanageanotebook.mdfile inyourgithub

directory (that wewill create for you) inwhich youwill update yourweeklyprogress. If youmiss a lecture, it is in your responsibility to inform yourselfwhatwasbeing taught.Attendanceandparticipationwillbegradedaswell astheupdatetonotebook.mdwillbegraded.

In the following calendar we put in the last day of the week when theassignmentsaretypicallydue

2.1.7CoveredTopics

Aspartofthisclassyouwillhavetoexplorethefollowingtopics.Thesetopicsare either included in this document, or we are pointing you within thisdocumenttootherdocumentswiththeinformation.

If we forgot anything let us know. The order of the lectures and the lecturematerialaresubjecttochangeasweseefit.

ThisweeklyAgendawillbeupdatedeveryweek.Yoarerequiredtocheckineveryweekforupdates.Atthistimewehaveincludedanapproximateweeklyagenda.

Toseethedifferencestopreviousversionsofthisdocument,youcanlookat:

https://github.com/cloudmesh-community/book/commits/master/chapters/e222-syllabus.md

Toseeifcheckinssucceedyoucanlookat:

https://circleci.com/gh/cloudmesh-community/book

Currently,thetopicscoveredintheclassincludethefollowing.

2.1.7.1Week1.OverviewofthisClass

Wewillprovideanoverviewofthisclass.

https://github.com/cloudmesh-community/book/commits/master/chapters/e222-syllabus.md

https://circleci.com/gh/cloudmesh-community/book

Logistic:Getfamiliarwiththeclassstructure.

Read:Preface;ClassOverview;Startreviewingyourpythonknowledge

AssignmentAccounts: Find a computer you can do the class programming on(tabletandchrombookwillnotsuffice).Getanaccountonpiazza.comwithyour???nameGetanaccountongithub.com(ThisisNOTtheIUgithub)andapplythereforagithubusername.Posttheusernameintoaformthatwillbesendtoyou.Makesurethattheaccountyousendusisyourgithub.comaccount.Thisisagradedassignmentthatmustbecompletedinthefirstweekofclass

Thismustbecompletedin thefirstweekbyFriday.(SurveywillbepostedonPiazza).

Assignment notebook: Once you get your github directory, update the filenotebook.md.Mindthespellingnotebookislowercase.Usesimplemarkdownbulletliststorecordyouractivities.

AssignmentDevelopmentEnvironment:(Multiweekassignment,tobecompletedin thefirtsmonth)It is important thatyouhaveadevelopmentenvironment toconducttheclassassignments.Werecommendthatyouusevirtualboxanduseubuntu.Wehaveprovidedanextensivesetofmaterialforyoutoachievethisinthisdocument.PleaseconsultadditionalresourcesformtheWebandutilizetheLabhours.

2.1.7.2Week1and2.ReviewofPythonforIntelligentSystemsEngineering

Theory:basicPythonLanguageTheory:pyenv,setup.py,modules

Practice:Livingwithoutanaconda

Pythonspecifictopicsinclude:

Assignment:InstallPythonanduseitthroughutthesemesterWhynotanaconda?Usingpython3.7pip

LanguageNumpyScipyOpenCVScikitLearn

Report:Createanemptyreportbasedonour template ingithub.TheTAsmaydothiswithyouintheLab.

GithubPullRequests:Findaspellingerrorintheclassmaterialandcreateapullrequesttocorrectit.

2.1.7.3Week2.ReviewofLinuxshellforOSX,Linux,andWindows

Theory:BasicLinuxShellPractice:SSHAssignment:sshkeygenerationonyourcomputer,uploadtogithub.com

2.1.7.4Week3.IntroductiontoREST

Theory:OverviewofREST,Eve,OpenAPIPractice:developaRESTservicewithOpenAPI

Theory:LearnaboutRESTservicesanduseSwaggerOpenAPItocreatearestservicethatreturnstheCPUinformationaboutyourcomputer

WewillbestartingtheclasswithintroducingyoutoRESTservicesthatprovidea foundation for setting up services in the cloud and to interact with theseservices.Aspartof thisclasswewillbe revisiting theRESTservicesandusethem to deploy them on a cloud as well as develop our own AI based restservicesinthesecondhalfoftheclass.

FocusonOpenAPIexamplepostedintheNISTgithubrepository

Projectteam:Buildaprojectteamwithnomorethan3people.Therewillnotbeanexception.Youareallowedtoworkalone.Makesureyourprojectteamdoestheworktogether.E.g.YOumustnothave3peopleontheteamand the project could have been executed by a single person. In case of

morethanonepersonthesumofthedeliverablesmustbelargerthanwhatoneteammembercanachieve.Itisanadvantagetoworkinateamasyoucancheckeachother.

Ifateammemberdoesnotcontributetotheproject,theteamhastherightto exclude the non working team member with consultation of theinstructors.Wewillhaveajointmeetingwiththeteamtoidentifythebestpathforward.Choseyour teammemberswisely. Ideallyyoushouldmakethisdecisioninthefirst3weeks.

2.1.7.5Week4.IntroductiontoScientificWriting

Theory:ScientificwritingwithmarkdownandbibtexPractice:Contributeasignificantchaptertothebook(asagroup)Practice:ProjectReport(asagroup)Practice:IntroductiontoEmacsPractice:Introductiontojabref

SeetheseparetePubformoreinformation:Link

AssignmentScientificWriting:Learnaboutmarkdown.Seeourclassnotesandinternet resources. Note that we use pandoc markdown that may not renderproperlyingithub,especiallywhenitcomestofigurecaptions,references,andbibliography.(Youhavetilltheendofthemonth).Installandusejabref.

Report: Learn bibtex and create references in report.bib that you use inreport.md.Make sure that you do only one report per team and update yourREADME.ymlfileaccordingly.CheckintheLabwiththeTAsifyouhavedoneitcorrect.

ProjectIdeadue:Aonepageformaldocumentthatsummarizestheproject.Thisisnotaproposal.TheworkdsIandproject,reportmustnotbeused.Itis essentiallya snapshotofyour final report.Discusswith theTAs in theLabhowtodefineaproject.

Github:makesureyourteammateshaveaccesstoyourprojectdirectory.

2.1.7.6Week5to9.IntroductiontoCloudComputing

https://github.com/cloudmesh-community/book/blob/master/vonLaszewski-writing-markdown.epub?raw=true

Theory:

Introduction-PartAIntroduction-PartB-DefiningCloudsIIntroduction-PartC-DefiningCloudsIIIntroduction-PartD-DefiningCloudsIIIIntroduction-PartE-VirtualizationIntroduction-PartF-TechnologyHypecycleIIntroduction-PartG-TechnologyHypecycleIIIntroduction-PartH-IaaSIIntroduction-PartI-IaaSIIIntroduction-PartJ-CloudSoftwareIntroduction-PartK-ApplicationsIIntroduction-PartM-ApplicationsIIIIntroduction-PartN-ParallelismIntroduction-PartO-StorageReleasedIntroduction-PartP-HPCintheCloudReleasedIntroduction-PartQ-AnalyticsandSimulationReleasedIntroduction-PartR-JobsReleasedIntroduction-PartS-TheFutureReleasedIntroduction-PartT-SecurityReleasedIntroduction-PartU-FaultTolerance

Practice:ManagevirtualmachineswithVirtualboxPractice:ManagevirtualmachineswithCloudmeshv4Practice:ManageacontainerwithDocker

Theory:Containers

Week5:ProjectUpdatedue:Atwopageformaldocumentthatsummarizestheproject.Thisisnotaproposal.ThewordsIandproject,reportmustnotbeused.Itisessentiallyasnapshotofyourfinalreport.

Week7:ProjectUpdatedue:Amulti-paragraphdescriptionaboutthedatathat youuse for your project is to be added to your report.This includesdetailsaboutthedata.INadocumentedprogramyoushowcaseshowyoudownloadthedatawithpythonrequestinanautomatedfashion.

Week8:ProjectUpdateDue:Haveadocumentedprogramreadythatuses

a REST service to obtain data for your analysis. Identify how to dobenchmarks and time the execution of your project. Add planedbenchmarkstotourreport.Donotusethewordplanorwillwriteitinsuchaformasifitweredone.Insteadputa �onbenchmarksthatyouwillthatyouwrokon

Week9:Projectupdate:Studymatplotlibandbokeahandidentifyhowtovisualizeotheraspectsofyourprojects.YourarealsoallowedtouseD3.jsandaddonstoit.Youarenotallowedtousetablaeu.

2.1.7.7Week10:LectureFreeTime

March10-17Lecturefreetime,noclasssupport.Agodweektoworkaheadonyourproject.

2.1.7.8Week11.IntroductiontoCloudPlatforms

Wewill introduceyou to theconceptofMapreduce.Wewilldiscusssystemssuch asHadoop and Spark and how they differ. Youwill be deploying via acontainerhadooponyourmachineanduse it togainhandsonexperience.Westartwithusing cloudmeshonyour computer tomanagevirtualmachines thatyoumaybeabletouseduringyourtestdevelopments.

BackgroundaboutHadoop,SparkandTwister

Theory:BackgroundtoCloudmeshTheory:BackgroundtoHadoopTheory:BackgroundtoSpark

Theory:BackgroundtoTwister

Week11Projectupdate: Identifyanalysisalgorithmsforyourprojectandapplythem.Experimentwithwhatyoucandowiththedata

2.1.7.9Week12to16.ReviewofAIforAI-CloudComputingIntegration

TheoryIntroductiontobasicAI

Practice:DevelopanontrivialAIRESTservice

See#sec:ai

OverviewofAIforthisclassTheoryUnsupervisedLearningDeepLearning

Forecasting

Week12:Projectupdate:Identifyanalysisalgorithmsforyourprojectandapplythem.Experimentwithwhatyoucandowiththedata

Week13:Projectupdate:Identifyanalysisalgorithmsforyourprojectandapply them. Experiment with what you can do with the data. Startbenchmarks.

Week14:Projectupdate:Focusonyourprojectreportandfinalizeit.Theproject report must include references in bibtex format. Double-checkintegrationinproceedings.

Week15:Apr19-Projectduedate.

AstheProjectwilltaketimetogradeallprojectsareduetwoweeks(yes,you read correctly) before the semester ends. The project will have thefollowingartifacts:

completedprojectreportcompletedprojectcode

completed instructions on you to replicate your project on someoneelse’scomputeroracloudservice

anyotheroutstandingtask.

Week16:Apr26

Projectimprovementifneeded(majorityshouldbefinished)

Make sure your project report is showing up correctly in theproceedings

2.1.7.10CloudEdgeComputing

Iftimeallowswemayinadditionalsocover.

Theory:RaspberryPIasPlatform

2.1.7.11AlternativeProjects

if you are interested the following could be chosen by you as project.Participation in theseprojectsneed tobeapprovedbyDr.vonLaszewski.Theprojectstartsinthsicaseinweek2or3.

Project(ifelected):Documentthebuilda100nodeRaspberryPIClusterProject:EnvironmentalRobotBoat

2.2ASSIGNMENTS☁�Formoredetailsseethecoursesyllabusandoverviewpages.Wegiveherejustsomesummary.

2.2.1AccountCreation

Aspartoftheclassyouwillneedanumberofaccounts

piazza.comgithub.com

Optional accounts include (only apply for them if you know you need them.Note that applying for some accountsmay take 1 - 2weeks to complete, youshould have identified before themiddle of the semester if you need some ofthem.

futuresystems.org(optional)chameleoncloud.org(optional)

https://github.com/cloudmesh-community/book/blob/master/chapters/222-assignments.md

aws.com(optional)google.com(optional)azure.com(optional)watsonfromIBM(optional)googleIaas(optional)

Inourpiazzawehavedetailshowtosubmitthemtous.Wesplitthesubmissioninmultiplesub-assignmentsasthegithub.comandpiazza.comareneededwithinthefirstweek.

2.2.2Sections,Chapters,Examples

As part of the class, we expect you to get familiar with topics related tointelligentsystemsengeneering.ThosthatliketogoforanA+arealsoexpectedtocontributesignificantly tothisdocumentorhavea trulyoutstandingproject.ThisisdoneinSections,Examples,andChapters,orexcelentProjectreportsandcode.

Section:

A section is a small section that explains a topic that is not yet in thehandbookorimprovesanexistingsectionsignificantly.Itistypicallymulti-paragraphs long and can even include an example if needed. ExamplesectionsthathavebeenprovidedareforexampletheLambdasectioninthepythonchapter

Sampleofstudentcontributedsectionsinclude:

ProjectNaticLambdaExpressions

�pleasefixlinks

Chapter:

Achapter is amuch longer topic and is a coherentdescriptionof a topicrelatedtocloudcomputing.Achaptercouldeitherbeareviewofatopicoradetailedtechnicalcontribution.SeveralSections(10+)maybeasubstitute

forachapter.

Youwill be contributing a significant chapter that can be used by otherstudentsintheclassandintroducesthereadertoageneraltopicrelatedtothe topicof theclass. Inaddition it isexpected ifapplicable todevelopapracticalexampledemonstratinghowtouseatechnology.Thechapterandthepracticalexamplecanbedonetogether.Wedonotliketousethetermtutorial inourwriteupbut sometimeswe refer to it inourassignmentsassuch.Chaptersthatfocusontheorymaynothaveanexampleanditcanbesubstitutedbyalongertext.

Asampleofastudentcontributedchapteris*GraphQL.

Example:

An example is a document that showcases the use of a particulartechnology. Typically it is a console session or a program. ExamplesaugmentchaptersandSections.

It is expected from you that you self identify a section or achapterasthisshowscompetenceintheareaofcloudcomputing.Ifhoweveryoudonotknowwhat toselect,youmustattendanonlinehourwithusinwhichweidentifysectionsandchapterswithyou.Theemphasizehereisthatwedonotdecidethemforyou,butweidentifythemwithyou.

SampleTopics thatcould formasectionorchapterareclearlymarkedwitha

.Thereareplentyinthehandbook,butyouarewelcometodefineyourowncontributions.Discussthemwithusintheonlinehours.

Alistoftopicsidentifiedbystudentsismaintainedinaspreadsheet.

Seehttps://piazza.com/class/jgxybbf5rnx5qd?cid=201fordetails.

Youareexpectedtosignupinthisspreadsheet.THisisdonetoab=voidoverlapandfosteruniquenessoftheassignmentforsectionsandchapters.

https://piazza.com/class/jgxybbf5rnx5qd?cid=201

2.2.3Project

Project:

Wereferwiththetermprojecttothemajoractivitythatyouchoseaspartofyour class. The default case is an implementation project that requires aprojectreportandprojectcode.

License:

AllprojectsaredevelopedunderanopensourcelicensesuchasApache2.0License.Youwill be required to add a LICENCE.txt file and if you useothersoftwareidentifyhowitcanbereusedinyourproject.Ifyourprojectusesdifferent licenses,pleaseadd inaREADME.mdfilewhichpackagesareusedandwhichlicensethesepackageshave.

ProjectReport:

A project report is an enhanced topic paper that includes not just theanalysisofatopic,butanactualcode,withbenchmarkanddemonstratedapplication use. Obviously it is longer than a term paper and includesdescriptions about reproducibility of the application. A README.md isprovided thatdescribeshowotherscan reproduceyourprojectand run it.Remember tables and figures donot count towards the paper length.Thefollowinglengthisrequired:

4pages,onestudentintheproject6pages,twostudentsintheproject8pages,threestudentsintheproject

Weestimatethatasinglepageisbetween1000-1200words.Pleasenotethatfor2018theformatwillbemarkdown,sothewordcountwillbeusedinstead.Howtousefigures isexplained in theNotationof thehandbook.Weusebibtexforbibliographies. Please be reminded that images and tables as well as code isexcludedfromthepagelength.Makesurethatyourtextismostlydevelopedbymidtermtime.

ProjectCode:

This is thedocumented andreproducible code and scripts that allows aTAdo replicate theproject. In caseyouuse images theymustbecreatedfrom scratch locally and may not be uploaded to services such asdockerhub.Youcanhoweverreusevendoruploadedimagessuchasfromubuntuorcentos.Allcode,scripts,anddocumentationmustbeuploadedtogithub.comundertheclassspecificgithubdirectory.

Data:

DataistobehostedonIUsgoogledriveifneeded.Ifyouhavelargerdata,it should be downloaded from the internet. It is in your responsibility todevelopadownloadprogram.Thedatamustnotbestoredingithub.Youwillbeexpectedtowriteapythonprogramthatdownloadsthedata.

WorkBreakdown:

Thisisanappendixtothedocumentthatdescribesindetailwhodidwhatintheproject.Thissectioncomesinanewpageafter thereferences.Itdoesnotcounttowardsthepagelengthofthedocument.ItalsoincludesexplicitURLstothegithistorythatdocumentsthestatisticstodemonstratenotonlyone student has worked on the project. If you can not provide such astatisticorallcheck-inshavebeenmadebyasinglestudent,theprojecthasshown that theyhavenotproperlyusedgit.Thuspointswillbedeductedfrom the project. Furthermore, if we detect that a student has notcontributed to a project we may invite the student to give a detailedpresentationoftheproject.

Bibliography:

All bibliography has to be provided in a jabref/bibtex file. There isNOEXCEPTION to this rule.Pleasebeadviseddoing references right takessometimesoyouwanttodothisearly.PleasenotethatexportsofEndnoteorotherbibliographymanagement toolsdonot lead toproperlyformattedbibtexfiles,despitetheyclaimingtodoso.Youwillhavetocleanthemupand we recommend to do it the other way around. Manage yourbibliographywithjabref.Makesurelabelsonlyincludechractersfrom[a-zA-Z0-9-].Usedashesandnotunderscoreor:inthelabel.

2.2.3.1ProjectDeliverables

Theobjectiveoftheprojectistodefineaclearproblemstatementandcreateaframeworktoaddressthatproblemasitrelatestocloudcomputing.Inthisclassitisespeciallyimportnattoaddressthereproducibilityofthedeployment.Atestand benchmark possibly including a dataset must be used to verify thecorrectnessofyourapproach.ProjectsrelatedtoNISTfocusonthespecificationandimplementation.Thereportherecanbesmaller,butthecontributionmustbeincludableinthespecificationdocument.

IngeneralanyprojectmustbedeployablebytheTA.Ifittakeshourstodeployyourproject,pleasetalktousbeforefinalsubmission.

Youhaveplentyoftimetomakeexecuteawonderfulproject.

The deliverables include but need to be updated according to your specificproject, for example if you do Edge Computing some deliverabl;es will bedifferent:

Providebenchmarks.

Take results in two different cloud services and your local PC (ex:ChameleonCloud,echokubernetes).Makesureyoursystemcanbecreatedanddeployedbasedonyourdocumentation.

Each team member must provide a benchmark on their computer and acloudIaaS,wherethecloudisdifferentfromeachteammember.

CreateaMakefilewith the tagsdeploy,run,kill,view,clean thatdeploysyourenvironment,runsapplication,killsit,viewstheresultandcleansupafterwards.Youare allowed tohavedifferentmakefiles for thedifferentcloudsanddifferentdirectories.Keepthecodeanddirectorystructurecleananddocumenthowtoreproduceyourresults.

Forpythonusearequirements.txtfilealso

FordockeruseaDockerfilealso

Writeareportthatincludesthefollowingsections

AbstractIntroductionDesign

ArchitectureImplementation

TechnologiesUsedResults

DeploymentBenchmarksApplicationBenchmarks

(Limitations)Conclusion(WorkBreakdown)

YourpaperwillnothaveaFutureWorksectionasthisimpliesthatyouwilldo work in future and your paper is incomplte, instead you can use anoptional“Limitations”section.

2.2.3.2ProjectTopic

As part of this class you will be developing a OpenAPI based ArtificialIntelligenceREST service and demonstrate its use.YOuwill be developing adocumentationandareportthatshowcasestheuseoftheservice.TheOpenAPIservicemustbenon trivial,e.g.youshoudlshowuploadofdata, sbmissionofparameters including the function to be executed, potential development of aGUIfortheservice.

Wewillworkwithyoutosolidifytheprojectthroughoutthesemester.

2.2.4AlternateProject:VirtualCluster

All students can contribute to the creation of theVirtualCluster code thatwewill be using throughout the class to improve and interface with cloud andcontainer frameworks. This project is typically done in a graduate class, butinterestedundergraduatescancontributealso.Thosethatliketocontributemusthave significant programming experience in either Python or Javascript. ThisprojectcouldreplacetheregilarAIRESTserviceproject.AweeklymeetinganddemonstratedprogresshastobeshowntoGregorvonLaszewski.

https://github.com/cloudmesh-community/cm

The residential students have been assigned this task, but online students canjoinandcontribute.

2.2.5AlternativeProject:100nodeRaspberryPicluster

In this project youwill be developing a 100 nodeRaspberry PI cluster. THisincludesputting thehardware together, anddeveloping software that allows tousesall100nodesasacluster.Softwareistobeusetomakemanagementeasiy.It is not sufficient to just install the softwarebut to develop a framework thatallowsustoeasilysharethisresourcewithotherusers.

Adocumentationhas tobewrittenfor thisprojectsootherscanreplicateyourclusterbuild.Agoodstartforthisis tolookatour cm-burncommand thatcreatesRaspberryPIOSbasedonmanipulationofthefilesystem

https://github.com/cloudmesh/cm-burnhttps://github.com/cloudmesh-community/cm

Substantialcontributionsareexpectedbeyondthehardwarebuild.WealsoliketodesignacasewithaLasercutterforthe100nodes.Buildingtheclusterwouldtake place in MESH and transportation to and from it is provided by theuniversity.Youwillbeabletoworkinanofficetheretoputtheclustertogether.AweeklymeetingwithGregorvonLaszewskiortheTAsisneededtoshowcaseprogress.

2.2.6Submissionofsectionsandchaptersandprojects

Sections and subsections are to be added to the book github repo. Do a pull

request.Theheadlineofthesectionneedstobemarkedwitha ifyoustill

workonit,markedwitha ifyouwantittobegraded.andhaveallhidsforpeoplethatcontributetothatsection.

In addition, simply add them to your README.yml file in your github repo.Addthefollowingtoit(Iamusinga18-516-18asexample).


https://github.com/cloudmesh/cm-burn


Please look at https://github.com/cloudmesh-community/fa18-516-18 andhttps://raw.githubusercontent.com/cloudmesh-community/fa18-523-62/master/README.ymlforanexamples.Pleasenotethatincaseyouworkinagroupthecodeandreportissupposedtobeonlystoredinthefirsthidmentionedin the group field. If you store it in multiple directories your project will berejected.

YouMUST run yamllint on the README.yml file. YAML errors will givepointdeductions.

section:

-title:titleofthesection1

url:https://github.com/cloudmesh-community/book/chapters/...





chapter:

-title:titleofthechapter

url:https://github.com/cloudmesh-community/fa18-516-18/blob/master/chapter/whatever.md

group:fa18-523-62fa18-523-69

keyword:whatever

project:

-title:titleoftheproject

url:urlinyourhidspaceorthatofyourpartner

group:fa18-523-62fa18-523-69

keyword:kubernetes,NIST,Database

code:theurltothecode

other:

-activity:spellcheckedmddocument

url:puturlhere

https://github.com/cloudmesh-community/fa18-516-18

https://raw.githubusercontent.com/cloudmesh-community/fa18-523-62/master/README.yml

3PYTHON

3.1INTRODUCTIONTOPYTHON☁�

LearningObjectives

Learn quickly Python under the assumption you know a programminglanguageWorkwithmodulesUnderstanddocoptsandcmdContuctsomepythonexamplestorefreshyourpythonknpwledgeLearnaboutthemapfunctioninPythonLearnhowtostartsubprocessesandrederecttheiroutputLearnmoreadvancedconstructssuchasmultiprocessingandQueuesUnderstandwhywedonotuseanacondaGetfamiliarwithpyenv

Portions of this lesson have been adapted from the official Python TutorialcopyrightPythonSoftwareFoundation.

Pythonisaneasytolearnprogramminglanguage.Ithasefficienthigh-leveldatastructuresandasimplebuteffectiveapproachtoobject-orientedprogramming.Python’ssimplesyntaxanddynamictyping,togetherwithitsinterpretednature,make it an ideal language for scripting and rapid application development inmanyareasonmostplatforms.ThePythoninterpreterandtheextensivestandardlibraryarefreelyavailableinsourceorbinaryformforallmajorplatformsfromthe PythonWeb site, https://www.python.org/, and may be freely distributed.ThesamesitealsocontainsdistributionsofandpointerstomanyfreethirdpartyPythonmodules,programsandtools,andadditionaldocumentation.ThePythoninterpretercanbeextendedwithnewfunctionsanddatatypesimplementedinCor C++ (or other languages callable from C). Python is also suitable as anextensionlanguageforcustomizableapplications.

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-intro.md

https://docs.python.org/2/tutorial/

http://www.python.org/

https://www.python.org/

Pythonisaninterpreted,dynamic,high-levelprogramminglanguagesuitableforawiderangeofapplications.

ThephilosophyofpythonissummarizedinTheZenofPythonasfollows:

ExplicitisbetterthanimplicitSimpleisbetterthancomplexComplexisbetterthancomplicatedReadabilitycounts

ThemainfeaturesofPythonare:

UseofindentationwhitespacetoindicateblocksObjectorientparadigmDynamictypingInterpretedruntimeGarbagecollectedmemorymanagementalargestandardlibraryalargerepositoryofthird-partylibraries

Python is used by many companies and is applied for web development,scientific computing, embedded applications, artificial intelligence, softwaredevelopment,andinformationsecurity,tonameafew.

The material collected here introduces the reader to the basic concepts andfeaturesofthePythonlanguageandsystem.Afteryouhaveworkedthroughthematerialyouwillbeableto:

usePythonusetheinteractivePythoninterfaceunderstandthebasicsyntaxofPythonwriteandrunPythonprogramshaveanoverviewofthestandardlibraryinstall Python libraries using pyenv for multipython interpreterdevelopment.

Edoenotattempttobecomprehensiveandcovereverysinglefeature,orevenevery commonly used feature. Instead, it introduces many of Python’s most

https://www.python.org/dev/peps/pep-0020/

noteworthyfeatures,andwillgiveyouagoodideaofthelanguage’sflavorandstyle.After reading it, youwillbeable to readandwritePythonmodulesandprograms,andyouwillbereadytolearnmoreaboutthevariousPythonlibrarymodules.

Inordertoconductthislessonyouneed

AcomputerwithPython2.7.16or3.7.4FamiliaritywithcommandlineusageA text editor such as PyCharm, emacs, vi or others.You should identitywhichworksbestforyouandsetitup.

3.1.1References

Some important additional information can be found on the following Webpages.

PythonPipVirtualenvNumPySciPyMatplotlibPandaspyenvPyCharm

Python module of the week is a Web site that provides a number of shortexamplesonhowtousesomeelementarypythonmodules.Notallmodulesareequallyusefulandyoushoulddecideiftherearebetteralternatives.Howeverforbeginnersthissiteprovidesanumberofgoodexamples

Python2:https://pymotw.com/2/Python3:https://pymotw.com/3/

3.2PYTHON3.7.4INSTALLATION☁�

https://www.jetbrains.com/pycharm/

https://www.python.org/

https://pip.pypa.io/en/stable/

https://virtualenv.pypa.io/en/stable/

http://www.numpy.org/

https://scipy.org/

http://matplotlib.org/

http://pandas.pydata.org/

https://github.com/pyenv/pyenv

https://github.com/pyenv/pyenv

https://pymotw.com/2/

https://pymotw.com/3/

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-install.md

LearningObjectives

Learnhowtoinstallpython.FindadditionalinformationaboutPython.MakesureyourComputersupportsPython.

Inthissetionweexplainhowtoinstallpython3.7.4onacomputer.Likelymuchofthecodewillworkwithearlierversions,butwedothedevelopmentinPythononthenewestversionofpythonavailableathttps://www.python.org/downloads.

3.2.1Hardware

Python does not require any special hardware.We have installed Python notonlyonPC’sandLaptops,butalsoonRaspberryPI’sandLegoMindstorms.

However,therearesomethingstoconsider.Ifyouusemanyprogramsonyourdesktop and run them all at the same time you will find that in up-to-dateoperating systems you will find your self quickly out of memmory. This isespeciallytrueifyouuseeditorssuchasPyCharmwhichwehighlyrecommend.Furthermore,asyoulikelyhavelotsofdiskaccess,makesuretouseafastHDDorbetteranSSD.

AtypicalmoderndeveloperPCorLaptophas16GBRAMandanSSD.Youcancertainlydopythonona$35RapbperryPI,butyouprobablywillnotbeabletorun PyCharm. There are many alternative editors with lessMemory footprintavialable.

3.2.2PrerequisitsUbuntu19.04

Python 3.7 is installed in ubuntu 19.04. Therefore, it already fulfills theprerequisits.Howeverwerecommendthatyouupdate to thenewestversionofpythonandpip.Howeverwerecommendthatyouupdatethethenewestversionofpython.Pleasevisit:https://www.python.org/downloads

3.2.3PrerequisitsmacOS

https://www.python.org/downloads


3.2.3.1InstallationfromAppleAppStore

Youwant a number of useful tool on yourmacOS. They are not installed bydefault,butareavailableviaXcode.Firstyouneedtoinstallxcodefrom

https://apps.apple.com/us/app/xcode/id497799835

NextyouneedtoinstallmacOSxcodecommandlinetools:

3.2.3.2Installationfrompython.org

The easiest instalation of Python for cloudmesh is to use the instaltion fromhttps://www.python.org/downloads. Please, visit the page and follow theinstructions.Afterthisinstallyouhavepython3avalablefromthecommandline

3.2.3.3InstallationfromHoembrew

An alternative instalation is provided from Homebrew. To use this installmethod,youneed to installHomebrewfirst.Start theprocessby installing thepython3usinghomebrew.Installhomebrewusingtheinstructionintheirwebpage:

ThenyoushouldbeabletoinstallPython3.7.4using:

3.2.4PrerequisitsUbuntu18.04

We recommend you update your ubuntu version to 19.04 and follow theinstructionsforthatversioninstead,asitissignificantlyeasier.Ifyouhoweverarenotabletodoso,thefollowinginstructionsmaybehelpful.

Wefirstneed tomakesure that thecorrectversionof thePython3is installed.ThedefaultversionofPythononUbuntu18.04is3.6.Youcangettheversionwith:

$xcode-select--install

$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"

$brewinstallpython

$python3--version

https://apps.apple.com/us/app/xcode/id497799835


https://brew.sh/#install

Iftheversionisnot3.7.4ornewer,youcanupdateitasfollows:

Youcan thencheck the installedversionusing python3.7--version which should be3.7.4.

Nowwewillcreateanewvirtualenvironment:

Theeditthe~/.bashrcfileandaddthefollowinglineattheend:

nowactivatethevirtualenvironmentusing:

nowyoucaninstallthepipforthevirtualenvironmentwithoutconflictingwiththenativepip:

3.2.5PrerequisiteWindows10

Python 3.7 can be installed on Windows 10 using:https://www.python.org/downloads

For3.7.4cangoto thedownloadpageanddownloadoneof thedifferent filesforWindows.

LetusassumeyouchoetheWebbasedinstaller,thanyouclickonthefileintheedge browser (make sure the account you use has administrative priviledges).Followtheinstructionsthattheinstallergives.Importantisthatyouselectatonepoint“[x]AddtoPath”.Therewillbeanemptycheckmarkaboutthisthatyouwillclickon.

Onceitisinstalled.choseaterminalandexecute

$sudoapt-getupdate

$sudoaptinstallsoftware-properties-common

$sudoadd-apt-repositoryppa:deadsnakes/ppa

$sudoapt-getinstallpython3.7python3-devpython3.7-dev

$python3.7-mvenv--without-pip~/ENV3

aliasENV3="source~/ENV3/bin/activate"

ENV3

$source~/.bashrc

$curl"https://bootstrap.pypa.io/get-pip.py"-o"get-pip.py"

$pythonget-pip.py

$rmget-pip.py


https://www.python.org/downloads/release/python-374/

However, ifyouhave installedconda for somereasonyouneed to readuponhowtoinstall3.7.4pythonincondaoridentifyhowtoruncondaandpython.orgatthesametime.Weseeoftenothersgivingthewronginstallationinstructions.

Analternative is tousepythonfromwithin theLinuxSubsystem.But thathassomelimitationsandyouwillneedtoexplorehowtoexxessthefilesysteminthesubssytemtohaveasmoothintegrationbetweenyourWindowshostsoyoucanforexampleusePyCharm.

3.2.5.1LinuxSubsystemInstall

ToactivatetheLinuxSubsystem,pleasefollowtheinstructionsat

https://docs.microsoft.com/en-us/windows/wsl/install-win10

Asuitabledistributionwouldbe

https://www.microsoft.com/en-us/p/ubuntu-1804-lts/9n9tngvndl3q?activetab=pivot:overviewtab

Howeverasitusesanolderversionofpythonyouwillahvetoupdateit.

3.2.6Prerequisitvenv

This step is highly recommend if you have not yet already installed a venv forpythontomakesureyouarenotinterferingwithyoursystempython.NotusingavenvcouldhavecatastrophicconsequencesandadestructionofyouroperatingsystemtoolsiftheyrealyonPython.Theuseofvenvissimple.Forourpurposesweassumethatyouusethedirectory:

Followthesestepsfirst:

Firstcdtoyourhomedirectory.Thenexecute

python--version

~/ENV3

$python3-mvenv~/ENV3

$source~/ENV3/bin/activate

https://docs.microsoft.com/en-us/windows/wsl/install-win10

https://www.microsoft.com/en-us/p/ubuntu-1804-lts/9n9tngvndl3q?activetab=pivot:overviewtab

Youcanaddat theendofyour .bashrc(ubuntu)or .bash_profile (macOS)filetheline

sotheenvironmentisalwaysloaded.Nowyouarereadytoinstallcloudmesh.

Checkifyouhavetherightversionofpythoninstalledwith

Tomakesureyouhaveanuptodateversionofpipissuethecommand

3.2.7InstallPython3.7viaAnaconda

3.2.7.1Downloadcondainstaller

Minicondaisrecommendedhere.Downloadan installerforWindows,macOS,andLinuxfromthispage:https://docs.conda.io/en/latest/miniconda.html

3.2.7.2Installconda

Followinstructionstoinstallcondaforyouroperatingsystems:

Windows. https://conda.io/projects/conda/en/latest/user-guide/install/windows.htmlmacOS. https://conda.io/projects/conda/en/latest/user-guide/install/macos.htmlLinux.https://conda.io/projects/conda/en/latest/user-guide/install/linux.html

3.2.7.3InstallPython3.7.4viaconda

Itisveryimportanttomakesureyouhaveanewerversionofpipinstalled.Afteryou installed and created theENV3you need to activate it. This can be done

$source~/ENV3/bin/activate

$python--version

$pipinstallpip-U

$cd~

$condacreate-nENV3python=3.7.4

$condaactivateENV3

$condainstall-canacondapip

$condadeactivateENV3

https://docs.conda.io/en/latest/miniconda.html

https://conda.io/projects/conda/en/latest/user-guide/install/windows.html

https://conda.io/projects/conda/en/latest/user-guide/install/macos.html

https://conda.io/projects/conda/en/latest/user-guide/install/linux.html

with

Ifyou like toactivate itwhenyoustartanewterminal,pleaseadd this line toyour.bashrcor.bash_profile

Ifyouusezshpleaseadditto.zprofileinstead.

3.3INTERACTIVEPYTHON☁�Pythoncanbeusedinteractively.Youcanentertheinteractivemodebyenteringtheinteractiveloopbyexecutingthecommand:

Youwillseesomethinglikethefollowing:

The >>> is the prompt used by the interpreter. This is similar to bash wherecommonly$isused.

Sometimes it is convenient to show the promptwhen illustrating an example.This is to provide some context for what we are doing. If you are followingalongyouwillnotneedtotypeintheprompt.

Thisinteractivepythonprocessdoesthefollowing:

readyourinputcommandsevaluateyourcommandprinttheresultofevaluationloopbacktothebeginning.

This is why you may see the interactive loop referred to as aREPL:Read-Evaluate-Print-Loop.

3.3.1REPL(ReadEvalPrintLoop)

$condaactivateENV3

$python

$python

Python3.7.1(default,Nov242018,14:27:15)

[Clang10.0.0(clang-1000.11.45.5)]ondarwin

Type"help","copyright","credits"or"license"formoreinformation.

>>>

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-interactive.md

There are many different types beyond what we have seen so far, such asdictionariess,lists,sets.Onehandywayofusingtheinteractivepythonistogetthetypeofavalueusingtype():

Youcanalsoaskforhelpaboutsomethingusinghelp():

Usinghelp()opensupahelpmessagewithinapager.Tonavigateyoucanusethespacebartogodownapagewtogoupapage,thearrowkeystogoup/downline-by-line,orqtoexit.

3.3.2Interpreter

Althoughtheinteractivemodeprovidesaconvenienttooltotestthingsoutyouwillseequicklythatforourclasswewanttousethepythoninterpreterfromthecommandline.Letusassumetheprogramiscalledprg.py.Onceyouhavewrittenitinthatfileyousimplycancallitwith

Itisimportanttonametheprogramwithmeaningfulnames.

3.3.3Python3FeaturesinPython2

In this coursewewant to be able to seamlessly switch between python 2 andpython3.Thusitisconvenientfromthestarttousepython3syntaxwhenitissupportedalsoinpython2.Oneofthemostusedfunctionsistheprintstatementthathasinpython3parentheses.Toenableitinpython2youjustneedtoimportthisfunction:

Thefirstoftheseimportsallowsustousetheprintfunctiontooutputtexttothescreen, instead of the print statement, which Python 2 uses. This is simply a

>>>type(42)

<type'int'>

>>>type('hello')

<type'str'>

>>>type(3.14)

<type'float'>

>>>help(int)

>>>help(list)

>>>help(str)

$pythonprg.py

from__future__importprint_function,division

designdecisionthatbetterreflectsPython’sunderlyingphilosophy.

Otherfunctionssuchasthedivisionalsobehavedifferently.Thusweuse

Thisimportmakessurethatthedivisionoperatorbehavesinawayanewcomertothelanguagemightfindmoreintuitive.InPython2,division/isfloordivisionwhentheargumentsareintegers,meaningthatthefollowing

InPython3,division/isafloatingpointdivision,thus

3.4EDITORS☁�Thissectionismeanttogiveanoverviewofthepythoneditingtoolsneededfordoing for this course. There are many other alternatives, however, we dorecommendtousePyCharm.

3.4.1Pycharm

PyCharm is an Integrated Development Environment (IDE) used forprogramming in Python. It provides code analysis, a graphical debugger, anintegratedunittester,integrationwithgit.

Python8:56Pycharm

3.4.2Pythonin45minutes

AnadditionalcommunityvideoaboutthePythonprogramminglanguagethatwefoundontheinternet.Naturallytherearemanyalternativestothisvideo,butthevideoisprobablyagoodstart.ItalsousesPyCharmwhichwerecommend.

Python43:16PyCharm

Howmuchyouwanttounderstandofpythonisactuallyabituptoyou.While

from__future__importdivision

(5/2==2)isTrue

(5/2==2.5)isTrue



https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-editor.md

https://youtu.be/X8ZpbZweJcw

https://www.youtube.com/watch?v=N4mEzFDjqtA

itsgood toknowclassesand inheritance,youmaybeable for thisclass togetawaywithoutusingit.However,wedorecommendthatyoulearnit.

PyCharmInstallation:Method1:PyCharmInstallationonubuntuusingumake

Once umake command is run, use the next command to install Pycharmcommunityedition:

IfyouwanttoremovePyCharminstalledusingumakecommand,usethis:

Method2:PyCharminstallationonubuntuusingPPA

PyCharm also has a Professional (paid) version which can be installed usingfollowingcommand:

Onceinstalled,gotoyourVMdashboardandsearchforPyCharm.

3.5LANGUAGE☁�3.5.1StatementsandStrings

LetusexplorethesyntaxofPythonwhilestartingwithaprintstatement

Thiswillprintontheterminal

The print function was given a string to process. A string is a sequence ofcharacters. A character can be a alphabetic (A through Z, lower and upper

sudoadd-apt-repositoryppa:ubuntu-desktop/ubuntu-make

sudoapt-getupdate

sudoapt-getinstallubuntu-make

umakeidepycharm

umake-ridepycharm

sudoadd-apt-repositoryppa:mystic-mirage/pycharm

sudoapt-getupdate

sudoapt-getinstallpycharm-community

sudoapt-getinstallpycharm

print("HelloworldfromPython!")

HelloworldfromPython!

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python.md

case), numeric (any of the digits), white space (spaces, tabs, newlines, etc),syntacticdirectives(comma,colon,quotation,exclamation,etc),andsoforth.Astringisjustasequenceofthecharacterandtypicallyindicatedbysurroundingthecharactersindoublequotes.

StandardoutputisdiscussedintheSectionLinux.

So, what happened when you pressed Enter? The interactive Python programreadthelineprint("HelloworldfromPython!"),splititintotheprintstatementandthe"HelloworldfromPython!"string,andthenexecutedtheline,showingyoutheoutput.

3.5.2Comments

Commentsinpythonarefollowedbya#:

3.5.3Variables

Youcanstoredataintoavariabletoaccessitlater.Forinstance:

Thiswillprintagain

3.5.4DataTypes

3.5.4.1Booleans

Aboolean is a value that can have the values True or False. You can combinebooleanswithbooleanoperatorssuchasandandor

3.5.4.2Numbers

#Thisisacomment

hello='HelloworldfromPython!'

print(hello)

HelloworldfromPython!

print(TrueandTrue)#True

print(TrueandFalse)#False

print(FalseandFalse)#False

print(TrueorTrue)#True

print(TrueorFalse)#True

print(FalseorFalse)#False

Theinteractiveinterpretercanalsobeusedasacalculator.Forinstance,saywewantedtocomputeamultipleof21:

Wesawheretheprintstatementagain.Wepassedintheresultoftheoperation21 * 2.An integer (or int) in Python is a numeric valuewithout a fractionalcomponent(thosearecalledfloatingpointnumbers,orfloatforshort).

Themathematicaloperators compute the relatedmathematicaloperation to theprovidednumbers.Someoperatorsare:

Operator Function* multiplication/ division+ addition- subtraction** exponent

Exponentiationxyiswrittenasx**yisxtotheythpower.

Youcancombinefloatsandints:

Notethatoperatorprecedenceisimportant.Usingparenthesistoindicateaffecttheorderofoperationsgivesadifferenceresults,asexpected:

3.5.5ModuleManagement

AmoduleallowsyoutologicallyorganizeyourPythoncode.Groupingrelatedcodeintoamodulemakesthecodeeasiertounderstandanduse.AmoduleisaPythonobjectwitharbitrarilynamedattributesthatyoucanbindandreference.Amodule is a file consistingofPython code.Amodule candefine functions,classesandvariables.Amodulecanalsoincluderunnablecode.

print(21*2)#42

print(3.14*42/11+4-2)#13.9890909091

print(2**3)#8

print(3.14*(42/11)+4-2)#11.42

print(1+2*3-4/5.0)#6.2

print((1+2)*(3-4)/5.0)#-0.6

3.5.5.1ImportStatement

Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpreter searches before importing a module. The from…import StatementPython’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Itispreferredtouseforeachimportitsownlinesuchas:

Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpretersearchesbeforeimportingamodule.

3.5.5.2Thefrom…importStatement

Python’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Thefrom…importhasthefollowingsyntax:

3.5.6DateTimeinPython

Thedatetimemodulesuppliesclassesformanipulatingdatesandtimesinbothsimpleandcomplexways.Whiledateandtimearithmeticissupported,thefocusof the implementation is on efficient attribute extraction for output formattingand manipulation. For related functionality, see also the time and calendarmodules.

The import Statement You can use any Python source file as a module byexecutinganimportstatementinsomeotherPythonsourcefile.

Thismoduleoffersagenericdate/timestringparserwhichisabletoparsemostknownformatstorepresentadateand/ortime.

pandas is an open source Python library for data analysis that needs to be

importnumpy

importmatplotlib

fromdatetimeimportdatetime

fromdatetimeimportdatetime

fromdateutil.parserimportparse

imported.

Createastringvariablewiththeclassstarttime

Convertthestringtodatetimeformat

Creatingalistofstringsasdates

ConvertClass_datesstringsintodatetimeformatandsavethelistintovariablea

Useparse()toattempttoauto-convertcommonstringformats.Parsermustbeastringorcharacterstream,notlist.

Useparse()oneveryelementoftheClass_datesstring.

Useparse,butdesignatethatthedayisfirst.

Create adataframe.ADataFrame is a tabulardata structure comprisedof rowsand columns, akin to a spreadsheet, database table. DataFrame as a group ofSeriesobjectsthatshareanindex(thecolumnnames).

importpandasaspd

fall_start='08-21-2018'

datetime.strptime(fall_start,'%m-%d-%Y')\#

datetime.datetime(2017,8,21,0,0)

class_dates=[

'8/25/2017',

'9/1/2017',

'9/8/2017',

'9/15/2017',

'9/22/2017',

'9/29/2017']

a=[datetime.strptime(x,'%m/%d/%Y')forxinclass_dates]

parse(fall_start)#datetime.datetime(2017,8,21,0,0)

[parse(x)forxinclass_dates]

#[datetime.datetime(2017,8,25,0,0),

#datetime.datetime(2017,9,1,0,0),




#datetime.datetime(2017,9,29,0,0)]

parse(fall_start,dayfirst=True)

#datetime.datetime(2017,8,21,0,0)

importpandasaspd

data={

Convertdf[`date`]fromstringtodatetime

3.5.7ControlStatements

3.5.7.1Comparison

Computer programs do not only execute instructions. Occasionally, a choiceneedstobemade.Suchasachoiceisbasedonacondition.Pythonhasseveralconditionaloperators:

Operator Function> greaterthan< smallerthan== equals!= isnot

Conditionsarealwayscombinedwithvariables.Aprogramcanmakeachoiceusingtheifkeyword.Forexample:

'dates':[

'8/25/201718:47:05.069722',

'9/1/201718:47:05.119994',

'9/8/201718:47:05.178768',

'9/15/201718:47:05.230071',

'9/22/201718:47:05.230071',

'9/29/201718:47:05.280592'],

'complete':[1,0,1,1,0,1]}

df=pd.DataFrame(

data,

columns=['dates','complete'])

print(df)

#datescomplete

#08/25/201718:47:05.0697221

#19/1/201718:47:05.1199940

#29/8/201718:47:05.1787681

#39/15/201718:47:05.2300711

#49/22/201718:47:05.2300710

#59/29/201718:47:05.2805921

importpandasaspd

pd.to_datetime(df['dates'])

#02017-08-2518:47:05.069722

#12017-09-0118:47:05.119994

#22017-09-0818:47:05.178768

#32017-09-1518:47:05.230071

#42017-09-2218:47:05.230071

#52017-09-2918:47:05.280592

#Name:dates,dtype:datetime64[ns]

x=int(input("Guessx:"))

ifx==4:

print('Correct!')

In this example,You guessed correctly! will only be printed if the variable xequals to four. Python can also executemultiple conditions using the elif andelsekeywords.

3.5.7.2Iteration

To repeat code, the for keyword can be used. For example, to display thenumbersfrom1to10,wecouldwritesomethinglikethis:

Thesecondargument to range,11, isnot inclusive,meaning that the loopwillonlygetto10beforeitfinishes.Pythonitselfstartscountingfrom0,sothiscodewillalsowork:

Infact,therangefunctiondefaultstostartingvalueof0,soitisequivalentto:

Wecanalsonestloopsinsideeachother:

In this case we have two nested loops. The code will iterate over the entirecoordinaterange(0,0)to(9,9)

3.5.8Datatypes

3.5.8.1Lists

see:https://www.tutorialspoint.com/python/python_lists.htm

x=int(input("Guessx:"))

ifx==4:

print('Correct!')

elifabs(4-x)==1:

print('Wrong,butclose!')

else:

print('Wrong,wayoff!')

foriinrange(1,11):

print('Hello!')

foriinrange(0,10):

print(i+1)

foriinrange(10):

print(i+1)

foriinrange(0,10):

forjinrange(0,10):

print(i,'',j)

https://www.tutorialspoint.com/python/python_lists.htm

Lists inPythonareorderedsequencesofelements,whereeachelementcanbeaccessedusinga0-basedindex.

Todefinealist,yousimplylistitselementsbetweensquarebrackets‘[]’:

Youcanalsouseanegative index ifyouwant tostartcountingelementsfromthe endof the list.Thus, the last element has index -1, the second before lastelementhasindex-2andsoon:

Pythonalsoallowsyoutotakewholeslicesofthelistbyspecifyingabeginningandendofthesliceseparatedbyacolon

Asyoucanseefromtheexample,thestartingindexinthesliceisinclusiveandtheendingone,exclusive.

Pythonprovidesavarietyofmethodsformanipulatingthemembersofalist.

Youcanaddelementswithappend’:

Asyoucansee,theelementsinalistneednotbeunique.

Mergetwolistswith‘extend’:

names=[

'Albert',

'Jane',

'Liz',

'John',

'Abby']

#accessthefirstelementofthelist

names[0]

#'Albert'

#accessthethirdelementofthelist

names[2]

#'Liz'

#accessthelastelementofthelist

names[-1]

#'Abby'

#accessthesecondlastelementofthelist

names[-2]

#'John'

#themiddleelements,excludingfirstandlast

names[1:-1]

#['Jane','Liz','John']

names.append('Liz')

names

#['Albert','Jane','Liz',

#'John','Abby','Liz']

names.extend(['Lindsay','Connor'])

names

Findtheindexofthefirstoccurrenceofanelementwith‘index’:

Removeelementsbyvaluewith‘remove’:

Removeelementsbyindexwith‘pop’:

Noticethatpopreturnstheelementbeingremoved,whileremovedoesnot.

Ifyouarefamiliarwithstacksfromotherprogramminglanguages,youcanuseinsertand‘pop’:

ThePythondocumentationcontainsafulllistoflistoperations.

To go back to the range function you used earlier, it simply creates a list ofnumbers:

3.5.8.2Sets

Pythonlistscancontainduplicatesasyousawpreviously:

#['Albert','Jane','Liz','John',

#'Abby','Liz','Lindsay','Connor']

names.index('Liz')\#2

names.remove('Abby')

names

#['Albert','Jane','Liz','John',

#'Liz','Lindsay','Connor']

names.pop(1)

#'Jane'

names

#['Albert','Liz','John',

#'Liz','Lindsay','Connor']

names.insert(0,'Lincoln')

names

#['Lincoln','Albert','Liz',

#'John','Liz','Lindsay','Connor']

names.pop()

#'Connor'

names

#['Lincoln','Albert','Liz',

#'John','Liz','Lindsay']

range(10)

#[0,1,2,3,4,5,6,7,8,9]

range(2,10,2)

#[2,4,6,8]

names=['Albert','Jane','Liz',

'John','Abby','Liz']

Whenwedonotwantthistobethecase,wecanuseaset:

Keepinmindthatthesetisanunorderedcollectionofobjects,thuswecannotaccessthembyindex:

However,wecanconvertasettoalisteasily:

Noticethatinthiscase,theorderofelementsinthenewlistmatchestheorderinwhichtheelementsweredisplayedwhenwecreatetheset.Wehadset(['Lincoln','John','Albert','Liz','Lindsay'])

andnowwehave['Lincoln','John','Albert','Liz','Lindsay'])

You should not assume this is the case in general. That is, do not make anyassumptionsabouttheorderofelementsinasetwhenitisconvertedtoanytypeofsequentialdatastructure.

You can change a set’s contents using the add, remove and update methodswhich correspond to the append, remove and extend methods in a list. Inaddition to these, set objects support the operations youmay be familiarwithfrommathematicalsets:union,intersection,difference,aswellasoperations tocheck containment. You can read about this in the Python documentation forsets.

3.5.8.3RemovalandTestingforMembershipinSets

Oneimportantadvantageofasetoveralististhataccesstoelementsisfast. Ifyou are familiarwith different data structures fromaComputerScience class,thePython list is implementedby an array,while the set is implementedby a

unique_names=set(names)

unique_names

#set(['Lincoln','John','Albert','Liz','Lindsay'])

unique_names[0]

#Traceback(mostrecentcalllast):

#File"<stdin>",line1,in<module>

#TypeError:'set'objectdoesnotsupportindexing

unique_names=list(unique_names)

unique_names[`Lincoln',`John',Àlbert',`Liz',`Lindsay']

unique_names[0]

#`Lincoln'

https://docs.python.org/2/library/stdtypes.html#set

https://docs.python.org/2/library/stdtypes.html#set

hashtable.

Wewilldemonstratethiswithanexample.Letussaywehavealistandasetofthesamenumberofelements(approximately100thousand):

Wewill use the timeit Pythonmodule to time 100 operations that test for theexistenceofamemberineitherthelistorset:

The exact duration of the operations on your systemwill be different, but thetake away will be the same: searching for an element in a set is orders ofmagnitudefasterthaninalist.Thisisimportanttokeepinmindwhenyouworkwithlargeamountsofdata.

3.5.8.4Dictionaries

Oneoftheveryimportantdatastructuresinpythonisadictionaryalsoreferredtoasdict.

Adictionaryrepresentsakeyvaluestore:

Aconvenientfortoprintbynamedattributesis

Thisformofprintingwiththeformatstatementandareferencetodataincreasesreadabilityoftheprintstatements.

importsys,random,timeit

nums_set=set([random.randint(0,sys.maxint)for_inrange(10**5)])

nums_list=list(nums_set)

len(nums_set)

#100000

timeit.timeit('random.randint(0,sys.maxint)innums',

setup='importrandom;nums=%s'%str(nums_set),number=100)

#0.0004038810729980469

timeit.timeit('random.randint(0,sys.maxint)innums',

setup='importrandom;nums=%s'%str(nums_list),number=100)

#0.398054122924804

person={

'Name':'Albert',

'Age':100,

'Class':'Scientist'

}

print("person['Name']:",person['Name'])

#person['Name']:Albert

print("person['Age']:",person['Age'])

#person['Age']:100

print("{Name}{Age}'.format(**data))

https://docs.python.org/2/library/timeit.html

Youcandeleteelementswiththefollowingcommands:

Youcaniterateoveradict:

3.5.8.5DictionaryKeysandValues

Youcanretrieveboth thekeysandvaluesofadictionaryusing thekeys()andvalues()methodsofthedictionary,respectively:

Bothmethodsreturnlists.Notice,however,thattheorderinwhichtheelementsappear in the returned lists (Age, Name, Class) is different from the order inwhichwe listed theelementswhenwedeclared thedictionary initially (Name,Age,Class).Itisimportanttokeepthisinmind:

Youcannotmakeanyassumptionsabouttheorderinwhichtheelementsofadictionarywillbe returnedby thekeys()andvalues()methods.

However,youcanassumethatifyoucallkeys()andvalues()insequence,theorderof elements will at least correspond in both methods. In the example Agecorrespondsto100,Nameto Albert,andClasstoScientist,andyouwillobservethe same correspondence in general as long as keys() and values() are called onerightaftertheother.

delperson['Name']#removeentrywithkey'Name'

#person

#{'Age':100,'Class':'Scientist'}

person.clear()#removeallentriesindict

#person

#{}

delperson#deleteentiredictionary

#person

#Traceback(mostrecentcalllast):

#File"<stdin>",line1,in<module>

#NameError:name'person'isnotdefined

person={

'Name':'Albert',

'Age':100,

'Class':'Scientist'

}

foriteminperson:

print(item,person[item])

#Age100

#NameAlbert

#ClassScientist

person.keys()#['Age','Name','Class']

person.values()#[100,'Albert','Scientist']

3.5.8.6CountingwithDictionaries

Oneapplicationofdictionariesthatfrequentlycomesupiscountingtheelementsinasequence.Forexample,saywehaveasequenceofcoinflips:

Theactual listdie_rollswill likelybedifferentwhenyouexecute thisonyourcomputersincetheoutcomesofthedierollsarerandom.

Tocomputetheprobabilitiesofheadsandtails,wecouldcounthowmanyheadsandtailswehaveinthelist:

In addition to how we use the dictionary counts to count the elements ofcoin_flips,noticeacouplethingsaboutthisexample:

1. We used the assert outcome in counts statement. The assert statement inPython allows you to easily insert debugging statements in your code tohelp you discover errors more quickly. assert statements are executedwhenevertheinternalPython__debug__variableissettoTrue,whichisalwaysthecaseunlessyoustartPythonwiththe-OoptionwhichallowsyoutorunoptimizedPython.

2. When we computed the probability of tails, we used the built-in sumfunction,whichallowedus toquickly find the totalnumberof coin flips.sumisoneofmanybuilt-infunctionyoucanreadabouthere.

3.5.9Functions

Youcanreusecodebyputtingitinsideafunctionthatyoucancallinotherparts

importrandom

die_rolls=[

random.choice(['heads','tails'])for_inrange(10)

]

#die_rolls

#['heads','tails','heads',

#'tails','heads','heads',

'tails','heads','heads','heads']

counts={'heads':0,'tails':0}

foroutcomeincoin_flips:

assertoutcomeincounts

counts[outcome]+=1

print('Probabilityofheads:%.2f'%(counts['heads']/len(coin_flips)))

#Probabilityofheads:0.70

print('Probabilityoftails:%.2f'%(counts['tails']/sum(counts.values())))

#Probabilityoftails:0.30

https://docs.python.org/2/library/functions.html

ofyourprograms.Functionsarealsoagoodwayofgroupingcodethatlogicallybelongs together in one coherentwhole.A function has a unique name in theprogram.Onceyoucallafunction,itwillexecuteitsbodywhichconsistsofoneormorelinesofcode:

The def keyword tells Python we are defining a function. As part of thedefinition,wehavethefunctionname,check_triangle,andtheparametersofthefunction–variablesthatwillbepopulatedwhenthefunctioniscalled.

Wecallthefunctionwitharguments4,5and6,whicharepassedinorderintotheparametersa,bandc.Afunctioncanbecalledseveral timeswithvaryingparameters.Thereisnolimittothenumberoffunctioncalls.

It is also possible to store the output of a function in a variable, so it can bereused.

3.5.10Classes

Aclass is an encapsulation of data and the processes thatwork on them.Thedata is represented inmember variables, and the processes are defined in themethodsoftheclass(methodsarefunctionsinsidetheclass).Forexample,let’sseehowtodefineaTriangleclass:

defcheck_triangle(a,b,c):

return\

a<b+canda>abs(b-c)and\

b<a+candb>abs(a-c)and\

c<a+bandc>abs(a-b)

print(check_triangle(4,5,6))

defcheck_triangle(a,b,c):

return\



c<a+bandc>abs(a-b)

result=check_triangle(4,5,6)

print(result)

classTriangle(object):

def__init__(self,length,width,

height,angle1,angle2,angle3):

ifnotself._sides_ok(length,width,height):

print('Thesidesofthetriangleareinvalid.')

elifnotself._angles_ok(angle1,angle2,angle3):

print('Theanglesofthetriangleareinvalid.')

self._length=length

self._width=width

self._height=height

Python has full object-oriented programming (OOP) capabilities, however wecannotcoveralloftheminthissection,soifyouneedmoreinformationpleaserefertothePythondocsonclassesandOOP.

3.5.11Modules

Nowwritethissimpleprogramandsaveit:

Asacheck,makesurethefilecontainstheexpectedcontentsonthecommandline:

Toexecuteyourprogrampassthefileasaparametertothepythoncommand:

Files in which Python code is stored are calledmodules. You can execute aPythonmoduleformthecommandlinelikeyoujustdid,oryoucanimportitinotherPythoncodeusingtheimportstatement.

Let us write a more involved Python program that will receive as input thelengths of the three sides of a triangle, andwill outputwhether they define avalidtriangle.Atriangleisvalidifthelengthofeachsideislessthanthesumofthelengthsoftheothertwosidesandgreaterthanthedifferenceofthelengthsoftheothertwosides.:

self._angle1=angle1

self._angle2=angle2

self._angle3=angle3

def_sides_ok(self,a,b,c):

return\



c<a+bandc>abs(a-b)

def_angles_ok(self,a,b,c):

returna+b+c==180

triangle=Triangle(4,5,6,35,65,80)

print("Helloworld!")

$cathello.py

print("Helloworld!")

$pythonhello.py

Helloworld!

"""Usage:check_triangle.py[-h]LENGTHWIDTHHEIGHT

Checkifatriangleisvalid.

Arguments:

https://docs.python.org/2.7/tutorial/classes.html

Assumingwesavetheprograminafilecalledcheck_triangle.py,wecanrunitlikeso:

Letusbreakthisdownabit.

1. Weare importing theprint_functionanddivisionmodules frompython3likewedidearlierinthissection.It’sagoodideatoalwaysincludetheseinyourprograms.

2. We’vedefinedabooleanexpressionthattellsusifthesidesthatwereinputdefine a valid triangle. The result of the expression is stored in thevalid_trianglevariable.insidearetrue,andFalseotherwise.

3. We’ve used the backslash symbol \ to format are code nicely. Thebackslash simply indicates that the current line is being continued on thenextline.

4. Whenweruntheprogram,wedothecheckif__name__=='__main__'. __name__ is aninternal Python variable that allows us to tell whether the current file isbeingrunfromthecommandline(value__name__),orisbeingimportedbyamodule (the value will be the name of the module). Thus, with thisstatementwe’rejustmakingsuretheprogramisbeingrunbythecommandline.

5. Weareusing thedocoptmodule tohandlecommand linearguments.Theadvantageofusing thismodule is that itgeneratesausagehelpstatementfor theprogramandenforces command line arguments automatically.Allofthisisdonebyparsingthedocstringatthetopofthefile.

6. Intheprintfunction,weareusingPython’sstringformattingcapabilitiestoinsertvaluesintothestringwearedisplaying.

LENGTHThelengthofthetriangle.

WIDTHThewidthofthetraingle.

HEIGHTTheheightofthetriangle.

Options:

-h--help

"""

fromdocoptimportdocopt

if__name__=='__main__':

arguments=docopt(__doc__)

a,b,c=int(arguments['LENGTH']),

int(arguments['WIDTH']),

int(arguments['HEIGHT'])

valid_triangle=\



c<a+bandc>abs(a-b)

print('Trianglewithsides%d,%dand%disvalid:%r'%(

a,b,c,valid_triangle

))

$pythoncheck_triangle.py456

Trianglewithsides4,5and6isvalid:True

https://docs.python.org/2/library/string.html#format-string-syntax

3.5.12LambdaExpressions

As oppose to normal functions in Python which are defined using the def

keyword, lambda functions in Python are anonymous functions which do nothaveanameandaredefinedusing the lambda keyword.Thegeneric syntaxof alambda function is in form oflambdaarguments:expression, as shown in the followingexample:

Asyoucouldprobablyguess,theresultis:

Nowconsiderthefollowingexamples:

The power2 function defined in the expression, is equivalent to the followingdefinition:

Lambdafunctionsareusefulforwhenyouneedafunctionforashortperiodoftime.Note that theycanalsobeveryusefulwhenpassedasanargumentwithotherbuilt-infunctionsthattakeafunctionasanargument,e.g.filter()andmap().Inthenextexampleweshowhowalambdafunctioncanbecombinedwiththefilerfunction. Consider the array all_names which contains five words that rhymetogether.Wewanttofilterthewordsthatcontainthewordname.Toachievethis,wepassthefunctionlambdax:'name'inxas thefirstargument.This lambdafunctionreturns True if the word name exists as a sub-string in the string x. The secondargumentoffilterfunctionisthearrayofnames,i.e.all_names.

Asyoucansee,thenamesaresuccessfullyfilteredasweexpected.

InPython3,filterfunctionreturnsafilterobjectortheiteratorwhichgetslazilyevaluatedwhichmeans neitherwe can access the elements of the filter object

greeter=lambdax:print('Hello%s!'%x)

print(greeter('Albert'))

HelloAlbert!

power2=lambdax:x**2

defpower2(x):

returnx**2

all_names=['surname','rename','nickname','acclaims','defame']

filtered_names=list(filter(lambdax:'name'inx,all_names))

print(filtered_names)

#['surname','rename','nickname']

withindexnorwecanuselen()tofindthelengthofthefilterobject.

InPython,wecanhaveasmallusuallyasinglelineranonymousfunctioncalledLambda functionwhich canhave anynumberof arguments just like anormalfunctionbutwithonlyoneexpressionwithnoreturnstatement.Theresultofthisexpressioncanbeappliedtoavalue.

BasicSyntax:

Foranexample:afunctioninpython

SamefunctioncanwrittenasLambdafunction.Thisfunctionnamedasmultiplyishaving2argumentsandreturnstheirmultiplication.

Lambdaequivalentforthisfunctionwouldbe:

Here a and b are the 2 arguments and a*b is the expression whose value isreturnedasanoutput.

Alsowedon’tneedtoassignLambdafunctiontoavariable.

Lambdafunctionsaremostlypassedasparametertoafunctionwhichexpectsafunctionobjectslikeinmaporfilter.

3.5.12.1map

list_a=[1,2,3,4,5]

filter_obj=filter(lambdax:x%2==0,list_a)

#Convertthefilerobjtoalist

even_num=list(filter_obj)

print(even_num)

#Output:[2,4]

lambdaarguments:expression

defmultiply(a,b):

returna*b

#callthefunction

multiply(3*5)#outputs:15

multiply=Lambdaa,b:a*b

print(multiply(3,5))

#outputs:15

(lambdaa,b:a*b)(3*5)

Thebasicsyntaxofthemapfunctionis

mapfunctionsexpectsafunctionobjectandanynumberofiterableslikelistordictionary.Itexecutesthefunction_objectforeachelementinthesequenceandreturnsalistoftheelementsmodifiedbythefunctionobject.

Example:

IfwewanttowritesamefunctionusingLambda

3.5.12.2dictionary

Now,letsseehowwecaninterateoveradictionaryusingmapandlambdaLetssaywehaveadictionaryobject

We can iterate over this dictionary and read the elements of it usingmap andlambdafunctionsinfollowingway:

In Python3, map function returns an iterator or map object which gets lazilyevaluatedwhichmeans neitherwe can access the elements of themap objectwith indexnorwe canuse len() to find the lengthof themapobject.We canforceconvertthemapoutputi.e.themapobjecttolistasshownnext:

map(function_object,iterable1,iterable2,...)

defmultiply(x):

returnx*2

map(multiply2,[2,4,6,8])

#Output[4,8,12,16]

map(lambdax:x*2,[2,4,6,8])

#Output[4,8,12,16]

dict_movies=[

{'movie':'avengers','comic':'marvel'},

{'movie':'superman','comic':'dc'}]

map(lambdax:x['movie'],dict_movies)#Output:['avengers','superman']

map(lambdax:x['comic'],dict_movies)#Output:['marvel','dc']

map(lambdax:x['movie']=="avengers",dict_movies)

#Output:[True,False]

map_output=map(lambdax:x*2,[1,2,3,4])

print(map_output)

#Output:mapobject:<mapobjectat0x04D6BAB0>

list_map_output=list(map_output)

print(list_map_output)#Output:[2,4,6,8]

3.5.13Iterators

InPython, an iteratorprotocol isdefinedusing twomethods: __iter()__ and next().The former returns the iterator object and latter returns the next element of asequence.Someadvantagesofiteratorsareasfollows:

ReadabilitySupportssequencesofinfinitelengthSavingresources

Thereareseveralbuilt-inobjects inPythonwhich implement iteratorprotocol,e.g.string,list,dictionary.Inthefollowingexample,wecreateanewclassthatfollowstheiteratorprotocol.Wethenusetheclasstogeneratelog2ofnumbers:

As you can see,we first create an instance of the class and assign its __iter()__functiontoavariablecalledi.Thenbycallingthenext()functionfourtimes,wegetthefollowingoutput:

Asyouprobablynoticed,thelinesarelog2()of1,2,3,4respectively.

3.5.14Generators

frommathimportlog2

classLogTwo:

"Implementsaniteratoroflogtwo"

def__init__(self,last=0):

self.last=last

def__iter__(self):

self.current_num=1

returnself

def__next__(self):

ifself.current_num<=self.last:

result=log2(self.current_num)

self.current_num+=1

returnresult

else:

raiseStopIteration

L=LogTwo(5)

i=iter(L)

print(next(i))

print(next(i))

print(next(i))

print(next(i))

$pythoniterator.py

0.0

1.0

1.584962500721156

2.0

Before we go to Generators, please understand Iterators. Generators are alsoIteratorsbuttheycanonlybeinteratedoveronce.ThatsbecauseGeneratorsdonotstorethevaluesinmemoryinsteadtheygeneratethevaluesonthego.Ifwewanttoprintthosevaluesthenwecaneithersimplyiterateoverthemorusetheforloop.

3.5.14.1Generatorswithfunction

For example:we have a function named asmultiplyBy10which prints all theinputnumbersmultipliedby10.

Now,ifwewanttouseGeneratorsherethenwewillmakefollowingchanges.

InGenerators,weuseyield() function inplaceof return().Sowhenwe try toprintnew_numberslistnow,itjustprintsGeneratorsobject.ThereasonforthisisbecauseGeneratorsdontholdanyvalue inmemory, ityieldsone resultatatime.Soessentiallyitisjustwaitingforustoaskforthenextresult.Toprintthenextresultwecanjustsayprintnext(new_numbers),sohowitisworkingisitsreadingthefirstvalueandsquaringitandyieldingoutvalue1.Alsointhiscasewecanjustprintnext(new_numbers)5timestoprintallnumbersandifwedoitfor6thtimethenwewillgetanerrorStopIterationwhichmeannsGeneratorshasexausteditslimitandithasno6thelementtoprint.

3.5.14.2Generatorsusingforloop

Ifwenowwanttoprintthecompletelistofsquaredvaluesthenwecanjustdo:

defmultiplyBy10(numbers):

result=[]

foriinnumbers:

result.append(i*10)

returnresult

new_numbers=multiplyBy10([1,2,3,4,5])

printnew_numbers#Output:[10,20,30,40,50]


foriinnumbers:

yield(i*10)


printnew_numbers#Output:Generatorsobject

printnext(new_numbers)#Output:1

Theoutputwillbe:

3.5.14.3GeneratorswithListComprehension

Python has something called List Comprehension, ifwe use this thenwe canreplacethecompletefunctiondefwithjust:

Here the point to note is square brackets [] in line 1 is very important. Ifwechangeitto()thenagainwewillstartgettingGeneratorsobject.

Wecanget the individualelementsagain fromGenerators ifwedoa for loopovernew_numberslikewedidpreviously.Alternatively,wecanconvertitintoalistandthenprintit.

Buthereifweconvertthisintoalistthenwelooseonperformance,whichwewilljustseenext.

3.5.14.4WhytouseGenerators?

Generators are betterwithPerformance because it does not hold the values inmemoryandherewiththesmallexamplesweprovideitsnotabigdealsinceweare dealing with small amount of data but just consider a scenario where therecords are in millions of data set. And if we try to convert millions of dataelements into a list then that will definitely make an impact on memory and


foriinnumbers:

yield(i*10)


fornuminnew_numbers:

printnum

10

20

30

40

50

new_numbers=[x*10forxin[1,2,3,4,5]]

printnew_numbers#Output:[10,20,30,40,50]

new_numbers=(x*10forxin[1,2,3,4,5])

printnew_numbers#Output:Generatorsobject

new_numbers=(x*10forxin[1,2,3,4,5])

printlist(new_numbers)#Output:[10,20,30,40,50]

performancebecauseeverythingwillinmemory.

Lets see an example on how Generators help in Performance. First, withoutGenerators, normal function taking 1 million record and returns theresult[people]for1million.

I am justgivingapproximatevalues tocompare itwithnext executionbutwejusttrytorunitwewillseeaseriousconsumptionofmemorywithgoodamountoftimetaken.

names=['John','Jack','Adam','Steve','Rick']

majors=['Math',

'CompScience',

'Arts',

'Business',

'Economics']

#printsthememorybeforewerunthefunction

memory=mem_profile.memory_usage_resource()

print(f'Memory(Before):{memory}Mb')

defpeople_list(people):

result=[]

foriinrange(people):

person={

'id':i,

'name':random.choice(names),

'major':randon.choice(majors)

}

result.append(person)

returnresult

t1=time.clock()

people=people_list(10000000)

t2=time.clock()

#printsthememoryafterwerunthefunction


print(f'Memory(After):{memory}Mb')

print('Took{time}seconds'.format(time=t2-t1))

#Output

Memory(Before):15Mb

Memory(After):318Mb

Took1.2seconds

names=['John','Jack','Adam','Steve','Rick']

majors=['Math',

'CompScience',

'Arts',

'Business',

'Economics']

#printsthememorybeforewerunthefunction


print(f'Memory(Before):{memory}Mb')

defpeople_generator(people):

foriinxrange(people):

person={

'id':i,

'name':random.choice(names),

'major':randon.choice(majors)

}

yieldperson

t1=time.clock()

people=people_list(10000000)

t2=time.clock()

Now after running the same code using Generators, wewill see a significantamount of performanceboostwith alomost 0Seconds.And the reasonbehindthisisthatincaseofGenerators,wedonotkeepanythinginmemorysosystemjustreads1atatimeandyieldsthat.

3.6LIBRARIES

3.6.1PythonModules☁�

OftenyoumayneedfunctionalitythatisnotpresentinPython’sstandardlibrary.Inthiscaseyouhavetwooption:

implementthefeaturesyourselfuseathird-partylibrarythathasthedesiredfeatures.

Oftenyoucanfindapreviousimplementationofwhatyouneed.Sincethisisacommonsituation,thereisaservicesupportingit:thePythonPackageIndex(orPyPiforshort).

Our task here is to install the autopep8 tool from PyPi. Thiswill allow us toillustratetheuseifvirtualenvironmentsusingthepyenvorvirtualenvcommand,andinstallinganduninstallingPyPipackagesusingpip.

3.6.1.1UpdatingPip

Itisimportantthatyouhavethenewestversionofpipinstalledforyourversionof python. Let us assume your python is registered with python and you usepyenv,thanyoucanupdatepipwith

without interferingwith a potential systemwide installed version of p ip that

#printsthememoryafterwerunthefunction


print(f'Memory(After):{memory}Mb')

print('Took{time}seconds'.format(time=t2-t1))

#Output

Memory(Before):15Mb

Memory(After):15Mb

Took0.01seconds

pipinstall-Upip

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-libraries.md

https://pypi.python.org/pypi

maybeneededby the systemdefaultversionofpython.See the sectionaboutpyenvformoredetails

3.6.1.2UsingpiptoInstallPackages

Letusnowlookatanother important toolforPythondevelopment: thePythonPackageIndex,orPyPIforshort.PyPIprovidesalargesetofthird-partypythonpackages. If youwant todo something inpython, first checkpypi, asodd aresomeonealreadyranintotheproblemandcreatedapackagesolvingit.

InordertoinstallpackagefromPyPI,usethepipcommand.WecansearchforPyPIforpackages:

Itappearsthatthetoptworesultsarewhatwewantsoinstallthem:

Thiswill cause pip to download the packages fromPyPI, extract them, checktheir dependencies and install those as needed, then install the requestedpackages.

Youcanskip‘–trusted-hostpypi.python.org’optionifyouhave

patchedurllib3onPython2.7.9.

3.6.1.3GUI

3.6.1.3.1GUIZero

Installguizerowiththefollowingcommand:

Foracomprehensivetutorialonguizero,clickhere.

3.6.1.3.2Kivy

YoucaninstallKivyonmacOSasfollows:

$pipsearch--trusted-hostpypi.python.orgautopep8pylint

$pipinstall--trusted-hostpypi.python.orgautopep8pylint

sudopipinstallguizero

https://lawsie.github.io/guizero/howto/

Ahelloworldprogramforkivy is included in thecloudmesh.robot repository.Whichyoucanfinehere

https://github.com/cloudmesh/cloudmesh.robot/tree/master/projects/kivy

To run the program, please download it or execute it in cloudmesh.robot asfollows:

Tocreatestandalonepackageswithkivy,pleasesee:

3.6.1.4FormattingandCheckingPythonCode

First,getthebadcode:

Examinethecode:

As you can see, this is very dense and hard to read. Cleaning it up by handwouldbeatime-consuminganderror-proneprocess.Luckily,thisisacommonproblemsothereexistacouplepackagestohelpinthissituation.

3.6.1.5Usingautopep8

Wecannowrunthebadcodethroughautopep8tofixformattingproblems:

Letuslookattheresult.Thisisconsiderablybetterthanbefore.Itiseasytotellwhattheexample1andexample2functionsaredoing.

It is a good idea to develop a habit of using autopep8 in your python-

brewinstallpkg-configsdl2sdl2_imagesdl2_ttfsdl2_mixergstreamer

pipinstall-UCython

pipinstallkivy

pipinstallpygame

cdcloudmesh.robot/projects/kivy

pythonswim.py

-https://kivy.org/docs/guide/packaging-osx.html

$wget--no-check-certificatehttp://git.io/pXqb-Obad_code_example.py

$emacsbad_code_example.py

$autopep8bad_code_example.py>code_example_autopep8.py

https://github.com/cloudmesh/cloudmesh.robot/tree/master/projects/kivy

development workflow. For instance: use autopep8 to check a file, and if itpasses,makeanychangesinplaceusingthe-iflag:

IfyouusepyCharmyouhavetheabilitytouseasimilarfunctionwhilepressingonInspectCode.

3.6.1.6WritingPython3CompatibleCode

Towritepython2and3compatiblecodewerecommendthatyoutakealookat:http://python-future.org/compatible_idioms.html

3.6.1.7UsingPythononFutureSystems

ThisisonlyimportantifyouuseFuturesystemsresources.

InordertousePythonyoumustlogintoyourFutureSystemsaccount.Thenattheshellpromptexecutethefollowingcommand:

Thiswillmakethepythonandvirtualenvcommandsavailabletoyou.

Thedetailsofwhatthemoduleloadcommanddoesaredescribedinthefuturelessonmodules.

3.6.1.8Ecosystem

3.6.1.8.1pypi

The Python Package Index is a large repository of software for the Pythonprogramminglanguagecontaininga largenumberofpackages,manyofwhichcanbefoundonpypi.Thenice thingaboutpypi is thatmanypackagescanbeinstalledwiththeprogram‘pip’.

Todosoyouhave to locate the<package_name>forexamplewith thesearchfunctioninpypiandsayonthecommandline:

$autopep8file.py#checkoutputtoseeofpasses

$autopep8-ifile.py#updateinplace

$moduleloadpython

http://python-future.org/compatible_idioms.html

https://pypi.python.org/pypi

where package_name is the string name of the package. an example would be thepackagecalledcloudmesh_clientwhichyoucaninstallwith:

Ifallgoeswellthepackagewillbeinstalled.

3.6.1.8.2AlternativeInstallations

The basic installation of python is provided by python.org. However othersclaim to have alternative environments that allow you to install python. Thisincludes

CanopyAnacondaIronPython

Typically they include not only the python compiler but also several usefulpackages.Itisfinetousesuchenvironmentsfortheclass,butitshouldbenotedthat in both cases not every python librarymay be available for install in thegivenenvironment.Forexampleifyouneedtousecloudmeshclient,itmaynotbeavailableascondaorCanopypackage.This isalso thecaseformanyothercloudrelatedandusefulpythonlibraries.Hence,wedorecommendthat ifyouare new to python to use the distribution form python.org, and use pip andvirtualenv.

Additionally some python version have platform specific libraries ordependencies.Forexamplecocalibraries,.NETorotherframeworksareexamples.Fortheassignmentsandtheprojectssuchplatformdependentlibrariesarenottobeused.

If however you can write a platform independent code that works on Linux,macOSandWindowswhileusingthepython.orgversionbutdevelopitwithanyoftheothertoolsthatisjustfine.Howeveritisuptoyoutoguaranteethatthisindependence is maintained and implemented. You do have to writerequirements.txtfilesthatwillinstallthenecessarypythonlibrariesinaplatformindependent fashion.ThehomeworkassignmentPRG1hasevena requirement

$pipinstall<package_name>

$pipinstallcloudmesh_client

https://store.enthought.com/downloads/#default

https://www.continuum.io/downloads

http://ironpython.net/

todoso.

Inordertoprovideplatformindependencewehavegivenintheclassaminimalpythonversionthatwehavetestedwithhundredsofstudents:python.org.Ifyouuseanyotherversion,thatisyourdecision.Additionallysomestudentsnotonlyusepython.orgbuthaveusediPythonwhichisfinetoo.Howeverthisclassisnotonlyaboutpython,butalsoabouthowtohaveyourcoderunonanyplatform.Thehomeworkisdesignedsothatyoucanidentifyasetupthatworksforyou.

Howeverwehaveconcernsifyouforexamplewantedtousechameleoncloudwhichwerequireyoutoaccesswithcloudmesh.cloudmeshisnotavailableasconda,canopy,orotherframeworkpackage.Cloudmeshclientisavailableformpypiwhichisstandardandshouldbesupportedbytheframeworks.Wehavenottestedcloudmeshonanyotherpythonversionthenpython.orgwhichistheopensourcecommunitystandard.Noneoftheotherversionsarestandard.

In factwe had students over the summer using canopyon theirmachines andtheygotconfusedas theynowhadmultiplepythonversionsanddidnotknowhow to switchbetween themandactivate the correct version.Certainly if youknowhowtodothat,thanfeelfreetousecanopy,andifyouwanttousecanopyall this isuptoyou.However thehomeworkandprojectrequiresyoutomakeyourprogramportabletopython.org.Ifyouknowhowtodothatevenifyouusecanopy,anaconda,oranyotherpythonversionthatisfine.Graderswilltestyourprograms on a python.org installation and not canopy, anaconda, ironpythonwhileusingvirtualenv. It isobviouswhy. Ifyoudonotknowthatansweryoumaywant to thinkabout thatevery timetheytestaprogramtheyneedtodoanewvirtualenvandrunvanillapythoninit.Ifweweretoruntwoinstallsinthesamesystem,thiswillnotworkaswedonotknowifonestudentwillcauseasideeffect foranother.Thusweas instructorsdonot justhave to lookatyourcode but code of hundreds of students with different setups. This is a nonscalablesolutionaseverytimewetestoutcodefromastudentwewouldhavetowipeout theOS, install itnew, installannewversionofwhateverpythonyouhave elected, become familiarwith that version and so on and on.This is thereason why the open source community is using python.org.We follow bestpractices.Usingotherversionsisnotacommunitybestpractice,butmayworkforanindividual.

We have however in regards to using other python version additional bonus

projectssuchas

deployrunanddocumentcloudmeshonironpythondeploy run and document cloudmesh on anaconda, develop script togenerateacondapackageformgithubdeployrunanddocumentcloudmeshoncanopy,developscripttogenerateacondapackageformgithubdeployrunanddocumentcloudmeshonironpythonotherdocumentationthatwouldbeuseful

3.6.1.9Resources

IfyouareunfamiliarwithprogramminginPython,wealsoreferyoutosomeofthenumerousonlineresources.YoumaywishtostartwithLearnPythonorthebookLearnPythontheHardWay.OtheroptionsincludeTutorialsPointorCodeAcademy, and the Python wiki page contains a long list of references forlearningaswell.Additionalresourcesinclude:

https://virtualenvwrapper.readthedocs.iohttps://github.com/yyuu/pyenvhttps://amaral.northwestern.edu/resources/guides/pyenv-tutorialhttps://godjango.com/96-django-and-python-3-how-to-setup-pyenv-for-multiple-pythons/https://www.accelebrate.com/blog/the-many-faces-of-python-and-how-to-manage-them/http://ivory.idyll.org/articles/advanced-swc/http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.htmlhttp://www.youtube.com/watch?v=0vJJlVBVTFghttp://www.korokithakis.net/tutorials/python/http://www.afterhoursprogramming.com/tutorial/Python/Introduction/http://www.greenteapress.com/thinkpython/thinkCSpy.pdfhttps://docs.python.org/3.3/tutorial/modules.htmlhttps://www.learnpython.org/en/Modules/_and/_Packageshttps://docs.python.org/2/library/datetime.htmlhttps://chrisalbon.com/python/strings/_to/_datetime.html

Averylonglistofusefulinformationarealsoavailablefrom

https://www.learnpython.org

http://learnpythonthehardway.org/book/

http://www.tutorialspoint.com/python/

http://www.codecademy.com/en/tracks/python

https://wiki.python.org/moin/BeginnersGuide/Programmers

https://virtualenvwrapper.readthedocs.io

https://github.com/yyuu/pyenv

https://amaral.northwestern.edu/resources/guides/pyenv-tutorial

https://godjango.com/96-django-and-python-3-how-to-setup-pyenv-for-multiple-pythons/

https://www.accelebrate.com/blog/the-many-faces-of-python-and-how-to-manage-them/

http://ivory.idyll.org/articles/advanced-swc/

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html

http://www.youtube.com/watch?v=0vJJlVBVTFg

http://www.korokithakis.net/tutorials/python/

http://www.afterhoursprogramming.com/tutorial/Python/Introduction/

http://www.greenteapress.com/thinkpython/thinkCSpy.pdf

https://docs.python.org/3.3/tutorial/modules.html

https://www.learnpython.org/en/Modules/_and/_Packages

https://docs.python.org/2/library/datetime.html

https://chrisalbon.com/python/strings/_to/_datetime.html

https://github.com/vinta/awesome-pythonhttps://github.com/rasbt/python_reference

This list may be useful as it also contains links to data visualization andmanipulationlibraries,andAItoolsandlibraries.Pleasenotethatforthisclassyoucanreusesuchlibrariesifnototherwisestated.

3.6.1.9.1JupyterNotebookTutorials

AShort Introduction toJupyterNotebooksandNumPyToviewthenotebook,open this link in a background tab https://nbviewer.jupyter.org/ and copy andpaste the following link in the URL input areahttps://cloudmesh.github.io/classes/lesson/prg/Jupyter-NumPy-tutorial-I523-F2017.ipynbThenhitGo.

3.6.1.10Exercises

E.Python.Lib.1:

Write a python program called iterate.py that accepts an integer nfromthecommandline.Passthisintegertoafunctioncallediterate.

Theiteratefunctionshouldtheniteratefrom1ton.Ifthei-thnumberis a multiple of three, print multiple of 3, if a multiple of 5 printmultipleof5,ifamultipleofbothprintmultipleof3and5,elseprintthevalue.

E:Python.Lib.2:

1. Createapyenvorvirtualenv~/ENV

2. Modify your ~/.bashrc shell file to activate your environmentuponlogin.

3. Installthedocoptpythonpackageusingpip

4. Write a program that uses docopt to define a commandlineprogram.Hint:modifytheiterateprogram.

https://github.com/vinta/awesome-python

https://github.com/rasbt/python_reference

https://nbviewer.jupyter.org/

https://cloudmesh.github.io/classes/lesson/prg/Jupyter-NumPy-tutorial-I523-F2017.ipynb

5. Demonstratetheprogramworks.

3.6.2DataManagement☁�

Obviouslywhendealingwithbigdatawemaynotonlybedealingwithdatainoneformatbutinmanydifferentformats.Itisimportantthatyouwillbeabletomastersuchformatsandseamlesslyintegrateinyouranalysis.Thusweprovidesome simple examples on which different data formats exist and how to usethem.

3.6.2.1Formats

3.6.2.1.1Pickle

Pythonpickleallowsyoutosavedatainapythonnativeformatintoafilethatcan later be read in by other programs.However, the data formatmay not beportableamongdifferentpythonversionsthustheformatisoftennotsuitabletostoreinformation.Insteadwerecommendforstandarddatatouseeitherjsonoryaml.

Toreaditbackinuse

3.6.2.1.2TextFiles

Toreadtextfilesintoavariablecalledcontentyoucanuse

Youcanalsousethefollowingcodewhileusingtheconvenientwithstatement

importpickle

flavor={

"small":100,

"medium":1000,

"large":10000

}

pickle.dump(flavor,open("data.p","wb"))

flavor=pickle.load(open("data.p","rb"))

content=open('filename.txt','r').read()

withopen('filename.txt','r')asfile:

content=file.read()

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-data.md

Tosplitupthelinesofthefileintoanarrayyoucando

Thiscamalsobedonewiththebuildinreadlinesfunction

Incasethefileistoobigyouwillwanttoreadthefilelinebyline:

3.6.2.1.3CSVFiles

Oftendataiscontainedincommaseparatedvalues(CSV)withinafile.Toreadsuchfilesyoucanusethecsvpackage.

Usingpandasyoucanreadthemasfollows.

TherearemanyothermodulesandlibrariesthatincludeCSVreadfunctions.Incaseyouneedtosplitasinglelinebycomma,youmayalsousethesplitfunction.However,rememberitswillsplitateverycomma,includingthosecontainedinquotes.Sothismethodalthoughlookingoriginallyconvenienthaslimitations.

3.6.2.1.4Excelspreadsheets

PandascontainsamethodtoreadExcelfiles

3.6.2.1.5YAML

YAML is a very important format as it allows you easily to structure data in


lines=file.read().splitlines()

lines=open('filename.txt','r').readlines()


line=file.readline()

print(line)

importcsv

withopen('data.csv','rb')asf:

contents=csv.reader(f)

forrowincontent:

printrow

importpandasaspd

df=pd.read_csv("example.csv")

importpandasaspd

filename='data.xlsx'

data=pd.ExcelFile(file)

df=data.parse('Sheet1')

hierarchicalfieldsItisfrequentlyusedtocoordinateprogramswhileusingyamlasthespecificationforconfigurationfiles,butalsodatafiles.Toreadinayamlfilethefollowingcodecanbeused

Thenicepartisthatthiscodecanalsobeusedtoverifyifafileisvalidyaml.Towritedataoutwecanuse

The flow style set to false formats the data in a nice readable fashion withindentations.

3.6.2.1.6JSON

3.6.2.1.7XML

XML format is extensively used to transport data across the web. It has ahierarchicaldataformat,andcanberepresentedintheformofatree.

ASampleXMLdatalookslike:

PythonprovidestheElementTreeXMLAPItoparseandcreateXMLdata.

ImportingXMLdatafromafile:

ReadingXMLdatafromastringdirectly:

importyaml

withopen('data.yaml','r')asf:

content=yaml.load(f)

withopen('data.yml','w')asf:

yaml.dump(data,f,default_flow_style=False)

importjson

withopen('strings.json')asf:

content=json.load(f)

<data>

<items>

<itemname="item-1"></item>



</items>

</data>

importxml.etree.ElementTreeasET

tree=ET.parse('data.xml')

root=tree.getroot()

root=ET.fromstring(data_as_string)

Iteratingoverchildnodesinaroot:

ModifyingXMLdatausingElementTree:

Modifyingtextwithinatagofanelementusing.textmethod:

Adding/modifyinganattributeusing.set()method:

OtherPythonmodulesusedforparsingXMLdatainclude

minidom:https://docs.python.org/3/library/xml.dom.minidom.htmlBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/

3.6.2.1.8RDF

ToreadRDFfilesyouwillneedtoinstallRDFlibwith

ThiswillthanallowyoutoreadRDFfiles

Good examples on using RDF are provided on the RDFlib Web page athttps://github.com/RDFLib/rdflib

FromtheWebpageweshowcasealsohowtodirectlyprocessRDFdata fromtheWeb

forchildinroot:

print(child.tag,child.attrib)

tag.text=new_data

tree.write('output.xml')

tag.set('key','value')

tree.write('output.xml')

$pipinstallrdflib

fromrdflib.graphimportGraph

g=Graph()

g.parse("filename.rdf",format="format")

forentrying:

print(entry)

importrdflib

g=rdflib.Graph()

g.load('http://dbpedia.org/resource/Semantic_Web')

fors,p,oing:

prints,p,o

https://docs.python.org/3/library/xml.dom.minidom.html

https://www.crummy.com/software/BeautifulSoup/

https://github.com/RDFLib/rdflib

3.6.2.1.9PDF

The Portable Document Format (PDF) has been made available by AdobeInc.royaltyfree.ThishasenabledPDFtobecomeaworldwideadoptedformatthat also has been standardized in 2008 (ISO/IEC 32000-1:2008,https://www.iso.org/standard/51502.html). A lot of research is published inpapersmakingPDFoneofthede-factostandardsforpublishing.However,PDFis difficult to parse and is focused on high quality output instead of datarepresentation.Nevertheless,toolstomanipulatePDFexist:

PDFMiner

https://pypi.python.org/pypi/pdfminer/allowsthesimpletranslationofPDFinto text that than can be further mined. The manual page helps todemonstratesomeexampleshttp://euske.github.io/pdfminer/index.html.

pdf-parser.py

https://blog.didierstevens.com/programs/pdf-tools/ parses pdf documentsandidentifiessomestructuralelementsthatcanthanbefurtherprocessed.

Ifyouknowaboutothertools,letusknow.

3.6.2.1.10HTML

A very powerful library to parse HTML Web pages is provided withhttps://www.crummy.com/software/BeautifulSoup/

More details about it are provided in the documentation pagehttps://www.crummy.com/software/BeautifulSoup/bs4/doc/

�TODO:Studentscancontributeasection

BeautifulSoupisapythonlibrarytoparse,processandeditHTMLdocuments.

ToinstallBeautifulSoup,usepipcommandasfollows:

In order to process HTML documents, a parser is required. Beautiful Soup

$pipinstallbeautifulsoup4

https://www.iso.org/standard/51502.html

https://pypi.python.org/pypi/pdfminer/

http://euske.github.io/pdfminer/index.html

https://blog.didierstevens.com/programs/pdf-tools/

https://www.crummy.com/software/BeautifulSoup/

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

supports the HTML parser included in Python’s standard library, but it alsosupports a number of third-party Python parsers like the lxml parser which iscommonlyused[1].

Followingcommandcanbeusedtoinstalllxmlparser

Tobeginwith,weimportthepackageandinstantiateanobjectasfollowsforahtmldocumenthtml_handle:

Now,wewilldiscussafewfunctions,attributesandmethodsofBeautifulSoup.

prettifyfunction

prettify() method will turn a Beautiful Soup parse tree into a nicely formattedUnicode string,witha separate line for eachHTML/XML tagand string. It isanalgoustopprint()function.Theobjectcreatedabovecanbeviewedbyprintingtheprettfiedversionofthedocumentasfollows:

tagObject

AtagobjectreferstotagsintheHTMLdocument.ItispossibletogodowntotheinnerlevelsoftheDOMtree.Toaccessatagdivunderthetagbody,itcanbedoneasfollows:

TheattrsattributeofthetagobjectreturnsadictionaryofallthedefinedattributesoftheHTMLtagaskeys.

has_attr()method

Tocheckifatagobjecthasaspecificattribute,has_attr()methodcanbeused.

$pipinstalllxml

frombs4importBeautifulSoup

soup=BeautifulSoup(html_handle,`lxml`)

print(soup.prettify())

body_div=soup.body.div

print(body_div.prettify())

ifbody_div.has_attr('p'):

print('Thevalueof\'p\'attributeis:',body_div['p'])

tagobjectattributes

name-Thisattributereturnsthenameofthetagselected.attrs -Thisattribute returnsadictionaryofall thedefinedattributesof theHTMLtagaskeys.contents -Thisattributereturnsa listofcontentsenclosedwithin theHTMLtagstring-ThisattributewhichreturnsthetextenclosedwithintheHTMLtag.ThisreturnsNoneiftherearemultiplechildrenstrings-Thisovercomesthelimitationofstringandreturnsageneratorofallstringsenclosedwithinthegiventag

Followingcodeshowcasesusageoftheabovediscussedattributes:

SearchingtheTree

find() function takes a filter expression as argument and returns the firstmatchfoundfindall()functionreturnsalistofallthematchingelements

select()functioncanbeusedtosearchthetreeusingCSSselectors

3.6.2.1.11ConfigParser

�TODO:Studentscancontributeasection

body_tag=soup.body

print("Nameofthetag:',body_tag.name)

attrs=body_tag.attrs

print('Theattributesdefinedforbodytagare:',attrs)

print('Thecontentsof\'body\'tagare:\n',body_tag.contents)

print('Thestringvalueenclosedin\'body\'tagis:',body_tag.string)

forsinbody_tag.strings:

print(repr(s))

search_elem=soup.find('a')

print(search_elem.prettify())

search_elems=soup.find_all("a",class_="sample")

pprint(search_elems)

#Selectà`tagwithclass`sample`

a_tag_elems=soup.select('a.sample')

print(a_tag_elems)

https://pymotw.com/2/ConfigParser/

3.6.2.1.12ConfigDict

https://github.com/cloudmesh/cloudmesh-common/blob/master/cloudmesh/common/ConfigDict.py

3.6.2.2Encryption

Oftenweneedtoprotecttheinformationstoredinafile.Thisisachievedwithencryption.Therearemanymethodsofsupportingencryptionandevenifafileisencrypteditmaybetargettoattacks.Thusitisnotonlyimportanttoencryptdatathatyoudonotwantotherstosebutalsotomakesurethatthesystemonwhichthedataishostedissecure.Thisisespeciallyimportantifwetalkaboutbigdatahavingapotentiallargeeffectifitgetsintothewronghands.

To illustrate one type of encryption that is non trivial we have chosen todemonstrate how to encrypt a file with an ssh key. In case you have opensslinstalledonyoursystem,thiscanbeachievedasfollows.

MostimportanthereareStep4thatencryptsthefileandStep5thatdecryptsthefile. Using the Python os module it is straight forward to implement this.However,weareprovidingincloudmeshaconvenientclassthatmakestheuseinpythonverysimple.

#!/bin/sh

#Step1.Creatingafilewithdata

echo"BigDataisthefuture.">file.txt

#Step2.Createthepem

opensslrsa-in~/.ssh/id_rsa-pubout>~/.ssh/id_rsa.pub.pem

#Step3.lookatthepemfiletoillustratehowitlookslike(optional)

cat~/.ssh/id_rsa.pub.pem

#Step4.encryptthefileintosecret.txt

opensslrsautl-encrypt-pubin-inkey~/.ssh/id_rsa.pub.pem-infile.txt-outsecret.txt

#Step5.decryptthefileandprintthecontentstostdout

opensslrsautl-decrypt-inkey~/.ssh/id_rsa-insecret.txt

fromcloudmesh.common.ssh.encryptimportEncryptFile

e=EncryptFile('file.txt','secret.txt')

e.encrypt()

e.decrypt()

https://pymotw.com/2/ConfigParser/

https://github.com/cloudmesh/cloudmesh-common/blob/master/cloudmesh/common/ConfigDict.py

Inourclassweinitializeitwiththelocationsofthefilethatistobeencryptedanddecrypted.Toinitiatethatactionjustcallthemethodsencryptanddecrypt.

3.6.2.3DatabaseAccess

�TODO:Students:defineconventionaldatabaseaccesssection

see:https://www.tutorialspoint.com/python/python_database_access.htm

3.6.2.4SQLite

�TODO:Studentscancontributetothissection

https://www.sqlite.org/index.html

https://docs.python.org/3/library/sqlite3.html

3.6.2.4.1Exercises �

E:Encryption.1:

Testtheshellscripttoreplicatehowthisexampleworks

E:Encryption.2:

Testthecloudmeshencryptionclass

E:Encryption.3:

What other encryptionmethods exist. Can you provide an exampleandcontributetothesection?

E:Encryption.4:

WhatistheissueofencryptionthatmakeitchallengingforBigData

E:Encryption.5:

Givenatestdatasetwithmanyfilestextfiles,howlongwillittaketo

https://www.tutorialspoint.com/python/python_database_access.htm

https://www.sqlite.org/index.html

https://docs.python.org/3/library/sqlite3.html

encrypt anddecrypt themon variousmachines.Write a benchmarkthatyoutest.Developthisbenchmarkasagroup,testoutthetimeittakestoexecuteitonavarietyofplatforms.

3.6.3Plottingwithmatplotlib☁�

Abrief overviewofplottingwithmatplotlib alongwith examples is provided.Firstmatplotlibmustbeinstalled,whichcanbeaccomplishedwithpipinstallasfollows:

Wewillstartbyplottingasimplelinegraphusingbuiltinnumpyfunctionsforsineandcosine.Thisfirststepistoimporttheproperlibrariesshownnext.

Nextwewilldefinethevaluesforthexaxis,wedothiswiththelinspaceoptioninnumpy.Thefirsttwoparametersarethestartingandendingpoints,thesemustbescalars.Thethirdparameterisoptionalanddefinesthenumberofsamplestobe generated between the starting and ending points, this value must be aninteger.Additionalparametersforthelinspaceutilitycanbefoundhere:

Nowwewillusethesineandcosinefunctionsinordertogenerateyvalues,forthiswewill use the values of x for the argument of both our sine and cosinefunctionsi.e.cos(x).

Youcandisplay thevaluesof the threeparameterswehavedefinedby typingtheminapythonshell.

Havingdefinedxandyvalueswecangeneratealineplotandsinceweimportedmatplotlib.pyplotaspltwesimplyuseplt.plot.

$pipinstallmatplotlib

importnumpyasnp

importmatplotlib.pyplotasplt

x=np.linspace(-np.pi,np.pi,16)

cos=np.cos(x)

sin=np.sin(x)

x

array([-3.14159265,-2.72271363,-2.30383461,-1.88495559,-1.46607657,

-1.04719755,-0.62831853,-0.20943951,0.20943951,0.62831853,

1.04719755,1.46607657,1.88495559,2.30383461,2.72271363,

3.14159265])

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-matplotlib.md

Wecandisplaytheplotusingplt.show()whichwillpopupafiguredisplayingtheplotdefined.

Additionallywecanaddthesinelinetooutlinegraphbyenteringthefollowing.

Invoking plt.show() now will show a figure with both sine and cosine linesdisplayed.Nowthatwehaveafiguregenerateditwouldbeusefultolabelthexandyaxisandprovideatitle.Thisisdonebythefollowingthreecommands:

Alongwithaxislabelsandatitleanotherusefulfigurefeaturemaybealegend.Inordertocreatealegendyoumustfirstdesignatealabelfortheline,thislabelwill be what shows up in the legend. The label is defined in the initialplt.plot(x,y)instance,nextisanexample.

Theninordertodisplaythelegendthefollowingcommandisissued:

Thelocationisspecifiedbyusingupperorlowerandleftorright.Naturallyallthesecommandscanbecombinedandput ina filewith the .pyextensionandrunfromthecommandline.

�linkerror

plt.plot(x,cos)

plt.show()

plt.plot(x,sin)

plt.xlabel("X-label(units)")

plt.ylabel("Y-label(units)")

plt.title("AcleverTitleforyourFigure")

plt.plot(x,cos,label="cosine")

plt.legend(loc='upperright')

importnumpyasnp


x=np.linspace(-np.pi,np.pi,16)

cos=np.cos(x)

sin=np.sin(x)

plt.plot(x,cos,label="cosine")

plt.plot(x,sin,label="sine")

plt.xlabel("X-label(units)")

plt.ylabel("Y-label(units)")

plt.title("AcleverTitleforyourFigure")

plt.legend(loc='upperright')

plt.show()

Anexampleofabarchartisprecedednextusingdatafrom[T:fast-cars].

You can customize plots further by using plt.style.use(), in python 3. If youprovide thefollowingcommandinsideapythoncommandshellyouwillseealistofavailablestyles.

Anexampleofusingapredefinedstyleisshownnext.

Uptothispointwehaveonlyshowcasedhowtodisplayfiguresthroughpythonoutput, however web browsers are a popular way to display figures. OneexampleisBokeh, thefollowinglinescanbeenteredinapythonshellandthefigureisoutputtedtoabrowser.

3.6.4DocOpts☁�

Whenwewanttodesigncommandlineargumentsforpythonprogramswehavemany options. However, as our approach is to create documentation first,docoptsprovidesalsoagoodapprachforPython.Thecodeforitislocatedat

https://github.com/docopt/docopt


x=['ToyotaPrius',

'TeslaRoadster',

'BugattiVeyron',

'HondaCivic',

'LamborghiniAventador']

horse_power=[120,288,1200,158,695]

x_pos=[ifori,_inenumerate(x)]

plt.bar(x_pos,horse_power,color='green')

plt.xlabel("CarModel")

plt.ylabel("HorsePower(Hp)")

plt.title("HorsePowerforSelectedCars")

plt.xticks(x_pos,x)

plt.show()

print(plt.style.available)

plt.style.use('seaborn')

frombokeh.ioimportshow

frombokeh.plottingimportfigure

x_values=[1,2,3,4,5]

y_values=[6,7,2,3,6]

p=figure()

p.circle(x=x_values,y=y_values)

show(p)

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-docopts.md

https://github.com/docopt/docopt

Itcanbeinstalledwith

Asampleprogramsarelocatedat

https://github.com/docopt/docopt/blob/master/examples/options_example.py

Asampleprogramofusingdocoptsforourpurposesloksasfollows

Another good feature of using docopts is that we can use the same verbaldescriptioninotherprogramminglanguagesasshowcasedinthisbook.

3.6.5CloudmeshCommandShell☁�

3.6.5.1CMD5

Python’s CMD (https://docs.python.org/2/library/cmd.html) is a very usefulpackagetocreatecommandlineshells.Howeveritdoesnotallowthedynamicintegrationofnewlydefinedcommands.Furthermore,additionstoCMDneedtobe donewithin the same source tree. To simplify developing commands by anumber of people and to have a dynamic plugin mechanism, we developedcmd5.Itisarewriteonourearliereffortsincloudmeshclientandcmd3.

3.6.5.1.1Resources

$pipinstalldocopt

"""CloudmeshVMmanagement

Usage:

cm-govmstartNAME[--cloud=CLOUD]

cm-govmstopNAME[--cloud=CLOUD]

cm-goset--cloud=CLOUD

cm-go-h|--help

cm-go--version

Options:

-h--helpShowthisscreen.

--versionShowversion.

--cloud=CLOUDThenameofthecloud.

--mooredMoored(anchored)mine.

--driftingDriftingmine.

ARGUMENTS:

NAMEThenameoftheVM`

"""



arguments=docopt(__doc__,version='1.0.0rc2')

print(arguments)

https://github.com/docopt/docopt/blob/master/examples/options_example.py

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/cloudmesh/python-cmd5.md

https://docs.python.org/2/library/cmd.html

Thesourcecodeforcmd5islocatedingithub:

https://github.com/cloudmesh/cmd5

We have discussed in Section ¿sec:cloudmesh-cms-install? how to installcloudmeshasdeveloperandhaveaccesstothesourcecodeinadirectorycalledcm.Asyoureadthisdocumentweassumeyouareadeveloperandcanskipthenextsection.

3.6.5.1.2Installationfromsource

WARNING:DONOT EXECUTE THIS IFYOUAREADEVELOPERORYOURENVIRONMENTWILLNOTPROPERLYWORK.

However,ifyouareauserofcloudmeshyoucaninstallitwith

3.6.5.1.3Execution

To run the shell you can activate it with the cms command. cms stands forcloudmeshshell:

Itwillprintthebannerandentertheshell:

Toseethelistofcommandsyoucansay:

Toseethemanualpageforaspecificcommand,pleaseuse:

$pipinstallcloudmesh-cmd5

(ENV2)$cms

+-------------------------------------------------------+

|_______|

|/___||_______||____________||__|

|||||/_\||||/_`|'_`_\/_\/__|'_\|

|||___||(_)||_||(_|||||||__/\__\||||

|\____|_|\___/\__,_|\__,_|_||_||_|\___||___/_||_||

+-------------------------------------------------------+

|CloudmeshCMD5Shell|

+-------------------------------------------------------+

cms>

cms>help

helpCOMMANDNAME

https://github.com/cloudmesh/cmd5

3.6.5.1.4CreateyourownExtension

OneofthemostimportantfeaturesofCMD5isitsabilitytoextenditwithnewcommands.Thisisdoneviapackagednamespaces.Werecommendyounameiscloudmesh-mycommand,wheremycommand is thenameof thecommand thatyou like to create. This can easily be done while using the sys* cloudmeshcommand (we suggest you use a different name than gregor maybe yourfirstname):

Itwilldownloadatemplatefromcloudmeshcalledcloudmesh-barandgenerateanewdirectorycloudmesh-gregorwithalltheneededfilestocreateyourowncommandandregister it dynamically with cloudmesh. All you have to do is to cd into thedirectoryandinstallthecode:

Addingyourowncommandiseasy.Itisimportantthatallobjectsaredefinedinthe command itself and that noglobal variables beuse in order to alloweachshell command to stand alone. Naturally you should develop API librariesoutside of the cloudmesh shell command and reuse them in order to keep thecommandcodeassmallaspossible.Weplacethecommandin:

Nowyoucangoaheadandmodifyyourcommandinthatdirectory.Itwilllooksimilarto(ifyouusedthecommandnamegregor):

$cmssyscommandgenerategregor

$cdcloudmesh-gregor

$pythonsetup.pyinstall

#pipinstall.

cloudmsesh/mycommand/command/gregor.py

from__future__importprint_function

fromcloudmesh.shell.commandimportcommand

fromcloudmesh.shell.commandimportPluginCommand

classGregorCommand(PluginCommand):

@command

defdo_gregor(self,args,arguments):

"""

::

Usage:

gregor-fFILE

gregorlist

Thiscommanddoessomeusefulthings.

Arguments:

FILEafilename

Options:

-fspecifythefile

"""

print(arguments)

ifarguments.FILE:

print("Youhaveusedfile:",arguments.FILE)

An important difference to other CMD solutions is that our commands canleverage (besides the standarddefinition), docopts as away todefine themanualpage. This allows us to use arguments as dict and use simple if conditions tointerpret the command. Using docopts has the advantage that contributors areforcedtothinkaboutthecommandanditsoptionsanddocumentthemfromthestart.Previouslywedidnotusebutargparseandclick.Howeverwenoticedthatforourcontributorsbothsystemsleadtocommandsthatwereeithernotproperlydocumentedor thedevelopersdelivered ambiguous commands that resulted inconfusionandwrongusagebysubsequentusers.Hence,wedorecommendthatyou use docopts for documenting cmd5 commands. The transformation isenabledbythe@commanddecoratorthatgeneratesamanualpageandcreatesaproper help message for the shell automatically. Thus there is no need tointroduceaseparatehelpmethodaswouldnormallybeneeded inCMDwhilereducingtheeffortittakestocontributenewcommandsinadynamicfashion.

3.6.5.1.5Bug:Quotes

Wehaveonebugincmd5thatrelatestotheuseofquotesonthecommandline

Forexampleyouneedtosay

Ifyou like tohelpus fix this thatwouldbegreat. it requires theuseof shlex.Unfortuantlywedidnotyettimetofixthis“feature”.

3.6.6cmdModule☁�

Ifyouconsiderusingthismodule,youmayinsteadwanttousecloudmeshcmd5instead as it provides some very nice features that are not included in cmd.Howevertodothebasics,cmdwilldo.

The Python cmd module is useful for any more involved command-lineapplication.ItisusedintheCloudmeshProject,forexample,andstudentshavefoundithelpful intheirprojectstodevelopquicklyhighqualitycommandlinetoolswithdocumentationsothatotherscanreplicateandusetheprograms.ThePythoncmdmodulecontainsapublicclass,Cmd,designedtobeusedasabase

return""

$cmsgregor-f\"filenamewithspaces\"

https://docs.python.org/3/library/shlex.html

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-cmd.md

http://cloudmesh.github.io/

class for command processors such as interactive shells and other commandinterpreters.

3.6.6.1Hello,Worldwithcmd

Thisexampleshowsaverysimplecommandinterpreterthatsimplyrespondstothegreetcommand.

In order to demonstrate commands provided by cmd, let’s save the followingprograminafilecalledhelloworld.py.

Asessionwiththisprogrammightlooklikethis:

TheCmdclasscanbeusedtocustomizeasubclassthatbecomesauser-definedcommandprompt.Afteryouhaveexecutedyourprogram,commandsdefinedinyourclasscanbeused.Takenoteofthefollowinginthisexample:


importcmd

classHelloWorld(cmd.Cmd):

'''Simplecommandprocessorexample.'''

defdo_greet(self,line):

iflineisnotNoneandlen(line.strip())>0:

print('Hello,%s!'%line.strip().title())

else:

print('Hello!')

defdo_EOF(self,line):

print('bye,bye')

returnTrue


HelloWorld().cmdloop()

$pythonhelloworld.py

(Cmd)help

Documentedcommands(typehelp<topic>):

========================================

help

Undocumentedcommands:

======================

EOFgreet

(Cmd)greet

Hello!

(Cmd)greetalbert

Hello,Albert!

<CTRL-Dpressed>

(Cmd)bye,bye

The methods of the class of the form do_xxx implement the shellcommands,withxxxbeingthenameofthecommand.Forexample,intheHelloWorldclass,thefunctiondo_greetmapstothegreetonthecommandline.

TheEOFcommandisaspecialcommandthatisexecutedwhenyoupressCTRL-Donyourkeyboard.

AssoonasanycommandmethodreturnsTrue theshellapplicationexits.Thus, in this example the shell is exited by pressing CTRL-D, since thedo_EOFmethodistheonlyonethatreturnsTrue.

Theshellapplicationisstartedbycallingthecmdloopmethodoftheclass.

3.6.6.2AMoreInvolvedExample

Letuslookatalittlemoreinvolvedexample.Savethefollowingcodeinafilecalledcalculator.py.

Asessionwiththisprogrammightlooklikethis:


importcmd

classCalculator(cmd.Cmd):

prompt='calc>>>'

intro='Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.'

defdo_add(self,line):

args=line.split()

total=0

forarginargs:

total+=float(arg.strip())

print(total)

defdo_subtract(self,line):

args=line.split()

total=0

iflen(args)>0:

total=float(args[0])

forarginargs[1:]:

total-=float(arg.strip())

print(total)


print('bye,bye')

returnTrue


Calculator().cmdloop()

$pythoncalculator.py

Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.

calc>>>help

Inthiscaseweareusingthepromptandintroclassvariablestodefinewhatthedefaultpromptlookslikeandawelcomemessagewhenthecommandinterpreterisinvoked.

Intheaddandsubtractcommandsweareusingthestripandsplitmethodstoparseallarguments.Ifyouwanttogetfancy,youcanusePythonmoduleslikegetoptsorargparseforthis,butthisisnotnecessaryinthissimpleexample.

3.6.6.3HelpMessages

Noticethatallcommandspresentlyshowupasundocumented.Toremedythis,wecandefinehelp_methodsforeachcommand:


========================================

help


======================

EOFaddsubtract

calc>>>add

0

calc>>>add456

15.0

calc>>>subtract

0

calc>>>subtract102

8.0

calc>>>subtract10220

-12.0

calc>>>bye,bye


importcmd

classCalculator(cmd.Cmd):

prompt='calc>>>'

intro='Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.'

defdo_add(self,line):

args=line.split()

total=0

forarginargs:

total+=float(arg.strip())

print(total)

defhelp_add(self):

print('\n'.join([

'add[number,]',

'Addtheargumentstogetheranddisplaythetotal.'

]))

defdo_subtract(self,line):

args=line.split()

total=0

iflen(args)>0:

total=float(args[0])

forarginargs[1:]:

total-=float(arg.strip())

print(total)

Now,wecanobtainhelpfortheaddandsubtractcommands:

3.6.6.4UsefulLinks

cmsPython2Docs

cmdPython3Docs

Python Module of the Week: cmd – Create line-oriented commandprocessors

Python Module of the Week: cmd – Create line-oriented commandprocessors

3.6.7OpenCV☁�

LearningObjectives

Providesomesimplecalculationssowecantestcloudservices.

defhelp_subtract(self):

print('\n'.join([

'subtract[number,]',

'Subtractallfollowingargumentsfromthefirstargument.'

]))


print('bye,bye')

returnTrue


Calculator().cmdloop()

$pythoncalculator.py

Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.

calc>>>help


========================================

addhelpsubtract


======================

EOF

calc>>>helpadd

add[number,]

Addtheargumentstogetheranddisplaythetotal.

calc>>>helpsubtract

subtract[number,]

Subtractallfollowingargumentsfromthefirstargument.

calc>>>bye,bye



https://pymotw.com/2/cmd/

https://pymotw.com/3/cmd/

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/opencv/opencv.md

ShowcasesomeelementaryOpenCVfunctionsShowanenvironmentalimageanalysisapplicationusingSecchidisks

OpenCV (OpenSourceComputerVisionLibrary) is a library of thousands ofalgorithmsforvariousapplicationsincomputervisionandmachinelearning.Ithas C++, C, Python, Java and MATLAB interfaces and supports Windows,Linux,AndroidandMacOS. In this section,wewill explainbasic featuresofthislibrary,includingtheimplementationofasimpleexample.

3.6.7.1Overview

OpenCVhascountlessfunctionsforimageandvideosprocessing.Thepipelinestarts with reading the images, low-level operations on pixel values,preprocessinge.g.denoising,andthenmultiplestepsofhigher-leveloperationswhich vary depending on the application.OpenCV covers thewhole pipeline,especiallyprovidingalargesetoflibraryfunctionsforhigh-leveloperations.Asimpler library for image processing in Python is Scipy’s multi-dimensionalimageprocessingpackage(scipy.ndimage).

3.6.7.2Installation

OpenCV for Python can be installed on Linux in multiple ways, namelyPyPI(Python Package Index), Linux package manager (apt-get for Ubuntu),Condapackagemanager,andalsobuildingfromsource.YouarerecommendedtousePyPI.Here’sthecommandthatyouneedtorun:

ThiswastestedonUbuntu16.04withafreshPython3.6virtualenvironment.Inordertotest,importthemoduleinPythoncommandline:

If itdoesnotraiseanerror, it is installedcorrectly.Otherwise, try tosolvetheerror.

ForinstallationonWindows,see:

$pipinstallopencv-python

importcv2

https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_setup/py_setup_in_windows/py_setup_in_windows.html#install-opencv-python-in-windows

NotethatbuildingfromsourcecantakealongtimeandmaynotbefeasiblefordeployingtolimitedplatformssuchasRaspberryPi.

3.6.7.3ASimpleExample

Inthisexample,animageisloaded.Asimpleprocessingisperformed,andtheresultiswrittentoanewimage.

3.6.7.3.1Loadinganimage

TheimagewasdownloadedfromUSCstandarddatabase:

http://sipi.usc.edu/database/database.php?volume=misc&image=9

3.6.7.3.2Displayingtheimage

The image is saved in anumpyarray.Eachpixel is representedwith3values(R,G,B).Thisprovidesyouwithaccesstomanipulatetheimageatthelevelofsingle pixels. You can display the image using imshow function as well asMatplotlib’simshowfunction.

Youcandisplaytheimageusingimshowfunction:

oryoucanuseMatplotlib.IfyouhavenotinstalledMatplotlibbefore,installitusing:

Nowyoucanuse:

%matplotlibinline

importcv2

img=cv2.imread('images/opencv/4.2.01.tiff')

cv2.imshow('Original',img)

cv2.waitKey(0)

cv2.destroyAllWindows()

$pipinstallmatplotlib

https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_setup/py_setup_in_windows/py_setup_in_windows.html#install-opencv-python-in-windows

http://sipi.usc.edu/database/database.php?volume=misc&image=9

whichresultsinFigure1

Figure1:Imagedisplay

3.6.7.3.3ScalingandRotation

Scaling(resizing)theimagerelativetodifferentaxis


Figure2:Scalingandrotation


plt.imshow(img)

res=cv2.resize(img,

None,

fx=1.2,

fy=0.7,

interpolation=cv2.INTER_CUBIC)

plt.imshow(res)

Rotationoftheimageforanangleoft


Figure3:image

3.6.7.3.4Gray-scaling

whichresultsin+Figure4

Figure4:Graysacling

rows,cols,_=img.shape

t=45

M=cv2.getRotationMatrix2D((cols/2,rows/2),t,1)

dst=cv2.warpAffine(img,M,(cols,rows))

plt.imshow(dst)

img2=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

plt.imshow(img2,cmap='gray')

3.6.7.3.5ImageThresholding


Figure5:ImageThresholding

3.6.7.3.6EdgeDetection

EdgedetectionusingCannyedgedetectionalgorithm


Figure6:Edgedetection

3.6.7.4AdditionalFeatures

OpenCV has implementations of many machine learning techniques such as

ret,thresh=cv2.threshold(img2,127,255,cv2.THRESH_BINARY)

plt.subplot(1,2,1),plt.imshow(img2,cmap='gray')

plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')

edges=cv2.Canny(img2,100,200)

plt.subplot(121),plt.imshow(img2,cmap='gray')

plt.subplot(122),plt.imshow(edges,cmap='gray')

KMeansandSupportVectorMachines,thatcanbeputintousewithonlyafewlines of code. It also has functions especially for video analysis, featuredetection,objectrecognitionandmanymore.Youcanfindoutmoreaboutthemintheirwebsite

[OpenCV](https://docs.opencv.org/3.0-beta/index.html was initially developedfor C++ and still has a focus on that language, but it is still one of themostvaluableimageprocessinglibrariesinPython.

3.6.8SecchiDisk☁�

Wearedevelopinganautonomousrobotboatthatyoucanbepartofdevelopingwithinthisclass.Therobotbotisactuallymeasuringturbidityorwaterclarity.TraditionallythishasbeendonewithaSecchidisk.TheuseoftheSecchidiskisasfollows:

1. LowertheSecchidiskintothewater.2. Measurethepointwhenyoucannolongerseeit3. Recordthedepthatvariouslevelsandplotinageographical3Dmap

One of the thingswe can do is take a video of themeasurement instead of ahumanrecordingthem.Thanwecananalysethevideoautomaticallytoseehowdeep a diskwas lowered.This is a classical image analysis program.You areencouragedtoidentifyalgorithmsthatcanidentifythedepth.Themostsimplestseemstobetodoahistogramatavarietyofdepthsteps,andmeasurewhenthehistogramno longerchangessignificantly.Thedepthat that imagewillbe themeasurementwelookfor.

Thus ifwe analyse the imageswe need to look at the image and identify thenumbersonthemeasuringtape,aswellasthevisibilityofthedisk.

To show case how such a disk looks like we refer to the image showcasingdifferent Secchi disks. For our purpose the black-white contrast Secchi diskworkswell.SeeFigure7

https://docs.opencv.org/3.0-beta/index.html

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/opencv/secchi.md

Figure 7: Secchi disk types. A marine style on the left and thefreshwaterversionontherightwikipedia.

MoreinformationaboutSecchiDiskcanbefoundat:

https://en.wikipedia.org/wiki/Secchi/_disk

WehaveincludednextacoupleofexampleswhileusingsomeobviouslyusefulOpenCVmethods.Surprisingly,theuseoftheedgedetectionthatcomesinmindfirst to identify if we still can see the disk, seems to complicated to use foranalysis.Weatthistimebelievethehistogramwillbesufficient.

Pleaseinspectourexamples.

3.6.8.1SetupforOSX

First lest setup theOpenCVenvironment forOSX.Naturallyyouwillhave toupdatetheversionsbasedonyourversionsofpython.Whenwetriedtheinstallof OpenCV on MacOS, the setup was slightly more complex than otherpackages. This may have changed by now and if you have improvedinstructions, pleas elt us know. However we do not want to install it viaAnacondaoutoftheobviousreasonthatanacondainstallstomanyotherthings.importos,sys

fromos.pathimportexpanduser

https://en.wikipedia.org/wiki/Secchi/_disk

3.6.8.2Step1:Recordthevideo

Recordthevideoontherobot

Wehaveactuallydonethisforyouandwillprovideyouwithimagesandvideosifyouareinterestedinanalyzingthem.SeeFigure8

3.6.8.3Step2:AnalysetheimagesfromtheVideo

Fornowwejustselected4imagesfromthevideo

os.path

home=expanduser("~")

sys.path.append('/usr/local/Cellar/opencv/3.3.1_1/lib/python3.6/site-packages/')

sys.path.append(home+'/.pyenv/versions/OPENCV/lib/python3.6/site-packages/')

importcv2

cv2.__version__

!pipinstallnumpy>tmp.log

!pipinstallmatplotlib>>tmp.log

%matplotlibinline

importcv2


img1=cv2.imread('secchi/secchi1.png')




figures=[]

fig=plt.figure(figsize=(18,16))

foriinrange(1,13):

figures.append(fig.add_subplot(4,3,i))

count=0

forimgin[img1,img2,img3,img4]:

figures[count].imshow(img)

color=('b','g','r')

fori,colinenumerate(color):

histr=cv2.calcHist([img],[i],None,[256],[0,256])

figures[count+1].plot(histr,color=col)

figures[count+2].hist(img.ravel(),256,[0,256])

count+=3

print("Legend")

print("Firstcolumn=imageofSecchidisk")

print("Secondcolumn=histogramofcolorsinimage")

print("Thirdcolumn=histogramofallvalues")

plt.show()

Figure8:Histogram

3.6.8.3.1ImageThresholding

SeeFigure9,Figure10,Figure11,Figure12defthreshold(img):

ret,thresh=cv2.threshold(img,150,255,cv2.THRESH_BINARY)

plt.subplot(1,2,1),plt.imshow(img,cmap='gray')

plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')

threshold(img1)

Figure9:Threshold1

Figure10:Threshold2

Figure11:Threshold3

Figure12:Threshold4

3.6.8.3.2EdgeDetection

threshold(img2)

threshold(img3)

threshold(img4)

SeeFigure13,Figure14,Figure15,Figure16,Figure17.EdgedetectionusingCannyedgedetectionalgorithm

Figure13:EdgeDetection1



deffind_edge(img):

edges=cv2.Canny(img,50,200)

plt.subplot(121),plt.imshow(img,cmap='gray')

plt.subplot(122),plt.imshow(edges,cmap='gray')

find_edge(img1)

find_edge(img2)

find_edge(img3)

find_edge(img4)


3.6.8.3.3Blackandwhite

Figure17:BackWhiteconversion

3.7DATA

3.7.1DataFormats☁�

3.7.1.1YAML

ThetermYAMLstandfor“YAMLAinotMarkupLanguage”.AccordingtotheWebPageat

http://yaml.org/

“YAML is a human friendly data serialization standard for all programming

bw1=cv2.cvtColor(img1,cv2.COLOR_BGR2GRAY)

plt.imshow(bw1,cmap='gray')

https://github.com/cloudmesh-community/book/blob/master/chapters/data/formats.md

http://yaml.org/

languages.”TherearemultipleversionsofYAMLexistingandoneneedstotakecare of that your software supports the right version. The current version isYAML1.2.

YAML is oftenused for configuration and inmany cases can alsobeused asXMLreplacement.ImportantistatYAMincontrasttoXMLremovesthetagswhilereplacingthemwithindentation.Thishasnaturallytheadvantagethatitismoreasily to read,however, the format is strictandneeds toadhere toproperindentation. Thus it is important that you check your YAML files forcorrectness,eitherbywritingforexampleapythonprogramthatreadyouryamlfile,oranonlineYAMLcheckersuchasprovidedat

http://www.yamllint.com/

An example on how to use yaml in python is provided in our next example.PleasenotethatYAMLisasupersetofJSON.OriginallyYAMLwasdesignedasamarkuplanguage.Howeverasitisnotdocumentorientedbutdataorientedithasbeenrecastanditdoesnolongerclassifyitselfasmarkuplanguage.

Resources:

http://yaml.org/https://en.wikipedia.org/wiki/YAMLhttp://www.yamllint.com/

3.7.1.2JSON

ThetermJSONstandforJavaScriptObjectNotation.Itistargetedasanopen-standard file format that emphasizes on integration of human-readable text totransmitdataobjects.Thedataobjectscontainattributevaluepairs.Althoughit

importos

importsys

importyaml

try:

yamlFilename=os.sys.argv[1]

yamlFile=open(yamlFilename,"r")

except:

print("filenamedoesnotexist")

sys.exit()

try:

yaml.load(yamlFile.read())

except:

print("YAMLfileisnotvalid.")


http://yaml.org/

https://en.wikipedia.org/wiki/YAML


originates from JavaScript, the format itself is language independent. It usesbracketstoalloworganizationofthedata.PLeasenotethatYAMLisasupersetofJSONandnotallYAMLdocumentscanbeconvertedtoJSON.FurthermoreJSONdoesnotsupportcomments.ForthesereasonsweoftenprefertousYAMlinsteadofJSON.HoweverJSONdatacaneasilybetranslatedtoYAMLaswellasXML.

Resources:

https://en.wikipedia.org/wiki/JSONhttps://www.json.org/

3.7.1.3XML

XMLstandsforExtensibleMarkupLanguage.XMLallowstodefinedocumentswith the help of a set of rules in order to make it machine readable. Theemphasize here is on machine readable as document in XML can becomequickly complex and difficult to understand for humans. XML is used fordocumentsaswellasdatastructures.

AtutorialaboutXMLisavailableat

https://www.w3schools.com/xml/default.asp

Resources:

https://en.wikipedia.org/wiki/XML

3.7.2MongoDBinPython☁�

LearningObjectives

IntroductiontobasicMongoDBknowledgeUseofMongoDBviaPyMongoUseofMongoEngineMongoEngineandObject-Documentmapper,UseofFlask-Mongo

https://en.wikipedia.org/wiki/JSON

https://www.json.org/

https://www.w3schools.com/xml/default.asp

https://en.wikipedia.org/wiki/XML

https://github.com/cloudmesh-community/book/blob/master/chapters/data/mongodb.md

In today’s era, NoSQL databases have developed an enormous potential toprocess the unstructured data efficiently. Modern information is complex,extensive, andmaynothavepre-existing relationships.With the adventof theadvanced search engines, machine learning, and Artificial Intelligence,technology expectations to process, store, and analyze such data have growntremendously[2].TheNoSQLdatabaseenginessuchasMongoDB,Redis,andCassandra have successfully overcome the traditional relational databasechallenges such as scalability, performance, unstructured data growth, agilesprint cycles, andgrowingneeds of processingdata in real-timewithminimalhardwareprocessingpower[3].TheNoSQLdatabasesareanewgenerationofengines that do not necessarily require SQL language and are sometimes alsocalledNotOnlySQL databases.However,mostof themsupportvarious third-partyopenconnectivitydriversthatcanmapNoSQLqueriestoSQL’s.Itwouldbe safe to say that althoughNoSQL databases are still far from replacing therelationaldatabases,theyareaddinganimmensevaluewhenusedinhybridITenvironmentsinconjunctionwithrelationaldatabases,basedontheapplicationspecific needs [3].We will be covering theMongoDB technology, its driverPyMongo, itsobject-documentmapperMongoEngine, and theFlask-PyMongomicro-webframeworkthatmakeMongoDBmoreattractiveanduser-friendly.

3.7.2.1CloudmeshMongoDBUsageQuickstart

Beforeyoureadonwelikeyoutoreadthisquickstart.Theeasiestwayformanyof the activities we do to interact with MongoDB is to use our cloudmeshfunctionality.Thispreludesectionisnotintendedtodescribeallthedetails,butgetyoustartedquicklywhileleveragingcloudmesh

Thisisdoneviathecloudmeshcmd5andthecloudmesh_community/cmcode:

https://cloudmesh-community.github.io/cm/

ToinstallmongoonforexamplemacOSyoucanuse

Tostart,stopandseethestatusofmongoyoucanuse

$cmsadminmongoinstall

$cmsadminmongostart

https://cloudmesh-community.github.io/cm/

Toadd anobject toMongo, you simplyhave to define a dictwithpredefinedvaluesforkindandcloud.InfuturesuchattributescanbepassedtothefunctiontodeterminetheMongoDBcollection.

When you invoke the function itwill automatically store the information intoMongoDB. Naturally this requires that the ~/.cloudmesh/cloudmesh.yaml file is properlyconfigured.

3.7.2.2MongoDB

TodayMongoDB isoneof leadingNoSQLdatabasewhich is fullycapableofhandling dynamic changes, processing large volumes of complex andunstructureddata,easilyusingobject-orientedprogrammingfeatures;aswellasdistributed system challenges [4]. At its core, MongoDB is an open source,cross-platform,documentdatabasemainlywritteninC++language.

3.7.2.2.1Installation

MongoDBcanbeinstalledonvariousUnixPlatforms,includingLinux,Ubuntu,AmazonLinux,etc[5].ThissectionfocusesoninstallingMongoDBonUbuntu18.04BionicBeaverusedasastandardOSforavirtualmachineusedasapartofBigDataApplicationClassduringthe2018Fallsemester.

3.7.2.2.1.1Installationprocedure

Beforeinstalling,itisrecommendedtoconfigurethenon-rootuserandprovidethe administrative privileges to it, in order to be able to perform generalMongoDBadmin tasks.Thiscanbeaccomplishedby loginas the rootuser inthefollowingmanner[6].

$cmsadminmongostop

$cmsadminmongostatus

fromcloudmesh.mongo.DataBaseDecoratorimportDatabaseUpdate

@DatabaseUpdate

deftest():

data={

"kind":"test",

"cloud":"testcloud",

"value":"hello"

}

returndata

result=test()

When logged in as a regular user, one can perform actions with superuserprivilegesbytypingsudobeforeeachcommand[6].

Oncetheusersetupiscompleted,onecanloginasaregularuser(mongoadmin)andusethefollowinginstructionstoinstallMongoDB.

To update the Ubuntu packages to the most recent versions, use the nextcommand:

ToinstalltheMongoDBpackage:

Tochecktheserviceanddatabasestatus:

VerifyingthestatusofasuccessfulMongoDBinstallationcanbeconfirmedwithanoutputsimilartothis:

Toverify theconfiguration,morespecifically the installedversion, server,andport,usethefollowingcommand:

Similarly,torestartMongoDB,usethefollowing:

To allow access toMongoDB from an outside hosted server one can use thefollowingcommandwhichopensthefire-wallconnections[5].

Statuscanbeverifiedbyusing:

$addusermongoadmin

$usermod-aGsudosammy

$sudoaptupdate

$sudoaptinstall-ymongodb

$sudosystemctlstatusmongodb

$mongodb.service-Anobject/document-orienteddatabase

Loaded:loaded(/lib/systemd/system/mongodb.service;enabled;vendorpreset:enabled)

Active:**active**(running)sinceSat2018-11-1507:48:04UTC;2min17sago

Docs:man:mongod(1)

MainPID:2312(mongod)

Tasks:23(limit:1153)

CGroup:/system.slice/mongodb.service

└─2312/usr/bin/mongod--unixSocketPrefix=/run/mongodb--config/etc/mongodb.conf

$mongo--eval'db.runCommand({connectionStatus:1})'

$sudosystemctlrestartmongodb

$sudoufwallowfromyour_other_server_ip/32toanyport27017

Other MongoDB configurations can be edited through the /etc/mongodb.conffilessuchasportandhostnames,filepaths.

Also, to complete this step, a server’s IPaddressmustbe added to thebindIPvalue[5].

MongoDB is now listening for a remote connection that can be accessed byanyonewithappropriatecredentials[5].

3.7.2.2.2CollectionsandDocuments

Each database within Mongo environment contains collections which in turncontaindocuments.Collectionsanddocumentsareanalogoustotablesandrowsrespectivelytotherelationaldatabases.Thedocumentstructureisinakey-valueform which allows storing of complex data types composed out of field andvalue pairs. Documents are objects which correspond to native data types inmanyprogramming languages, hence awell defined, embeddeddocument canhelpreduceexpensivejoinsandimprovequeryperformance.The_idfieldhelpstoidentifyeachdocumentuniquely[3].

MongoDB offers flexibility towrite records that are not restricted by columntypes.Thedata storageapproach is flexibleas it allowsone to storedataas itgrowsandtofulfillvaryingneedsofapplicationsand/orusers.ItsupportsJSONlikebinarypointsknownasBSONwheredatacanbestoredwithoutspecifyingthe type of data.Moreover, it can be distributed tomultiplemachines at highspeed. It includes a sharding feature that partitions and spreads the data outacrossvariousservers.ThismakesMongoDBanexcellentchoiceforclouddataprocessing. Its utilities can load high volumes of data at high speed whichultimately provides greater flexibility and availability in a cloud-basedenvironment[2].

ThedynamicschemastructurewithinMongoDBallowseasytestingofthesmall

$sudoufwstatus

$sudonano/etc/mongodb.conf

$logappend=true

bind_ip=127.0.0.1,your_server_ip

*port=27017*

sprints in theAgile projectmanagement life cycles and research projects thatrequirefrequentchangestothedatastructurewithminimaldowntime.Contrarytothisflexibleprocess,modifyingthedatastructureofrelationaldatabasescanbeaverytediousprocess[2].

3.7.2.2.2.1Collectionexample

ThefollowingcollectionexampleforapersonnamedAlbertincludesadditionalinformationsuchasage,status,andgroup[7].

3.7.2.2.2.2Documentstructure

3.7.2.2.2.3CollectionOperations

If collection does not exists, MongoDB database will create a collection bydefault.

3.7.2.2.3MongoDBQuerying

Thedataretrievalpatterns, thefrequencyofdatamanipulationstatementssuchas insert, updates, and deletes may demand for the use of indexes orincorporatingtheshardingfeaturetoimprovequeryperformanceandefficiencyof MongoDB environment [3]. One of the significant difference betweenrelationaldatabasesandNoSQLdatabasesare joins. In the relationaldatabase,onecancombineresultsfromtwoormoretablesusingacommoncolumn,oftencalled as key. The native table contains the primary key column while thereferenced table contains a foreign key. This mechanism allows one to makechangesinasinglerowinsteadofchangingallrowsinthereferencedtable.This

{

name:"Albert"

age:"21"

status:"Open"

group:["AI","MachineLearning"]

}

{

field1:value1,

field2:value2,

field3:value3,

...

fieldN:valueN

}

>db.myNewCollection1.insertOne({x:1})

>db.myNewCollection2.createIndex({y:1})

action is referred to as normalization.MongoDB is a document database andmainlycontainsdenormalizeddatawhichmeansthedataisrepeatedinsteadofindexedoveraspecifickey.Ifthesamedataisrequiredinmorethanonetable,itneedstoberepeated.ThisconstrainthasbeeneliminatedinMongoDB’snewversion 3.2. The new release introduced a $lookup feature whichmore likelyworksasaleft-outer-join.Lookupsarerestrictedtoaggregatedfunctionswhichmeansthatdatausuallyneedsometypeoffilteringandgroupingoperationstobe conducted beforehand. For this reason, joins in MongoDB require morecomplicated querying compared to the traditional relational database joins.Although at this time, lookups are still very far from replacing joins, this is aprominent feature that can resolve some of the relational data challenges forMongoDB[8].MongoDBqueriessupport regularexpressionsaswellas rangeasksforspecificfieldsthateliminatetheneedofreturningentiredocuments[3].MongoDB collections do not enforce document structure like SQL databaseswhichisacompellingfeature.However,itisessentialtokeepinmindtheneedsoftheapplications[2].

3.7.2.2.3.1MongoQueriesexamples

ThequeriescanbeexecutedfromMongoshellaswellasthroughscripts.

To query the data from a MongoDB collection, one would use MongoDB’sfind()method.

Theoutputcanbeformattedbyusingthepretty()command.

TheMongoDBinsertstatementscanbeperformedinthefollowingmanner:

“The$lookup command performs a left-outer-join to an unshardedcollectioninthesamedatabasetofilterindocumentsfromthejoinedcollectionforprocessing”[9].

>db.COLLECTION_NAME.find()

>db.mycol.find().pretty()

>db.COLLECTION_NAME.insert(document)

${

$lookup:

{

from:<collectiontojoin>,

localField:<fieldfromtheinputdocuments>,

ThisoperationisequivalenttothefollowingSQLoperation:

ToperformaLikeMatch(Regex),onewouldusethefollowingcommand:

3.7.2.2.4MongoDBBasicFunctions

WhenitcomestothetechnicalelementsofMongoDB,itpossesarichinterfacefor importing and storage of external data in various formats. By using theMongoImport/Exporttool,onecaneasilytransfercontentsfromJSON,CSV,orTSV files into a database. MongoDB supports CRUD (create, read, update,delete) operations efficiently and has detailed documentation available on theproductwebsite.Itcanalsoquerythegeospatialdata,anditiscapableofstoringgeospatial data in GeoJSON objects. The aggregation operation of theMongoDB process data records and returns computed results. MongoDBaggregationframeworkismodeledontheconceptofdatapipelines[10].

3.7.2.2.4.1Import/Exportfunctionsexamples

ToimportJSONdocuments,onewouldusethefollowingcommand:

The CSV import uses the input file name to import a collection, hence, thecollectionnameisoptional[10].

“Mongoexport is a utility that produces a JSON or CSV export ofdatastoredinaMongoDBinstance”[10].

3.7.2.2.5SecurityFeatures

foreignField:<fieldfromthedocumentsofthe"from"collection>,

as:<outputarrayfield>

}

}

$SELECT*,<outputarrayfield>

FROMcollection

WHERE<outputarrayfield>IN(SELECT*

FROM<collectiontojoin>

WHERE<foreignField>=<collection.localField>);`

>db.products.find({sku:{$regex:/789$/}})

$mongoimport--dbusers--collectioncontacts--filecontacts.json

$mongoimport--dbusers--typecsv--headerline--file/opt/backups/contacts.csv

$mongoexport--dbtest--collectiontraffic--outtraffic.json

DatasecurityisacrucialaspectoftheenterpriseinfrastructuremanagementandisthereasonwhyMongoDBprovidesvarioussecurityfeaturessuchasolebasedaccess control, numerous authentication options, and encryption. It supportsmechanisms such as SCRAM, LDAP, and Kerberos authentication. Theadministrator can create role/collection-based access control; also roles can bepredefined or custom. MongoDB can audit activities such as DDL, CRUDstatements,authenticationandauthorizationoperations[11].

3.7.2.2.5.1Collectionbasedaccesscontrolexample

Auserdefinedrolecancontainthefollowingprivileges[11].

3.7.2.2.6MongoDBCloudService

In regards to the cloud technologies, MongoDB also offers fully automatedcloudservicecalledAtlaswithcompetitivepricingoptions.MongoAtlasCloudinterface offers interactive GUI for managing cloud resources and deployingapplications quickly. The service is equipped with geographically distributedinstances to ensure no single point failure. Also, a well-rounded performancemonitoring interface allows users to promptly detect anomalies and generateindex suggestions to optimize the performance and reliability of the database.Global technology leaders such as Google, Facebook, eBay, and Nokia areleveragingMongoDB andAtlas cloud services makingMongoDB one of themostpopularchoicesamongtheNoSQLdatabases[12].

3.7.2.3PyMongo

PyMongo is the official Python driver or distribution that allowsworkwith aNoSQLtypedatabasecalledMongoDB[13].Thefirstversionofthedriverwasdevelopedin2009[14],onlytwoyearsafterthedevelopmentofMongoDBwasstarted.Thisdriverallowsdevelopers tocombinebothPython’sversatilityandMongoDB’sflexibleschemanatureintosuccessfulapplications.Currently,thisdriver supports MongoDB versions 2.6, 3.0, 3.2, 3.4, 3.6, and 4.0 [15].MongoDBandPythonrepresentacompatiblefitconsideringthatBSON(binary

$privileges:[

{resource:{db:"products",collection:"inventory"},actions:["find","update"]},

{resource:{db:"products",collection:"orders"},actions:["find"]}

]

JSON) used in this NoSQL database is very similar to Python dictionaries,whichmakes thecollaborationbetweenthe twoevenmoreappealing[16].Forthisreason,dictionariesaretherecommendedtoolstobeusedinPyMongowhenrepresentingdocuments[17].


Prior to being able to exploit the benefits of Python and MongoDBsimultaneously,thePyMongodistributionmustbeinstalledusingpip.Toinstallitonallplatforms,thefollowingcommandshouldbeused[18]:$python-mpipinstallpymongo

SpecificversionsofPyMongocanbe installedwithcommand lines suchas inourexamplewherethe3.5.1versionisinstalled[18].

Asinglelineofcodecanbeusedtoupgradethedriveraswell[18].

Furthermore, the installation process can be completed with the help of theeasy_installtool,whichrequiresuserstousethefollowingcommand[18].

To do an upgrade of the driver using this tool, the following command isrecommended[18]:

There are many other ways of installing PyMongo directly from the source,however, theyrequireforCextensiondependencies tobe installedprior to thedriver installation step, as they are the ones that skim through the sources onGitHubandusethemostup-to-datelinkstoinstallthedriver[18].

Tocheckiftheinstallationwascompletedaccurately,thefollowingcommandisusedinthePythonconsole[19].

If the command returns zero exceptions within the Python shell, one can

$python-mpipinstallpymongo==3.5.1

$python-mpipinstall--upgradepymongo

$python-measy_installpymongo

$python-measy_install-Upymongo

importpymongo

considerforthePyMongoinstallationtohavebeencompletedsuccessfully.

3.7.2.3.2Dependencies

The PyMongo driver has a few dependencies that should be taken intoconsiderationpriortoitsusage.Currently,itsupportsCPython2.7,3.4+,PyPy,and PyPy 3.5+ interpreters [15]. An optional dependency that requires someadditionalcomponentstobeinstalledistheGSSAPIauthentication[15].FortheUnixbasedmachines, it requirespykerberos,while for theWindowsmachinesWinKerberos is needed to fullfill this requirement [15]. The automaticinstallation of this dependency can be done simultaneously with the driverinstallation,inthefollowingmanner:

Other third-party dependencies such as ipaddress, certifi, or wincerstore arenecessaryforconnectionswithhelpofTLS/SSLandcanalsobesimultaneouslyinstalledalongwiththedriverinstallation[15].

3.7.2.3.3RunningPyMongowithMongoDeamon

OncePyMongois installed, theMongodeamoncanberunwithaverysimplecommandinanewterminalwindow[19].

3.7.2.3.4ConnectingtoadatabaseusingMongoClient

In order to be able to establish a connectionwith a database, aMongoClientclass needs to be imported, which sub-sequentially allows the MongoClientobjecttocommunicatewiththedatabase[19].

Thiscommandallowsaconnectionwithadefault,localhostthroughport27017,however, depending on the programming requirements, one can also specifythosebylistingthemintheclient instanceorusethesameinformationviatheMongoURIformat[19].

3.7.2.3.5AccessingDatabases

$python-mpipinstallpymongo[gssapi]

$mongod

frompymongoimportMongoClient

client=MongoClient()

Since MongoClient plays a server role, it can be used to access any desireddatabasesinaneasyway.Todothat,onecanusetwodifferentapproaches.Thefirstapproachwouldbedoingthisvia theattributemethodwhere thenameofthe desired database is listed as an attribute, and the second approach, whichwouldincludeadictionary-styleaccess[19].Forexample,toaccessadatabasecalled cloudmesh_community, onewould use the following commands for theattributeandforthedictionarymethod,respectively.

3.7.2.3.6CreatingaDatabase

Creating a database is a straight forward process. First, one must create aMongoClientobjectandspecifytheconnection(IPaddress)aswellasthenameofthedatabasetheyaretryingtocreate[20].Theexampleof thiscommandispresentedinthefollowngsection:

3.7.2.3.7InsertingandRetrievingDocuments(Querying)

Creating documents and storing data using PyMongo is equally easy asaccessingandcreatingdatabases.Inordertoaddnewdata,acollectionmustbespecifiedfirst.Inthisexample,adecisionismadetousethecloudmeshgroupofdocuments.

Oncethisstepiscompleted,datamaybeinsertedusingtheinsert_one()method,whichmeans that only one document is being created.Of course, insertion ofmultiple documents at the same time is possible as well with use of theinsert_many()method[19].Anexampleofthismethodisasfollows:

Anotherexampleofthismethodwouldbetocreateacollection.Ifwewantedtocreateacollectionofstudents in thecloudmesh_community,wewoulddo it in

db=client.cloudmesh_community

db=client['cloudmesh_community']

importpymongo

client=pymongo.MongoClient('mongodb://localhost:27017/')

db=client['cloudmesh']

cloudmesh=db.cloudmesh

course_info={

'course':'BigDataApplicationsandAnalytics',

'instructor':'GregorvonLaszewski',

'chapter':'technologies'

}

result=cloudmesh.insert_one(course_info)`

thefollowingmanner:

Retrievingdocumentsisequallysimpleascreatingthem.Thefind_one()methodcanbeusedtoretrieveonedocument[19].Animplementationofthismethodisgiveninthefollowingexample.

Similarly, to retieve multiple documents, one would use the find() methodinsteadofthe find_one().Forexample,tofindallcoursesthoughtbyprofessorvonLaszewski,onewouldusethefollowingcommand:

Onethingthatusersshouldbecognizantofwhenusingthefind()methodisthatit doesnot return results in an array formatbut as acursor object,which is acombinationofmethods thatwork together tohelpwithdataquerying[19]. Inordertoreturnindividualdocuments,iterationovertheresultmustbecompleted[19].

3.7.2.3.8LimitingResults

When itcomes toworkingwith largedatabases it isalwaysuseful to limit thenumberofquery results.PyMongosupports thisoptionwith its limit()method[20]. This method takes in one parameter which specifies the number ofdocumentstobereturned[20].Forexample,ifwehadacollectionwithalargenumber of cloud technologies as individual documents, one couldmodify thequery results to return only the top 10 technologies.To do this, the followingexamplecouldbeutilized:

student=[{'name':'John','st_id':52642},

{'name':'Mercedes','st_id':5717},

{'name':'Anna','st_id':5654},

{'name':'Greg','st_id':5423},

{'name':'Amaya','st_id':3540},

{'name':'Cameron','st_id':2343},

{'name':'Bozer','st_id':4143},

{'name':'Cody','price':2165}]

client=MongoClient('mongodb://localhost:27017/')

withclient:

db=client.cloudmesh

db.students.insert_many(student)

gregors_course=cloudmesh.find_one({'instructor':'GregorvonLaszewski'})

gregors_course=cloudmesh.find({'instructor':'GregorvonLaszewski'})



col=db['technologies']

topten=col.find().limit(10)

3.7.2.3.9UpdatingCollection

Updating documents is very similar to inserting and retrieving the same.Depending on the number of documents to be updated, one would use theupdate_one()orupdate_many()method[20].Twoparametersneedtobepassedintheupdate_one()methodforittosuccessfullyexecute.Thefirstargumentisthe query object that specifies the document to be changed, and the secondargumentistheobjectthatspecifiesthenewvalueinthedocument.Anexampleoftheupdate_one()methodinactionisthefollowing:

Updating all documents that fall under the same criteria can be donewith theupdate_many method [20]. For example, to update all documents in whichcoursetitlestartswithletterBwithadifferentinstructorinformation,wewoulddothefollowing:

3.7.2.3.10CountingDocuments

Counting documents can be done with one simple operation calledcount_documents()insteadofusingafullquery[21].Forexample,wecancountthedocumentsinthecloudmesh_commpunitybyusingthefollowingcommand:

Tocreateamorespecificcount,onewoulduseacommandsimilartothis:

Thistechnologysupportssomemoreadvancedqueryingoptionsaswell.Thoseadvanced queries allow one to add certain contraints and narrow down theresults even more. For example, to get the courses thought by professor vonLaszewskiafteracertaindate,onewouldusethefollowingcommand:

myquery={'course':'BigDataApplicationsandAnalytics'}

newvalues={'$set':{'course':'CloudComputing'}}



col=db['courses']

query={'course':{'$regex':'^B'}}

newvalues={'$set':{'instructor':'GregorvonLaszewski'}}

edited=col.update_many(query,newvalues)

cloudmesh=count_documents({})

cloudmesh=count_documents({'author':'vonLaszewski'})

d=datetime.datetime(2017,11,12,12)

forcourseincloudmesh.find({'date':{'$lt':d}}).sort('author'):

pprint.pprint(course)

3.7.2.3.11Indexing

Indexing is a very important part of querying. It can greately improve queryperformancebutalsoaddfunctionalityandaideinstoringdocuments[21].

“To create a unique index on a key that rejects documents whosevalueforthatkeyalreadyexistsintheindex”[21].

Weneedtofirstlycreatetheindexinthefollowingmanner:

Thiscommandacutallycreatestwodifferentindexes.Thefirstoneisthe*_id*,createdbyMongoDBautomatically,andthesecondoneis theuser_id,createdbytheuser.

Thepurposeof those indexes is to cleverlyprevent future additionsof invaliduser_idsintoacollection.

3.7.2.3.12Sorting

Sortingontheserver-sideisalsoavaialableviaMongoDB.ThePyMongosort()methodisequivalenttotheSQLorderbystatementanditcanbeperformedaspymongo.ascending andpymongo.descending [22].Thismethod ismuchmoreefficient as it is being completed on the server-side, compared to the sortingcompleted on the client side. For example, to return all userswith first nameGregorsortedindescendingorderbybirthdatewewoulduseacommandsuchasthis:

3.7.2.3.13Aggregation

Aggregationoperationsareusedtoprocessgivendataandproducesummarizedresults. Aggregation operations collect data from a number of documents andprovide collective results by grouping data. PyMongo in its documentationoffers a separate framework that supports data aggregation. This aggregationframeworkcanbeusedto

result=db.profiles.create_index([('user_id',pymongo.ASCENDING)],

unique=True)

sorted(list(db.profiles.index_information()))

users=cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth',pymongo.DESCENDING))

foruserinusers:

printuser.get('email')

“provideprojectioncapabilitiestoreshapethereturneddata”[23].

In the aggregation pipeline, documents pass through multiple pipeline stageswhich convert documents into result data. The basic pipeline stages includefilters. Those filters act like document transformation by helping change thedocument output form. Other pipelines help group or sort documents withspecific fields. By using native operations from MongoDB, the pipelineoperatorsareefficientinaggregatingresults.

TheaddFieldsstageisusedtoaddnewfieldsintodocuments.Itreshapeseachdocument in stream, similarly to the project stage. The output document willcontain existing fields from input documents and the newly added fields 24].Thefollowingexampleshowshowtoaddstudentdetailsintoadocument.

Thebucketstageisusedtocategorizeincomingdocumentsintogroupsbasedonspecified expressions. Those groups are called buckets [24]. The followingexampleshowsthebucketstageinaction.

In the bucketAuto stage, the boundaries are automatically determined in anattempttoevenlydistributedocumentsintoaspecifiednumberofbuckets.Inthefollowingoperation,inputdocumentsaregroupedintofourbucketsaccordingtothevaluesinthepricefield[24].

db.cloudmesh_community.aggregate([

{

$addFields:{

"document.StudentDetails":{

$concat:['$document.student.FirstName','$document.student.LastName']

}

}

}])

db.user.aggregate([

{"$group":{

"_id":{

"city":"$city",

"age":{

"$let":{

"vars":{

"age":{"$subtract":[{"$year":newDate()},{"$year":"$birthDay"}]}},

"in":{

"$switch":{

"branches":[

{"case":{"$lt":["$$age",20]},"then":0},

{"case":{"$lt":["$$age",30]},"then":20},

{"case":{"$lt":["$$age",40]},"then":30},

{"case":{"$lt":["$$age",50]},"then":40},

{"case":{"$lt":["$$age",200]},"then":50}

]}}}}},

"count":{"$sum":1}}})

db.artwork.aggregate([

{

$bucketAuto:{

ThecollStatsstagereturnsstatisticsregardingacollectionorview[24].

Thecount stagepasses a document to thenext stage that contains thenumberdocumentsthatwereinputtothestage[24].

The facet stage helps processmultiple aggregation pipelines in a single stage[24].

The geoNear stage returns an ordered stream of documents based on theproximity to a geospatial point. The output documents include an additionaldistancefieldandcanincludealocationidentifierfield[24].

ThegraphLookup stage performs a recursive search on a collection. To eachoutputdocument, itaddsanewarrayfieldthatcontainsthetraversalresultsoftherecursivesearchforthatdocument[24].

groupBy:"$price",

buckets:4

}

}

])

db.matrices.aggregate([{$collStats:{latencyStats:{histograms:true}}

}])

db.scores.aggregate([{

$match:{score:{$gt:80}}},

{$count:"passing_scores"}])

db.artwork.aggregate([{

$facet:{"categorizedByTags":[{$unwind:"$tags"},

{$sortByCount:"$tags"}],"categorizedByPrice":[

//Filteroutdocumentswithoutapricee.g.,_id:7

{$match:{price:{$exists:1}}},

{$bucket:{groupBy:"$price",

boundaries:[0,150,200,300,400],

default:"Other",

output:{"count":{$sum:1},

"titles":{$push:"$title"}

}}}],"categorizedByYears(Auto)":[

{$bucketAuto:{groupBy:"$year",buckets:4}

}]}}])

db.places.aggregate([

{$geoNear:{

near:{type:"Point",coordinates:[-73.99279,40.719296]},

distanceField:"dist.calculated",

maxDistance:2,

query:{type:"public"},

includeLocs:"dist.location",

num:5,

spherical:true

}}])

db.travelers.aggregate([

{

$graphLookup:{

from:"airports",

startWith:"$nearestAirport",

connectFromField:"connects",

Thegroup stageconsumes thedocumentdatapereachdistinctgroup. IthasaRAM limit of 100MB. If the stage exceeds this limit, thegroup produces anerror[24].

The indexStats stage returns statistics regarding the use of each index for acollection[24].

The limit stage is used for controlling the number of documents passed to thenextstageinthepipeline[24].

ThelistLocalSessionsstagegivesthesessioninformationcurrentlyconnectedtomongosormongodinstance[24].

ThelistSessionsstagelistsoutallsessionthathavebeenactivelongenoughtopropagatetothesystem.sessionscollection[24].

Thelookupstageisusefulforperformingouterjoinstoothercollectionsinthesamedatabase[24].

connectToField:"airport",

maxDepth:2,

depthField:"numConnections",

as:"destinations"

}

}

])

db.sales.aggregate(

[

{

$group:{

_id:{month:{$month:"$date"},day:{$dayOfMonth:"$date"},

year:{$year:"$date"}},

totalPrice:{$sum:{$multiply:["$price","$quantity"]}},

averageQuantity:{$avg:"$quantity"},

count:{$sum:1}

}

}

]

)

db.orders.aggregate([{$indexStats:{}}])

db.article.aggregate(

{$limit:5}

)

db.aggregate([{$listLocalSessions:{allUsers:true}}])

useconfig

db.system.sessions.aggregate([{$listSessions:{allUsers:true}}])

{

$lookup:

{

from:<collectiontojoin>,

Thematchstageisusedtofilterthedocumentstream.Onlymatchingdocumentspasstonextstage[24].

Theproject stage is used to reshape the documents by adding or deleting thefields.

The redact stage reshapes stream documents by restricting information usinginformationstoredindocumentsthemselves[24].

ThereplaceRootstageisusedtoreplaceadocumentwithaspecifiedembeddeddocument[24].

Thesample stage isused to sampleoutdataby randomlyselectingnumberofdocumentsforminput[24].

Theskipstageskipsspecifiedinitialnumberofdocumentsandpassesremainingdocumentstothepipeline[24].

localField:<fieldfromtheinputdocuments>,

foreignField:<fieldfromthedocumentsofthe"from"collection>,

as:<outputarrayfield>

}

}

db.articles.aggregate(

[{$match:{author:"dave"}}]

)

db.books.aggregate([{$project:{title:1,author:1}}])

db.accounts.aggregate(

[

{$match:{status:"A"}},

{

$redact:{

$cond:{

if:{$eq:["$level",5]},

then:"$$PRUNE",

else:"$$DESCEND"

}}}]);

db.produce.aggregate([

{

$replaceRoot:{newRoot:"$in_stock"}

}

])

db.users.aggregate(

[{$sample:{size:3}}]

)

db.article.aggregate(

{$skip:5}

);

Thesort stage is usefulwhile reordering document stream by a specified sortkey[24].

The sortByCounts stage groups the incoming documents based on a specifiedexpressionvalueandcountsdocumentsineachdistinctgroup[24].

Theunwindstagedeconstructsanarrayfieldfromtheinputdocumentstooutputadocumentforeachelement[24].

Theoutstageisusedtowriteaggregationpipelineresultsintoacollection.Thisstageshouldbethelaststageofapipeline[24].

AnotheroptionfromtheaggregationoperationsistheMap/Reduceframework,whichessentiallyincludestwodifferentfunctions,mapandreduce.Thefirstoneprovidesthekeyvaluepairforeachtaginthearray,whilethelatterone

“sumsoveralloftheemittedvaluesforagivenkey”[23].

ThelaststepintheMap/Reduceprocessittocallthemap_reduce()functionanditerateovertheresults[23].TheMap/Reduceoperationprovidesresultdatainacollectionorreturnsresultsin-line.Onecanperformsubsequentoperationswiththesameinputcollectioniftheoutputofthesameiswrittentoacollection[25].Anoperationthatproducesresultsinain-lineformmustprovideresultswithintheBSONdocument size limit.Thecurrent limit for aBSONdocument is 16MB.Thesetypesofoperationsarenotsupportedbyviews[25].ThePyMongo’sAPI supports all features of the MongoDB’s Map/Reduce engine [26].Moreover,Map/Reduce has the ability to getmore detailed results by passingfull_response=Trueargumenttothemap_reduce()function[26].

db.users.aggregate(

[

{$sort:{age:-1,posts:1}}

]

)

db.exhibits.aggregate(

[{$unwind:"$tags"},{$sortByCount:"$tags"}])

db.inventory.aggregate([{$unwind:"$sizes"}])

db.inventory.aggregate([{$unwind:{path:"$sizes"}}])

db.books.aggregate([

{$group:{_id:"$author",books:{$push:"$title"}}},

{$out:"authors"}

])

3.7.2.3.14DeletingDocumentsfromaCollection

ThedeletionofdocumentswithPyMongo is fairly straight forward.Todo so,one would use the remove() method of the PyMongo Collection object [22].Similarlytothereadsandupdates,specificationofdocumentstoberemovedisamust.Forexample,removaloftheentiredocumentcollectionwithascoreof1,wouldrequiredonetousethefollowingcommand:

ThesafeparametersettoTrueensurestheoperationwascompleted[22].

3.7.2.3.15CopyingaDatabase

Copying databases within the same mongod instance or between differentmongodservers ismadepossiblewith thecommand()methodafterconnectingto the desired mongod instance [27]. For example, to copy the cloudmeshdatabase and name the new database cloudmesh_copy, one would use thecommand()methodinthefollowingmanner:

There are two ways to copy a database between servers. If a server is notpassword-prodected, one would not need to pass in the credentials nor toauthenticate to the admin database [27]. In that case, to copy a database onewouldusethefollowingcommand:

On the other hand, if the server where we are copying the database to isprotected,onewouldusethiscommandinstead:

3.7.2.3.16PyMongoStrengths

cloudmesh.users.remove({"score":1,safe=True})

client.admin.command('copydb',

fromdb='cloudmesh',

todb='cloudmesh_copy')


fromdb='cloudmesh',

todb='cloudmesh_copy',

fromhost='source.example.com')

client=MongoClient('target.example.com',

username='administrator',

password='pwd')


fromdb='cloudmesh',

todb='cloudmesh_copy',

fromhost='source.example.com')

One of PyMongo strengths is that allows document creation and queryingnatively

“through the use of existing language features such as nesteddictionariesandlists”[22].

Formoderately experienced Python developers, it is very easy to learn it andquicklyfeelcomfortablewithit.

“For these reasons, MongoDB and Python make a powerfulcombinationforrapid,iterativedevelopmentofhorizontallyscalablebackendapplications”[22].

Accordingto[22],MongoDBisveryapplicable tomodernapplications,whichmakesPyMongoequallyvaluable[22].

3.7.2.4MongoEngine

“MongoEngineisanObject-DocumentMapper,writteninPythonforworkingwithMongoDB”[28].

It is actually a library that allows a more advanced communication withMongoDBcomparedtoPyMongo.AsMongoEngineistechnicallyconsideredtobeanobject-documentmapper(ODM),itcanalsobeconsideredtobe

“equivalenttoaSQL-basedobjectrelationalmapper(ORM)”[19].

Theprimary techniquewhyonewoulduse anODM includesdataconversionbetweencomputersystemsthatarenotcompatiblewitheachother[29].Forthepurpose of converting data to the appropriate form, a virtual object databasemustbecreatedwithin theutilizedprogramming language[29].This library isalsousedtodefineschematafordocumentswithinMongoDB,whichultimatelyhelpswithminimizingcodingerrorsaswelldefiningmethodsonexistingfields[30].Itisalsoverybeneficialtotheoverallworkflowasittrackschangesmadetothedocumentsandaidsinthedocumentsavingprocess[31].


Theinstallationprocessforthistechnologyisfairlysimpleasitisconsideredtobealibrary.Toinstallit,onewouldusethefollowingcommand[32]:

Ableeding-edgeversionofMongoEnginecanbeinstalleddirectlyfromGitHubbyfirstcloningtherepositoryonthelocalmachine,virtualmachine,orcloud.

3.7.2.4.2ConnectingtoadatabaseusingMongoEngine

Once installed, MongoEngine needs to be connected to an instance of themongod, similarly to PyMongo [33]. The connect() function must be used tosuccessfully complete this step and the argument that must be used in thisfunctionisthenameofthedesireddatabase[33].Priortousingthisfunction,thefunctionnameneedstobeimportedfromtheMongoEnginelibrary.

SimilarlytotheMongoClient,MongoEngineusesthelocalhostandport27017by default, however, the connect() function also allows specifying other hostsandportargumentsaswell[33].

Other types of connections are also supported (i.e. URI) and they can becompletedbyprovidingtheURIintheconnect()function[33].

3.7.2.4.3QueryingusingMongoEngine

ToqueryMongoDBusingMongoEngineanobjectsattribute isused,whichis,technically, a part of the document class [34]. This attribute is called theQuerySetManagerwhichinreturn

“createsanewQuerySetobjectonaccess”[34].

Tobeabletoaccessindividualdocumentsfromadatabase,thisobjectneedstobe iterated over. For example, to return/print all students in thecloudmesh_community object (database), the following command would beused.

$pipinstallmongoengine

frommongoengineimportconnect

connect('cloudmesh_community')

connect('cloudmesh_community',host='196.185.1.62',port=16758)

foruserincloudmesh_community.objects:

MongoEngine also has a capability of query filtering which means that akeyword can be used within the called QuerySet object to retrieve specificinformation [34]. Let us say one would like to iterate overcloudmesh_communitystudentsthatarenativesofIndiana.Toachievethis,onewouldusethefollowingcommand:

Thislibraryalsoallowstheuseofalloperatorsexceptfortheequalityoperatorin itsqueries, andmoreover,has thecapabilityofhandlingstringqueries,geoqueries,listquerying,andqueryingoftherawPyMongoqueries[34].

The string queries are useful in performing text operations in the conditionalqueries.Aquery to find a document exactlymatching andwith stateACTIVEcanbeperformedinthefollowingmanner:

ThequerytoretrievedocumentdatafornamesthatstartwithacasesensitiveALcanbewrittenas:

Toperformanexactsamequeryforthenon-key-sensitiveALonewouldusethefollowingcommand:

TheMongoEngineallowsdataextractionofgeographicallocationsbyusingGeoqueries.Thegeo_withinoperatorchecksifageometryiswithinapolygon.

Thelistquerylooksupthedocumentswherethespecifiedfieldsmatchesexactlytothegivenvalue.Tomatchallpagesthathavethewordcodingasaniteminthetagslistonewouldusethefollowingquery:

printcloudmesh_community.student

indy_students=cloudmesh_community.objects(state='IN')

db.cloudmesh_community.find(State.exact("ACTIVE"))

db.cloudmesh_community.find(Name.startswith("AL"))

db.cloudmesh_community.find(Name.istartswith("AL"))

cloudmesh_community.objects(

point__geo_within=[[[40,5],[40,6],[41,6],[40,5]]])

cloudmesh_community.objects(

point__geo_within={"type":"Polygon",

"coordinates":[[[40,5],[40,6],[41,6],[40,5]]]})

classPage(Document):

tags=ListField(StringField())

Page.objects(tags='coding')

Overall,itwouldbesafetosaythatMongoEnginehasgoodcompatibilitywithPython. It provides different functions to utilize Python easily withMongoDBand which makes this pair even more attractive to applicationdevelopers.

3.7.2.5Flask-PyMongo

“Flaskisamicro-webframeworkwritteninPython”[35].

ItwasdevelopedafterDjango,and it isverypythonic innaturewhich impliesthatitisexplicitlythetargetingthePythonusercommunity.ItislightweightasitdoesnotrequireadditionaltoolsorlibrariesandhenceisclassifiedasaMicro-Webframework.ItisoftenusedwithMongoDBusingPyMongoconnector,andit treats data within MongoDB as searchable Python dictionaries. TheapplicationssuchasPinterest,LinkedIn,andthecommunitywebpageforFlaskare using theFlask framework.Moreover, it supports various features such asthe RESTful request dispatching, secure cookies, Google app enginecompatibility,andintegratedsupportforunittesting,etc[35].Whenitcomestoconnectingtoadatabase,theconnectiondetailsforMongoDBcanbepassedasavariableorconfiguredinPyMongoconstructorwithadditionalargumentssuchasusernameandpassword,ifrequired.ItisimportantthatversionsofbothFlaskandMongoDBarecompatiblewitheachothertoavoidfunctionalitybreaks[36].


Flask-PyMongocanbeinstalledwithaneasycommandsuchasthis:

PyMongocanbeaddedinthefollowingmanner:

3.7.2.5.2Configuration

TherearetwowaystoconfigureFlask-PyMongo.ThefirstwaywouldbetopassaMongoDBURItothePyMongoconstructor,whilethesecondwaywouldbeto

$pipinstallFlask-PyMongo

fromflaskimportFlask

fromflask_pymongoimportPyMongo

app=Flask(__name__)

app.config["MONGO_URI"]="mongodb://localhost:27017/cloudmesh_community"

mongo=PyMongo(app)

“assignittotheMONGO_URIFlaskconfiurationvariable”[36].

3.7.2.5.3Connectiontomultipledatabases/servers

Multiple PyMongo instances can be used to connect to multiple databases ordatabase servers. To achieve this, once would use a command similar to thefollowing:

3.7.2.5.4Flask-PyMongoMethods

Flask-PyMongo provides helpers for some common tasks.One of them is theCollection.find_one_or_404methodshowninthefollowingexample:

This method is very similar to the MongoDB’s find_one() method, however,insteadofreturningNoneitcausesa404NotFoundHTTPstatus[36].

Similarly,thePyMongo.send_fileandPyMongo.save_filemethodsworkon thefile-likeobjectsandsavethemtoGridFSusingthegivenfilename[36].

3.7.2.5.5AdditionalLibraries

Flask-MongoAlchemyandFlask-MongoEngineare theadditional libraries thatcanbeused toconnect toaMongoDBdatabasewhileusingenhancedfeatureswith the Flask app. The Flask-MongoAlchemy is used as a proxy betweenPython and MongoDB to connect. It provides an option such as server ordatabasebasedauthenticationtoconnecttoMongoDB.Whilethedefault issetserver based, to use a database-based authentication, the config valueMONGOALCHEMY_SERVER_AUTHparametermustbesettoFalse[37].

Flask-MongoEngine is the Flask extension that provides integration with theMongoEngine. It handles connection management for the apps. It can beinstalledthroughpipandsetupveryeasilyaswell.Thedefaultconfigurationis

app=Flask(__name__)

mongo1=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_one")

mongo2=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_two")

mongo3=PyMongo(app,uri=

"mongodb://another.host:27017/cloudmesh_community_Three")

@app.route("/user/<username>")

defuser_profile(username):

user=mongo.db.cloudmesh_community.find_one_or_404({"_id":username})

returnrender_template("user.html",user=user)

set to the local host and port 27017. For the custom port and in caseswhereMongoDB is running on another server, the host and port must be explicitlyspecifiedinconnectstringswithintheMONGODB_SETTINGSdictionarywithapp.config, alongwith the database username and password, in caseswhere adatabaseauthenticationisenabled.TheURIstyleconnectionsarealsosupportedandsupply theURIas thehost in theMONGODB_SETTINGS dictionarywithapp.config.TherearevariouscustomquerysetsthatareavailablewithinFlask-MongoenginethatareattachedtoMongoengine’sdefaultqueryset[38].

3.7.2.5.6ClassesandWrappers

Attributes such as cx and db in the PyMongo objects are the ones that helpprovideaccesstotheMongoDBserver[36].Toachievethis,onemustpasstheFlaskapptotheconstructororcallinit_app()[36].

“Flask-PyMongo wraps PyMongo’s MongoClient, Database, andCollectionclasses,andoverridestheirattributeanditemaccessors”[36].

This type of wrapping allows Flask-PyMongo to add methods to CollectionwhileatthesametimeallowingaMongoDB-styledottedexpressionsinthecode[36].

Flask-PyMongo creates connectivity between Python and Flask using aMongoDBdatabaseandsupports

“extensions that can add application features as if they wereimplementedinFlaskitself”[39],

hence, it canbeused as an additionalFlask functionality inPython code.Theextensions are there for the purpose of supporting form validations,authentication technologies, object-relational mappers and framework relatedtoolswhichultimatelyaddsalotofstrengthtothismicro-webframework[39].OneofthemainreasonsandbenefitswhyitisfrequentlyusedwithMongoDBisitscapabilityofaddingmorecontroloverdatabasesandhistory[39].

type(mongo.cx)

type(mongo.db)

type(mongo.db.cloudmesh_community)

3.7.3Mongoengine☁�

3.7.3.1Introduction

MongoEngine isadocumentmapper forworkingwithmongoldbwithpython.To be able to use mongo engine MongodD should be already installed andrunning.

3.7.3.2Installandconnect

Mongoenginecanbeinstalledbyrunning:

Thiswillinstallsix,pymongoandmongoengine.

To connect to mongoldb use connect () function by specifying mongoldbinstancename.Youdon’tneedtogotomongoshellbutthiscanbedonefromunix shell or cmd line. In this case we are connecting to a database namedstudent_db.

Ifmongodb is runningonaportdifferent fromdefaultport , portnumberandhost need to be specified. If mongoldb needs authentication username andpasswordneedtobespecified.

3.7.3.3Basics

Mongodbdoesnotenforceschemas.ComparingtoRDBMS,Rowinmongoldbis called a “document” and table can be compared toCollection. Defining aschemaishelpfulasitminimizescodingerror’s.Todefineaschemawecreateaclassthatinheritsfromdocument.

�TODO:Canyoufixthecodesectionsandlookattheexamplesweprovided.

$pipinstallmongoengine

frommongoengineimport*connect(‘student_db’)

frommongoengineimport*

classStudent(Document):

first_name=StringField(max_length=50)

last_name=StringField(max_length=50)

https://github.com/cloudmesh-community/book/blob/master/chapters/data/mongoengine.md

Fields are notmandatory but if needed, set the required keyword argument toTrue. There are multiple values available for field types. Each field can becustomizedbybykeywordargument.IfeachstudentissendingtextmessagestoUniversitiescentraldatabase,thesecanbestoredusingMongodb.Eachtextcanhavedifferentdatatypes,somemighthaveimagesorsomemighthaveurl’s.SowecancreateaclasstextandlinkittostudentbyusingReferencefield(similartoforeignkeyinRDBMS).

MongoDb supports adding tags to individual texts rather then storing themseparately and then having them referenced.Similarly Comments can also bestoreddirectlyinaText.

Foraccessingdata:ifweneedtogettitles.

Searchingtextswithtags.

3.8CALCULATION

3.8.1WordCountwithParallelPython☁�

We will demonstrate Python’s multiprocessing API for parallel computation bywriting a program that counts how many times each word in a collection ofdocumentsappear.

classText(Document):

title=StringField(max_length=120,required=True)

author=ReferenceField(Student)

meta={'allow_inheritance':True}

classOnlyText(Text):

content=StringField()

classImagePost(Text):

image_path=StringField()

classLinkPost(Text):

link_url=StringField()

classText(Document):

title=StringField(max_length=120,required=True)

author=ReferenceField(User)

tags=ListField(StringField(max_length=30))

comments=ListField(EmbeddedDocumentField(Comment))

fortextinOnlyText.objects:

print(text.title)

fortextinText.objects(tags='mongodb'):

print(text.title)

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-wordcount.md

3.8.1.1GeneratingaDocumentCollection

Beforewebegin,letuswriteascriptthatwillgeneratedocumentcollectionsbyspecifying the number of documents and the number ofwords per document.Thiswillmakebenchmarkingstraightforward.

To keep it simple, the vocabulary of the document collection will consist ofrandomnumbersratherthanthewordsofanactuallanguage:

Notice thatwe are using the docoptmodule that you should be familiar withfromtheSection[PythonDocOpts](#s-python-docopts}tomakethescripteasytorunfromthecommandline.

'''Usage:generate_nums.py[-h]NUM_LISTSINTS_PER_LISTMIN_INTMAX_INTDEST_DIR

Generaterandomlistsofintegersandsavethem

as1.txt,2.txt,etc.

Arguments:

NUM_LISTSThenumberofliststocreate.

INTS_PER_LISTThenumberofintegersineachlist.

MIN_NUMEachgeneratedintegerwillbe>=MIN_NUM.

MAX_NUMEachgeneratedintegerwillbe<=MAX_NUM.

DEST_DIRAdirectorywherethegeneratednumberswillbestored.

Options:

-h--help

'''


importos,random,logging


defgenerate_random_lists(num_lists,

ints_per_list,min_int,max_int):

return[[random.randint(min_int,max_int)\

foriinrange(ints_per_list)]foriinrange(num_lists)]


args=docopt(__doc__)

num_lists,ints_per_list,min_int,max_int,dest_dir=[

int(args['NUM_LISTS']),

int(args['INTS_PER_LIST']),

int(args['MIN_INT']),

int(args['MAX_INT']),

args['DEST_DIR']

]

ifnotos.path.exists(dest_dir):

os.makedirs(dest_dir)

lists=generate_random_lists(num_lists,

ints_per_list,

min_int,

max_int)

curr_list=1

forlstinlists:

withopen(os.path.join(dest_dir,'%d.txt'%curr_list),'w')asf:

f.write(os.linesep.join(map(str,lst)))

curr_list+=1

logging.debug('Numberswritten.')

https://pypi.python.org/pypi/docopt

Youcangenerateadocumentcollectionwiththisscriptasfollows:

3.8.1.2SerialImplementation

Afirstserialimplementationofwordcountisstraightforward:

3.8.1.3SerialImplementationUsingmapandreduce

We can improve the serial implementation in anticipation of parallelizing theprogrambymakinguseofPython’smapandreducefunctions.

In short, you can use map to apply the same function to the members of acollection.Forexample,toconvertalistofnumberstostrings,youcoulddo:

pythongenerate_nums.py1000100000100docs-1000-10000

'''Usage:wordcount.py[-h]DATA_DIR

Readacollectionof.txtdocumentsandcounthowmanytimeseachword

appearsinthecollection.

Arguments:

DATA_DIRAdirectorywithdocuments(.txtfiles).

Options:

-h--help

'''

from__future__importdivision,print_function

importos,glob,logging


logging.basicConfig(level=logging.DEBUG)

defwordcount(files):

counts={}

forfilepathinfiles:

withopen(filepath,'r')asf:

words=[word.strip()forwordinf.read().split()]

forwordinwords:

ifwordnotincounts:

counts[word]=0

counts[word]+=1

returncounts



ifnotos.path.exists(args['DATA_DIR']):

raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])

counts=wordcount(glob.glob(os.path.join(args['DATA_DIR'],'*.txt')))

logging.debug(counts)

importrandom

nums=[random.randint(1,2)for_inrange(10)]

print(nums)

[2,1,1,1,2,2,2,2,2,2]

print(map(str,nums))

We can use reduce to apply the same function cumulatively to the items of asequence.Forexample,tofindthetotalofthenumbersinourlist,wecouldusereduceasfollows:

Wecansimplifythisevenmorebyusingalambdafunction:

YoucanreadmoreaboutPython’slambdafunctioninthedocs.

Withthisinmind,wecanreimplementthewordcountexampleasfollows:

['2','1','1','1','2','2','2','2','2','2']

defadd(x,y):

returnx+y

print(reduce(add,nums))

17

print(reduce(lambdax,y:x+y,nums))

17

'''Usage:wordcount_mapreduce.py[-h]DATA_DIR

Readacollectionof.txtdocumentsandcounthow

manytimeseachword

appearsinthecollection.

Arguments:


Options:

-h--help

'''





defcount_words(filepath):

counts={}

withopen(filepath,'r')asf:

words=[word.strip()forwordinf.read().split()]

forwordinwords:

ifwordnotincounts:

counts[word]=0

counts[word]+=1

returncounts

defmerge_counts(counts1,counts2):

forword,countincounts2.items():

ifwordnotincounts1:

counts1[word]=0

counts1[word]+=counts2[word]

returncounts1





https://docs.python.org/2.7/tutorial/controlflow.html#lambda-expressions

3.8.1.4ParallelImplementation

Drawingon theprevious implementationusing mapand reduce,we can parallelizetheimplementationusingPython’smultiprocessingAPI:

3.8.1.5Benchmarking

Totimeeachoftheexamples,enteritintoitsownPythonfileanduseLinux’stimecommand:

Theoutput contains the real run timeand theuser run time. real iswall clocktime-timefromstarttofinishofthecall.useristheamountofCPUtimespentin user-mode code (outside the kernel)within the process, that is, only actualCPUtimeusedinexecutingtheprocess.

per_doc_counts=map(count_words,

glob.glob(os.path.join(args['DATA_DIR'],

'*.txt')))

counts=reduce(merge_counts,[{}]+per_doc_counts)


'''Usage:wordcount_mapreduce_parallel.py[-h]DATA_DIRNUM_PROCESSES

Readacollectionof.txtdocumentsandcount,inparallel,howmany

timeseachwordappearsinthecollection.

Arguments:


NUM_PROCESSESThenumberofparallelprocessestouse.

Options:

-h--help

'''




fromwordcount_mapreduceimportcount_words,merge_counts

frommultiprocessingimportPool






num_processes=int(args['NUM_PROCESSES'])

pool=Pool(processes=num_processes)

per_doc_counts=pool.map(count_words,

glob.glob(os.path.join(args['DATA_DIR'],

'*.txt')))

counts=reduce(merge_counts,[{}]+per_doc_counts)


$timepythonwordcount.pydocs-1000-10000

3.8.1.6Excersises

E.python.wordcount.1:

Run the threedifferentprograms (serial, serialw/mapandreduce,parallel)andanswerthefollowingquestions:

1. Is there any performance difference between the differentversionsoftheprogram?

2. Doesusertimesignificantlydifferfromrealtimeforanyoftheversionsoftheprogram?

3. Experimentwithdifferentnumbersofprocessesfortheparallelexample, starting with 1. What is the performance gain whenyougoalfrom1to2processes?From2to3?Whendoyoustopseeing improvement? (this will depend on your machinearchitecture)

3.8.1.7References

Map,FilterandReducemultiprocessingAPI

3.8.2NumPy☁�

NumPyisapopularlibrarythatisusedbymanyotherPythonpackagessuchasPandas, SciPy, and scikit-learn. It provides a fast, simple-to-use way ofinteracting with numerical data organized in vectors and matrices. In thissection,wewillprovideashortintroductiontoNumPy.

3.8.2.1InstallingNumPy

The most common way of installing NumPy, if it wasn’t included with yourPythoninstallation,istoinstallitviapip:

IfNumPyhasalreadybeeninstalled,youcanupdatetothemostrecentversionusing:

$pipinstallnumpy

http://book.pythontips.com/en/latest/map_filter.html

https://docs.python.org/2/library/multiprocessing.html

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/numpy/numpy.md

YoucanverifythatNumPyisinstalledbytryingtouseitinaPythonprogram:

Notethat,byconvention,weimportNumPyusingthealias‘np’-wheneveryousee ‘np’ sprinkled in example Python code, it’s a good bet that it is usingNumPy.

3.8.2.2NumPyBasics

At its core, NumPy is a container for n-dimensional data. Typically, 1-dimensional data is called an array and 2-dimensional data is called amatrix.Beyond2-dimenionswouldbeconsideredamultidimensionalarray.Exampleswhereyou’llencounterthesedimenionsmayinclude:

1 Dimensional: time series data such as audio, stock prices, or a singleobservationinadataset.2 Dimensional: connectivity data between network nodes, user-productrecommendations,anddatabasetables.3+ Dimensional: network latency between nodes over time, video(RGB+time),andversioncontrolleddatasets.

All of these data can be placed into NumPy’s array object, just with varyingdimensions.

3.8.2.3DataTypes:TheBasicBuildingBlocks

Beforewedelveintoarraysandmatrices,wewillstartoffwiththemostbasicelement of those: a single value. NumPy can represent data utilizing manydifferentstandarddatatypessuchasuint8(an8-bitusigned integer), float64(a64-bitfloat),orstr(astring).Anexhaustivelistingcanbefoundat:

https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Beforemovingon,itisimportanttoknowaboutthetradeoffmadewhenusingdifferentdatatypes.Forexample,auint8canonlycontainvaluesbetween0and255.This,however,contrastswithfloat64whichcanexpressanyvaluefrom+/-

$pipinstall-Unumpy

importnumpyasnp

https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

1.80e+308.Sowhywouldn’twejustalwaysusefloat64s?Thoughtheyallowustobemoreexpressiveintermsofnumbers,theyalsoconsumemorememory.Ifwewereworkingwith a12megapixel image, for example, storing that imageusinguint8valueswouldrequire3000*4000*8=96millionbits,or91.55MBof memory. If we were to store the same image utilizing float64, our imagewouldconsume8 timesasmuchmemory:768millionbitsor732.42MB.It’simportant use the right datatype for the job to avoid consuming unneccessaryresourcesorslowingdownprocessing.

Finally,whileNumPywillconvenientlyconvertbetweendatatypes,onemustbeawareofoverflowswhenusingsmallerdatatypes.Forexample:

In this example, it makes sense that 6+7=13. But how does 13+245=2? Putsimply, the object type (uint8) simply ran out of space to store the value andwrapped back around to the beginning. An 8-bit number is only capable ofstoring2^8,or256,uniquevalues.Anoperationthatresultsinavalueabovethatrangewill‘overflow’andcausethevaluetowrapbackaroundtozero.Likewise,anythingbelowthatrangewill‘underflow’andwrapbackaroundtotheend.Inour example, 13+245 became 258,whichwas too large to store in 8 bits andwrappedbackaroundto0andendedupat2.

NumPywill, generally, try to avoid this situation by dynamically retyping towhateverdatatypewillsupporttheresult:

Here,ouradditioncausedourarray,‘a’,tobeupscaledtouseuint16insteadofuint8. Finally, NumPy offers convenience functions akin to Python’s range()functiontocreatearraysofsequentialnumbers:

a=np.array([6],dtype=np.uint8)

print(a)

>>>[6]

a=a+np.array([7],dtype=np.uint8)

print(a)

>>>[13]

a=a+np.array([245],dtype=np.uint8)

print(a)

>>>[2]

a=a+260

print(test)

>>>[262]

X=np.arange(0.2,1,.1)

print(X)

>>>array([0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],dtype=float32)

Wecanusethisfunctiontoalsogenerateparametersspacesthatcanbeiteratedon:

3.8.2.4Arrays:StringingThingsTogether

With our knowledge of datatypes in hand, we can begin to explore arrays.Simply put, arrays can be thought of as a sequence of values (not neccesarilynumbers).Arraysare1dimensionalandcanbecreatedandaccessedsimply:

Arrays (and, later,matrices)arezero-indexed.Thismakes it convenientwhen,forexample,usingPython’srange()functiontoiteratethroughanarray:

Arraysare,also,mutableandcanbechangedeasily:

NumPy also includes incredibly powerful broadcasting features.Thismakes itvery simple to perform mathematical operations on arrays that also makesintuitivesense:

Arrayscanalsointeractwithotherarrays:

P=10.0**np.arange(-7,1,1)

print(P)

forx,pinzip(X,P):

print('%f,%f'%(x,p))

a=np.array([1,2,3])

print(type(a))

>>><class'numpy.ndarray'>

print(a)

>>>[123]

print(a.shape)

>>>(3,)

a[0]

>>>1

foriinrange(3):

print(a[i])

>>>1

>>>2

>>>3

a[0]=42

print(a)

>>>array([42,2,3])

a*3

>>>array([3,6,9])

a**2

>>>array([1,4,9],dtype=int32)

b=np.array([2,3,4])

print(a*b)

>>>array([2,6,12])

In this example, the result of multiplying together two arrays is to take theelement-wiseproductwhilemultiplyingbyaconstantwillmultiplyeachelementin the array by that constant. NumPy supports all of the basic mathematicaloperations: addition, subtraction, multiplication, division, and powers. It alsoincludesanextensivesuiteofmathematicalfunctions,suchaslog()andmax(),whicharecoveredlater.

3.8.2.5Matrices:AnArrayofArrays

Matrices can be thought of as an extension of arrays - rather than having onedimension,matriceshave2(ormore).Muchlikearrays,matricescanbecreatedeasilywithinNumPy:

Accessingindividualelementsissimilartohowwediditforarrays.Wesimplyneedtopassinanumberofargumentsequaltothenumberofdimensions:

In this example, our first index selected the row and the second selected thecolumn-givingusourresultof3.Matricescanbeextendingouttoanynumberofdimensionsbysimplyusingmoreindicestoaccessspecificelements(thoughuse-casesbeyond4maybesomewhatrare).

Matricessupportallofthenormalmathematialfunctionssuchas+,-,*,and/.Aspecialnote:the*operatorwillresultinanelement-wisemultiplication.Using@ornp.matmul()formatrixmultiplication:

MorecomplexmathematicalfunctionscantypicallybefoundwithintheNumPylibraryitself:

A full listing can be found at:

m=np.array([[1,2],[3,4]])

print(m)

>>>[[12]

>>>[34]]

m[1][0]

>>>3

print(m-m)

print(m*m)

print(m/m)

print(np.sin(x))

print(np.sum(x))

https://docs.scipy.org/doc/numpy/reference/routines.math.html

3.8.2.6SlicingArraysandMatrices

As one can imagine, accessing elements one-at-a-time is both slow and canpotentially require many lines of code to iterate over every dimension in thematrix. Thankfully, NumPy incorporate a very powerful slicing engine thatallowsustoaccessrangesofelementseasily:

The ‘:’value tellsNumPy toselectallelements in thegivendimension.Here,we’ve requested all elements in the first row. We can also use indexing torequestelementswithinagivenrange:

Here,weaskedNumPy togiveuselements4 through7 (ranges inPythonareinclusiveatthestartandnon-inclusiveattheend).Wecanevengobackwards:

Inthepreviousexample,thenegativevalueisaskingNumPytoreturnthelast5elementsof thearray.Had theargumentbeen‘:-5’,NumPywould’vereturnedeverythingBUTthelastfiveelements:

Becoming more familiar with NumPy’s accessor conventions will allow youwritemoreefficient,clearercodeasitiseasiertoreadasimpleone-lineaccessorthan it is a multi-line, nested loop when extracting values from an array ormatrix.

3.8.2.7UsefulFunctions

The NumPy library provides several convenient mathematical functions thatusers can use. These functions provide several advantages to codewritten by

m[1,:]

>>>array([3,4])

a=np.arange(0,10,1)

print(a)

>>>[0123456789]

a[4:8]

>>>array([4,5,6,7])

a[-5:]

>>>array([5,6,7,8,9])

a[:-5]

>>>array([0,1,2,3,4])

users:

They are open source typically have multiple contributors checking forerrors.Many of them utilize a C interface andwill runmuch faster than nativePythoncode.They’rewrittentoveryflexible.

NumPyarraysandmatricescontainmanyusefulaggregatingfunctionssuchasmax(),min(),mean(), etc These functions are usually able to run an order ofmagnitudefasterthanloopingthroughtheobject,soit’simportanttounderstandwhatfunctionsareavailabletoavoid‘reinventingthewheel.’Inaddition,manyof the functions are able to sum or average across axes, which make themextremely useful if your data has inherent grouping. To return to a previousexample:

In thisexample,wecreateda2x2matrixcontaining thenumbers1 through4.Thesumof thematrix returned theelement-wiseadditionof theentirematrix.Summing across axis 0 (rows) returned a new array with the element-wiseadditionacrosseachrow.Likewise,summingacrossaxis1(columns)returnedthecolumnarsummation.

3.8.2.8LinearAlgebra

Perhaps one of the most important uses for NumPy is its robust support forLinear Algebra functions. Like the aggregation functions described in theprevious section, these functions are optimized to be much faster than userimplementationsandcanutilizeprocessesorlevelfeaturestoprovideveryquickcomputations. These functions can be accessed very easily from the NumPypackage:

m=np.array([[1,2],[3,4]])

print(m)

>>>[[12]

>>>[34]]

m.sum()

>>>10

m.sum(axis=1)

>>>[3,7]

m.sum(axis=0)

>>>[4,6]

a=np.array([[1,2],[3,4]])

b=np.array([[5,6],[7,8]])

print(np.matmul(a,b))

Included in within np.linalg are functions for calculating theEigendecompositionofsquarematricesandsymmetricmatrices.Finally,togiveaquickexampleofhoweasy it is to implementalgorithms inNumPy,wecaneasilyuseittocalculatethecostandgradientwhenusingsimpleMean-Squared-Error(MSE):

Finally, more advanced functions are easily available to users via the linalglibraryofNumPyas:

3.8.2.9NumPyResources

https://docs.scipy.org/doc/numpyhttp://cs231n.github.io/python-numpy-tutorial/#numpyhttps://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.htmlhttps://en.wikipedia.org/wiki/Mean_squared_error

3.8.3Scipy☁�

SciPy is a library built around numpy and has a number of off-the-shelfalgorithmsandoperationsimplemented.Theseincludealgorithmsfromcalculus(such as integration), statistics, linear algebra, image-processing, signalprocessing,machinelearning.

To achieve this, SciPy bundels a number of useful open-source software formathematics,science,andengineering.Itincludesthefollowingpackages:

NumPy,

formanagingN-dimensionalarrays

>>>[[1922]

[4350]]

cost=np.power(Y-np.matmul(X,weights)),2).mean(axis=1)

gradient=np.matmul(X.T,np.matmul(X,weights)-y)

fromnumpyimportlinalg

A=np.diag((1,2,3))

w,v=linalg.eig(A)

print('w=',w)

print('v=',v)

https://docs.scipy.org/doc/numpy

http://cs231n.github.io/python-numpy-tutorial/#numpy

https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.html

https://en.wikipedia.org/wiki/Mean_squared_error

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/scipy/scipy.md

SciPylibrary,

toaccessfundamentalscientificcomputingcapabilities

Matplotlib,

toconduct2Dplotting

IPython,

foranInteractiveconsole(seejupyter)

Sympy,

forsymbolicmathematics

pandas,

forprovidingdatastructuresandanalysis

3.8.3.1Introduction

First we add the usual scientific computing modules with the typicalabbreviations, including sp for scipy. We could invoke scipy’s statisticalpackageassp.stats,butforthesakeoflazinessweabbreviatethattoo.

Nowwecreatesomerandomdatatoplaywith.Wegenerate100samplesfromaGaussiandistributioncenteredatzero.

Howmanyelementsareintheset?

Whatisthemean(average)oftheset?

importnumpyasnp#importnumpy

importscipyassp#importscipy

fromscipyimportstats#referdirectlytostatsratherthansp.stats

importmatplotlibasmpl#forvisualization

frommatplotlibimportpyplotasplt#referdirectlytopyplot

#ratherthanmpl.pyplot

s=sp.randn(100)

print('Thereare',len(s),'elementsintheset')

Whatistheminimumoftheset?

Whatisthemaximumoftheset?

Wecanusethescipyfunctionstoo.What’sthemedian?

Whataboutthestandarddeviationandvariance?

Isn’tthevariancethesquareofthestandarddeviation?

How close are the measures? The differences are close as the followingcalculationshows

Howdoesthislookasahistogram?SeeFigure18,Figure19,Figure20

print('Themeanofthesetis',s.mean())

print('Theminimumofthesetis',s.min())

print('Themaximumofthesetis',s.max())

print('Themedianofthesetis',sp.median(s))

print('Thestandarddeviationis',sp.std(s),

'andthevarianceis',sp.var(s))

print('Thesquareofthestandarddeviationis',sp.std(s)**2)

print('Thedifferenceis',abs(sp.std(s)**2-sp.var(s)))

print('Andindecimalform,thedifferenceis%0.16f'%

(abs(sp.std(s)**2-sp.var(s))))

plt.hist(s)#yes,onelineofcodeforahistogram

plt.show()

Figure18:Histogram1

Letusaddsometitles.

Figure19:Histogram2

Typically we do not include titles when we prepare images for inclusion inLaTeX.Thereweusethecaptiontodescribewhatthefigureisabout.

plt.clf()#clearoutthepreviousplot

plt.hist(s)

plt.title("HistogramExample")

plt.xlabel("Value")

plt.ylabel("Frequency")

plt.show()

plt.clf()#clearoutthepreviousplot

plt.hist(s)

plt.xlabel("Value")

plt.ylabel("Frequency")

plt.show()

Figure20:Histogram3

Let us try out some linear regression, or curve fitting. See @#fig:scipy-output_30_0

Figure21:Result1

importrandom

defF(x):

return2*x-2

defadd_noise(x):

returnx+random.uniform(-1,1)

X=range(0,10,1)

Y=[]

foriinrange(len(X)):

Y.append(add_noise(X[i]))

plt.clf()#clearouttheoldfigure

plt.plot(X,Y,'.')

plt.show()

Nowlet’strylinearregressiontofitthecurve.

Whatistheslopeandy-interceptofthefittedcurve?

Nowlet’sseehowwellthecurvefitsthedata.We’llcallthefittedcurveF’.

To save images into a PDF file for inclusion into LaTeX documents you cansavetheimagesasfollows.Otherformatssuchaspngarealsopossible,butthequalityisnaturallynotsufficientforinclusioninpapersanddocuments.Forthatyoucertainlywant tousePDF.Thesaveof thefigurehas tooccurbeforeyouusetheshow()command.SeeFigure22

m,b,r,p,est_std_err=stats.linregress(X,Y)

print('Theslopeis',m,'andthey-interceptis',b)

defFprime(x):#thefittedcurve

returnm*x+b

X=range(0,10,1)

Yprime=[]

foriinrange(len(X)):

Yprime.append(Fprime(X[i]))

plt.clf()#clearouttheoldfigure

#theobservedpoints,bluedots

plt.plot(X,Y,'.',label='observedpoints')

#theinterpolatedcurve,connectedredline

plt.plot(X,Yprime,'r-',label='estimatedpoints')

plt.title("LinearRegressionExample")#title

plt.xlabel("x")#horizontalaxistitle

plt.ylabel("y")#verticalaxistitle

#legendlabelstoplot

plt.legend(['obseredpoints','estimatedpoints'])

#commentoutsothatyoucansavethefigure

#plt.show()

plt.savefig("regression.pdf",bbox_inches='tight')

plt.savefig('regression.png')

plt.show()

Figure22:Result2

3.8.3.2References

Formore informationaboutSciPywe recommend thatyouvisit the followinglink

https://www.scipy.org/getting-started.html#learning-to-work-with-scipy

Additionalmaterialandinspirationforthissectionarefrom

[ ]“GettingStartedguide”https://www.scipy.org/getting-started.html

[ ]Prasanth.“SimplestatisticswithSciPy.”Comfortat1AU.February28, 2011. https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/.

[ ] SciPy Cookbook. Lasted updated: 2015. http://scipy-cookbook.readthedocs.io/.

createbibtexentries

3.8.4Scikit-learn☁�

https://www.scipy.org/getting-started.html#learning-to-work-with-scipy

https://www.scipy.org/getting-started.html

https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/

http://scipy-cookbook.readthedocs.io/

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/scikit-learn/scikit-learn.md

LearningObjectives

ExploratorydataanalysisPipelinetopreparedataFulllearningpipelineFinetunethemodelSignificancetests

3.8.4.1IntroductiontoScikit-learn

Scikit learnisaMachineLearningspecific libraryusedinPython.Librarycanbeusedfordataminingandanalysis.ItisbuiltontopofNumPy,matplotlibandSciPy.ScikitLearnfeaturesDimensionalityreduction,clustering,regressionandclassificationalgorithms.Italsofeaturesmodelselectionusinggridsearch,crossvalidationandmetrics.

Scikitlearnalsoenablesuserstopreprocessthedatawhichcanthenbeusedformachinelearningusingmoduleslikepreprocessingandfeatureextraction.

Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.

3.8.4.2Installation

Ifyoualreadyhaveaworkinginstallationofnumpyandscipy,theeasiestwaytoinstallscikit-learnisusingpip

3.8.4.3SupervisedLearning

SupervisedLearningisusedinmachinelearningwhenwealreadyknowasetofoutputpredictionsbasedon input characteristics andbasedon thatweneed topredictthetargetforanewinput.Trainingdataisusedtotrainthemodelwhichthencanbeusedtopredicttheoutputfromaboundedset.

$pipinstallnumpy

$pipinstallscipy-U

$pipinstall-Uscikit-learn

Problemscanbeoftwotypes

1. Classification:Trainingdatabelongstothreeorfourclasses/categoriesandbasedon the labelwewant topredict theclass/categoryfor theunlabeleddata.

2. Regression : Training data consists of vectorswithout any correspondingtargetvalues.Clusteringcanbeusedforthesetypeofdatasetstodeterminediscover groups of similar examples. Another way is density estimationwhichdeterminethedistributionofdatawithintheinputspace.Histogramisthemostbasicform.

3.8.4.4UnsupervisedLearning

UnsupervisedLearning isused inmachine learningwhenwehave the trainingsetavailablebutwithoutanycorrespondingtarget.Theoutcomeoftheproblemistodiscovergroupswithintheprovidedinput.Itcanbedoneinmanyways.

Fewofthemarelistedhere

1. Clustering:Discovergroupsofsimilarcharacteristics.2. Density Estimation : Finding the distribution of datawithin the provided

inputorchanging thedata fromahighdimensional space to twoor threedimension.

3.8.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn

Adatapipelineisasetofprocessingcomponentsthataresequencedtoproducemeaningfuldata.PipelinesarecommonlyusedinMachinelearning,sincethereislotofdatatransformationandmanipulationthatneedstobeappliedtomakedatauseful formachine learning.All components are sequenced in away thatthe output of one component becomes input for the next and each of thecomponentisselfcontained.Componentsinteractwitheachotherusingdata.

Evenifacomponentbreaks,thedownstreamcomponentcanrunnormallyusingthe last output. Sklearn provide the ability to build pipelines that can betransformedandmodeledformachinelearning.

3.8.4.6Stepsfordevelopingamachinelearningmodel

1. Explorethedomainspace2. Extracttheproblemdefinition3. Getthedatathatcanbeusedtomakethesystemlearntosolvetheproblem

definition.4. DiscoverandVisualizethedatatogaininsights5. Featureengineeringandpreparethedata6. Finetuneyourmodel7. Evaluateyoursolutionusingmetrics8. Onceprovenlaunchandmaintainthemodel.

3.8.4.7ExploratoryDataAnalysis

Exampleproject=Frauddetectionsystem

Firststepistoloadthedataintoadataframeinorderforaproperanalysistobedoneontheattributes.

Performthebasicanalysisonthedatashapeandnullvalueinformation.

Hereistheexampleoffewofthevisualdataanalysismethods.

3.8.4.7.1Barplot

Abarchartorgraphisagraphwithrectangularbarsorbinsthatareusedtoplotcategoricalvalues.Eachbarinthegraphrepresentsacategoricalvariableandtheheightofthebarisproportionaltothevaluerepresentedbyit.

Bargraphsareused:

TomakecomparisonsbetweenvariablesTovisualizeanytrendinthedata,i.e.,they show the dependence of one variable on another Estimate values of avariable

data=pd.read_csv('dataset/data_file.csv')

data.head()

print(data.shape)

print(data.info())

data.isnull().values.any()

Figure23:Exampleofscikit-learnbarplots

3.8.4.7.2Correlationbetweenattributes

Attributesinadatasetcanberelatedbasedondifferntaspects.

Examplesincludeattributesdependentonanotherorcouldbelooselyortightlycoupled.Alsoexampleincludestwovariablescanbeassociatedwithathirdone.

Inordertounderstandtherelationshipbetweenattributes,correlationrepresentsthebestvisualwaytogetaninsight.Positivecorrelationmeaningbothattributesmovingintothesamedirection.Negativecorrelationreferstooppostedirections.One attributes values increase results in value decrease for other. Zerocorrelationiswhentheattributesareunrelated.

plt.ylabel('Transactions')

plt.xlabel('Type')

data.type.value_counts().plot.bar()

#computethecorrelationmatrix

corr=data.corr()

#generateamaskforthelowertriangle

mask=np.zeros_like(corr,dtype=np.bool)

mask[np.triu_indices_from(mask)]=True

#setupthematplotlibfigure

f,ax=plt.subplots(figsize=(18,18))

Figure24:scikit-learncorrelationarray

3.8.4.7.3HistogramAnalysisofdatasetattributes

Ahistogramconsistsofasetofcountsthatrepresentthenumberoftimessomeeventoccurred.

#generateacustomdivergingcolormap

cmap=sns.diverging_palette(220,10,as_cmap=True)

#drawtheheatmapwiththemaskandcorrectaspectratio

sns.heatmap(corr,mask=mask,cmap=cmap,vmax=.3,

square=True,

linewidths=.5,cbar_kws={"shrink":.5},ax=ax);

%matplotlibinline

data.hist(bins=30,figsize=(20,15))

plt.show()

Figure25:scikit-learn

3.8.4.7.4BoxplotAnalysis

Box plot analysis is useful in detecting whether a distribution is skewed anddetectoutliersinthedata.fig,axs=plt.subplots(2,2,figsize=(10,10))

tmp=data.loc[(data.type=='TRANSFER'),:]

a=sns.boxplot(x='isFlaggedFraud',y='amount',data=tmp,ax=axs[0][0])

axs[0][0].set_yscale('log')

b=sns.boxplot(x='isFlaggedFraud',y='oldbalanceDest',data=tmp,ax=axs[0][1])

axs[0][1].set(ylim=(0,0.5e8))

c=sns.boxplot(x='isFlaggedFraud',y='oldbalanceOrg',data=tmp,ax=axs[1][0])

axs[1][0].set(ylim=(0,3e7))

d=sns.regplot(x='oldbalanceOrg',y='amount',data=tmp.loc[(tmp.isFlaggedFraud==1),:],ax=axs[1][1])

plt.show()


3.8.4.7.5ScatterplotAnalysis

The scatter plot displays values of two numerical variables as Cartesiancoordinates.plt.figure(figsize=(12,8))

sns.pairplot(data[['amount','oldbalanceOrg','oldbalanceDest','isFraud']],hue='isFraud')

Figure27:scikit-learnscatterplots

3.8.4.8DataCleansing-RemovingOutliers

Ifthetransactionamountislowerthan5percentoftheallthetransactionsANDdoesnotexceedUSD3000,wewillexcludeitfromouranalysistoreduceType1costsIfthetransactionamountishigherthan95percentofallthetransactionsAND exceeds USD 500000, we will exclude it from our analysis, and use ablanketreviewprocessforsuchtransactions(similartoisFlaggedFraudcolumninoriginaldataset)toreduceType2costslow_exclude=np.round(np.minimum(fin_samp_data.amount.quantile(0.05),3000),2)

high_exclude=np.round(np.maximum(fin_samp_data.amount.quantile(0.95),500000),2)

###UpdatingDatatoexcluderecordspronetoType1andType2costs

low_data=fin_samp_data[fin_samp_data.amount>low_exclude]

3.8.4.9PipelineCreation

Machinelearningpipelineisusedtohelpautomatemachinelearningworkflows.Theyoperateby enabling a sequenceofdata tobe transformedandcorrelatedtogether in a model that can be tested and evaluated to achieve an outcome,whetherpositiveornegative.

3.8.4.9.1DefiningDataFrameSelectortoseparateNumericalandCategoricalattributes

SamplefunctiontoseperateoutNumericalandcategoricalattributes.

3.8.4.9.2FeatureCreation/AdditionalFeatureEngineering

DuringEDAweidentifiedthattherearetransactionswherethebalancesdonottallyafterthetransactioniscompleted.Webelievethiscouldpotentiallybecaseswherefraudisoccurring.Toaccountforthiserrorinthetransactions,wedefinetwo new features“errorBalanceOrig” and “errorBalanceDest”, calculated byadjusting theamountwith thebeforeandafterbalances for theOriginatorandDestinationaccounts.

Below,wecreateafunctionthatallowsustocreatethesefeaturesinapipeline.

data=low_data[low_data.amount<high_exclude]

fromsklearn.baseimportBaseEstimator,TransformerMixin

#Createaclasstoselectnumericalorcategoricalcolumns

#sinceScikit-Learndoesn'thandleDataFramesyet

classDataFrameSelector(BaseEstimator,TransformerMixin):

def__init__(self,attribute_names):

self.attribute_names=attribute_names

deffit(self,X,y=None):

returnself

deftransform(self,X):

returnX[self.attribute_names].values

fromsklearn.baseimportBaseEstimator,TransformerMixin

#columnindex

amount_ix,oldbalanceOrg_ix,newbalanceOrig_ix,oldbalanceDest_ix,newbalanceDest_ix=0,1,2,3,4

classCombinedAttributesAdder(BaseEstimator,TransformerMixin):

def__init__(self):#no*argsor**kargs

pass

deffit(self,X,y=None):

returnself#nothingelsetodo

deftransform(self,X,y=None):

errorBalanceOrig=X[:,newbalanceOrig_ix]+X[:,amount_ix]-X[:,oldbalanceOrg_ix]

errorBalanceDest=X[:,oldbalanceDest_ix]+X[:,amount_ix]-X[:,newbalanceDest_ix]

returnnp.c_[X,errorBalanceOrig,errorBalanceDest]

3.8.4.10CreatingTrainingandTestingdatasets

Trainingsetincludesthesetofinputexamplesthatthemodelwillbefitinto ortrained on by adjusting the parameters. Testing dataset is critical to test thegeneralizability of the model . By using this set, we can get the workingaccuracyofourmodel.

Testingsetshouldnotbeexposedtomodelunlessmodeltraininghasnotbeencompleted.Thiswaytheresultsfromtestingwillbemorereliable.

3.8.4.11Creatingpipelinefornumericalandcategoricalattributes

IdentifyingcolumnswithNumericalandCategoricalcharacteristics.

3.8.4.12Selectingthealgorithmtobeapplied

Algorithimselectionprimarilydependsontheobjectiveyouaretryingtosolveandwhatkindofdatasetisavailable.Therearediffernttypeofalgorithmswhichcanbeappliedandwewilllookintofewofthemhere.

3.8.4.12.1LinearRegression

This algorithm can be applied when you want to compute some continuousvalue.Topredictsomefuturevalueofaprocesswhichiscurrentlyrunning,you

fromsklearn.model_selectionimporttrain_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42,stratify=y)

X_train_num=X_train[["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest"]]

X_train_cat=X_train[["type"]]

X_model_col=["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest","type"]

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportStandardScaler

fromsklearn.preprocessingimportImputer

num_attribs=list(X_train_num)

cat_attribs=list(X_train_cat)

num_pipeline=Pipeline([

('selector',DataFrameSelector(num_attribs)),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler())

])

cat_pipeline=Pipeline([

('selector',DataFrameSelector(cat_attribs)),

('cat_encoder',CategoricalEncoder(encoding="onehot-dense"))

])

cangowithregressionalgorithm.

Exampleswherelinearregressioncanusedare:

1. Predictthetimetakentogofromoneplacetoanother2. Predictthesalesforafuturemonth3. Predictsalesdataandimproveyearlyprojections.

3.8.4.12.2LogisticRegression

Thisalgorithmcanbeusedtoperformbinaryclassification.Itcanbeusedifyouwantaprobabilisticframework.Alsoincaseyouexpecttoreceivemoretrainingdata in the future that you want to be able to quickly incorporate into yourmodel.

1. Customerchurnprediction.2. CreditScoring&FraudDetectionwhichisourexampleproblemwhichwe

aretryingtosolveinthischapter.3. Calculatingtheeffectivenessofmarketingcampaigns.

3.8.4.12.3Decisiontrees

Decision trees handle feature interactions and they’re non-parametric. Doesntsupportonlinelearningandtheentiretreeneedstoberebuildwhennewtraning

fromsklearn.linear_modelimportLinearRegression


importtime

scl=StandardScaler()

X_train_std=scl.fit_transform(X_train)

X_test_std=scl.transform(X_test)

start=time.time()

lin_reg=LinearRegression()

lin_reg.fit(X_train_std,y_train)#SKLearn'slinearregression

y_train_pred=lin_reg.predict(X_train_std)

train_time=time.time()-start

fromsklearn.linear_modelimportLogisticRegression


X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)

X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)

model_lr_sklearn=LogisticRegression(multi_class="multinomial",C=1e6,solver="sag",max_iter=15)

model_lr_sklearn.fit(X_train,y_train)

y_pred_test=model_lr_sklearn.predict(X_test)

acc=accuracy_score(y_test,y_pred_test)

results.loc[len(results)]=["LRSklearn",np.round(acc,3)]

results

datasetcomesin.Memoryconsumptionisveryhigh.

Canbeusedforthefollowingcases

1. Investmentdecisions2. Customerchurn3. Banksloandefaulters4. BuildvsBuydecisions5. Salesleadqualifications

3.8.4.12.4KMeans

Thisalgorithmisusedwhenwearenotawareofthelabelsandoneneedstobecreatedbasedon the featuresofobjects.Examplewillbe todivideagroupofpeopleintodifferntsubgroupsbasedoncommonthemeorattribute.

ThemaindisadvantageofK-meanisthatyouneedtoknowexactlythenumberofclustersorgroupswhichisrequired.IttakesalotofiterationtocomeupwiththebestK.

3.8.4.12.5SupportVectorMachines

SVM is a supervised ML technique and used for pattern recognition and

fromsklearn.treeimportDecisionTreeRegressor

dt=DecisionTreeRegressor()

start=time.time()

dt.fit(X_train_std,y_train)

y_train_pred=dt.predict(X_train_std)


start=time.time()

y_test_pred=dt.predict(X_test_std)

test_time=time.time()-start

fromsklearn.neighborsimportKNeighborsClassifier

fromsklearn.model_selectionimporttrain_test_split,GridSearchCV,PredefinedSplit

fromsklearn.metricsimportaccuracy_score

X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)

X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)

model_knn_sklearn=KNeighborsClassifier(n_jobs=-1)

model_knn_sklearn.fit(X_train,y_train)

y_pred_test=model_knn_sklearn.predict(X_test)

acc=accuracy_score(y_test,y_pred_test)

results.loc[len(results)]=["KNNArbitarySklearn",np.round(acc,3)]

results

classification problems when your data has exactly two classes. Its popular intextclassificationproblems.

FewcaseswhereSVMcanbeusedis

1. Detectingpersonswithcommondiseases.2. Hand-writtencharacterrecognition3. Textcategorization4. Stockmarketpriceprediction

3.8.4.12.6NaiveBayes

NaiveBayesisusedforlargedatasets.ThisalgoritmworkswellevenwhenwehavealimitedCPUandmemoryavailable.Thisworksbycalculatingbunchofcounts.Itrequireslesstrainingdata.Thealgorthimcantlearninterationbetweenfeatures.

NaiveBayescanbeusedinreal-worldapplicationssuchas:

1. Sentimentanalysisandtextclassification2. RecommendationsystemslikeNetflix,Amazon3. Tomarkanemailasspamornotspam4. Facerecognition

3.8.4.12.7RandomForest

RanmdonforestissimilartoDecisiontree.Canbeusedforbothregressionandclassificationproblemswithlargedatasets.

Fewcasewhereitcanbeapplied.

1. Predictpatientsforhighrisks.2. Predictpartsfailuresinmanufacturing.3. Predictloandefaulters.

fromsklearn.ensembleimportRandomForestRegressor

forest=RandomForestRegressor(n_estimators=400,criterion='mse',random_state=1,n_jobs=-1)

start=time.time()

forest.fit(X_train_std,y_train)

y_train_pred=forest.predict(X_train_std)


start=time.time()

3.8.4.12.8Neuralnetworks

Neural network works based on weights of connections between neurons.Weights are trained and based on that the neural network can be utilized topredicttheclassoraquantity.Theyareresourceandmemoryintensive.

Fewcaseswhereitcanbeapplied.

1. Appliedtounsupervisedlearningtasks,suchasfeatureextraction.2. Extracts features from raw images or speech with much less human

intervention

3.8.4.12.9DeepLearningusingKeras

Keras is most powerful and easy-to-use Python libraries for developing andevaluating deep learning models. It has the efficient numerical computationlibrariesTheanoandTensorFlow.

3.8.4.12.10XGBoost

XGBooststandsforeXtremeGradientBoosting.XGBoostisanimplementationof gradient boosted decision trees designed for speed and performance. It isengineeredforefficiencyofcomputetimeandmemoryresources.

3.8.4.13ScikitCheatSheet

ScikitlearninghasputaveryindepthandwellexplainedflowcharttohelpyouchoosetherightalgorithmthatIfindveryhandy.

y_test_pred=forest.predict(X_test_std)

test_time=time.time()-start


3.8.4.14ParameterOptimization

Machinelearningmodelsareparameterizedsothattheirbehaviorcanbetunedforagivenproblem.Thesemodelscanhavemanyparametersand finding thebestcombinationofparameterscanbetreatedasasearchproblem.

Aparameterisaconfigurationthatispartofthemodelandvaluescanbederivedfromthegivendata.

1. Requiredbythemodelwhenmakingpredictions.2. Valuesdefinetheskillofthemodelonyourproblem.3. Estimatedorlearnedfromdata.4. Oftennotsetmanuallybythepractitioner.5. Oftensavedaspartofthelearnedmodel.

3.8.4.14.1Hyperparameteroptimization/tuningalgorithms

Gridsearchisanapproachtohyperparametertuningthatwillmethodicallybuildandevaluateamodelforeachcombinationofalgorithmparametersspecifiedin

agrid.

Random search provide a statistical distribution for each hyperparameter fromwhichvaluesmayberandomlysampled.

3.8.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)

3.8.4.15.1Creatingaparametergrid

3.8.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.

grid_param=[

[{#LogisticRegression

'model__penalty':['l1','l2'],

'model__C':[0.01,1.0,100]

}],

[{#keras

'model__optimizer':optimizer,

'model__loss':loss

}],

[{#SVM

'model__C':[0.01,1.0,100],

'model__gamma':[0.5,1],

'model__max_iter':[-1]

}],

[{#XGBClassifier

'model__min_child_weight':[1,3,5],

'model__gamma':[0.5],

'model__subsample':[0.6,0.8],

'model__colsample_bytree':[0.6],

'model__max_depth':[3]

}]

]

Pipeline(memory=None,

steps=[('preparation',FeatureUnion(n_jobs=None,

transformer_list=[('num_pipeline',Pipeline(memory=None,

steps=[('selector',DataFrameSelector(attribute_names=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest'

tol=0.0001,verbose=0,warm_start=False))])

fromsklearn.metricsimportmean_squared_error

fromsklearn.metricsimportclassification_report

fromsklearn.metricsimportf1_score

fromxgboost.sklearnimportXGBClassifier

fromsklearn.svmimportSVC

test_scores=[]

#MachineLearningAlgorithm(MLA)SelectionandInitialization

MLA=[

linear_model.LogisticRegression(),

keras_model,

SVC(),

XGBClassifier()

]

3.8.4.15.3ResultstablefromtheModelevaluationwithmetrics.


#createtabletocompareMLAmetrics

MLA_columns=['Name','Score','Accuracy_Score','ROC_AUC_score','final_rmse','Classification_error','Recall_Score','Precision_Score'

MLA_compare=pd.DataFrame(columns=MLA_columns)

Model_Scores=pd.DataFrame(columns=['Name','Score'])

row_index=0

foralginMLA:

#setnameandparameters

MLA_name=alg.__class__.__name__

MLA_compare.loc[row_index,'Name']=MLA_name

#MLA_compare.loc[row_index,'Parameters']=str(alg.get_params())

full_pipeline_with_predictor=Pipeline([

("preparation",full_pipeline),#combinationofnumericalandcategoricalpipelines

("model",alg)

])

grid_search=GridSearchCV(full_pipeline_with_predictor,grid_param[row_index],cv=4,verbose=2,scoring='f1',return_train_score

grid_search.fit(X_train[X_model_col],y_train)

y_pred=grid_search.predict(X_test)

MLA_compare.loc[row_index,'Accuracy_Score']=np.round(accuracy_score(y_pred,y_test),3)

MLA_compare.loc[row_index,'ROC_AUC_score']=np.round(metrics.roc_auc_score(y_test,y_pred),3)

MLA_compare.loc[row_index,'Score']=np.round(grid_search.score(X_test,y_test),3)

negative_mse=grid_search.best_score_

scores=np.sqrt(-negative_mse)

final_mse=mean_squared_error(y_test,y_pred)

final_rmse=np.sqrt(final_mse)

MLA_compare.loc[row_index,'final_rmse']=final_rmse

confusion_matrix_var=confusion_matrix(y_test,y_pred)

TP=confusion_matrix_var[1,1]

TN=confusion_matrix_var[0,0]

FP=confusion_matrix_var[0,1]

FN=confusion_matrix_var[1,0]

MLA_compare.loc[row_index,'Classification_error']=np.round(((FP+FN)/float(TP+TN+FP+FN)),5)

MLA_compare.loc[row_index,'Recall_Score']=np.round(metrics.recall_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'Precision_Score']=np.round(metrics.precision_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'F1_Score']=np.round(f1_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'mean_test_score']=grid_search.cv_results_['mean_test_score'].mean()

MLA_compare.loc[row_index,'mean_fit_time']=grid_search.cv_results_['mean_fit_time'].mean()

Model_Scores.loc[row_index,'MLAName']=MLA_name

Model_Scores.loc[row_index,'MLScore']=np.round(metrics.roc_auc_score(y_test,y_pred),3)

#CollectMeanTestscoresforstatisticalsignificancetest

test_scores.append(grid_search.cv_results_['mean_test_score'])

row_index+=1

3.8.4.15.4ROCAUCScore

AUC-ROCcurve isaperformancemeasurementforclassificationproblematvarious thresholds settings. ROC is a probability curve and AUC representsdegree or measure of separability. It tells how much model is capable ofdistinguishing between classes. Higher the AUC, better the model is atpredicting0sas0sand1sas1s.



3.8.4.16K-meansinscikitlearn.

3.8.4.16.1Import

3.8.4.17K-meansAlgorithm

Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.

3.8.4.17.1Import

3.8.4.17.2Createsamples

3.8.4.17.3Createsamples

fromtimeimporttime

importnumpyasnp


fromsklearnimportmetrics

fromsklearn.clusterimportKMeans

fromsklearn.datasetsimportload_digits

fromsklearn.decompositionimportPCA

fromsklearn.preprocessingimportscale

np.random.seed(42)

digits=load_digits()

data=scale(digits.data)

np.random.seed(42)

digits=load_digits()

data=scale(digits.data)

n_samples,n_features=data.shape

n_digits=len(np.unique(digits.target))

labels=digits.target

sample_size=300

print("n_digits:%d,\tn_samples%d,\tn_features%d"%(n_digits,n_samples,n_features))

print(79*'_')

print('%9s'%'init''timeinertiahomocomplv-measARIAMIsilhouette')

print("n_digits:%d,\tn_samples%d,\tn_features%d"

%(n_digits,n_samples,n_features))

print(79*'_')

print('%9s'%'init'

'timeinertiahomocomplv-measARIAMIsilhouette')

defbench_k_means(estimator,name,data):

t0=time()

estimator.fit(data)

print('%9s%.2fs%i%.3f%.3f%.3f%.3f%.3f%.3f'

3.8.4.17.4Visualize

SeeFigure32

3.8.4.17.5Visualize

SeeFigure32

%(name,(time()-t0),estimator.inertia_,

metrics.homogeneity_score(labels,estimator.labels_),

metrics.completeness_score(labels,estimator.labels_),

metrics.v_measure_score(labels,estimator.labels_),

metrics.adjusted_rand_score(labels,estimator.labels_),

metrics.adjusted_mutual_info_score(labels,estimator.labels_),

metrics.silhouette_score(data,estimator.labels_,metric='euclidean',sample_size=sample_size)))

bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),name="k-means++",data=data)

bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),name="random",data=data)

metrics.silhouette_score(data,estimator.labels_,

metric='euclidean',

sample_size=sample_size)))

bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),

name="k-means++",data=data)

bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),

name="random",data=data)

#inthiscasetheseedingofthecentersisdeterministic,hencewerunthe

#kmeansalgorithmonlyoncewithn_init=1

pca=PCA(n_components=n_digits).fit(data)

bench_k_means(KMeans(init=pca.components_,n_clusters=n_digits,n_init=1),name="PCA-based",data=data)

print(79*'_')

bench_k_means(KMeans(init=pca.components_,

n_clusters=n_digits,n_init=1),

name="PCA-based",

data=data)

print(79*'_')

reduced_data=PCA(n_components=2).fit_transform(data)

kmeans=KMeans(init='k-means++',n_clusters=n_digits,n_init=10)

kmeans.fit(reduced_data)

#Stepsizeofthemesh.DecreasetoincreasethequalityoftheVQ.

h=.02#pointinthemesh[x_min,x_max]x[y_min,y_max].

#Plotthedecisionboundary.Forthat,wewillassignacolortoeach

x_min,x_max=reduced_data[:,0].min()-1,reduced_data[:,0].max()+1

y_min,y_max=reduced_data[:,1].min()-1,reduced_data[:,1].max()+1

xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))

#Obtainlabelsforeachpointinmesh.Uselasttrainedmodel.

Z=kmeans.predict(np.c_[xx.ravel(),yy.ravel()])

#Puttheresultintoacolorplot

Z=Z.reshape(xx.shape)

plt.figure(1)

plt.clf()

plt.imshow(Z,interpolation='nearest',

extent=(xx.min(),xx.max(),yy.min(),yy.max()),

cmap=plt.cm.Paired,

Figure32:Result

3.8.5ParallelComputinginPython☁�

InthismodulewewillreviewtheavailablePythonmodulesthatcanbeusedforparallel computing. The parallel computing can be in form of either multi-threadingormulti-processing.Inmulti-threadingapproach,thethreadsruninthesame shared memory heap whereas in case of multi-processing, the memoryheaps of processes are separate and independent, therefore the communicationbetweentheprocessesarealittlebitmorecomplex.

3.8.5.1Multi-threadinginPython

ThreadinginPythonisperfectforI/Ooperationswheretheprocessisexpectedto be idle regularly, e.g. web scraping. This is a very useful feature becauseseveral applications and script might spend the majority of their runtime onwaiting for network or data I/O. In several cases, e.g. web scraping, theresources, i.e. downloading from different websites, are most of the time

aspect='auto',origin='lower')

plt.plot(reduced_data[:,0],reduced_data[:,1],'k.',markersize=2)

#PlotthecentroidsasawhiteX

centroids=kmeans.cluster_centers_

plt.scatter(centroids[:,0],centroids[:,1],

marker='x',s=169,linewidths=3,

color='w',zorder=10)

plt.title('K-meansclusteringonthedigitsdataset(PCA-reduceddata)\n'

'Centroidsaremarkedwithwhitecross')

plt.xlim(x_min,x_max)

plt.ylim(y_min,y_max)

plt.xticks(())

plt.yticks(())

plt.show()

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/python-parallel.md

independent.Thereforetheprocessorcandownloadinparallelandjointheresultattheend.

3.8.5.1.1ThreadvsThreading

Therearetwobuilt-inmodulesinPythonthatarerelatedtothreading,namelythreadandthreading.TheformermoduleisdeprecatedforsometimeinPython2,andinPython3itisrenamedto_threadforthesakeofbackwardsincompatibilities.The_threadmoduleprovideslow-levelthreadingAPIformulti-threadinginPython,whereasthemodulethreadingbuildsahigh-levelthreadinginterfaceontopofit.

TheThread()isthemainmethodofthethreadingmodule,thetwoimportantargumentsof which are target, for specifying the callable object, and args to pass theargumentsforthetargetcallable.Weillustratetheseinthefollowingexample:

Thisistheoutputofthepreviousexample:

Incaseyouarenotfamiliarwiththeif__name__=='__main__:'statement,whatitdoesisbasicallymakingsurethatthecodenestedunderthisconditionwillberunonlyifyourunyourmoduleasaprogramanditwillnotrunincaseyourmoduleisimportedinanotherfile.

3.8.5.1.2Locks

Asmentionedprior,thememoryspaceissharedbetweenthethreads.Thisisatthe same time beneficial and problematic: it is beneficial in a sense that thecommunication between the threads becomes easy, however, you mightexperience strange outcome if you let several threads change same variablewithoutcaution,e.g.thread2changesvariablexwhilethread1isworkingwith

importthreading

defhello_thread(thread_num):

print("HellofromThread",thread_num)


forthread_numinrange(5):

t=threading.Thread(target=hello_thread,arg=(thread_num,))

t.start()

In[1]:%runthreading.py

HellofromThread0

HellofromThread1

HellofromThread2

HellofromThread3

HellofromThread4

it.Thisiswhenlockcomesintoplay.Usinglock,youcanallowonlyonethreadtoworkwithavariable.Inotherwords,onlyasinglethreadcanholdthelock.Iftheother threadsneed toworkwith thatvariable, theyhave towaituntil theotherthreadisdoneandthevariableis“unlocked”.

Weillustratethiswithasimpleexample:

Supposewewant toprintmultiplesof3between1and12, i.e.3,6,9and12.Forthesakeofargument,wetrytodothisusing2threadsandanestedforloop.Thenwecreateaglobalvariablecalledcounterandweinitializeitwith0.Thenwhenever each of the incrementer1 or incrementer2 functions are called, the counter isincrementedby3 twice (counter is incrementedby6 in each function call). Ifyourunthepreviouscode,youshouldbereallyluckyifyougetthefollowingaspartofyouroutput:

Thereasonistheconflictthathappensbetweenthreadswhileincrementingthecounter in thenestedfor loop.Asyouprobablynoticed, thefirst levelfor loopisequivalentofadding3 to thecounterand theconflict thatmighthappen isnoteffective on that level but the nested for loop.Accordingly, the output of thepreviouscodeisdifferentineveryrun.Thisisanexampleoutput:

importthreading

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter1incrementedthecounterby1")

print("Counteris%d"%counter)

defincrementer2():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1


print("Counterisnow%d"%counter)


t1=threading.Thread(target=incrementer1)


t1.start()

t2.start()

Counterisnow3

Counterisnow6

Counterisnow9

Counterisnow12

We can fix this issue using a lock: whenever one of the function is going toincrementthevalueby3,itwillacquire()thelockandwhenitisdonethefunctionwillrelease()thelock.Thismechanismisillustratedinthefollowingcode:

Nomatterhowmanytimesyourunthiscode,theoutputwouldalwaysbeinthecorrectorder:

$python3lock_example.py

Greeter1incrementedthecounterby1



Counteris4






Counteris8



Counteris10



Counteris12

importthreading

increment_by_3_lock=threading.Lock()

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

increment_by_3_lock.acquire(True)

foriinrange(3):

counter+=1



increment_by_3_lock.release()

defincrementer2():

globalcounter

forjinrange(2):


foriinrange(3):

counter+=1







t1.start()

t2.start()

$python3lock_example.py




Counteris3




Counteris6




Using the Threading module increases both the overhead associated with threadmanagementaswellasthecomplexityoftheprogramandthatiswhyinmanysituations,employingmultiprocessingmodulemightbeabetterapproach.

3.8.5.2Multi-processinginPython

We already mentioned that multi-threading might not be sufficient in manyapplicationsandwemightneedtousemultiprocessingsometime,orbettertosaymostof the times. That is why we are dedicating this subsection to this particularmodule.ThismoduleprovidesyouwithanAPIforspawningprocessesthewayyou spawn threads using threading module. Moreover, there are somefunctionalities that are not even available in threading module, e.g. the Pool classwhichallowsyoutorunabatchofjobsusingapoolofworkerprocesses.

3.8.5.2.1Process

Similartothreadingmodulewhichwasemployingthread(aka_thread)underthehood,multiprocessingemploystheProcessclass.Considerthefollowingexample:

In this example, after importing the Processmodulewecreated a greeter() functionthattakesanameandgreetsthatperson.Italsoprintsthepid(processidentifier)oftheprocessthatisrunningit.Notethatweusedtheosmoduletogetthepid.Inthebottomofthecodeaftercheckingthe__name__='__main__'condition,wecreateaseriesofProcessesandstartthem.Finallyinthelastforloopandusingthejoinmethod,wetell Python towait for the processes to terminate. This is one of the possibleoutputsofthecode:

Counteris9




Counteris12

frommultiprocessingimportProcess

importos

defgreeter(name):

proc_idx=os.getpid()

print("Process{0}:Hello{1}!".format(proc_idx,name))


name_list=['Harry','George','Dirk','David']

process_list=[]

forname_idx,nameinenumerate(name_list):

current_process=Process(target=greeter,args=(name,))

process_list.append(current_process)

current_process.start()

forprocessinprocess_list:

process.join()

3.8.5.2.2Pool

ConsiderthePoolclassasapoolofworkerprocesses.ThereareseveralwaysforassigningjobstothePoolclassandwewillintroducethemostimportantonesinthis section. These methods are categorized as blocking or non-blocking. The formermeans that after calling the API, it blocks the thread/process until it has theresultoranswerreadyandthecontrolreturnsonlywhenthecallcompletes.Inthenon-blockinontheotherhand,thecontrolreturnsimmediately.

3.8.5.2.2.1SynchronousPool.map()

WeillustratethePool.mapmethodbyre-implementingourpreviousgreeterexampleusingPool.map:

Asyoucansee,wehavesevennamesherebutwedonotwanttodedicateeachgreeting toa separateprocess. Insteadwedo thewhole jobof“greetingsevenpeople”using“twoprocesses”.Wecreateapoolof3processeswithPool(processes=3)syntax and then we map an iterable called names to the greeter function usingpool.map(greeter,names).Asweexpected,thegreetingsintheoutputwillbeprintedfromthreedifferentprocesses:

Note that Pool.map() is in blocking category and does not return the control to yourscriptuntilitisdonecalculatingtheresults.ThatiswhyDone!isprintedafterallof

$python3process_example.py

Process23451:HelloHarry!

Process23452:HelloGeorge!

Process23453:HelloDirk!

Process23454:HelloDavid!


importos

defgreeter(name):

pid=os.getpid()

print("Process{0}:Hello{1}!".format(pid,name))


names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']

pool=Pool(processes=3)

sync_map=pool.map(greeter,names)

print("Done!")

$pythonpoolmap_example.py

Process30585:HelloJenna!


Process30587:HelloMarry!

Process30585:HelloTed!

Process30585:HelloJerry!

Process30587:HelloTom!

Process30585:HelloJustin!

Done!

thegreetingsareover.

3.8.5.2.2.2AsynchronousPool.map_async()

As the name implies, you can use the map_async method, when you want assignmany function calls to a pool of worker processes asynchronously. Note thatunlike map, the order of the results is not guaranteed (as oppose to map) and thecontrolisreturnedimmediately.Wenowimplementthepreviousexampleusingmap_async:

As you probably noticed, the only difference (clearly apart from the map_async

methodname)iscallingthe wait()methodin the last line.The wait()method tellsyourscripttowaitfortheresultofmap_asyncbeforeterminating:

Note that the order of the results are not preserved.Moreover, Done! is printerbefore anyof the results,meaning that ifwedonot use the wait()method, youprobablywillnotseetheresultatall.

3.8.5.2.3Locks

Thewaymultiprocessingmoduleimplementslocksisalmostidenticaltothewaythethreadingmoduledoes.AfterimportingLockfrommultiprocessingallyouneedtodoistoacquireit,dosomecomputationandthenreleasethelock.WewillclarifytheuseofLockbyprovidinganexampleinnextsectionaboutprocesscommunication.

3.8.5.2.4ProcessCommunication


importos

defgreeter(name):

pid=os.getpid()

print("Process{0}:Hello{1}!".format(pid,name))


names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']

pool=Pool(processes=3)

async_map=pool.map_async(greeter,names)

print("Done!")

async_map.wait()

$pythonpoolmap_example.py

Done!

Process30740:HelloJenna!


Process30740:HelloTed!

Process30742:HelloMarry!

Process30740:HelloJerry!

Process30741:HelloTom!

Process30742:HelloJustin!

Process communication in multiprocessing is one of the most important, yetcomplicated,featuresforbetteruseofthismodule.Asopposetothreading,theProcessobjects will not have access to any shared variable by default, i.e. no sharedmemoryspacebetweentheprocessesbydefault.Thiseffectisillustratedinthefollowingexample:

Probably you already noticed that this is almost identical to our example inthreadingsection.Now,takealookatthestrangeoutput:

Asyoucansee,itisasiftheprocessesdoesnotseeeachother.Insteadofhavingtwoprocessesonecounting to6and theothercountingfrom6 to12,wehavetwoprocessescountingto6.

Nevertheless, there are several ways that Processes from multiprocessing cancommunicatewitheachother,includingPipe,Queue,Value,ArrayandManager.Pipeand Queueare appropriate for inter-processmessage passing. To bemore specific, Pipe isuseful for process-to-process scenarios while Queue is more appropriate forprocesses-toprocessesones.ValueandArrayarebothusedtoprovideasynchronizedaccesstoashareddata(verymuchlikesharedmemory)andManagerscanbeusedondifferentdatatypes.Inthefollowingsub-sections,wecoverbothValueandArray

frommultiprocessingimportProcess,Lock,Value

importtime

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter1:Counteris%d"%counter)

defincrementer2():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter2:Counteris%d"%counter)


t1=Process(target=incrementer1)

t2=Process(target=incrementer2)

t1.start()

t2.start()

$pythoncommunication_example.py

Greeter1:Counteris3

Greeter1:Counteris6

Greeter2:Counteris3

Greeter2:Counteris6

sincetheyarebothlightweight,yetuseful,approaches.

3.8.5.2.4.1Value

The following example re-implements the broken example in the previoussection.Wefixthestrangeoutput,byusingbothLockandValue:

The usage of Lock object in this example is identical to the example in threadingsection.Theusageof counter ison theotherhand thenovelpart.First,note thatcounterisnotaglobalvariableanymoreandinsteaditisaValuewhichreturnsactypes object allocated from a shared memory between the processes. The firstargument 'i' indicates a signed integer, and the second argument defines theinitializationvalue.Inthiscaseweareassigningasignedintegerinthesharedmemory initialized to size 0 to the counter variable.We thenmodified our twofunctionsandpassthissharedvariableasanargument.Finally,wechangethewayweincrementthecountersincecounterisnotanPythonintegeranymorebutactypes signed integerwherewe can access its value using the value attribute.Theoutputofthecodeisnowasweexpected:

frommultiprocessingimportProcess,Lock,Value

importtime

increment_by_3_lock=Lock()

defincrementer1(counter):

forjinrange(3):


foriinrange(3):

counter.value+=1

time.sleep(0.1)

print("Greeter1:Counteris%d"%counter.value)


defincrementer2(counter):

forjinrange(3):


foriinrange(3):

counter.value+=1

time.sleep(0.05)

print("Greeter2:Counteris%d"%counter.value)



counter=Value('i',0)

t1=Process(target=incrementer1,args=(counter,))

t2=Process(target=incrementer2,args=(counter,))

t2.start()

t1.start()

$pythonmp_lock_example.py

Greeter2:Counteris3

Greeter2:Counteris6

Greeter1:Counteris9

Greeter1:Counteris12

Thelastexamplerelatedtoparallelprocessing,illustratestheuseofbothValueandArray,aswellasa technique topassmultiplearguments toa function.Note thattheProcessobjectdoesnotacceptmultipleargumentsforafunctionandthereforewe need this or similar techniques for passingmultiple arguments. Also, thistechniquecanalsobeusedwhenyouwanttopassmultipleargumentstomapormap_async:

Inthisexamplewecreatedamultiprocessing.Array()objectandassignedittoavariablecallednames.Aswementionedbefore,thefirstargumentisthectypedatatypeandsincewewanttocreateanarrayofstringswithlengthof4(secondargument),weimportedthec_char_pandpasseditasthefirstargument.

Instead of passing the arguments separately,wemerged both the Value and Arrayobjects in a tuple andpassed the tuple to the functions.We thenmodified thefunctions to unpack the objects in the first two lines in the both functions.Finally we changed the print statement in a way that each process greets aparticularname.Theoutputoftheexampleis:

frommultiprocessingimportProcess,Lock,Value,Array

importtime

fromctypesimportc_char_p

increment_by_3_lock=Lock()

defincrementer1(counter_and_names):

counter=counter_and_names[0]

names=counter_and_names[1]

forjinrange(2):


foriinrange(3):

counter.value+=1

time.sleep(0.1)

name_idx=counter.value//3-1

print("Greeter1:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))


defincrementer2(counter_and_names):

counter=counter_and_names[0]

names=counter_and_names[1]

forjinrange(2):


foriinrange(3):

counter.value+=1

time.sleep(0.05)

name_idx=counter.value//3-1

print("Greeter2:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))



counter=Value('i',0)

names=Array(c_char_p,4)

names.value=['James','Tom','Sam','Larry']

t1=Process(target=incrementer1,args=((counter,names),))

t2=Process(target=incrementer2,args=((counter,names),))

t2.start()

t1.start()

3.8.6Dask-RandomForestFeatureDetection☁�

3.8.6.1Setup

First we need our tools. pandas gives us the DataFrame, very similar to R’sDataFrames.TheDataFrameisastructurethatallowsustoworkwithourdatamoreeasily.Ithasnicefeaturesforslicingandtransformationofdata,andeasywaystodobasicstatistics.

numpyhassomeveryhandyfunctionsthatworkonDataFrames.

3.8.6.2Dataset

We are using a dataset about the wine quality dataset, archived at UCI’sMachineLearningRepository(http://archive.ics.uci.edu/ml/index.php).

Nowwewillloadourdata.pandasmakesiteasy!

Like in R, there is a .describe() method that gives basic statistics for everycolumninthedataset.

fixedacidity volatileacidity citricacid residual

sugar chlorides

count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000mean 8.319637 0.527821 0.270976 2.538806 0.087467

$python3mp_lock_example.py

Greeter2:GreetingJames!Counteris3

Greeter2:GreetingTom!Counteris6

Greeter1:GreetingSam!Counteris9

Greeter1:GreetingLarry!Counteris12

importpandasaspd

importnumpyasnp

#redwinequalitydata,packedinaDataFrame

red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)

#whitewinequalitydata,packedinaDataFrame

white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#rose?otherfruitwines?plumwine?:(

#forredwines

red_df.describe()

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/python/random-forest/random-forest.md

std 1.741096 0.179060 0.194801 1.409928 0.047065min 4.600000 0.120000 0.000000 0.900000 0.01200025% 7.100000 0.390000 0.090000 1.900000 0.07000050% 7.900000 0.520000 0.260000 2.200000 0.07900075% 9.200000 0.640000 0.420000 2.600000 0.090000max 15.900000 1.580000 1.000000 15.500000 0.611000

fixedacidity volatileacidity citricacid residual

sugar chlorides

count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000mean 6.854788 0.278241 0.334192 6.391415 0.045772std 0.843868 0.100795 0.121020 5.072058 0.021848min 3.800000 0.080000 0.000000 0.600000 0.00900025% 6.300000 0.210000 0.270000 1.700000 0.03600050% 6.800000 0.260000 0.320000 5.200000 0.04300075% 7.300000 0.320000 0.390000 9.900000 0.050000max 14.200000 1.100000 1.660000 65.800000 0.346000

Sometimesitiseasiertounderstandthedatavisually.Ahistogramofthewhitewinequalitydatacitricacidsamplesisshownnext.Youcanofcoursevisualizeother columns’ data or other datasets. Just replace theDataFrame and columnname(seeFigure33).

#forwhitewines

white_df.describe()


defextract_col(df,col_name):

returnlist(df[col_name])

col=extract_col(white_df,'citricacid')#canreplacewithanotherdataframeorcolumn

plt.hist(col)

#TODO:addaxesandsuchtosetagoodexample

plt.show()

Figure33:Histogram

3.8.6.3DetectingFeatures

Letustryoutasomeelementarymachinelearningmodels.Thesemodelsarenotalways for prediction. They are also useful to find what features are mostpredictiveofavariableofinterest.Dependingontheclassifieryouuse,youmayneedtotransformthedatapertainingtothatvariable.

3.8.6.3.1DataPreparation

LetusassumewewanttostudywhatfeaturesaremostcorrelatedwithpH.pHofcourseisreal-valued,andcontinuous.Theclassifierswewanttouseusuallyneed labeled or integer data.Hence,wewill transform the pH data, assigningwineswithpHhigherthanaverageashi(morebasicoralkaline)andwineswithpHlowerthanaverageaslo(moreacidic).#refreshtomakeJupyterhappy

red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)

white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#TODO:datacleansingfunctionshere,e.g.replacementofNaN

#ifthevariableyouwanttopredictiscontinuous,youcanmaprangesofvalues

#tointeger/binary/stringlabels

#forexample,mapthepHdatato'hi'and'lo'ifapHvalueismorethanor

#lessthanthemeanpH,respectively

M=np.mean(list(red_df['pH']))#expectinelegantcodeinthesemappings

Lf=lambdap:int(p<M)*'lo'+int(p>=M)*'hi'#someC-stylehackery

#createthenewclassifiablevariable

red_df['pH-hi-lo']=map(Lf,list(red_df['pH']))

#andremovethepredecessor

delred_df['pH']

Nowwe specifywhich dataset and variable youwant to predict by assigningvluestoSELECTED_DFandTARGET_VAR,respectively.

Weliketokeepaparameterfilewherewespecifydatasourcesandsuch.Thisletsmecreategenericanalyticscodethatiseasytoreuse.

Afterwehavespecifiedwhatdatasetwewanttostudy,wesplitthetrainingandtestdatasets.Wethenscale (normalize) thedata,whichmakesmostclassifiersrunbetter.

Nowwepickaclassifier.Asyoucansee, therearemany to tryout,andevenmoreinscikit-learn’sdocumentationandmanyexamplesandtutorials.RandomForests are data scienceworkhorses.They are thego-tomethod formost datascientists. Be careful relying on them though–they tend to overfit.We try toavoidoverfittingbyseparatingthetrainingandtestdatasets.

3.8.6.4RandomForest

Nowwewilltestitoutwiththedefaultparameters.

Notethatthiscodeisboilerplate.Youcanuseitinterchangeablyformostscikit-



fromsklearnimportmetrics

#makeselectionsherewithoutdiggingincode

SELECTED_DF=red_df#selecteddataset

TARGET_VAR='pH-hi-lo'#thepredictedvariable

#generatenamelessdatastructures

df=SELECTED_DF

target=np.array(df[TARGET_VAR]).ravel()

deldf[TARGET_VAR]#nocheating

#TODO:datacleansingfunctioncallshere

#splitdatasetsfortrainingandtesting

X_train,X_test,y_train,y_test=train_test_split(df,target,test_size=0.2)

#setupthescaler

scaler=StandardScaler()

scaler.fit(X_train)

#applythescaler

X_train=scaler.transform(X_train)

X_test=scaler.transform(X_test)

#pickaclassifier

fromsklearn.treeimportDecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor

fromsklearn.ensembleimportRandomForestClassifier,ExtraTreesClassifier

clf=RandomForestClassifier()

learnmodels.

Nowoutputtheresults.ForRandomForests,wegetafeatureranking.Relativeimportances usually exponentially decay. The first few highly-ranked featuresareusuallythemostimportant.

Featureranking:

fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832sulphates0.047786freesulfurdioxide0.042727residualsugar0.037459quality0.021075

Sometimesit’seasiertovisualize.We’lluseabarchart.SeeFigure34

#testitout

model=clf.fit(X_train,y_train)

pred=clf.predict(X_test)

conf_matrix=metrics.confusion_matrix(y_test,pred)

var_score=clf.score(X_test,y_test)

#theresults

importances=clf.feature_importances_

indices=np.argsort(importances)[::-1]

#forthesakeofclarity

num_features=X_train.shape[1]

features=map(lambdax:df.columns[x],indices)

feature_importances=map(lambdax:importances[x],indices)

print'Featureranking:\n'

foriinrange(num_features):

feature_name=features[i]

feature_importance=feature_importances[i]

print'%s%f'%(feature_name.ljust(30),feature_importance)

plt.clf()

plt.bar(range(num_features),feature_importances)

plt.xticks(range(num_features),features,rotation=90)

plt.ylabel('relativeimportance(a.u.)')

plt.title('Relativeimportancesofmostpredictivefeatures')

plt.show()

Figure34:Result

3.8.6.5Acknowledgement

ThisnotebookwasdevelopedbyJulietteZerickandGregorvonLaszewski

importdask.dataframeasdd

red_df=dd.read_csv('winequality-red.csv',sep=';',header=0)

white_df=dd.read_csv('winequality-white.csv',sep=';',header=0)

4DEVOPSTOOLS

4.1REFCARDS☁�

LearningObjectives

Obtain quickly information about technical aspects with the help ofreferencecards.

We present youwith a list of useful short reference cards. This cards can beextremely useful to remind yourself about some important commands andfeatures.Havingthemcouldsimplifyyourinteractionwiththesystems,WenotonlycollectedheresomerefcardsaboutLinux,butalsoaboutotherusefultoolsandservices.

If you like to add new topics, let us know via your contribution (see thecontributionsection).

CheatSheets

CheatSheets

Editors

EmacsViVim

Documentation

LaTeXRST

Linux

https://github.com/cloudmesh-community/book/blob/master/chapters/linux/refcards.md

http://www.cheat-sheets.org/

https://www.gnu.org/software/emacs/refcards/pdf/refcard.pdf

http://www.ks.uiuc.edu/Training/Tutorials/Reference/virefcard.pdf

http://michaelgoerz.net/refcards/vimqrc.pdf

https://wch.github.io/latexsheet/latexsheet.pdf

https://github.com/ralsina/rst-cheatsheet/blob/master/rst-cheatsheet.pdf

LinuxMakefileGit

Cloud/Virtualization

OpenstackOpenstackvagrant

SQL

SQL

Languages

R

Python

PythonPythonDataNumpy/PandasPythonTutorialPythonPythonPythonAPIIndexPython3

4.2VIRTUALBOX☁�FordevelopmentpurposeswerecommendthatyouuseforthisclassanUbuntuvirtualmachinethatyousetupwiththehelpofvirtualbox.Werecommendthatyouusethecurrentversionofubuntuanddonotinstallorreuseaversionthatyouhavesetupyearsago.

As access to cloud resources requires some basic knowledge of linux andsecurity we will restrict access to our cloud services to those that have

http://www.cs.jhu.edu/~joanne/unixRC.pdf

http://www.tofgarion.net/lectures/IN323/refcards/refcardMakeIN323.pdf

https://education.github.com/git-cheat-sheet-education.pdf

http://docs.openstack.org/user-guide/cli_cheat_sheet.html

http://cmias.free.fr/IMG/pdf/rc208_010d-openstack_2.pdf

https://www.cheatography.com/davbfr/cheat-sheets/vagrant-cheat-sheet/

http://www.digilife.be/quickreferences/QRC/MySQL-4.02a.pdf

https://cran.r-project.org/doc/contrib/Short-refcard.pdf

https://dzone.com/refcardz/core-python

https://dzone.com/refcardz/data-mining-discovering-and

http://www.cheat-sheets.org/saved-copy/NumPy_SciPy_Pandas_Quandl_Cheat_Sheet.pdf

http://fivedots.coe.psu.ac.th/Software.coe/learnPython/Cheat%20Sheets/python2.pdf

http://www.cheat-sheets.org/saved-copy/PQRC-2.4-A4-latest.pdf

https://www.cheatography.com/davechild/cheat-sheets/python/pdf/

http://overapi.com/python

https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf

https://github.com/cloudmesh-community/book/blob/master/chapters/cloud/virtualbox.md

demonstrated responsible use on their own computers. Naturally as it is yourowncomputeryoumustmakesureyoufollowpropersecurity.Wehaveseeninthe past students carelessly working with virtual machines and introducingsecurityvulnerabilitiesonourclouds justbecause“itwasnot theircomputer.”Hence,wewill allowusing of cloud resources only if you have demonstratedthatyou responsiblyuse a linuxvirtualmachineonyourowncomputer.Onlyafteryouhavesuccessfullyusedubuntuinavirtualmachineyouwillbeallowedtousevirtualmachinesonclouds.

Aclouddriverslicensetestwillbeconducted.Onlyafteryoupassitwewilletyougainaccess to thecloud infrastructure.Wewillannounce this test.Beforeyouhavenotpassedthetest,youwillnotbeabletousetheclouds.Furthermore,youdonothavetoaskusforjoinrequeststocloudprojectsbeforeyouhavenotpassed the test. Please be patient. Only students enrolled in the class can getaccesstothecloud.

Ifyouhoweverhaveaccesstoothercloudsyourselfyouarewelcometousethe,However,beremindedthatprojectsneedtobereproducible,onourcloud.ThiswillrequireyoutomakesureaTAcanreplicateit.

Letusnowfocusonusingvirtualbox.

4.2.1Installation

Firstyouwillneed to installvirtualbox. It is easy to install anddetails canbefoundat

https://www.virtualbox.org/wiki/Downloads

Afteryouhaveinstalledvirtualboxyoualsoneedtouseanimage.ForthisclasswewillbeusingubuntuDesktop16.04whichyoucanfindat:

http://www.ubuntu.com/download/desktop

Please note some hardware you may have may be too old or has too littleresources to be useful. We have heard from students that the following is aminimalsetupforthedesktopmachine:

https://www.virtualbox.org/wiki/Downloads

http://www.ubuntu.com/download/desktop

multicoreprocessororbetterallowingtorunhypervisors

8GBsystemmemory

50GBoffreeharddrivespace

For virtualmachines youmay needmultiple,while theminimal configurationmaynotworkforallcases.

Asconfigurationweoftenuse

minimal

1core,2GBMemory,5GBdisk

latex

2core,4GBMemory,25GBdisk

Avideotoshowcasesuchaninstallisavailableat:

UsingUbuntuinVirtualbox(8:08)

Pleasenote that thevideoshowstheversion16.04.Youshouldhoweverusethenewestversionwhichatthistimeis18.04.

If you specify your machine too small you will not be able to install thedevelopment environment.Gregor used on hismachine 8GBRAMand25GBdiskspace.

Pleaseletusknowthesmallestconfigurationthatworks.

4.2.2Guestadditions

Thevirtualguestadditionsallowyoutoeasilydothefollowingtasks:

Resizethewindowsofthevm

https://youtu.be/NWibDntN2M4

Copyandpaste content between theGuest operating systemand thehostoperatingsystemwindows.

This way you can use many native programs on you host and copy contentseasilyintoforexampleaterminaloraneditorthatyourunintheVm.

Avideoislocatedat

Virtualbox(4:46)

Pleaserebootthemachineafterinstallationandconfiguration.

OnOSXyoucanonceyouhaveenabledbidirectionalcopyingintheDevicetabwith

OSXtoVbox:

commandcshiftCONTRLv

VboxtoOSX:

shiftCONTRLvshiftCONTRLv

On Windows the key combination is naturally different. Please consult yourwindowsmanual.IfyouletusknowTAswilladdtheinformationhere.

4.2.3Exercises

E.Virtualbox.1:

Installubuntudesktoponyourcomputerwithguestadditions.

E.Virtualbox.2:

Makesureyouknowhow topasteandcopybetweenyourhostandguestoperatingsystem.

E.Virtualbox.3:

https://youtu.be/wdCoiNdn2jA

Installtheprogramsdefinedbythedevelopmentconfiguration.

E.Virtualbox.4:

Provide us with the key combination to copy and paste betweenWindowsandVbox.

4.3VAGRANT☁�

LearningObjectives

Beabletoexperimentwithvirtualmachinesonyourcomputerbeforeyougoonacloud.SimulateavirtualclusterwithmultipleVMsrunningonyourcomputerifitisbigenough.

AconvenienttooltointerfacewithVirtualBoxisvagrant.Vagrantallowsustomanage virtualmachines directly from the commandline. It support also otherproviders and can be used to start virtual machines and even containers. Thelatest version of vagrant includes the ability to automatically fetch a virtualmachine image and start it on your local computer. It assumes that you havevirtualboxinstalled.Somekeyconceptsandadvertisementarelocatedat

https://www.vagrantup.com/intro/index.html:

Detaileddocumentationforitislocated

https://www.vagrantup.com/docs/index.html

Alistofboxesisavailablefrom

https://app.vagrantup.com/boxes/search

OneimagewewilltypicallyuseisUbuntu18.04.Pleasenotethatolderversionmaynotbesuitableforclassandwewillnotsupportanyquestionsaboutthem.Thisimageislocatedat

https://github.com/cloudmesh-community/book/blob/master/chapters/cloud/vagrant.md

https://www.vagrantup.com/intro/index.html

https://www.vagrantup.com/docs/index.html

https://app.vagrantup.com/boxes/search

https://app.vagrantup.com/ubuntu/boxes/bionic64

4.3.1Installation

Vagrantiseasytoinstall.Youcangotothedownloadpageanddownloadandinstalltheappropriateversion:

https://www.vagrantup.com/downloads.html

4.3.1.1macOS

OnMacOS,downloadthedmgimage,andclickonit.Youwillfindapkginitthatyoudoubleclick.Afterinstallationvagrantisinstalledin

/usr/local/bin/vagrant

Makesure/usr/local/binisinyourPATHStartanewterminaltoverifythis.

Checkitwith

Ifitisnotinthepathput

exportPATH=/usr/local/bin:$PATH

intheterminalcommandorinyour~/.bash_profile

4.3.1.2Windows �

studentscontribute

4.3.1.3Linux �

echo$PATH

https://app.vagrantup.com/ubuntu/boxes/bionic64

https://www.vagrantup.com/downloads.html

studentscontribute

4.3.2Usage

Todownload,startandloginintoinstallthe18.04:

Onceyouareloggedinyoucantesttheversionofpythonwith

Toinstallanewerversionofpython,andpipyoucanuse

To install the light weight idle development environment in case you do notwantousepyCharm,pleaseuse

So thatyoudonothave toalwaysuse thenumber3,youcanalsosetanaliaswith

Whenyouexitthevirtualmachinewiththe

ItdoesnotterminatetheVM.Youcanusefromyourhostsystemthecommandssuchas

tomanagethevm.

4.4LINUXSHELL☁�

host$vagrantinitubuntu/bionic64

host$vagrantup

host$vagrantssh

vagrant@ubuntu-bionic:~$sudoapt-getupdate

vagrant@ubuntu-bionic:~$python3--version

Python3.6.5

vagrant@ubuntu-bionic:~$sudoapt-getinstallpython3.7

vagrant@ubuntu-bionic:~$sudoapt-getinstallpython3-pip

vagrant@ubuntu-bionic:~$sudoapt-getinstallidle-python

aliaspython=python3

exitcommand

host$vagrantstatus

host$vagrantdestroy

host$vagrantsuspend

host$vagrantresume

https://github.com/cloudmesh-community/book/blob/master/chapters/linux/linux.md

LearningObjectives

BeabletoknowthebasiccommandstoworkinaLinuxterminal.GetfamiliarwithLinuxCommands

In this chapterwe introduce you to a number of useful shell commands.Youmayask:

“WhyishesokeenontellingmeallaboutshellsasIdohaveabeautifulGUI?”

YouwillsoonlearnthatAGUImaynotbethatsuitableifyouliketomanage10,100,1000,10000,…virtualmachines.Acommandline interfacecouldbemcuhsimplerandwouldallowscripting.

4.4.1History

LINUXisareimplementationbythecommunityofUNIXwhichwasdevelopedin1969byKenThompsonandDennisRitchieofBellLaboratoriesandrewritteninC.AnimportantpartofUNIXiswhat iscalledthekernelwhichallows thesoftwaretotalktothehardwareandutilizeit.

In 1991 Linus Torvalds started developing a Linux Kernel that was initiallytargeted forPC’s.Thismade itpossible to run itonLaptopsandwas lateronfurtherdevelopedbymakingitafullOperatingsystemreplacementforUNIX.

4.4.2Shell

Oneofthemostimportantfeaturesforuswillbetoaccessthecomputerwiththehelpofashell.Theshellistypicallyruninwhatiscalledaterminalandallowsinteractiontothecomputerwithcommandlineprograms.

TherearemanygoodtutorialsouttherethatexplainwhyoneneedsalinuxshellandnotjustaGUI.Randomlywepickedthefirstonethatcameupwithagooglequery.This isnotanendorsement for thematerialwepoint to,butcouldbeaworthwhilereadforsomeonethathasnoexperienceinShellprogramming:

http://linuxcommand.org/lc3_learning_the_shell.php

Certainlyyouarewelcome touseother resources thatmaysuiteyoubest.Wewillhowever summarize in table formanumberofuseful commands thatyoumayalsfindevenasaRefCard.

http://www.cheat-sheets.org/#Linux

We provide in the next table a number of useful commands that youwant toexplore.Formoreinformationsimplytypemanandthenameofthecommand.If you find a useful command that is missing, please add it with a Git pullrequest.

.

Command Descriptionmancommand manualpageforthecommandapropostext listallcommandsthathavetextinitls Directorylistingls-lisa listdetailstree listthedirectoriesingraphicalformcddirname Changedirectorytodirnamemkdirdirname createthedirectoryrmdirdirname deletethedirectorypwd printworkingdirectoryrmfile removethefilecpab copyfileatobmvab move/renamefileatobcata printcontentoffileacat-nfilename printcontentoffileawith

linenumberslessa printpagedcontentoffileahead-5a Displayfirst5linesoffilea

http://linuxcommand.org/lc3_learning_the_shell.php

http://www.cheat-sheets.org/#Linux

tail-5a Displaylast5linesoffilea

du-hs. showinhumanreadableformthespaceusedbythecurrentdirectory

df-h showthedetailsofthediskfilesystem

wcfilename countsthewordinafilesortfilename sortsthefileuniqfilename displaysonlyuniqentriesinthefile

tar-xvfdir tarsupacompressedversionofthedirectory

rsync faster,flexiblereplacementforrcpgzipfilename compressesthefilegunzipfilename compressesthefilebzip2filename compressesthefilewith

block-sorting

bunzip2filename uncompressesthefilewithblock-sorting

clear clearstheterminalscreen

touchfilenamechangefileaccessandmodificationtimesoriffiledoesnotexistcreatesfile

who

displaysalistofusersthatarecurrentlyloggedon,foreachusertheloginname,dateandtimeoflogin,ttyname,andhostnameifnotlocalaredisplayed

whoami displaystheuserseffectiveidseealsoid

echo-nstring writespecifiedargumentstostandardoutput

datedisplaysorsetsdate&time,wheninvokedwithoutargumentsthecurrentdateandtimearedisplayed

logout exitagivensession

exitwhenissuedattheshellprompttheshellwillexitandterminateanyrunningjobswithintheshell

killterminateorsignalaprocessbysendingasignaltothespecifiedprocessusuallybythepid

psdisplaysaheaderlinefollowedbyallprocessesthathavecontrollingterminals

sleep suspendsexecutionforanintervaloftimespecifiedinseconds

uptime displayshowlongthesystemhasbeenrunning

timecommand timesthecommandexecutioninseconds

find/[-name]file-name.txt

searchesaspecifiedpathordirectorywithagivenexpressionthattellsthefindutilitywhattofind,ifusedasshownthefindutilitywouldsearchtheentiredriveforafilenamedfile-name.txt

diff comparesfileslinebyline

hostname printsthenameofthecurrenthostsystem

which locatesaprogramfileintheuserspath

tail displaysthelastpartofthefilehead displaysthefirstlinesofafile

top displaysasortedlistofsystemprocesses

locatefilename findsthepathofafilegrep‘word’filename findsalllineswiththewordinitgrep-v‘word’filename findsalllineswithoutthewordinit

chmodug+rwfilenamechangefilemodesorAccessControlLists.Inthisexampleuserandgrouparechangedtoreadandwrite

chown changefileownerandgroup

history abuild-incommandtolistthepastcommands

sudo executeacommandasanotherusersu substituteuseridentityuname printtheoperatingsystemname

set-oemacs tellstheshelltouseEmacscommands.

chmodgo-rwxfile changesthepermissionofthefilechownusernamefile changestheownershipofthefilechgrpgroupfile changesthegroupofafilefgreptextfilename searchesthetextinthegivenfile

grep-Rtext. recursivelysearchesforxyzinallfiles

find.-name*.py findallfileswith.pyattheendps listtherunningprocesseskill-91234 killtheprocesswiththeid1234at quecommandsforlaterexecution

cron daemontoexecutescheduledcommands

crontab managethetimetableforexecutioncommandswithcron

mount/dev/cdrom/mnt/cdrom mountafilesystemfromacdromto/mnt/cdrom

users listtheloggedinuserswho displaywhoisloggedinwhoami printtheuseriddmesg displaythesystemmessagebufferlast indicatelastloginsofusersandttys

uname printoperatingsystemnamedate printsthecurrentdateandtimetimecommand printsthesys,realandusertimeshutdown-h“shutdown” shutdownthecomputerping pingahostnetstat shownetworkstatushostname printnameofcurrenthostsystem

traceroute printtheroutepacketstaketonetworkhost

ifconfig configurenetworkinterfaceparameters

host DNSlookuputility

whois Internetdomainnameandnetworknumberdirectoryservice

dig DNSlookuputilitywget non-interactivenetworkdownloadercurl transferaURLssh remoteloginprogramscp remotefilecopyprogramsftp securefiletransferprogram

watchcommand runanydesignatedcommandatregularintervals

awkprogramthatyoucanusetoselectparticularrecordsinafileandperformoperationsonthem

sed streameditorusedtoperformbasictexttransformations

xargs programthatcanbeusedtobuildandexecutecommandsfromSTDIN

catsome_file.json|python-mjson.tool quickandeasyJSONvalidator

4.4.3Thecommandman

OnLinuxyoufinda richsetofmanualpages for thescommands.Try topickoneandexecute:

Youwillseesomthinglikethis

4.4.4Multi-commandexecution

Oneoftheimportantfeaturesisthatonecanexecutemultiplecommandsintheshell.

Toexecutecommand2oncecommand1hasfinisheduse

Toexecutecommand2assoonascommand1forwardsoutputtostdoutuse

$manls

LS(1)BSDGeneralCommandsManualLS(1)

NAME

ls--listdirectorycontents

SYNOPSIS

ls[-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1][file...]

DESCRIPTION

Foreachoperandthatnamesafileofatypeotherthandirectory,

lsdisplaysitsnameaswellasanyrequested,associated

information.Foreachoperandthatnamesafileoftypedirectory,

lsdisplaysthenamesoffilescontainedwithinthatdirectory,as

wellasanyrequested,associatedinformation.

Ifnooperandsaregiven,thecontentsofthecurrentdirectoryare

displayed.Ifmorethanoneoperandisgiven,non-directory

operandsaredisplayedfirst;directoryandnon-directoryoperands

aresortedseparatelyandinlexicographicalorder.

Thefollowingoptionsareavailable:

-@Displayextendedattributekeysandsizesinlong(-l)output.

-1(Thenumericdigit`òne''.)Forceoutputtobeoneentry

perline.Thisisthedefaultwhenoutputisnottoaterminal.

-AListallentriesexceptfor.and...Alwayssetforthe

super-user.

-aIncludedirectoryentrieswhosenamesbeginwithadot(.).

...onpurposecut...insteadtryityourslef

command1;command2

command1;command2

Toexecutecommand1inthebackgrounduse

4.4.5KeyboardShortcuts

These shortcutswill come inhandy.Note thatmanyoverlapwithemacs shortcuts.

.

Keys DescriptionUpArrow ShowthepreviouscommandCtrl+z Stopsthecurrentcommand

ResumewithfgintheforegroundResumewithbginthebackground

Ctrl+c HaltsthecurrentcommandCtrl+l ClearthescreenCtrl+a ReturntothestartofthelineCtrl+e GototheendofthelineCtrl+k CuteverythingafterthecursortoaspecialclipboardCtrl+y PastefromthespecialclipboardCtrl+d Logoutofcurrentsession,similartoexit

4.4.6bashrc,bash_profileorzprofile

Usageofaparticularcommandandall theattributesassociatedwith it,use mancommand.Avoidusingrm-rcommandtodeletefilesrecursively.Agoodwaytoavoidaccidentaldeletionistoincludethefollowinginthefile.bash_profileor.zprofileonmacOSor.bashrconotherplatforms:

4.4.7Makefile

command1&

aliasrm='rm-i'

aliasmv='mv-i'

aliash='history'

Makefiles allow developers to coordinate the execution of code compilations.ThisnotonlyincludesCorC++code,butanytranslationfromsourcetoafinalformat.Forus thiscould include thecreationofPDF files from latex sources,creation of docker images, and the creation of cloud services and theirdeployment through simple workflows represented in makefiles, or thecoordinationofexecutiontargets.

Asmakefilesincludeasimplesyntaxallowingstructuraldependenciestheycaneasilyadapted to fulfill simpleactivities tobeexecuted in repeated fashionbydevelopers.

An example of how to use Makefiles for docker is provided athttp://jmkhael.io/makefiles-for-your-dockerfiles/.

An example on how to use Makefiles for LaTeX is provided athttps://github.com/cloudmesh/book/blob/master/Makefile.

Makefiles include anumberof rules that aredefinedby a target name.Let usdefineatargetcalledhellothatprintsoutthestring“HelloWorld”.

Importanttorememberisthatthecommandsafteratargetarenotindentedjustbyspaces,butactuallybyasingleTABcharacter.Editorssuchasemacswillbeideal to edit such Makefiles, while allowing syntax highlighting and easymanipulation of TABs.Naturally other editorswill do that also. Please choseyoureditorofchoice.Oneofthebestfeaturesoftargetsisthattheycandependonothertargets.Thus,iwwedefine

ourmakefilewillfirstexecutehelloandthanallcommandsinhallo.Asyoucanseethiscanbeveryusefulfordefiningsimpledependencies.

Inadditionwecandefinevariablesinamakefilesuchas

hello:

@echo"HelloWorld"

hallo:hello

@echo"HalloWorld"

HELLO="HelloWorld"

hello:

@echo$(HELLO)

http://jmkhael.io/makefiles-for-your-dockerfiles/

https://github.com/cloudmesh/book/blob/master/Makefile

andcanusetheminourtextwith$invocations.

Moreover,insophisticatedMakefiles,wecouldevenmakethetargetsdependenton files and a target rules couldbedefined that only compiles those files thathavechangedsinceourlastinvocationoftheMakefile,savingpotentiallyalotoftime.However,forourworkherewejustusethemostelementarymakefiles.

Formoreinformationwerecommendyoutofindoutaboutitontheinternet.Aconvenient reference card sis available athttp://www.cs.jhu.edu/~joanne/unixRC.pdf.

4.4.8chmod

The chmod command stand for change mode and changes the accesspermissions foragiven file systemobject(s). Ituses the followingsyntax: chmod[options]mode[,mode]file1[file2…]. The option parametersmodify how the process runs,includingwhatinformationisoutputtedtotheshell:

Option: Description:-f,--silent,--quiet Forcesprocesstocontinueeveniferrorsoccur-v,--verbose Outputsforeveryfilethatisprocessed-c,--changes Outputswhenafileischanged--reference=RFile UsesRFileinsteadofModevalues-R,--recursive Makechangestoobjectsinsubdirectoriesaswell--help Showhelp--version Showversioninformation

Modes specifywhich rights togive towhichusers.Potentialusers include theuserwho owns the file, users in the file’sGroup, other users not in the file’sGroup,andall,andareabbreviatedas u,g, o,and a respectively.More thanoneuser can be specified in the same command, such as chmod–vug(operator)(permissions)file.txt.Ifnouserisspecified,thecommanddefaultstoa.Next,a+or- indicateswhetherpermissions shouldbeaddedor removed for the selecteduser(s).Thepermissionsareasfollows:

http://www.cs.jhu.edu/~joanne/unixRC.pdf

Permission: Description:r Readw Writex ExecutefileoraccessdirectoryX Executeonlyiftheobjectisadirectorys SettheuserorgroupIDwhenrunningt Restricteddeletionflagorstickymodeu Specifiesthepermissionstheuserwhoownsthefilehasg Specifiesthepermissionsofthegroupo Specifiesthepermissionsofusersnotinthegroup

Morethanonepermissioncanbealsobeusedinthesamecommandasfollows:

Multiplefilescanalsobespecified:

4.4.9Exercises

E.Linux.1

Familiarizeyourselfwiththecommands

E.Linux.2

Findmorecommandsthatyoufindusefulandaddthemtothispage.

E.Linux.3

Use the sort command to sort all lines of a file while removingduplicates.

E.Linux.4

Should there be other commands listed in the table with the Linux

$chmod–vo+rwfile.txt

$chmoda-x,o+rfile1.txtfile2.txt

commandsIfsowhich?Createapullrequestforthem.

E.Linux.5

Writeasectionexplainingchmod.Uselettersnotnumbers

E.Linux.6

Writeasectionexplainingchown.Uselettersnotnumbers

E.Linux.7

Writeasectionexplainingsuandsudo

E.Linux.8

Writeasectionexplainingcron,at,andcrontab

4.5SECURESHELL☁�

LearningObjectives

Thisisaveryimportantsectionsofthebook,studdyitcarefully.learnhowtouseSSHkeysLearn how to use ssh-add and ssh-keycahin so you only have to type inyourpasswordonceUnderstandthateachcomputerneedsitsownsshkey

SecureShellisanetworkprotocolallowinguserstosecurelyconnecttoremoteresourcesovertheinternet.InmanyservicesweneedtouseSSHtoassurethatweprotecthemessagessendbetweenthecommunicatingentities.SecureShellisbasedonpublickeytechnologyrequiringtogenerateapublic-privatekeypairon thecomputer.Thepublickeywill thanbeuploaded to the remotemachineand when a connection is established during authentication the public privatekeypair is tested. If theymatchauthentication isgranted.Asmanyusersmayhave to share a computer it is possible to add a list of public keys so that a

https://github.com/cloudmesh-community/book/blob/master/chapters/linux/ssh.md

http://openssh.com/manual.html

number of computers can connect to a server that hosts such a list. Thismechanismbuildsthebasisfornetworkedcomputers.

Inthissectionwewillintroduceyoutosomeofthecommandstoutilizesecureshell.Wewill reuse this technology in other sections to for example create anetwork of workstations to which we can log in from your laptop. For moreinformationpleasealsoconsultwiththeSSHManual.

Whateverotherstellyou,theprivatekeyshouldneverbecopiedto anothermachine. You almost alwayswant to have a passphraseprotectingyourkey.

4.5.1ssh-keygen

Thefirstthingyouwillneedtodoistocreateapublicprivatekeypair.Beforeyoudothischeckwhethertherearealreadykeysonthecomputeryouareusing:

If there are files named id_rsa.pub or id_dsa.pub, then the keys are set upalready,andwecanskipthegeneratingkeysstep.Howeveryoumustknowthepassphraseofthekey.Ifyouforgotityouwillneedtorecreatethekey.Howeveryouwill loseanyability toconnectwith theoldkey to theresources towhichyouuploadedthepublickey.Sobecareful.

Togenerateakeypairusethecommandssh-keygen.ThisprogramiscommonlyavailableonmostUNIXsystemsandmostrecentlyevenWindows10.

Togeneratethekey,pleasetype:

The commentwill remind youwhere the key has been created, you could forexampleusethehostnameonwhichyoucreatedthekey.

In the following textwewill use localname to indicate the username on yourcomputeronwhichyouexecutethecommand.

ls~/.ssh

$ssh-keygen-trsa-C<comment>

http://openssh.com/manual.html

http://linux.die.net/man/1/ssh-keygen

Thecommandrequirestheinteractionoftheuser.Thefirstquestionis:

We recommendusing thedefault location~/.ssh/ and thedefault name id_rsa.Todoso,justpresstheenterkey.

Thesecondandthirdquestionistoprotectyoursshkeywithapassphrase.Thispassphrasewillprotectyourkeybecauseyouneedtotypeitwhenyouwanttouseit.Thus,youcaneithertypeapassphraseorpressentertoleaveitwithoutpassphrase.Toavoidsecurityproblems,youMUSTchoseapassphrase.

Itwillaskyouforthelocationandnameofthenewkey.Itwillalsoaskyoufora passphrase, which youMUST provide. Please use a strong passphrase toprotectitappropriately.Somemayadviseyou(includingteachersandTA’s)tonotusepassphrases.ThisisWRONGasitallowssomeonethatgainsaccesstoyourcomputertoalsogainaccesstoallresourcesthathavethepublickey.Onlyfor some system related services youmay create passwordless keys, but suchsystemsneedtobeproperlyprotected.

Notusingpassphrasesposesasecurityrisk!

Makesuretonotjusttypereturnforanemptypassphrase:

and:

Ifexecutedcorrectly,youwillseesomeoutputsimilarto:

Enterfileinwhichtosavethekey(/home/localname/.ssh/id_rsa):

Enterpassphrase(emptyfornopassphrase):

Entersamepassphraseagain:

Generatingpublic/privatersakeypair.

Enterfileinwhichtosavethekey(/home/localname/.ssh/id_rsa):



Youridentificationhasbeensavedin/home/localname/.ssh/id_rsa.

Yourpublickeyhasbeensavedin/home/localname/.ssh/id_rsa.pub.

Thekeyfingerprintis:

34:87:67:ea:c2:49:ee:c2:81:d2:10:84:b1:3e:05:[email protected]

+--[RSA2048]----+

|.+...Eo=.|

|..=.o+o+o|

|O.=......|

Once,youhavegeneratedyourkey,youshouldhavetheminthe .sshdirectory.Youcancheckitby:

Ifeverythingisnormal,youwillseesomethinglike:

The directory ~/.ssh will also contain the private key id_rsa which youmust notshareorcopytoanothercomputer.

Never,copyyourprivatekeytoanothermachineorcheckitintoarepository!

Toseewhatisinthe.sshdirectory,pleaseuse

Typicallyyouwillsealistoffilessuchas

Incaseyouneedtochangeyourchangepassphrase,youcansimplyrunssh-keygen-pcommand. Then specify the location of your current key, and input (old and)newpassphrases.Thereisnoneedtore-generatekeys:

Youwillseethefollowingoutputonceyouhavecompletedthatstep:

|=...|

+-----------------+

$cat~/.ssh/id_rsa.pub

ssh-rsaAAAAB3NzaC1yc2EAAAADAQABAAABAQCXJH2iG2FMHqC6T/U7uB8kt

6KlRh4kUOjgw9sc4Uu+Uwe/kshuispauhfsjhfm,anf6787sjgdkjsgl+EwD0

thkoamyi0VvhTVZhj61pTdhyl1t8hlkoL19JVnVBPP5kIN3wVyNAJjYBrAUNW

4dXKXtmfkXp98T3OW4mxAtTH434MaT+QcPTcxims/hwsUeDAVKZY7UgZhEbiE

xxkejtnRBHTipi0W03W05TOUGRW7EuKf/4ftNVPilCO4DpfY44NFG1xPwHeim

Uk+t9h48pBQj16FrUCp0rS02Pj+4/9dNeS1kmNJu5ZYS8HVRhvuoTXuAY/UVc

[email protected]

$ls~/.ssh

authorized_keys

id_rsa

id_rsa.pub

known_hosts

ssh-keygen-p

Enterfileinwhichthekeyis(/home/localname/.ssh/id_rsa):

Enteroldpassphrase:

Keyhascomment'/home/localname/.ssh/id_rsa'

Enternewpassphrase(emptyfornopassphrase):


Youridentificationhasbeensavedwiththenewpassphrase.

4.5.2ssh-add

Often you wil find wrong information about passphrases on the internet andpeoplerecommendingyounottouseone.Howeveritisinalmostallcasesbettertocreateakeypairandusessh-addtoaddthekeytothecurrentsessionsoitcanbeusedinbehalfofyou.Thisisaccomplishedwithanagent.

Thessh-addcommandaddsSSHprivatekeysintotheSSHauthenticationagentforimplementing single sign-on with SSH. ssh-add allows the user to use anynumberofservers thatarespreadacrossanynumberoforganizations,withouthavingtotypeinapasswordeverytimewhenconnectingbetweenservers.Thisiscommonlyusedbysystemadministratorstologintomultipleserver.

ssh-add can be runwithout arguments.When runwithout arguments, it adds thefollowingdefaultfilesiftheydoexist:

~/.ssh/identity-Containstheprotocolversion1RSAauthenticationidentityoftheuser.~/.ssh/id_rsa -Contains theprotocolversion1RSAauthentication identityoftheuser.~/.ssh/id_dsa -Contains theprotocolversion2DSAauthentication identityoftheuser.~/.ssh/id_ecdsa-Containstheprotocolversion2ECDSAauthenticationidentityoftheuser.

Toaddakeyyoucanprovidethepathofthekeyfileasanargumenttossh-add.Forexample,

wouldaddthefile~/.ssh/id_rsa

If thekeybeingaddedhasapassphrase, ssh-addwill run the ssh-askpassprogramtoobtain thepassphrasefromtheuser. If the SSH_ASKPASSenvironmentvariable isset,theprogramgivenbythatenvironmentvariableisusedinstead.

Some people use the SSH_ASKPASS environment variable in scripts to provide apassphraseforakey.Thepassphrasemightthenbehard-codedintothescript,orthescriptmightfetchitfromapasswordvault.

ssh-add~/.ssh/id_rsa

Thecommandlineoptionsofssh-addareasfollows:

Option Description

-c

Causesaconfirmationtoberequestedfromtheusereverytimetheaddedidentitiesareusedforauthentication.Theconfirmationisrequestedusingssh-askpass.

-D Deletesallidentitiesfromtheagent.

-d

Deletesthegivenidentitiesfromtheagent.Theprivatekeyfilesfortheidentitiestobedeletedshouldbelistedonthecommandline.

-epkcs11 Removekeyprovidedbypkcs11

-LListspublickeyparametersofallidentitiescurrentlyrepresentedbytheagent.

-lListsfingerprintsofallidentitiescurrentlyrepresentedbytheagent.

-spkcs11 Addkeyprovidedbypkcs11.

-tlife

Setsthemaximumtimetheagentwillkeepthegivenkey.Afterthetimeoutexpires,thekeywillbeautomaticallyremovedfromtheagent.Thedefaultvalueisinseconds,butcanbesuffixedformforminutes,hforhours,dfordays,orwforweeks.

-X Unlockstheagent.Thisasksforapasswordtounlock.

-x

Lockstheagent.Thisasksforapassword;thepasswordisrequiredforunlockingtheagent.Whentheagentislocked,itcannotbeusedforauthentication.

4.5.3SSHAddandAgent

Tonotalwaystypeinyourpassword,youcanusessh-addaspreviouslydiscussed

It prompts the user for a private key passphrase and add it to a list of keysmanagedby thessh-agent.Once it is in this list,youwillnotbeaskedfor thepassphraseaslongastheagentisrunning.withyourpublickey.Tousethekeyacrossterminalshellsyoucanstartansshagent.

Tostarttheagentpleaseusethefollowingcommand:

oruse

Itisimportantthatyouusethebackquote,locatedunderthetilde(USkeyboard),ratherthanthesinglequote.OncetheagentisstarteditwillprintaPIDthatyoucanusetointeractwithlater

Toaddthekeyusethecommand

Toremovetheagentusethecommand

Toexecute thecommandupon logout,place it inyour .bash_logout (assumingyouusebash).

OnOSXyou can also add the key permanently to the keychain if you do toefollowing:

Modifythefile.ssh/configandaddthefollowinglines:

4.5.3.1UsingSSHonMacOSX

MacOSX comeswith an ssh client. In order to use it you need to open theTerminal.appapplication.Go to Finder, thenclick Go in themenubarat the topof thescreen.NowclickUtilitiesandthenopentheTerminalapplication.

4.5.3.2UsingSSHonLinux

AllLinuxversionscomewithsshandcanbeusedrightfromtheterminal.

$eval`ssh-agent`

$eval"$(ssh-agent-s)"

$ssh-add

kill$SSH_AGENT_PID

ssh-add-K~/.ssh/id_rsa

Host*

UseKeychainyes

AddKeysToAgentyes

IdentityFile~/.ssh/id_rsa

4.5.3.3UsingSSHonRaspberryPi3/4

SSHisavailableonRaspbian.However,tosshintothePIyouhavetoactivateitviatheconfigurationmenu.

4.5.3.4AccessingaRemoteMachine

Once thekeypair isgenerated,youcanuse it toaccessa remotemachine.Todod so the public key needs to be added to the authorized_keys file on the remotemachine.

Theeasiestwaytodotisistousethecommandssh-copy-id.

Notethatthefirsttimeyouwillhavetoauthenticatewithyourpassword.

Alternatively, if thessh-copy-id isnotavailableonyoursystem,youcancopythefilemanuallyoverSSH:

Nowtry:

andyouwillnotbepromptedforapassword.However,ifyousetapassphrasewhencreatingyourSSHkey,youwillbeasked toenter thepassphraseat thattime (and whenever else you log in in the future). To avoid typing in thepasswordallthetimeweusethessh-addcommandthatwedescribedearlier.

4.5.4SSHPortForwarding �

�TODO:Addimagestoillustratetheconcepts

SSH Port forwarding (SSH tunneling) creates an encrypted secure connectionbetweenalocalcomputerandaremotecomputerthroughwhichservicescanberelayed. Because the connection is encrypted, SSH tunneling is useful fortransmittinginformationthatusesanunencryptedprotocol.

$ssh-copy-iduser@host

$cat~/.ssh/id_rsa.pub|sshuser@host'cat>>.ssh/authorized_keys'

$sshuser@host

$ssh-add

4.5.4.1Prerequisites

Beforeyoubegin,youneedtocheckifforwardingisallowedontheSSHserveryouwillconnectto.YoualsoneedtohaveaSSHclientonthecomputeryouareworkingon.

IfyouareusingtheOpenSSHserver:

andlookandchangethefollowing:

Set the GatewaysPorts variable only if you are going to use remote port forwarding(discussed later in this tutorial). Then, you need to restart the server for thechangetotakeeffect.

4.5.4.2HowtoRestarttheServer

Ifyouareon:

Linux,dependingupontheinitsystemusedbyyourdistribution,run:

Note that depending on your distribution, you may have to change theservicetosshinsteadofsshd.

Mac,youcanrestarttheserverusing:

Windows and want to set up a SSH server, have a look at MSYS2 orCygwin.

4.5.4.3TypesofPortForwarding

TherearethreetypesofSSHPortforwarding:

$vi/etc/ssh/sshd_config

AllowTcpForwarding=Yes

GatewayPorts=Yes

$sudosystemctlrestartsshd

$sudoservicesshdrestart

$sudolaunchctlunload/System/Library/LaunchDaemons/ssh.plist

$sudolaunchctlload-w/System/Library/LaunchDaemons/ssh.plist

4.5.4.4LocalPortForwarding

Local port forwarding lets you connect from your local computer to anotherserver. Itallowsyou to forward trafficonaportofyour localcomputer to theSSH server, which is forwarded to a destination server. To use local portforwarding,youneedtoknowyourdestinationserver,andtwoportnumbers.

Example1:

Where <host> should be replaced by the name of your laptop. The -L optionspecifies local port forwarding. For the duration of the SSH session, pointingyourbrowserathttp://localhost:8080/wouldsendyoutohttp://cloudcomputing.com

Example2:

Thisexampleopensaconnectiontothewww.cloudcomputing.comjumpserver,and forwards any connection to port 80 on the local machine to port 80 onintra.example.com.

Example3:

By default, anyone (even on different machines) can connect to the specifiedportontheSSHclientmachine.However,thiscanberestrictedtoprogramsonthesamehostbysupplyingabindaddress:

Example4:

This would forward two connections, one to www.cloudcomputing.com, the other towww.cloud.com.Pointingyourbrowserathttp://localhost:8080/woulddownloadpagesfromwww.cloudcomputing.com,andpointingyourbrowsertohttp://localhost:12345/woulddownloadpagesfromwww.cloud.com.

Example5:

$ssh-L8080:www.cloudcomputing.org:80<host>

$ssh-L80:intra.example.com:80www.cloudcomputing.com

$ssh-L127.0.0.1:80:intra.example.com:80www.cloudcomputing.com

$ssh-L8080:www.Cloudcomputing.com:80-L12345:cloud.com:80<host>

ThedestinationservercanevenbethesameastheSSHserver.

TheLocalForwardoptionintheOpenSSHclientconfigurationfilecanbeusedtoconfigureforwardingwithouthavingtospecifyitoncommandline.

4.5.4.5RemotePortForwarding

Remote port forwarding is the exact opposite of local port forwarding. Itforwardstrafficcomingtoaportonyourservertoyourlocalcomputer,andthenit is sent to adestination.The first argument shouldbe the remoteportwheretrafficwillbedirectedontheremotesystem.Thesecondargumentshouldbetheaddressandporttopointthetraffictowhenitarrivesonthelocalsystem.

SSH does not by default allow remote hosts to forwarded ports. To enableremoteforwardingaddthefollowingto:/etc/ssh/sshd_config

andrestartSSH

Aftercompletingthepreviousstepsyoushouldbeabletoconnecttotheserverremotely, even fromyour localmachine. ssh-R first creates an SSH tunnel thatforwards traffic from the server on port 9000 to your local machine on port3000.

4.5.4.6DynamicPortForwarding

Dynamic port forwarding turns your SSH client into a SOCKS proxy server.SOCKS is a little-known but widely-implemented protocol for programs torequestanyInternetconnectionthroughaproxyserver.Eachprogramthatusestheproxyserverneedstobeconfiguredspecifically,andreconfiguredwhenyoustopusingtheproxyserver.

$ssh-L5900:localhost:5900<host>

$ssh-R9000:localhost:[email protected]

GatewayPortsyes

$sudovim/etc/ssh/sshd_config

$sudoservicesshrestart

TheSSHclient creates aSOCKSproxy at port 5000 on your local computer.AnytrafficsenttothisportissenttoitsdestinationthroughtheSSHserver.

Next,you’llneedtoconfigureyourapplicationstousethisserver.TheSettingssectionofmostwebbrowsersallowyoutouseaSOCKSproxy.

4.5.4.7sshconfig

Defaults and other configurations can be added to a configuration file that isplacedinthesystem.Thesshprogramonahostreceivesitsconfigurationfrom

thecommandlineoptionsauser-specificconfigurationfile:~/.ssh/configasystem-wideconfigurationfile:/etc/ssh/ssh_config

Nextweprovideanexampleonhowtouseaconfigfile

4.5.4.8Tips

UseSSHkeys

Youwillneedtousesshkeystoaccessremotemachines

Noblankpassphrases

Inmostcasesyoumustuseapassphrasewithyourkey.Infactifwefindthat you use passwordless keys to futuresystems and to chameleon cloudresources,wemay elect to give you anF for the assignment in question.Therearesomeexceptions,buttheywillbeclearlycommunicatedtoyouinclass.Youwillaspartofyourclouddrivers license testexplainhowyougainaccesstofuturesystemsandchameleontoexplicitlyexplainthispointandprovideuswithreasonswhatyoucannotdo.

Akeyforeachserver

Under no circumstances copy the same private key on multiple servers.This violates securitybest practices.Create for each server a newprivate

[email protected]

keyandusetheirpublickeystogainaccesstotheappropriateserver.

UseSSHagent

Soastonottotypeinallthetimethepassphraseforakey,werecommendusingssh-agenttomanagethelogin.Thiswillbepartofyourclouddriverslicense.

Butshutdownthessh-agentifnotinuse

keepanofflinebackup,putencryptthedrive

Youmayforsomeofourprojectsneedtomakebackupsofprivatekeysonotherserversyousetup.IfyouliketomakeabackupyoucandosoonaUSBstick,butmakesurethataccesstothestickisencrypted.Donotstoreanythingelseonthatkeyandlookit inasafeplace.Ifyoulosethestick,recreateallkeysonallmachines.

4.5.4.9References

TheSecureShell:TheDefinitiveGuide,2Ed(O’ReillyandAssociates)

4.5.5SSHtoFutureSystemsResources☁�

LearningObjectives

ObtainaFuturesystemaccountsoyoucanusekubernetesordockerswarmorotherservicesofferedbyFutureSystems.

Next, you need to upload the key to the portal. Youmust be logged into theportaltodoso.

Step1:Logintotheportalhttps://portal.futuresystems.org/

Step2:Clickonthe“MYACCOUNT”link.

http://shop.oreilly.com/product/9780596008956.do

https://github.com/cloudmesh-community/book/blob/master/chapters/linux/ssh-futuresystems.md

https://portal.futuresystems.org/

Step3:Clickon“EDIT”

Step 4: Paste your ssh key into the boxmarked Public SSHKey. Use a texteditortoopenthe id_rsa.pub.Copytheentirecontentsof thisfile into thesshkeyfieldaspartofyourprofileinformation.Manyerrorsareintroducedbyusersinthisstepastheydonotpasteandcopycorrectly.

Ifyouneedtoaddkeys,usetheAddanotheritembutton

Atthispoint,youhaveuploadedyourkey.However,youwillstillneedtowaittillallaccountshavebeensetuptousethekey,orifyoudidnothaveanaccounttillithasbeencreatedbyanadministrator.Please,checkyouremailforfurtherupdates. You can also refresh this page and see if the boxes in your accountstatusinformationareallgreen.Thenyoucancontinue.

4.5.5.1TestingyourFutureSystemssshkey

IfyouhavehadnoFutureSystemaccountbefore,youneedtowaitforuptotwobusinessdayssowecanverifyyour identityandcreate theaccount.Sopleasewait.Otherwise,testingyournewkeyisalmostinstantaneousonindia.Forotherclusterslikeitcantakearound30minutestoupdatethesshkeys.

Tologintoindiasimplytypetheusualsshcommandsuchas:

Thefirsttimeyousshintoamachineyouwillseeamessagelikethis:

Youhavetotypeyesandpressenter.Thenyouwillbeloggingintoindia.OtherFutureSystem machines can be reached in the same fashion. Just replace thenameindia,withtheappropriateFutureSystemsresourcename.

4.5.6Exercises☁�

E.SSH.1:

[email protected]

Theauthenticityofhost'india.futuresystems.org(192.165.148.5)'cannotbeestablished.

RSAkeyfingerprintis11:96:de:b7:21:eb:64:92:ab:de:e0:79:f3:fb:86:dd.

Areyousureyouwanttocontinueconnecting(yes/no)?yes

https://github.com/cloudmesh-community/book/blob/master/chapters/linux/ssh-excerise.md

CreateanSSHkeypair

E.SSH.2:

Uploadthepublickeytogitrepositoryyouuse.

E.SSH.3:

Whatistheoutputofakeythathasapassphrasewhenexecutingthefollowingcommand.Testitoutonyourkey

E.SSH.4

Getanaccountonfuturesystems.org(ifyouareauthorizedtodoso).Upload your key to https://futuresystems.org. Login toindia.futuresystems.org. Note that this could take some time asadministratorsneedtoapproveyou.Bepatient.

E.SSH.5:

Whatcanhappen if youcopyyourprivate key toamachineon thenetwork?

E.SSH.6:

ShouldIsharemyprovatekeywithothers?

E.SSH.7:

AssumeIparticipateinavideoconferencecallandIaccidentlysharemyprivatekey.WhatshouldIdo?

E.SSH.8:

AssumeIparticipateinavideoconferencecallandIaccidentlysharemypublickey.WhatshouldIdo?

4.6GITHUB☁�

$grepENCRYPTED~/.ssh/id_rsa

https://futuresystems.org

https://github.com/cloudmesh-community/book/blob/master/chapters/git/github.md

LearningObjectives

Be able to use the github cloud sevices to collaborately develop contentsandprograms.Beabletousegithubaspartofanopensourceproject.

In some classes thematerialmay be openly shared in code repositories. Thisincludesclassmaterial,papersandproject.Hence,weneedsomemechanismtosharecontentwithalargenumberofstudents.

First, we like to introduce you to git and github.com (Section 1.1).Next,weprovideyouwiththebasiccommandstointeractwithgitfromthecommandline(Section1.12).Thanwewillintroduceyouhowyoucancontributetothissetofdocumentationswithpullrequests.

4.6.1Overview

Githubisacoderepositorythatallowsthedevelopmentofcodeanddocumentswithmanycontributors inadistributed fashion.Therearemanygood tutorialsabout github. Some of them can be found on the github Web page. Aninteractivetutorialisforexampleavailableat

https://try.github.io/

However,althoughthesetutorialsarehelpfulinmanycasestheydonotaddresssome cases. For example, you have already a repository set up by yourorganization and you do not have to completely initialize it. Thus do not justreplicate the commands in the tutorial, or theoncewepresentherebeforenotevaluatingtheirconsequences.Ingeneralmakesureyouverifyifthecommanddoeswhatyouexpectbeforeyouexecuteit.

Amoreextensivelistoftutorialscanbefoundat

https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github

https://try.github.io/

https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github

The github foundation has a number of excellent videos about git. If you areunfamiliar with git and you like to watch videos in addition to reading thedocumentationwerecommendthesevideos

https://www.youtube.com/user/GitHubGuides/videos

Next,weintroducesomeimportantconceptsusedingithub.

4.6.2UploadKey

Beforeyoucanworkwitharepositoryinaneasyfashionyouneedtouploadapublickeyinordertoaccessyourrepository.Naturally,youneedtogenerateakeyfirstwhichisexplainedinthesectionaboutsshkeygeneration( �TODO:lessons-ssh-generate-key include link ) before you upload one. Copy thecontentsofyour.ssh/id_rsa.pubfileandaddthemtoyourgithubkeys.

MoreinformationonthistopiccanbefoundonthegithubWebpage.

4.6.3Fork

Forking is the first step to contributing to projects onGitHub.Forking allowsyoutocopyarepositoryandworkonitunderyourownaccount.Next,creatinga branch, making some changes, and offering a pull request to the originalrepository,roundsoutyourcontributiontotheopensourceproject.

Git1:41Fork

4.6.4Rebase

When you start editing your project, you diverge from the original version.During your developing, the original version may be updated, or otherdevelopersmay have some of their branches implementing good features thatyouwouldlike to includeinyourcurrentwork.That iswhenRebasebecomesuseful.When youRebase to certain points, could be a newerMaster or othercustombranch,consideryougraftallyouron-goingworkrighttothatpoint.

Rebasemay fail, because some times it is impossible to achievewhatwe just

https://www.youtube.com/user/GitHubGuides/videos

https://github.com/settings/keys

https://help.github.com/articles/adding-a-new-ssh-key-to-your-github-account/

https://www.youtube.com/watch?v=5oJHRbqEofs

describedasconflictsmayexist.Forexample,youand the to-be-rebasedcopybotheditedsomecommontextsection.Once thishappens,humaninterventionneedstotakeplacetoresolvetheconflict.

Git4:20Rebase

4.6.5Remote

Collaborating with others involves managing the remote repositories andpushing and pulling data to and from them when you need to share work.Managingremoterepositoriesincludesknowinghowtoaddremoterepositories,remove remotes thatareno longervalid,managevarious remotebranchesanddefinethemasbeingtrackedornot,andmore.

Thoughoutthissemester,youwilltypicallyworkontworemoterepos.Oneistheofficeclassrepo,andanotheristherepoyouforkedfromtheclassrepo.Theclass repo is used as the centralized, authority and final version of all studentsubmissions. The repo under your own Github account is for your personalstorage.Toshowprogressonaweeklybasisyouneedtocommityourchangeson a weekly basis. However make sure that things in the master branch areworking.Ifnot,justuseanotherbranchtoconductyourchangesandmergeatalatertime.Welikeyoutocallyourdevelopmentbranchdev.

https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

4.6.6PullRequest

Pull requests are ameans of starting a conversation about a proposed changeback into a project.Wewill be taking a look at the strength of conversation,integrationoptions for fuller informationabout a change, andcleanup strategyforwhenapullrequestisfinished.

Git4:26PullRequest

4.6.7Branch

Branches are an excellent way to not only work safely on features or

https://www.youtube.com/watch?v=SxzjZtJwOgo

https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

https://www.youtube.com/watch?v=d5wpJ5VimSU

experiments, but they are also the key element in creating Pull Requests onGitHub. Lets take a look atwhywewant branches, how to create and deletebranches,andhowtoswitchbranchesinthisepisode.

Git2:25Branch

4.6.8Checkout

Change where and what you are working on with the checkout command.Whetherwe are switching branches, wanting to look at theworking tree at aspecific commit in history, or discarding editswewant to throw away, all ofthesecanbedonewiththecheckoutcommand.

Git3:11Checkout

4.6.9Merge

Onceyouknowbranches,mergingthatworkintomasteristhenaturalnextstep.Findouthowtomergebranches,identifyandcleanupmergeconflictsoravoidconflictsuntilalaterdate.Lastly,wewilllookatcombiningthemergedfeaturebranchintoasinglecommitandcleaningupyourfeaturebranchaftermerges.

Git3:11Merge

4.6.10GUI

UsingGraphicalUserInterfacescansupplementyouruseofthecommandlinetogetthebestofbothworlds.GitHubforWindowsandGitHubforMacallowforswitchingtocommandline,easeofgrabbingrepositoriesfromGitHub,andparticipating in a particular pull request. We will also see the auto-updatingfunctionalityhelpsusstayuptodatewithstableversionsofGitonthecommandline.

Git3:47GUI

There aremany other git GUI tools available that directly integrate into your

https://www.youtube.com/watch?v=H5GJfcp3p4Q

https://www.youtube.com/watch?v=HwrPhOp6-aM

https://www.youtube.com/watch?v=yyLiplDQtf0

https://www.youtube.com/watch?v=BMYOs5jflGE

operatingsystemfinders,windows,…,orPyCharm.It isup toyouto identifysuchtoolsandseeiftheyareusefulforyou.Mostofthepeopleweworkwithusgitfromthecommandline,eveniftheyusePyCharm,eclipse,orothertoolsthathavebuildingitsupport.Youcanidentifyatoolthatworksbestforyou.

4.6.11Windows

ThisisaquicktourofGitHubforWindows.ItoffersGitHubnewcomersabriefoverview of what this feature-loaded version control tool and an equallypowerfulwebapplicationcandofordevelopers,designers,andmanagersusingWindows in both the open source and commercial software worlds. More:http://windows.github.com

Git1:25Windows

4.6.12GitfromtheCommandline

Althoughgithub.comprovidesapowerfulGUIandotherGUItoolsareavailabletointerfacewithgithub.com,theuseofgitfromthecommandlinecanoftenbefasterandinmanycasesmaybesimpler.

Gitcommandlinetoolscanbeeasilyinstalledonavarietyofoperatingsystemsincluding Linux, macOS, and Windows. Many great tutorials exist that willallow you to complete this task easily.We found the following two tutorialssufficienttogetthetaskaccomplished:

https://git-scm.com/book/en/v2/Getting-Started-Installing-Githttps://www.atlassian.com/git/tutorials/install-git

Although the later is provided by an alternate repository to github. Theinstallationinstructionsareveryniceandarenotimpactedbyit.Onceyouhaveinstalledgityouneedtoconfigureit.

4.6.13Configuration

Once you installed Git, you can need to configure it properly. This includessettingupyourusername,emailaddress,lineendings,andcolor,alongwiththe

http://windows.github.com

https://www.youtube.com/watch?v=YBbkvCrfDSo

https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

https://www.atlassian.com/git/tutorials/install-git

settings’associatedconfigurationscopes.

Git2:47Configuration

Itisimportantthatmakesurethatusethe gitconfig command to initializegit forthefirsttimeoneachnewcomputersystemorvirtualmachineyouuse.Thiswillensurethatyouuseonallresourcesthesamenameande-mailsothatgithistoryand logwill showconsistentlyyourcheckinsacrossalldevicesandcomputersyouuse.Ifyoudonotdothis,yourcheckinsingitdonotshowupinaconsistentfashion as a single user. Thus on each computer execute the followingcommands:

whereyoureplacetheinformationwiththeinformationrelatedtoyou.Youcansettheeditortoemacswith:

Naturallyifyouhappentowanttouseothereditorsyoucanconfigurethembyspecifyingthecommandthatstartsthemup.Youwillalsoneedtodecideifyouwanttopushbranchesindividuallyorallbranchesatthesametime.Itwillbeuptoyoutomakewhatwillworkforyoubest.Wefoundthatthefollowingseemstoworkbest:

Moreinformationaboutafirsttimesetupisdocumentedat:

Tocheckyoursetupyoucansay:

One problem we observed is that students often simply copy and pasteinstructions,butdonotreadcarefullytheerrorthatisreportedbackanddonotfix it.Overlooking thepropersetof thepush.default isoftenoverlooked.Thusweremindyou:Pleasereadtheinformationonthescreenwhenyousetup.

4.6.14Uploadyourpublickey

$gitconfig--globaluser.name"AlbertZweistein"

[email protected]

$gitconfig--globalcore.editoremacs

gitconfig--globalpush.defaultmatching

*http://git-scm.com/book/en/Getting-Started-First-Time-Git-Setup

$gitconfig--list

https://www.youtube.com/watch?v=ZChtKFLiaNw

Pleaseuploadyourpublickeytotherepositoryasdocumentedingithub,whilegoing toyouraccountand find it in settings.Thereyouwill findapanelSSHkeythatyoucanclickonwhichbringsyoutothewindowallowingyoutoaddanew key. If you have difficulties with this find a video from the githubfoundationthatexplainsthis.

4.6.15Workingwithadirectorythatwillbeprovidedforyou

Incaseyourcourseprovidedyouwithagithubdirectory,startingandworkinginitisgoingtoberealsimple.Pleasewaittillanannouncementtotheclassissendbeforeyouaskusquestionsaboutit.

If you are the only student working on this you still need to make sure thatpapersorprogramsyoumanageintherepositoryworkanddonotinterferewithscriptsthatinstructorsmayusetocheckyourassignments.Thusitisgodtostillcreateabranch,work in thebranchandthanmerge thebranch into themasteronceyouverifiedthingswork.Afteryoumergedyoucanpushthecontenttothegithubrepository.

Tip:Pleaseuseonlylowercasecharactersinthedirectorynamesandnospecialcharacterssuchas @;/_ and spaces. In general we recommend that you avoidusing directory names with capital letters spaces and _ in them. This willsimplifyyourdocumentationeffortsandmaketheURLsfromgitmorereadable.Also while on some OS’s the directories MyDirectory is different frommydirectoryonmacOSitisconsideredthesameandthusrenamingfromcapitaltolowercasecannotbedonewithoutfirstrenamingittoanotherdirectory.

Yourhomeworkforsubmissionshouldbeorganizedaccordingtofoldersinyourclonerepository.Tosubmitaparticularassignment,youmustfirstadditusing:

Afterwards,commititusing:

Thenpushittoyourremoterepositoryusing:

gitadd<nameofthefileyouareadding>

gitcommit-m"messagedescribingyoursubmission"

gitpush

Ifyouwanttomodifyyoursubmission,youonlyneedto:

afterwards:

If you lose any documents locally, you can retrieve them from your remoterepositoryusing:

4.6.16README.ymlandnotebook.md

In case you take classes e516 and e616 with us you will have to create aREADME.yaml and notebook.md file in the top most directory of yourrepository. It serves thepurposeof identifyingyour submission forhomeworkandinformationaboutyourself.

It is important to follow the format precisely. As it is yaml it is an easyhomeworktowritea4linepythonscriptthatvalidatesiftheREADME.yamlfileisvalid.Inadditionyoucanuseprogramssuchasyamllintwhichisdocumentedat

https://yamllint.readthedocs.io/en/latest/

Thisfileisusedtointegrateyourassignmentsintoaproceedings.Anexampleisprovidedat

https://github.com/cloudmesh-community/hid-sample/blob/master/README.yml

AnyderivationfromthisformatwillnotallowustoseeyourhomeworkasourautomatedscriptswillusetheREADME.ymltodetectthem.MakesurethefiledoesnotcontainayTABs.Pleasealsomindthatallfilenamesofallhomeworkandthemaindirectorymustbelowercaseanddonotincludespaces.Thiswillsimplifyyourtaskofmanagingthefilesacrossdifferentoperatingsystems.

In case you work in a team, on a submission, the document will only besubmitted in the author andhid that is listed first.All other readme files,will

gitcommit-m"messagerelatingtoupdatedfile"

gitpush

gitpull

https://yamllint.readthedocs.io/en/latest/

https://github.com/cloudmesh-community/hid-sample/blob/master/README.yml

haveforthatparticularartifactaduplicate:yesentrytoindicatethatthissubmissionismanagedelsewhere.The teamwillbe responsible tomanage theirownpullrequests, but if the team desires we can grant access for all members to arepositorybyauser.Pleasebeaware thatyoumustmakesureyoucoordinatewithyourteam.

Wewillnotacceptsubmissionofhomeworkaspdfdocumentsor tarfiles.Allassignmentsmust be submitted as code and the reports in native latex and ingithub.WehaveascriptthatwillautomaticallycreatethePDFandincludeitina proceedings. There is no exception from this rule and all reports notcompilable will be returned without review and if not submitted within thedeadlinereceiveapenalty.

PleasecheckwithyourinstructorontheformatoftheREADME.yamlfileasitcouldbedifferentforyourclass.

Toseeanexamplefor thenotebook.mdfile,youcanvisitoursamplehid,andbrowsetothenotebook.mdfile.Alternativelyyoucanvisitthefollowinglink

https://github.com/cloudmesh-community/hid-sample/blob/master/notebook.md

Thepurposeofthenotebookmdfileistorecordwhatyoudidintheclasstous.Wewillusethisfileattheendoftheclasstomakesureyouhaverecordedonaweekly basis what you did for the class. Inactivity is a valid response. Notupdatingthenotebook,isnot.

Thesampledirectorycontainsotherusefuldirectoriesandsamples,thatyoumaywant to investigate in more detail. One of the most important samples is thegithubissues(seeSection1.19).Thereisevenavideointhatsectionaboutthisandshowcasesyouhowtoorganizeyourtaskswithinthisclass,whilecopyingthe assignments from piazza into one ormore github issues.Aswe are aboutcloud computing, using the services offered by a prominent cloud computingservicesuchasgithubispartofthelearningexperienceofthiscourse.

4.6.17ContributingtotheDocument

Itisrelativelyeasytocontributetothedocumentifyouunderstandhowtouse

https://github.com/cloudmesh-community/hid-sample/blob/master/notebook.md

github.Thefirst thingyouwillneedtodois tocreateaforkof therepository.TheeasiestwaytodothisistovisittheURL

https://github.com/cloudmesh-community/book

TowardstheupperrightcorneryouwillfindalinkcalledFork.Clickonitandchoseintowhichaccountyouliketoforktheoriginalrepository.Nextyouwillcreate a colne from your corked directory. Youwill see in your fork a greenclonebutton.Youwill seeaURL thatyoucancopy intoyour terminal. If thelinksdoesnotincludeyourusername,itisthewronglink.

Inyourterminalyounowsay

Nowcdintothisdirectoryandmakeyourchanges.

Usetheusualgitcommandssuchasgitadd,gitcommit,gitpush

Noteyouwillpushintoyourlocaldirectory.

4.6.17.1Stayuptodatewiththeoriginalrepo

Formtimetotimeyouwillseethatothersarecontributingtotheoriginalrepo.Tostayuptodateyouwanttonotonlysyncfromyourlocalcopy,butalsofromtheoriginalrepo.Tolinkyourrepowithwhatiscalledtheupstreamyouneedtodothefollowingonce,soyoucanissuegitpullthaalsopullsfromtheupstream

Makesureyouhaveupstreamrepodefined:

NowGetlatestfromupstream:

Inthisstep,theconflictingfileshowsup(inmycaseitwasrefs.bib):

gitcolnehttps://github.com/<yourusername>/book

$cdbook

$gitremoteaddupstream\


$gitrebaseupstream/master

$gitstatus


shouldshowthenameoftheconflictingfile:

shouldshowtheactualdifferences.Maybe insomecases, It iseasy tosimplytakelatestversionfromupstreamandreapplyyourchanges.

So you can decide to checkout one version earlier of the specific file.At thisstage, the re-base should be complete. So, you need to commit and push thechangestoyourfork:

Then reapplyyourchanges to refs.bib - simplyuse thebackedupversionandusetheeditortoredothechanges.

Atthisstage,onlyrefs.bibischanged:

shouldshowthechangesonlyinrefs.bib.Committhischangeusing:

Andfinallypushthelastcommittedchange:

The changes in the file to resolve merge conflict automatically goes to theoriginalpullrequestandthepullrequestcanbemergedautomatically.

You still have to issue the pull request from the Github Web page so it isregisteredwiththeupstreamrepository.

4.6.17.2Resources

ProGitbookOfficialtutorialOfficialdocumentationTutorialsPointongitTrygitonline

$gitdiff<filename>

$gitcommit

$gitrebaseorigin/master

$gitpush

$gitstatus

$gitcommit-a-m"new:usr:<message>"

$gitpush

https://git-scm.com/book/en/v2

https://git-scm.com/docs/gittutorial

https://git-scm.com/doc

http://www.tutorialspoint.com/git/

https://try.github.io

GitHubresourcesforlearninggitNote:thisisforgithubandnotforgitlab.Howeverasitisforgttheonlythingyouhavetodoisreplacegithub,forgitlab.Atlassiantutorialsforgit

In addition the tutorials from atlassian are a good source.However rememberthatyoumaynotusebitbucketas the repository, so ignore those tutorials.Wefoundthefollowinguseful

Whatisgit:https://www.atlassian.com/git/tutorials/what-is-gitInstallinggit:https://www.atlassian.com/git/tutorials/install-gitgit config: https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-configgit clone: https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-clonesavingchanges:https://www.atlassian.com/git/tutorials/saving-changescollaboratingwithgit:https://www.atlassian.com/git/tutorials/syncing

4.6.18Exercises

E.Github.1:

Howdoyousetyourfavoriteeditorasadefaultwithgithubconfig

E.Github.2:

Whatisthedifferencebetweenmergeandrebase?

E.Github.3:

Assumeyouhavemadea change in your local fork, howeverotherusershavesincecommittedtothemasterbranch,howcanyoumakesureyourcommitworksofffromthelatestinformationinthemasterbranch?

E.Github.4:

FindaspellingerrorintheWebpageoracontributionandcreateapullrequestforit.

https://help.github.com/articles/good-resources-for-learning-git-and-github/

https://www.atlassian.com/git/tutorials/

https://www.atlassian.com/git/tutorials/what-is-git

https://www.atlassian.com/git/tutorials/install-git

https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-config

https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-clone

https://www.atlassian.com/git/tutorials/saving-changes

https://www.atlassian.com/git/tutorials/syncing

E.Gitlab.5:

CreateaREADME.ymlinyourgithubaccountdirectoryprovidedforyouforclass.

4.6.19GithubIssues

Github8:29Issues

Whenweworkinteamsorevenifweworkbyourselves,itisprudenttoidentifyasystemtocoordinateyourwork.Whileconductionprojectsthatuseavarietyofcloudservices,itisimportanttohaveasystemthatenablesustohaveacloudservice that enables us to facilitate this coordination. Github provides such afeaturethroughitsissueservicethatisembeddedineachrepository.

Issuesallow for the coordinationof tasks, enhancements,bugs, aswell as selfdefinedlabeledactivities.Issuesaresharedwithinyourteamthathasaccesstoyourrepository.Furthermore,inanopensourceprojecttheissuesarevisibletothecommunity,allowingtoeasilycommunicatethestatus,aswellasaroadmaptonewfeatures.

Thisenablesthecommunitytoparticipatealsoinreportingofbugs.Usingsuchasystemtransformsthedevelopmentofsoftwarefromthetraditionalclosedshopdevelopmenttoatrulyopensourcedevelopmentencouragingcontributionsfromothers.Furthermoreitisalsousedasbugtrackerinwhichnotonlyyou,butthecommunitycancommunicatebugstotheproject.

Agoodresourceforlearningmoreaboutissuesisprovidedat

https://guides.github.com/features/issues/

4.6.19.1GitIssueFeatures

Agitissuehasthefollowingfeatures:

title

https://youtu.be/qozgBPQJx0A

https://guides.github.com/features/issues/

–ashortdescriptionofwhattheissueisabout

description

a more detailed description. Descriptions allow also to conveniently addcheck-boxedtodo’s.

label

acolorenhancedlabelthatcanbeusedtoeasilycategorizetheissue.Youcandefineyourownlabels.

milestone

amilestone so you can identify categorical groups issues aswell as theirduedate.Youcanforexamplegroupalltasksforaweekinamilestone,oryoucouldforexampleputalltasksforatopicsuchasdevelopingapaperinamilestoneandprovideadeadlineforit.

assignee

an assignee is the person that is responsible for making sure the task isexecutedoron track ifa teamworkson it.Oftenprojectsallowonlyoneassignee,but incertaincases it isuseful toassignagroup,and thegroupidentifiesifthetaskcanbesplitupandassignsthemthroughcheck-boxedtodo’s.

comments

allowanyonewithaccesstoprovidefeedbackviacomments.

4.6.19.2GithubMarkdown

GithubusesmarkdownwhichweintroduceyouinSection[S:markdown].

Asgithubhasitsownflavorofmarkdownwehoweveralsopointyouto

as a reference. We like to mention the special enhancements fo github’smarkdownthatintegratewelltosupportprojectmanagement.

4.6.19.2.1Tasklists

TakslistscanbeaddedtoanydescriptionorcommentingithubissuesTocreateatasklistyoucanaddtoanyitem[].Thisincludesatasktobedone.Tomakeitascompletesimplechangeitto[x].Whoeverthegreatfeatureoftasksisthatyoudonotevenhavetoopentheeditorbutyoucansimplycheckthetaskonandofviaamouseclick.Anexampleofatasklistcouldbe

In caseyouneed touse a (have at thebeginningot the task text, youneed toescapeitwitha\

4.6.19.2.2Teamintegration

A person or team on GitHub can be mentioned by typing the usernameproceeded by the @ sign.When posting the text in the issue, it will trigger anotification to themandallow them to react to it. It is evenpossible tonotifyentireteams,whicharedescribedinmoredetailat

https://help.github.com/articles/about-teams/

4.6.19.2.3ReferencingIssuesandPullrequests

Eachissuehasanumber.Ifyouusethe#followedbytheissuenumberyoucanrefer to it in the textwhichwill also automatically include a hyperlink to thetask.Thesameisvalidforpullrequests.

4.6.19.2.4Emojis

Althoughgithubsupportsemojissuchas:+1:wedonotusethemtypicallyinourclass.

4.6.19.3Notifications

Github allows you to set preferences on how you lik to receive notifications.

PostBios

*[x]Postbioonpiazza

*[]Postbioongoogledocs

*[]Postbioongithub

*[]\(optional)integrateimageingoogledocsbio

https://help.github.com/articles/about-teams/

You can receive them either via e-mail or the Web. This is controlled byconfiguring it in your settings, where you can set the preferences forparticipating projects as well as projects you decide to watch. To access thenotifications you can simply look at them in the notification screen. In thisscreenwhenyoupressthe?youwillseeanumberofcommandsthatallowyoutocontrolthenotificationwhenpressingononeofthem.

4.6.19.4cc

Tocarboncopyusersinyourissuetext,simplyuse/ccfollowedbythe@signandtheirgithubusername.

4.6.19.5Interactingwithissues

Githubhastheabilitytosearchissueswithasearchqueryandasearchlanguagethatyoucanfindoutmoreaboutitat

https://guides.github.com/features/issues/#search

Adashboardgivesconvenientoverviewsoftheissuesincludingapulsethatliststodo’sstatusifyouusethemintheissuedescription.

4.6.20Glossary

TheGlossaryiscopiedfrom

https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/GitTipsAndTricks#A-suggested-work-flow-for-distributed-projects-NoSY

Addputafile(orparticularchangesthereto)intotheindexreadyforacommitoperation. Optional for modifications to tracked files; mandatory forhithertoun-trackedfiles.

Brancha divergent change tree (eg a patch branch)which canmemerged eitherwholesaleorpiecemealwiththemastertree.

https://guides.github.com/features/issues/#search

Commitsave thecurrent stateof the indexand/orother specified files to the localrepository.

Commitobjectanobjectwhichcontainstheinformationaboutaparticularrevision,suchasparents, committer, author, date and the tree objectwhich corresponds tothetopdirectoryofthestoredrevision.

Fast-forwardanupdateoperationconsistingonlyoftheapplicationofalinearpartofthechangetreeinsequence.

Fetchupdate your local repository database (not your working area) with thelatestchangesfromaremote.

HEADthelateststateofthecurrentbranch.

Indexa collection of files with stat information, whose contents are stored asobjects.The index is a storedversionofyourworking tree.Filesmaybestagedtoanindexpriortocommitting.

Masterthemainbranch:knownasthetrunkinotherSCMsystems.

Mergejointwotrees.Acommitismadeifthisisnotafast-forwardoperations(oroneisrequestedexplicitly.

Objecttheunitofstorageingit.ItisuniquelyidentifiedbytheSHA1hashofitscontents.Consequently,anobjectcannotbechanged.

Originthedefault remote,usually the source for the cloneoperation that createdthelocalrepository.

Pullshorthand for a fetch followedby amerge (or rebase if –rebaseoption isused).

Pushtransfer the state of the current branch to a remote tracking branch. Thismustbeafast-forwardoperation(seemerge).

Rebase

amerge-likeoperationinwhichthechangetreeisrewritten(seeRebasingbelow).Usedtoturnnon-trivialmergesintofast-forwardoperations.

Remoteanother repository known to this one. If the local repositorywas createdwith“clone”thenthereisatleastoneremote,usuallycalled,“origin.”

Stagetoaddafileorselectedchangestherefromtotheindexinpreparationforacommit.

Stashastackontowhichthecurrentsetofuncommittedchangescanbeput(eginorder to switch to or synchronize with another branch) as a patch forretrievallater.Alsotheactofputtingchangesontothisstack.

Taghuman-readablelabelforaparticularstateofthetree.Tagsmaybesimple(in which case they are actually branches) or annotated (analogous to aCVStag),withanassociatedSHA1hashandmessage.Annotatedtagsarepreferableingeneral.

Trackingbrancha branch on a remote which is the default source / sink for pull / pushoperationsrespectivelyforthecurrentbranch.Forinstance,origin/masteristhetrackingbranchforthelocalmasterinalocalrepository.

Un-trackednotknowncurrentlytogit.

4.6.21Examplecommands

Towork in your local directory you can use the following commands. Pleasenotethatthesecommandsdonotuploadyourworktogithub,butonlyintroduceversioncontrolwithinyourlocalfiles.

Thecommandlistiscopiedfrom

https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/GitTipsAndTricks#A-suggested-work-flow-for-distributed-projects-NoSY

4.6.21.1Localcommandstoversioncontrilyourfiles

Obtaindifferenceswith

Movefilesfromonepartofyourdirectorytreetoanother:

Deleteunwantedtrackedfiles:

Addun-trackedfiles:

Stageamodifiedfileforcommit:

Commitcurrently-stagedfiles:

Commitonlyspecificfiles(regardlessofwhatisstaged):

Commitallmodifiedfiles:

Un-stageapreviouslystaged(butnotyetcommitted)file:

Getdifferenceswithrespecttothecommitted(orstaged)versionofafile:

Getdifferencesbetweenlocalfileandcommittedversion:

Create(butdonotswitchto)anewlocalbranchbasedonthecurrentbranch:

$gitstatus

$gitmv<old-path><new-path>

$gitrm<path>

$gitadd<un-tracked-file>

$gitadd<file>

$gitcommit-m<log-message>

$gitcommit-m<log-message>

$gitcommit-a-m<log-message>

$gitresetHEAD<file>

$gitdiff<file>

$gitdiff--cached<file>

$gitbranch<new-branch>

Changetoanexistinglocalbranch:

Mergeanotherbranchintothecurrentone:

4.6.21.2Interactingwiththeremote

Getthecurrentlistofremotes(includingURIs)with

Getthecurrentlistofdefinedbrancheswith

Change to (creating if necessary) a local branch tracking an existing remotebranchofthesamename:

Updateyour local repository refdatabasewithout altering the currentworkingarea:

Updateyourcurrent localbranchwithrespect toyourrepository’scurrent ideaofaremotebranch’sstatus:

Pullremoterefinformationfromallremotesandmergelocalbrancheswiththeirremotetrackingbranches(ifapplicable):

Examinechangestothecurrentlocalbranchwithrespecttoitstrackingbranch:

Pushchangestotheremotetrackingbranch:

$gitcheckout<branch>

$gitmerge<branch>

$gitremote-v

$gitbranch-a

$gitcheckout<branch>

$gitfetch<remote>

$gitmerge<branch>

$gitpull

$gitcherry-v

$gitpush

Pushallchangestoalltrackingbranches:

4.7GITPULLREQUEST☁�4.7.1Introduction

Gitpullrequestsallowdeveloperstosubmitworkorchangestheyhavedonetoarepository,Thedeveloperscanthencheckthechangesthathavebeenproposedinthepullrequest,discussandmakechangesifneeded.Afterthecontentoffthepullrequesthasbeenagreeduponitcanbemergedtotherepositorytoaddtheinformationorchangesinthepullrequestintotherepository.

4.7.2Howtocreateapullrequest

InthisdocumentwewillseehowwecancreateapullrequestfortheCloudmeshtechnologiesrepothatislocatedat

https://github.com/cloudmesh/technologies

Howeverifyoudopullrequestonotherdirectories,youjusthavetoreplacetheurlwiththatoftherepositoryyouliketouse.Acommononefourourclassesisalso


Whichcontainsthisbook.

Youcaneithercreateapullrequestthroughabranchorthroughafork.Inthisdocumentwewillbelookingathowwecancreateapullrequestthroughafork.

4.7.3Forktheoriginalrepository

First you need to create a fork of the original repository.A fork is your owncopy of the repository to which you can make changes to. To fork theCloudmesh technologies goto Cloudmesh technologies repo and click on theFork button on the top right corner. Now you can notice that instead of

$gitpush--all

https://github.com/cloudmesh-community/book/blob/master/chapters/git/gitpullreqest.md




cloudmesh/technologies thenameofthereposays YOURGITUSERNAME/technologies,where YOURGITUSERANAMEisindeedyourgithubusername.Thatisbecauseyouarenowinyourowncopyofthecloudmesh/technologiesrepository.nourcasetheusernamewillbepulashti.

4.7.4Cloneyourcopy

Now that you have your fork created, we can go ahead and clone it into ourmachine. Instructionsonhowtoclonearepositorycanbefoundin theGithubdocumentation-Cloningarepository.Makesurethatyoucloneyourversionofthetechnologiesrepo.

4.7.5Addinganupstream

Beforewe can startworking on our copy of the git repo it is good to add anupstream(alinktotheoriginalrepo)sothatwecangetallthelatestchangesinthe original repository into our copy.Use the following commands to add anupstreamtocloudmesh/technologies.Firstgointothefolderwhichcontainsyourgitrepothatyouclonedandexecutethefollowingcommand.

Tomakesureyouhaveaddeditcorrectlyexecutethefollowingcommand

Youshouldseesomethingsimilartothefollowingastheoutput

4.7.6Makingchanges

Now you can make changes to your repo as with any normal git repository.However tomake sure you have the latest copy from the original execute thefollowing commandbefore you startmaking changes.Thiswill pull the latestchangesfromtheoriginalcloudmesh/technologiesintoyourlocalcopy

Nowmaketheneededchangescommitandpush,thechangeswillbepushedto

$gitremoteaddupstreamhttps://github.com/cloudmesh/technologies.git'

$gitremote-v

originhttps://github.com/pulasthi/technologies.git(fetch)

originhttps://github.com/pulasthi/technologies.git(push)

upstreamhttps://github.com/cloudmesh/technologies.git(fetch)

upstreamhttps://github.com/cloudmesh/technologies.git(push)

$gitpullupstreammaster

https://help.github.com/articles/cloning-a-repository/

yourcopyoftherepoiGithub,notthecloudmesh/technologiesrepo.

4.7.7Creatingapullrequest

Oncewe have changes pushed, you can go into your repository in Github tocreateapull request.As seen in@#fig:button-pullrequest,youhaveanbuttonnamedPullrequest

Figure35:ButtonPullrequest

Once you click on that button you will be taken to a page to create the pullrequest,whichwilllooksimilartoFigure36.

Figure36:Createapullrequest

OnceyouclickontheCreatepullrequestbuttonyouwillbegivenanoptiontoaddatitle and a comment for the pull request. Once you complete the details andsubmitthepullrequestwillappearintheoriginalcloudmesh/technologiesrepo.

Note: Make sure you see the Abletomerge sign before you submit the pullrequest, otherwise your pull will not be able to directly merged to theoriginalrepo.Ifyoudonotseethisthatmeansyouhavenotproperlydonethegitpullupstreammastercommandbeforeyoumadethechanges

gitexampleonCL10:09

4.8TIG☁�Manybrowsersexisttogaininsightintogitrepositories.IncaseyouhaveLinuxorUbuntuatooltodisplayinformationinaterminalisavailable.

https://jonas.github.io/tig/

OnOSXitcanbeinstalledwith:

Tig has many different views including views for main, log, diff, tree, blob,

$brewinstalltig

https://youtu.be/8wyTtG0PsgM

https://github.com/cloudmesh-community/book/blob/master/chapters/git/git-tools.md

blame,refs,status,stage.stash,grep,andpager.

AscreenshotshowssomeifitsbasicfunctionalityisshowninFigure37

Figure37:Gittigmainvie

Exampleinfocationsare$tig

$gitshow|tig

$gitlog|tig

5INTRODUCTIONTOCLOUDCOMPUTINGANDDATA

ENGINEERINGFORCLOUDCOMPUTINGANDMACHINE

LEARNING☁�E222IntelligentSystemsIIandE516EngineeringCloudComputing

YouTube Playlist https://www.youtube.com/playlist?list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

PowerPoint of full set A) to U) https://drive.google.com/open?id=1RQ8Q_A32ks02CSCZAzKiJJ9P9YntMRAo

5.1A.SUMMARYOFINTRODUCTIONTOCLOUDCOMPUTING

&DATAENGINEERING

This lessonsummarizes thecomponent lessonsCloudIntroBtoCloudIntroUofIntroductiontoCloudComputingandDataEngineeringLecture

SummaryCloudComputing15:40

5.2B.DEFININGCLOUDSI

Basic definition of cloud and two very simple examples of whyvirtualizationisimportant.HowcloudsaresituatedwrtHPCandsupercomputersWhymulticorechipsareimportantTypicaldatacenter

DefiningCloudsI20:22

5.3C.DEFININGCLOUDSII

Service-oriented architectures: Software services as Message-linked

https://github.com/cloudmesh-community/book/blob/master/chapters/class/222-a.md

https://www.youtube.com/playlist?list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

https://drive.google.com/open?id=1RQ8Q_A32ks02CSCZAzKiJJ9P9YntMRAo

https://www.youtube.com/watch?v=QYqpsjVTgNc&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=2&t=0s

https://www.youtube.com/watch?v=voOyaYUedSw&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=3&t=13s

computingcapabilitiesThedifferentaaS’s:Network,Infrastructure,Platform,SoftwareTheamazingservicesthatAmazonAWSandMicrosoftAzurehaveInitialGartnercommentsonclouds(theyarenowthenorm)andevolutionofservers;serverlessandmicroservices2016/2018InfrastructureStrategiesHypeCycleandPriorityMatrix

DefiningCloudsII24:22

5.4D.DEFININGCLOUDSIII

CloudMarketShareHowimportantarethey?Howmuchmoneydotheymake?

DefiningCloudsIII12:23

5.5E.VIRTUALIZATION

VirtualizationTechnologies,Hypervisorsandthedifferentapproaches

KVMXen,DockerandOpenstack

See:

https://en.wikipedia.org/wiki/Hypervisorhttps://en.wikipedia.org/wiki/Xenhttps://en.wikipedia.org/wiki/Kernel-based_Virtual_Machinehttps://en.wikipedia.org/wiki/Operating-system-level_virtualizationhttps://medium.com/\@dbclin/aws-just-announced-a-move-from-xen-towards-kvm-so-what-is-kvm-2091f123991https://nickjanetakis.com/blog/comparing-virtual-machines-vs-docker-containershttps://en.wikipedia.org/wiki/OpenStack

Virtualization11:21

https://www.youtube.com/watch?v=CZlferQHtlQ&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=4&t=0s

https://www.youtube.com/watch?v=7HFi2CwMF_o&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=5&t=0s

https://en.wikipedia.org/wiki/Hypervisor

https://en.wikipedia.org/wiki/Xen

https://en.wikipedia.org/wiki/Operating-system-level_virtualization

https://medium.com/@dbclin/aws-just-announced-a-move-from-xen-towards-kvm-so-what-is-kvm-2091f123991

https://nickjanetakis.com/blog/comparing-virtual-machines-vs-docker-containers

https://en.wikipedia.org/wiki/OpenStack

https://www.youtube.com/watch?v=f4hSZOeH-CA&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=6&t=3s

5.6F.TECHNOLOGYHYPECYCLEI

Gartner’s Hypecycles and especially that for emerging technologies in2018,2017and2016ThephasesofhypecyclesPriorityMatrixwithbenefitsandadoptiontimeTodaycloudshavegotthroughthecycle(theyhaveemerged)butfeatureslikeblockchain,serverlessandmachinelearningareoncycleHypecycle and Priority Matrix for Data Center Infrastructure 2017 and2018

TechnologyHypecycleI31:23

5.7G.TECHNOLOGYHYPECYCLEII

Emerging Technologies hypecycles and Priority matrix at selected times2008-2015Cloudsstarfrom2008totodayTheyaremixedupwithtransformationalanddisruptivechangesTheroutetoDigitalBusiness(2015)

TechnologyHypecycleII16:05

5.8H.CLOUDINFRASTRUCTUREI

CommentsontrendsinthedatacenteranditstechnologiesCloudsphysicallyacrosstheworldGreencomputingFractionofworld’scomputingecosystemincloudsandassociatedsizes

CloudInfrastructureI21:20

5.9I.CLOUDINFRASTRUCTUREII

https://www.youtube.com/watch?v=imDx6zgqulM&index=7&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s

https://www.youtube.com/watch?v=hVmn-HY5I7M&t=0s&index=8&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

https://www.youtube.com/watch?v=RtjO9tGG9y8&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=9&t=0s

GartnerhypecycleandprioritymatrixonInfrastructureStrategiesandComputeInfrastructureContainerscomparedtovirtualmachinesTheemergenceofartificialintelligenceasadominantforce

CloudInfrastructureII17:52

5.10JCLOUDSOFTWARE

HPC-ABDSwithover350 softwarepackages andhow touse eachof21layersGoogle’ssoftwareinnovationsMapReduceinpicturesCloudandHPCsoftwarestackscomparedComponentsneedtosupportcloud/distributedsystemprogrammingSingleProgram/InstructionMultipleDataSIMDSPMD

CloudSoftware37:56

5.11K.CLOUDAPPLICATIONSI

BigData;alotofbestexampleshaveNOTbeenupdatedsosomeslidesoldbutstillmakethecorrectpointsSomeofthebusinessusagepatternsfromNIST

CloudApplicationsI11:58

5.12LCLOUDAPPLICATIONSII

Clouds in sciencewhereareacalledcyberinfrastructure; theusagepatternfromNISTArtificialIntelligencefromGartner

CloudApplicationsII13:03

https://www.youtube.com/watch?v=UZxb8axd7Is&index=10&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s

https://www.youtube.com/watch?v=ee42X5mkOII&t=0s&index=11&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

https://www.youtube.com/watch?v=kyJIqXtCD2Q&index=12&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s

https://www.youtube.com/watch?v=GEnKyqFpEfo&index=13&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s

5.13MCLOUDAPPLICATIONSIII

CharacterizeApplicationsusingNISTapproachInternetofThingsDifferenttypesofMapReduce

CloudApplicationsIII24:12

5.14N.CLOUDSANDPARALLELCOMPUTING

ParallelComputingingeneralBigDataandSimulationsComparedWhatishardtodo?

:clapper:CloudsandParallelComputing35:03

5.15O.STORAGE

ClouddataapproachesRepositories,FileSystems,Datalakes

Storage19:22

5.16P.HPCANDCLOUDS

TheBranscombPyramidSupercomputersversuscloudsScienceComputingEnvironments

HPCandClouds19:29

5.17Q.COMPARISONOFDATAANALYTICSWITH

SIMULATION

https://www.youtube.com/watch?v=SVneA5UwMwg&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s&index=14

https://www.youtube.com/watch?v=w5mdlwmi4n4&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s&index=15

https://www.youtube.com/watch?v=kJZA_YZe-aA&index=16&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&t=0s

https://www.youtube.com/watch?v=fz36Cg1HFs4&t=0s&index=17&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

StructureofdifferentapplicationsforsimulationsandBigDataSoftwareimplicationsLanguages

ComparisonofDataAnalyticswithSimulation16:19

5.18R.JOBS

ComputerEngineeringCloudsDesignDataScience/Engineering

Jobs15:30

5.19S.THEFUTUREI

GartnercloudcomputinghypecycleandprioritymatrixHyperscalecomputingServerlessandFaaSCloudNativeMicroservices

TheFutureI29:29

5.20T.THEFUTUREANDOTHERISSUESII

SecurityBlockchain

TheFutureandotherIssuesII11:30

5.21U.THEFUTUREANDOTHERISSUESIII

https://www.youtube.com/watch?v=3IPNxrQ-DOo&t=0s&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=18

https://www.youtube.com/watch?v=03hyNaqDlIs&t=0s&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=19

https://www.youtube.com/watch?v=Ufz1D4a9624&t=0s&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb&index=20

https://www.youtube.com/watch?v=WvbSn_vFl-o&t=0s&index=21&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

FaultTolerance

TheFutureandotherIssuesIII9:10

https://www.youtube.com/watch?v=2yLGJ-6U5mg&t=1s&index=22&list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb

6REST

6.1INTRODUCTIONTOREST☁�

LearningObjectives

UnderstandRESTServioces.UnderstandOpenAPI.DevelopRESTservicesinPythonusingEve.DevelopRESTservicesinPythonusingOpenAPIwithswagger.

RESTstandsforREpresentationalStateTransfer.RESTisanarchitecturestylefor designing networked applications. It is based on stateless, client-server,cacheable communications protocol. In contrast to what some others write orsay, REST is not a standard. Although not based on http, in most cases, theHTTPprotocolisused.Inthatcase,RESTfulapplicationsuseHTTPrequeststo(a) post data while creating and/or updating it, (b) read data while makingqueries,and(c)deletedata.

RESTwasfirstintroducedinathesisfromRoyT.Fielding[40].

HenceRESTcanuseHTTPforthefourCRUDoperations:

CreateresourcesReadresourcesUpdateresourcesDeleteresources

AspartoftheHTTPprotocolwehavemethodssuchasGET,PUT,POST,andDELETE.ThesemethodscanthanbeusedtoimplementaRESTservice.Thisisnot surprising as the HTTP protocol was explicitly designed to support theseoperations.AsRESTintroducescollectionsanditemsweneedtoimplementtheCRUD functions for them.We distinguish single resources and collection ofresources.Thesemanticsforaccessingthemisexplainednextillustratinghowto

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest.md

https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm

implementthemwithHTTPmethods(SeeRESTonWikipedia[41]).

6.1.0.1CollectionofResources

LetusassumethefollowingURIidentifiesacollectionofresourceshttp://.../resources/

thanweneedtoimplementthefollowingCRUDmethods:

GET

ListtheURIsandperhapsotherdetailsofthecollectionsmembers

PUT

Replacetheentirecollectionwithanothercollection.

POST

Create a new entry in the collection. The new entry’s URI is assignedautomaticallyandisusuallyreturnedbytheoperation.

DELETE

Deletetheentirecollection.

6.1.0.2SingleResource

LetusassumethefollowingURI identifiesasingle resource inacollectionofresourceshttp://.../resources/item42

thanweneedtoimplementthefollowingCRUDmethods:

GET

Retrieve a representation of the addressed member of the collection,expressedinanappropriateinternetmediatype.

https://en.wikipedia.org/wiki/Representational_state_transfer

PUT

Replace the addressed member of the collection, or if it does not exist,createit.

POST

Notgenerallyused.Treattheaddressedmemberasacollectioninitsownrightandcreateanewentrywithinit.

DELETE

Deletetheaddressedmemberofthecollection.

6.1.0.3RESTToolClassification

Due to thewell defined structure thatREST provides a number of tools havebeencreated thatmanage thecreationof thespecificationfor restservicesandtheirprogramming.Wedistinguishseveraldifferentcategories:

RESTSpecificationFrameworks:

These are frameworks that help defining rest servicice throughspecifications to generate REST services in a language and frameworkindependentway. This includes for example Swagger 2.0 [42],OpenAPI3.0[43],andRAML[44].

RESTprogramminglanguagesupport:

These toolsandservicesare targetingaparticularprogramming language.SuchtoolsincludeFlaskRest[45],andDjangoRestServices[46],someofwhichwewillexploreinmoredetail.

RESTdocumentationbasedtools:

These tools are primarily focusing on documenting REST specifications.SuchtoolsincludeSwagger[47],whichwewillexploreinmoredetail.

RESTdesignsupporttools:

These tools are used to support the design process of developing RESTserviceswhileabstractingontopoftheprogramminglanguagesanddefinereusable specifications that can be used to create clients and servers forparticular technology targets. Such tools include also swagger [47] asadditional tools are available that can generate code from OpenAPIspecifications[48],whichwewillexploreinmoredetail.

AlistofsucheffortsisavailableatOpenAPITools[49]

6.2OPENAPIRESTSERVICESWITHSWAGGER☁�Swaggerhttps://swagger.io/isatoolfordevelopingAPIspecificationsbasedontheOpenAPISpecification (OAS). It allowsnotonly thespecification,but thegenerationofcodebasedonthespecificationinavarietyoflanguages.

Swagger itself has a number of tools which together build a framework fordevelopingRESTservicesforavarietyoflanguages.

6.2.1SwaggerTools

ThemajorSwaggertoolsofinterestare:

SwaggerCore

includes libraries for working with Swagger specificationshttps://github.com/swagger-api/swagger-core.

SwaggerCodegen

allows to generate code from the specifications to develop Client SDKs,servers, and documentation. https://github.com/swagger-api/swagger-codegen

SwaggerUI

is an HTML5 based UI for exploring and interacting with the specifiedAPIshttps://github.com/swagger-api/swagger-ui

https://openapi.tools/

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/swagger.md

https://swagger.io/

https://github.com/swagger-api/swagger-core

https://github.com/swagger-api/swagger-codegen

https://github.com/swagger-api/swagger-ui

SwaggerEditor

isaWeb-browserbasededitor forcomposingspecificationsusingYAMLhttps://github.com/swagger-api/swagger-editor

SwaggerHub

isaWebservicetocollaborativlydevelopandhostOpenAPIspecificationshttps://swagger.io/tools/swaggerhub/

ThedevelopedAPIscanbehostedandfurtherdevelopedonanonlinerepositorynamed SwaggerHub https://app.swaggerhub.com/home The convenient onlineeditor isavailablewhichalsocanbe installed locallyonavarietyofoperatingsystemsincludingmacOS,Linux,andWindows.

6.2.2SwaggerCommunityTools

notifyusaboutothertoolsthatyoufindandwouldlikeustomentionhere.

6.2.2.1ConvertingJsonExamplestoOpenAPIYAMLModels

Swagger toolbox is autility that canconvert json to swaggercompatibleyamlmodels.Itishostedonlineat

https://swagger-toolbox.firebaseapp.com/

Thesourcecodetothistoolisavailableongithubat

https://github.com/essuraj/swagger-toolbox

Itisimportanttomakesurethatthejsonmodelisproperlyconfigured.Assucheachdatatypemustbewrappedin“quotes”andthelastelementmustnothavea,behindit.

Incaseyouhavelargemodels,werecommendthatyougraduallyaddmoreandmore features so that it is easier todebug in caseof an error.This tool is notdesignedtoprovidebackafull featuredOpenAPI,buthelpyougettingstartedderivingone.

https://github.com/swagger-api/swagger-editor

https://swagger.io/tools/swaggerhub/

https://app.swaggerhub.com/home

https://swagger-toolbox.firebaseapp.com/

https://github.com/essuraj/swagger-toolbox

Letuslookatasmallexample.LetusassumewewanttocreateaRESTservicetoexecuteacommandontheremoteservice.Weknowthismaynotbeagoodideaifitisnotproperlysecured,sobeextracareful.Agoodwaytosimulatethisistojustuseareturnstringinsteadofexecutingthecommand.

Letusassumethejsonschemalookslike:

Theoutputtheswaggertoolboxcreatesis

Asyoucanseeitisfarfromcomplete,butitcouldbeusedtogetyoustarted.

BasedonthistooldeveloparestservicetowhichyousendaschemainJSONformatfromwhichyougetbacktheYAMLmodel.

6.3OPENAPI2.0SPECIFICATION☁�Swagger provides through its specification the definition of REST servicesthroughaYAMLorJSONdocument.

When following the API-specification-first approach to define and develop aRESTfulservice,thefirstandforemoststepistodefinetheAPIconformingtothe OpenAPI specification, and then using codegen tools to convenientlygenerateserversidestubcode,clientcode,documentations,inthelanguageyoudesire.InSectionRESTServiceGenerationwithOpenAPIwehaveintroducedthecodegentoolandhowtousethattogenerateserversideandclientsidecodeanddocumentation.InthisSectionTheVirtualClusterexampleAPIDefinitionwewill use a slightlymore complex example to show how to define anAPIfollowing the OpenAPI 2.0 specification. The example is to retrieve virtualcluster(VC)objectfromtheserver.

{

"host":"string",

"command":"string"

}

---

required:

-"host"

-"command"

properties:

host:

type:"string"

command:

type:"string"

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/swagger-spec.md

The OpenAPI Specification is formerly known as Swagger RESTful APIDocumentationSpecification.ItdefinesaspecificationtodescribeanddocumentaRESTfulserviceAPI.Itisalsoknownunderversion3.0ofswagger.However,as the tools for 3.0 are not yet completed, we will continue for now to useversionswagger2.0,tillthetransitionhasbeencompleted.Thisisespeciallyofimportance,asweneedtousetheswaggercodegentool,whichcurrentlysupportonlyuptospecificationv2.HenceweareatthistimeusingOpenAPI/Swaggerv2.0inourexample.Therearesomestructureandsyntaxchangesinv3,whiletheessenceisverysimilar.Formoredetailsofthechangesbetweenv3andv2,please refer to A document published on the Web titled Difference betweenOpenAPI3.0andSwagger2.0.

You canwrite theAPI definition in json for yaml format. Let us discuss thisformatbrieflyandfocusonyamlasitiseasiertoreadandmaintain.

Ontherootleveloftheyamldocumentweseefieldslikeswagger,info,andsoon.Amongthesefields,swagger,info,andpatharerequired.Theirmeaningisasfollows:

swagger

specifiestheversionnumber.Inourcaseastringvalue‘2.0’isusedaswearewritingthedefinitionconformingtothev2.0specification.

info

definesmetadatainformationrelatedtotheAPI.E.g.,theAPIversion,titleand description, termsOfService if applicable, contact information andlicense, etc. Among these attributes, version and title are required whileothersareoptional.

path

defines the actual endpoints of the exposed RESTful API service. Eachendpointhasafieldpatternasthekey,andaPathItemObjectasthevalue.In this example we have defined /vc and /vc/{id} as the two serviceendpoints.Theywill bepartof the final serviceURL,appendedafter theservicehostandbasePath,whichwillbeexplainedlater.

https://blog.readme.io/an-example-filled-guide-to-swagger-3-2/

Let us focus on thePath ItemObject. It contains one ormore supportedoperationsontheserviceendpoint.AnoperationiskeyedbyavalidHTTPoperationverb,e.g.,oneofget,put,post,delete,orpatch.IthasavalueofOperationObjectthatdescribestheoperationsinmoredetail.

TheOperationObjectwillalwaysrequireaResponseObject.AResponseObject has aHTTPstatuscode as the key, e.g.,200 as successful return;40X as authentication and authorization related errors; and 50x as otherserversideservers.Itcanalsohasadefaultresponsekeyedbydefault forundeclared http status return code. The Response Object value has arequireddescriptionfield,andifanythingisreturned,aschemaindicatingtheobjecttypetobereturned,whichcouldbeaprimitivetype,e.g.,string,oranarrayorcustomizedobject. Incaseofobject or anarray ofobject,use$reftopointtothedefinitionoftheobject.Inthisexample,wehave

$ref:“#/definitions/VC”

to point to the VC definition in the definitions section in the samespecificationfile,whichwillbeexplainedlater.

Besides the required field, theOperationObject can have summary anddescription to indicate what the operation is about; and operationId touniquely identify the operation; and consumes and produces to indicatewhatMIMEtypesitexpectsasinputandforreturns,e.g.,application/jsoninmostmodernRESTfulAPIs.Itcanfurtherspecifywhatinputparameteris expectedusingparameters,which requires aname and in fields.namespecifiesthenameoftheparameter,andinspecifiesfromwheretogettheparameter, and its possible values are query, header, path, formData orbody.Inthisexampleinthe/vc/{id}pathweobtaintheidparameter fromtheURLpathwoithasthepathvalue.Whenthe inhaspathasitsvalue,therequiredfieldisrequiredandhastobesetastrue;whentheinhasvalueotherthanbody,atypefieldisrequiredtospecifythetypeoftheparameter.

Whilethethreerootlevelfieldsmentionedpreviouslyarerequired,inmostcaseswewillalsouseotheroptionalfields.

host

toindicatewheretheserviceistobedeployed,whichcouldbelocalhostora valid IP address or aDNSnameof the hostwhere the service is to bedeployed. Ifotherportnumberother than80 is tobeused,write theportnumberaswell,e.g.,localhost:8080.

schemas

tospecifythetransferprotocol,e.g,httporhttps.

basePath

tospecify thecommonbaseURL tobeappendafter thehost to form thebasepathforalltheendpoints,e.g.,/apior/api/1.0/. In thisexamplewiththe values specified we would have the final service endpointshttp://localhost:8080/api/vcs and http://localhost:8080/api/vc/{id} bycombiningtheschemas,host,basePathandpathsvalues.

consumesandproduces

canalsobespecifiedonthetopleveltospecifythedefaultMIMEtypesoftheinputandreturnifmostpathsandthedefinedoperationshavethesame.

definitions

as used in in the paths field, in order to point to a customized objectdefinitionwitha$refkeyword.

Thedefinitionsfieldreallycontainstheobjectdefinitionofthecustomizedobjects involved in the API, similar to a class definition in any ObjectOrientedprogramminglanguage.Inthisexample,wedefinedaVCobject,andhierarchicallyaNodeobject.Eachobjectdefinedisa typeofSchemaObjectinwhichmanyfieldcouldbeusedtospecifytheobject(seedetailsintheREFlinkattopofthedocument),butthemostfrequentlyusedonesare:

type

tospecifythetype,andinthecustomizeddefinitioncasethevalueismostlyobject.

required

fieldtolistthenamesoftherequiredattributesoftheobject.

properties

has the detailed information of each attribute/property of the object, e.g,type,format.Italsosupportshierarchicalobjectdefinitionsoapropertyofone object could be another customized object defined elsewhere whileusingschemaand$refkeyword topoint to thedefinition. In thisexamplewehavedefinedaVC, orvirtual cluster,object,while it containsanotherobjectdefinitionof

Node

aspartofacluster.

6.3.1TheVirtualClusterexampleAPIDefinition

6.3.1.1Terminology

VC

A virtual cluster, which has one Front-End (FE) management node andmultiplecomputenodes.AVCobjectalsohasidandname toidentifytheVC,andnnodestoindicatehowmanycomputenodesithas.

FE

Amanagementnodefromwhichtoaccessthecomputenodes.TheFEnodeusuallyconnectstoallthecomputenodesviaprivatenetwork.

Node

Acomputernodeobjectthattheinfoncorestoindicatenumberofcoresithas,andramandlocaldisktoshowthesizeofRAMandlocaldiskstorage.

6.3.1.2Specification

swagger:"2.0"

info:

version:"1.0.0"

title:"AVirtualCluster"

description:"VirtualClusterasatestofusingswagger-2.0specificationandcodegen"

termsOfService:"http://swagger.io/terms/"

contact:

name:"IUISEsoftwareandsystemteam"

license:

name:"Apache"

host:"localhost:8080"

basePath:"/api"

schemes:

-"http"

consumes:

-"application/json"

produces:

-"application/json"

paths:

/vcs:

get:

description:"ReturnsallVCsfromthesystemthattheuserhasaccessto"

produces:

-"application/json"

responses:

"200":

description:"AlistofVCs."

schema:

type:"array"

items:

$ref:"#/definitions/VC"

/vcs/{id}:

get:

description:"ReturnsallVCsfromthesystemthattheuserhasaccessto"

operationId:getVCById

parameters:

-name:id

in:path

description:IDofVCtofetch

required:true

type:string

produces:

-"application/json"

responses:

"200":

description:"Thevcwiththegivenid."

schema:

$ref:"#/definitions/VC"

default:

description:unexpectederror

schema:

$ref:'#/definitions/Error'

definitions:

VC:

type:"object"

required:

-"id"

-"name"

-"nnodes"

-"FE"

-"computes"

properties:

id:

type:"string"

name:

type:"string"

nnodes:

type:"integer"

format:"int64"

FE:

type:"object"

schema:

$ref:"#/definitions/Node"

computes:

type:"array"

items:

$ref:"#/definitions/Node"

tag:

type:"string"

Node:

type:"object"

6.3.2References

TheofficialOpenAPI2.0Documentation

6.4OPENAPI3.0RESTSERVICEVIAINTROSPECTION☁�Thesimplestway tocreateanOpenAPI service is touse theconexionserviceandreadinthespecificationfromitsyamlfile.Itwillthanbeintrospectedanddynamically methods are created that are used for the implementation of theserver.

Thefullexampleforthisisavailablein

https://github.com/cloudmesh-community/book/tree/master/examples/rest/cpu

Anextensivedocumentationisavailableat

https://media.readthedocs.org/pdf/connexion/latest/connexion.pdf

This example will return dynamically the cpu information of a computer todemonstrate how simple it is to generate in python a REST service from anOpenAPIspecification.

Ourrequirements.txtfileincludes

required:

-"ncores"

-"ram"

-"localdisk"

properties:

ncores:

type:"integer"

format:"int64"

ram:

type:"integer"

format:"int64"

localdisk:

type:"integer"

format:"int64"

Error:

required:

-code

-message

properties:

code:

type:integer

format:int32

message:

type:string

https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/swagger-introspection.md

https://github.com/cloudmesh-community/book/tree/master/examples/rest/cpu

https://media.readthedocs.org/pdf/connexion/latest/connexion.pdf

asdependencies.Theserver.pyfilesimplycontainsthefollowingcode:

ThiswillrunourRESTserviceundertheassumptionwehaveacpu.yamlandacpu.pyfilesasouryamlfilecallsoutmethodsfromcpu.py

Theyamlfilelooksasfollows

Herewe simply implement a getmethod and associate iswith theURL /cpu.The operationid, defines the method that we call which as we used the localdirectory is included in the file cpu.py. This is controlled by the prefix in the

flask

connexion[swagger-ui]

fromflaskimportjsonify

importconnexion

#Createtheapplicationinstance

app=connexion.App(__name__,specification_dir="./")

#Readtheyamlfiletoconfiguretheendpoints

app.add_api("cpu.yaml")

#createaURLrouteinourapplicationfor"/"

@app.route("/")

defhome():

msg={"msg":"It'sworking!"}

returnjsonify(msg)

if__name__=="__main__":

app.run(port=8080,debug=True)

openapi:3.0.2

info:

title:cpuinfo

description:AsimpleservicetogetcpuinfoasanexampleofusingOpenAPI3.0

license:

name:Apache2.0

version:0.0.1

servers:

-url:http://localhost:8080/cloudmesh

paths:

/cpu:

get:

summary:Returnscpuinformationofthehostingserver

operationId:cpu.get_processor_name

responses:

'200':

description:cpuinfo

content:

application/json:

schema:

$ref:"#/components/schemas/cpu"

components:

schemas:

cpu:

type:"object"

required:

-"model"

properties:

model:

type:"string"

operationid.

Averysimplefunctiontoreturnthecpuinformationisdefinedincpu.pywhichwelistnext

Wehave implemented this function to return a jsonified information from thedictpinfo.

Tosimplifyworkingwiththisexample,wealsoprovideamakefileforOSXthatallowsustocalltheserverandthecalltotheservoerintwodifferentterminals

Whenwecall

ourdemoisrun.

6.4.1Verification

Itisimportanttobeabletoverifyifayamlfileiscorrect.Toidentifythis,the

importos,platform,subprocess,re


defget_processor_name():

ifplatform.system()=="Windows":

p=platform.processor()

elifplatform.system()=="Darwin":

command="/usr/sbin/sysctl-nmachdep.cpu.brand_string"

p=subprocess.check_output(command,shell=True).strip().decode()

elifplatform.system()=="Linux":

command="cat/proc/cpuinfo"

all_info=subprocess.check_output(command,shell=True).strip().decode()

forlineinall_info.split("\n"):

if"modelname"inline:

p=re.sub(".*modelname.*:","",line,1)

else:

p="cannotfindcpuinfo"

pinfo={"model":p}

returnjsonify(pinfo)

defineterminal

osascript-e'tellapplication"Terminal"todoscript"cd$(PWD);$1"'

endef

install:

pipinstall-rrequirements.txt

demo:

$(callterminal,pythonserver.py)

sleep3

@echo"==============================================================================="

@echo"Gettheinfo"

@echo"==============================================================================="

curlhttp://localhost:8080/cloudmesh/cpu

@echo

@echo"==============================================================================="

makedemo

easiestmethodistousetheswaggereditor.Thereisanonlineverionavailableat:

https://editor.swagger.io/

GototheWebsite,removethecurrentpetstoreexampleandsimplypasteyouryamlfileinit.Debugmeessageswillbehelpingyoutocorrectthings.

Aterminalbasedcommandmayalsbehelpful,butisabitdifficulttoread.

6.4.2Swagger-UI

SwaggercomeswithaconvenientUItoinvokeRESTAPIcallsusingthewebbrowserratherthanrelyingonthecurlcommands.

Once the requestand responsedefinitionsareproperly specified,youcanstarttheserverby,

Then the UI would also be spawned under the service URL http://[serviceurl]/ui/

Example:http://localhost:8080/cloudmesh/ui/

6.4.3Mockservice

InsomecasesitmaybeusefultodeveloptheAPIwithouthavingyetdevelopedmethodsthatyoucallwiththeOperationI.Inthiscaseitisusefultorunamockservice.YOucaninvocesuchaservicewith

6.4.4Exercise

OpenAPI.Conexion.1:

Modifythemakefilesoitworksalsoonubuntu,butdonotdisablethe

$connexionruncpu.yaml--stub--debug

$pythonserver.py

$connexionruncpu.yaml--mock=all-v

https://editor.swagger.io/

http://localhost:8080/cloudmesh/ui/

abilitytorunitcorrectlyonOSX.Tipuseif’sinmakefilesbaseontheOS.Youcanlookat themakefiles thatcreate thisbookasexample.findalternativestosartingaterminalinLinux.

OpenAPI.Conexion.2:

Modify the makefile so it works also on Windows 10, but do notdisabletheabilitytorunitcorrectlyonOSX.Tipuseifsinmakefiles.Youcanlookatthemakefilesthatcreatethisbookasexample.Findalternativestostartapowershellorcmd.exeinwindows.Maybeyouneedtousegitbash.

OpenAPI.Conexion.3:

Implement a swagger specification of an issue related to the NISTBDRA. Implement it.Pleaseremember thiscouldprepareyou foraprojectgoodtopicsinclude:

virtual compute service interfacingwith aws, azure, google oropenstackvirtual directory service interfacing with google drive, box,github,iCloud,ftp,scp,andothers

Astherearesomanypossibilitiestocontribute,comeupinclasswithone specificationand than implement it fordifferent providers.Thedifficultyhereis that it isnotdoneforoneIaaS,but forallof themandallcanbeintegrated.

Thisexerciseistypicallygrowingtobepartofyourclassproject.

OpenAPI.Conexion.4:

Develop instructions on how to integrate the OpenAPI serviceframeworkinaWSGIbasedWebservice.Choseaserviceyoulikesothattheservicecouldruninproduction.

OpenAPI.Conexion.5:

Develop instructions on how to integrate the OpenAPI service

frameworkinTornadosotheservicecouldruninproduction.

6.5OPENAPIRESTSERVICEVIACODEGEN☁�

REST36:02Swagger

In this subsection we are discussing how to use OpenAPI 2.0 and SwaggerCodegentodefineanddevelopaRESTService.

WeassumeyouhavebeenfamiliarwiththeconceptofRESTservice,OpenAPIasdiscussedinsectionOverviewofRest.

InnextsectionwewillfurtherlookintotheSwagger/OpenAPI2.0specificationSwagger Specification and use a slight more complex example to walk youthrough the design of a RESTful service following the OpenAPI 2.0specifications.

WewilluseasimpleexampletodemonstratetheprocessofdevelopingaRESTservicewithSwagger/OpenAPI2.0specificationandthetoolsrelatedtois.Thegeneralstepsare:

Step 1 (Section Step 1: Define Your REST Service. Define the RESTservice conforming toSwagger/OpenAPI 2.0 specification. It is aYAMLdocument file with the basics of the REST service defined, e.g., whatresourcesithasandwhatactionsaresupported.

Step 2 (Section Step 2: Server Side Stub Code Generation andImplementation. Use Swagger Codegen to generate the server side stubcode.Fillintheactualimplementationofthebusinesslogicportioninthecode.

Step3(SectionStep3:InstallandRuntheRESTService.Installtheserversidecodeandrunit.Theservicewillthenbeavailable.

Step 4 (Section Step 4: Generate Client Side Code andVerify.Generateclientsidecode.DevelopcodetocalltheRESTservice.Installandruntoverify.

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/swagger-codegen.md

https://youtu.be/0_Ub13py_K8

6.5.1Step1:DefineYourRESTService

In this example we define a simple REST service that returns the hostingserver’sbasicCPUinfo.Theexamplespecificationinyamlisasfollows:

6.5.2Step2:ServerSideStubCodeGenerationandImplementation

With theREST service having been defined,we can now generate the serversidestubcodeeasily.

6.5.2.1SetuptheCodegenEnvironment

YouwillneedtoinstalltheSwaggerCodegentoolifnotyetdoneso.FormacOSwerecommendthatyouusethehomebrewinstallvia

OnUbuntuyoucaninstallswaggerasfollows(updatetheversionasneeded):

swagger:"2.0"

info:

version:"0.0.1"

title:"cpuinfo"

description:"Asimpleservicetogetcpuinfoasanexampleofusingswagger-2.0specificationandcodegen"

termsOfService:"http://swagger.io/terms/"

contact:

name:"CloudmeshRESTServiceExample"

license:

name:"Apache"

host:"localhost:8080"

basePath:"/api"

schemes:

-"http"

consumes:

-"application/json"

produces:

-"application/json"

paths:

/cpu:

get:

description:"Returnscpuinformationofthehostingserver"

produces:

-"application/json"

responses:

"200":

description:"CPUinfo"

schema:

$ref:"#/definitions/CPU"

definitions:

CPU:

type:"object"

required:

-"model"

properties:

model:

type:"string"

$brewinstallswagger-codegen

https://swagger.io/docs/swagger-tools/

Addthealiastoyour.bashrcor.bash_profile file.Afteryoustartanewterminalyoucanuseinthatterminalnowthecommand

Forotherplatformsyoucanjustusethe.jarfile,whichcanbedownloadedfromthislink.

AlsomakesureJava7or8isinstalledinyoursystem.Tohaveawelldefinedlocationwerecommendthatyouplacethecodeinthedirectory~/cloudmesh.Inthisdirectoryyouwillalsoplacethecpu.yamlfile.

6.5.2.2GenerateServerStubCode

After youhave the codegen tool ready, andwith Java7or 8 installed inyoursystem,youcanrunthefollowingtogeneratetheserversidestubcode:

orifyouhavenotcreatedanalias

In the specified directory under flaskConnexion you will find the generatedpython flaskcode,withpython2compatibility. It isbest toplace theswaggercodeunderthedirectory~/cloudmeshtohavealocationwhereyoucaneasilyfindit.Ifyouwanttousepython3makesuretochosetheappropriateoption.Toswitchbetweenpython2andpython3werecommendthatyouusepyenvasdiscussedinourpythonsection.

6.5.2.3Fillintheactualimplementation

$mkdir~/swagger

$cd~/swagger

$wgethttps://oss.sonatype.org/content/repositories/releases/io/swagger/swagger-codegen-cli/2.3.1/swagger-codegen-cli-2.3.1.jar

$aliasswagger-codegen="java-jar~/swagger/swagger-codegen-cli-2.3.1.jar"

swagger-codegen

$swagger-codegengenerate\

-i~/cloudmesh/cpu.yaml\

-lpython-flask\

-o~/cloudmesh/swagger_example/server/cpu/flaskConnexion\

-DsupportPython2=true

$java-jarswagger-codegen-cli.jargenerate\


-lpython-flask\

-o~/cloudmesh/swagger_example/server/cpu/flaskConnexion\


https://oss.sonatype.org/content/repositories/releases/io/swagger/swagger-codegen-cli/2.3.1/swagger-codegen-cli-2.3.1.jar

Under the flaskConnexion directory, youwill find a swagger_server directory,underwhichyouwillfinddirectorieswithmodelsdefinedandcontrollerscodestubresides.ThemodelscodearegeneratedfromthedefinitioninStep1.Onthecontroller code though,wewill need to fill in the actual implementation.Youmay see a default_controller.py file under the controllers directory in which theresourceandaction isdefinedbutyet tobe implemented. Inourexample,youwillfindsuchafunctiondefinitionwhichwelistnext:

Pleasenote the dosomemagic! string at the return of the function.This ought to bereplacedwithactualimplementationwhatyouwouldlikeyourRESTcalltobereally doing. In reality this could be some call to a backend database ordatastore; a call to another API; or even calling another REST service fromanotherlocation.Inthisexamplewesimplyretrievethecpumodelinformationfromthehostingserverthroughasimplepythoncalltoillustratethisprinciple.Thusyoucandefinethefollowingcode:

Andthenchangethecpu_get()functiontothefollowing:

Please note we are returning a CPU object as defined in the API and later

defcpu_get():#noqa:E501

"""cpu_get

Returnscpuinfoofthehostingserver#noqa:E501

:rtype:CPU

"""

return'dosomemagic!'

importos,platform,subprocess,re

defget_processor_name():

ifplatform.system()=="Windows":

returnplatform.processor()

elifplatform.system()=="Darwin":

command="/usr/sbin/sysctl-nmachdep.cpu.brand_string"

returnsubprocess.check_output(command,shell=True).strip()

elifplatform.system()=="Linux":

command="cat/proc/cpuinfo"

all_info=subprocess.check_output(command,shell=True).strip()

forlineinall_info.split("\n"):

if"modelname"inline:

returnre.sub(".*modelname.*:","",line,1)

return"cannotfindcpuinfo"

defcpu_get():#noqa:E501

"""cpu_get

Returnscpuinfoofthehostingserver#noqa:E501

:rtype:CPU

"""

returnCPU(get_processor_name())

generatedbythecodegentoolinthemodelsdirectory.

Itisbestnottoincludethedefinitionofget_processor_name()inthesamefileasyouseethedefinitionofcpu_get().Thereasonforthisisthatincaseyouneedtoregeneratethe code, yourmodified codewill naturallybeoverwritten.Thus, tominimizethechanges,wedorecommendtomaintainthatportioninadifferentfilenameandimportthefunctionastokeepthemodificationssmall.

Atthisstepwehavecompletedtheserversidecodedevelopment.

6.5.3Step3:InstallandRuntheRESTService:

NowwecaninstallandruntheRESTservice.Itisstronglyrecommendedthatyourunthisinapyenvoravirtualenvenvironment.

6.5.3.1Startavirtualenv:

Incaseyouarenotusingpyenv,pleaseusevirtualenvasfollows:

6.5.3.2Makesureyouhavethelatestpip:

6.5.3.3Installtherequirementsoftheserversidecode:

6.5.3.4Installtheserversidecodepackage:

Underthesamedirectory,run:

6.5.3.5Runtheservice

Underthesamedirectory:

$virtualenvRESTServer

$sourceRESTServer/bin/activate

$pipinstall-Upip

$cd~/cloudmesh/swagger_example/server/cpu/flaskConnexion

$pipinstall-rrequirements.txt


Youshouldseeamessagelikethis:

6.5.3.6Verifytheserviceusingawebbrowser:

Openawebbrowserandvisit:

http://localhost:8080/api/cpu

toseeifitreturnsajsonobjectwithcpumodelinfoinit.

Assignment:Howwouldyouverifythatyourserviceworkswithacurlcall?

6.5.4Step4:GenerateClientSideCodeandVerify

Inadditiontotheserversidecodeswaggercanalsocreateaclientsidecode.

6.5.4.1Clientsidecodegeneration:

Generate theclientsidecode inasimilar fashionaswedid for theserversidecode:

6.5.4.2Installtheclientsidecodepackage:

Although we could have installed the client in the same python pyenv orvirtualenv,we showcase here that it can be installed in a completely differentenvironment.Thatwouldmake itevenpossible touseapython3basedclientand a python 2 based server showcasing interoperability between pythonversions(althoughwejustusepython2here).Thuswecreateanenewpythonvirtualenvironmentandconductourinstall.

$python-mswagger_server

*Runningonhttp://0.0.0.0:8080/(PressCTRL+Ctoquit)

$java-jarswagger-codegen-cli.jargenerate\


-lpython\

-o~/cloudmesh/swagger_example/client/cpu\


$virtualenvRESTClient

$sourceRESTClient/bin/activate

$pipinstall-Upip

6.5.4.3UsingtheclientAPItointeractwiththeRESTservice

Under thedirectoryswagger_example/client/cpuyouwill findaREADME.mdfilewhichservesasanAPIdocumentationwithexampleclientcodeinit.E.g.,ifwesavethefollowingcodeintoa.pyfile:

We can then run this code to verify the calling to the REST service actuallyworks.Weareexpectingtoseeareturnsimilartothis:

Obviously,wecouldhaveappliedadditionalcleanupoftheinformationreturnedbythepythoncode,suchasremovingduplicatedspaces.

6.5.5TowardsaDistributedClientServer

Althoughwedevelopand run theexampleonone localhostmachine,youcanseparatetheprocessintotwoseparatemachines.E.g.,onaserverwithexternalIPorevenDNSname todeploy theserversidecode,andona local laptoporworkstationtodeploytheclientsidecode.InthiscasepleasemakechangesontheAPIdefinitionaccordingly,e.g.,thehostvalue.

6.6FLASKRESTFULSERVICES☁�Flask is amicro services framework allowing towriteweb services in pythonquickly.OneofitsextensionsisFlask-RESTful.ItaddsforbuildingRESTAPIsbasedonaclassdefinitionmakingitrelativelysimple.ThroughthisinterfacewecanthanintegratewithyourexistingObjectRelationalModelsandlibraries.As

$cdswagger_example/client/cpu

$pipinstall-rrequirements.txt



importtime

importswagger_client

fromswagger_client.restimportApiException

frompprintimportpprint

#createaninstanceoftheAPIclass

api_instance=swagger_client.DefaultApi()

try:

api_response=api_instance.cpu_get()

pprint(api_response)

exceptApiExceptionase:

print("ExceptionwhencallingDefaultApi->cpu_get:%s\n"%e)

{'model':'Intel(R)Core(TM)[email protected]'}

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest-restful.md

Flask-RESTful leverages the main features from Flask an extensive set ofdocumentation isavailableallowingyou togetstartedquicklyand thoroughly.TheWebpagecontainsextensivedocumentation:

https://flask-restful.readthedocs.io/en/latest/

Wewillprovideasimpleexample thatshowcasessomehardcodeddata tobeservedasarestservice.Itwillbeeasytoreplacethisforexamplewithfunctionsand methods that obtain such information dynamically from the operatingsystem.

This example has not been tested. We like that the class defines a beautifulexampletocontributetothissectionandexplainswhathappensinthisexample.fromflaskimportFlask

fromflask_restfulimportreqparse,abort

fromflask_restfulimportApi,Resource

app=Flask(__name__)

api=Api(app)

COMPUTERS={

'computer1':{

'processor':'iCore7'

},

'computer2':{


},

'computer3':{


},

}

defabort_if_cluster_doesnt_exist(computer_id):

ifcomputer_idnotinCOMPUTERS:

abort(404,message="Computer{}doesnotexist".format(computer_id))

parser=reqparse.RequestParser()

parser.add_argument('processor')

classComputer(Resource):

'''showsasinglecomputeritemandletsyoudeleteacomputer

item.'''

defget(self,computer_id):

abort_if_computer_doesnt_exist(computer_id)

returnCOMPUTERS[computer_id]

defdelete(self,computer_id):

abort_if_computer_doesnt_exist(computer_id)

delCOMPUTERS[computer_id]

return'',204

defput(self,computer_id):

args=parser.parse_args()

processor={'processor':args['processor']}

COMPUTERS[computer_id]=processor

returnprocessor,201

#ComputerList

classComputerList(Resource):

'''showsalistofallcomputers,andletsyouPOSTtoaddnewcomputers'''


6.7RESTSERVICESWITHEVE☁�Next,wewill focusonhowtomakeaRESTfulwebservicewithPythonEve.Eve makes the creation of a REST implementation in python easy. MoreinformationaboutEvecanbefoundat:

http://python-eve.org/

AlthoughwedorecommendUbuntu17.04,atthistimethereisabugthatforcesustouse16.04.Furthermore,werequireyoutofollowtheinstructionsonhowto installpyenvanduse it tosetupyourpythonenvironment.Werecommendthat you use either python 2.7.14 or 3.6.4.We do not recommend you to useanacondaasitisnotsuitedforcloudcomputingbuttargetsdesktopcomputing.Ifyouusepyenvyoualsoavoidtheissueofinterferingwithyoursystemwidepythoninstall.Wedorecommendpyenvregardlessifyouuseavirtualmachineorareworkingdirectlyonyouroperatingsystem.Afteryouhavesetupaproperpython environment, make sure you have the newest version of pip installedwith

ToinstallEve,youcansay

AsEvealsoneedsabackenddatabase,andasMongoDBisanobviouschoicefor this, we will have to first install MongoDB. MongoDB is a Non-SQLdatabasewhichhelpstostorelightweightdataeasily.

defget(self):

returnCOMPUTERS

defpost(self):

args=parser.parse_args()

computer_id=int(max(COMPUTERS.keys()).lstrip('computer'))+1

computer_id='computer%i'%computer_id

COMPUTERS[computer_id]={'processor':args['processor']}

returnCOMPUTERS[computer_id],201

##

##SetuptheApiresourceroutinghere

##

api.add_resource(ComputerList,'/computers')

api.add_resource(Computer,'/computers/<computer_id>')


app.run(debug=True)

$pipinstallpip-U

$pipinstalleve

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest-eve.md


6.7.1UbuntuinstallofMongoDB

OnUbuntuyoucaninstallMongoDBasfollows

6.7.2macOSinstallofMongoDB

OnmacOSyoucanusethecommand

6.7.3Windows10InstallationofMongoDB

Astudentorstudentgroupofthisclassareinvitedtodiscussonpiazzaonhowto install mongoDB on Windows 10 and come up with an easy installationsolution.Naturallywehavethesame2differentwaysonhowtorunmongo.Inuser space or in the system. As we want to make sure your computer stayssecure. the solutionmust have an easyway on how to shut down theMongoservices.

Anenhancementofthistaskwouldbetointegratethisfunctionintocloudmeshcmd5with a commandmongo that allows for easily starting and stopping theservicefromcms.

6.7.4DatabaseLocation

After downloadingMongo, create the db directory. This is where theMongodatafileswilllive.Youcancreatethedirectoryinthedefaultlocationandassureithas therightpermissions.Makesure that the /data/dbdirectoryhas therightpermissionsbyrunning

6.7.5Verification

InordertochecktheMongoDBinstallation,pleaserunthefollowingcommands

$sudoapt-keyadv--keyserverhkp://keyserver.ubuntu.com:80\

--recv2930ADAE8CAF5059EE73BB4B58712A2291FA4AD5

$echo"deb[arch=amd64,arm64]https://repo.mongodb.org/apt/ubuntu\

xenial/mongodb-org/3.6multiverse"|\

sudotee/etc/apt/sources.list.d/mongodb-org-3.6.list

$sudoapt-getupdate

$sudoapt-getinstall-ymongodb-org

$brewupdate

$brewinstallmongodb

inoneterminal:

Inanotherterminalwetrytoconnecttomongoandissueamongocommandtoshowthedatabases:

If they execute without errors, you have successfully installed MongoDB. Inordertostoptherunningdatabaseinstancerunthefollowingcommand.simplyCTRL-Ctherunningmongodprocess

6.7.6BuildingasimpleRESTService

Inthissectionwewillfocusoncreatingasimplerestservice.Toorganizeourworkwewillcreatethefollowingdirectory:

AsEveneedsaconfigurationanditisreadinbydefaultfromthefilesettings.pyweplacethefollowingcontentinthefile~/cloudmesh/eve/settings.py:

TheDOMAINobjectspecifiestheformatofastudentobjectthatweareusingaspartofourRESTservice.InadditionwecanspecifyRESOURCE_METHODSwhichmethods

$mkdir-p~/cloudmesh/data/db

$mongod--dbpath~/cloudmesh/data/db

$mongo--host127.0.0.1:27017

$showdatabases

$mkdir-p~/cloudmesh/eve

$cd~/cloudmesh/eve

MONGO_HOST='localhost'

MONGO_PORT=27017

MONGO_DBNAME='student_db'

DOMAIN={

'student':{

'schema':{

'firstname':{

'type':'string'

},

'lastname':{

'type':'string'

},

'university':{

'type':'string'

},

'email':{

'type':'string',

'unique':True

}

'username':{

'type':'string',

'unique':True

}

}

}

}

RESOURCE_METHODS=['GET','POST']

are activated for the REST service. This way the developer can restrict theavailable methods for a REST service. To pass along the specification formongoDB,we simply specify the hostname, the port, aswell as the databasename.

Nowthatwehavedefinedthesettingsforourexampleservice,weneedtostartitwithasimplepythonprogram.Wecouldnamethatprogramanythingwelike,butoftenitiscalledsimplyrun.py.Thisfileisplacedinthesamedirectorywhereyou placed the settings.py. In our case it is in the file ~/cloudmesh/eve/run.py andcontainsthefollowingpythonprogram:

ThisisthemostminimalapplicationforEve,thatusesthesettings.pyfileforitsconfiguration. Naturally, if we were to change the configuration file and forexample change the DOMAIN and its schema, we would naturally have toremove the database previously created and start the service new. This isespeciallyimportantasduringthedevelopmentphasewemayfrequentlychangetheschemaandthedatabase.ThusitisconvenienttodevelopnecessarycleaningactionsaspartofaMakefilewhichweleaveaseasyexerciseforthestudents.

Next,weneed to start the serviceswhichcaneasilybeachieved ina terminalwhilerunningthecommands:

PreviouslywestartedthemongoDBserviceasfollows:

This is done in its own terminal, sowe can observe the logmessages easily.NextwestartinanotherwindowtheEveservicewith

You can find the codes and commands up to this point in the followingdocument.

6.7.7InteractingwiththeRESTservice

fromeveimportEve

app=Eve()


app.run()

$mongod--dbpath~/cloudmesh/data/db/

$cd~/cloudmesh/eve

$pythonrun.py

Yetinanotherwindow,wecannowinteractwiththeRESTservice.WecanusethecommandlinetosavethedatainthedatabaseusingtheRESTapi.Thedatacanbe retrieved inXMLor in json format. Json is oftenmore convenient fordebuggingasitiseasiertoreadthanXML.

Naturally,weneedfirsttoputsomedataintotheserver.LetusassumeweaddtheuserAlbertZweistein.

Toachievethis,weneedtospecifytheheaderusingH tagsayingweneedthedatatobesavedusingjsonformat.AndXtagsaystheHTTPprotocolandhereweusePOSTmethod.Andthetagdspecifies thedataandmakesureyouusejsonformattoenterthedata.Finally,theRESTapiendpointtowhichwemustsavedata.ThisallowsustosavethedatainatablecalledstudentinMongoDBwithinadatabasecalledeve.

Inordertocheckiftheentrywasacceptedinmongoandincludedintheserverissuethefollowingcommandsequenceinanotherterminal:

Nowyoucanquerymongodirectlywithitsshellinterface

NaturallythisisnotreallynecessaryforARESTservicesuchaseveasweshowyounexthowtogainaccesstothedataviamongowhileusingRESTcalls.WecansimplyretrievetheinformationwiththehelpofasimpleURI:

Naturally, you can formulate otherURLs and query attributes that are passedalongafterthe?.

ThiswillnowallowyoutodevelopsophisticatedRESTservices.Weencourageyou to inspect the documentation provided by Eve to showcase additionalfeaturesthatyoucouldbeusingaspartofyourefforts.

$curl-H"Content-Type:application/json"-XPOST\

-d'{"firstname":"Albert","lastname":"Zweistein",\

"school":"ISE","university":"IndianaUniversity",\

"email":"[email protected]","username":"albert"}'\

http://127.0.0.1:5000/student/

$mongo

>showdatabases

>usestudent_db

>showtables#querythetablenames

>db.student.find().pretty()#prettywillshowthejsoninaclearway

$curlhttp://127.0.0.1:5000/student?firstname=Albert

LetusexplorehowtoproperlyuseadditionalRESTAPIcalls.WeassumeyouhaveMongoDBupandrunning.ToquerytheserviceitselfwecanusetheURIontheEveport

Yourpayloadshouldlookliketheonelistednext,ifyouroutputisnotformattedlikethistryadding?pretty=1

RememberthattheAPIentrypointsincludeadditionalinformationsuchaslinksandachild,andhref.

Set up a python environment that works for your platform. Provide explicitreasonswhy anaconda and other prepackaged python versions have issues forcloudrelatedactivities.Whenmayyouuseanacondaandwhenshouldyounotuseanaconda.Whywouldyouwanttousepyenv?

Whatisthemeaningandpurposeoflinks,child,andhref

InthiscasehowmanychildresourcesareavailablethroughourAPI?

DevelopaRESTservicewithEveandstartandstopit

Definecurlcallstostoredataintotheserviceandretrieveit.

Write a Makefile and in it a target clean that cleans the data base. Developadditional targets such as start and stop, that start and stop themongoDB butalsotheEveRESTservice

Issuethecommand

$curl-ihttp://127.0.0.1:5000

$curl-ihttp://127.0.0.1:5000?pretty=1

HTTP/1.0200OK

Content-Type:application/json

Content-Length:150

Server:Eve/0.7.6Werkzeug/0.11.15Python/2.7.16

Date:Wed,17Jan201818:34:07GMT

{

"_links":{

"child":[

{

"href":"student",

"title":"student"

}

]

}

Whatdoesthe_linkssectiondescribe?

Whatdoesthe_itemssectiondescribe?

6.7.8CreatingRESTAPIEndpoints

Nextwewont to enhance our example a bit. First, let us get back to the eveworkingdirectorywith

Addthefollowingcontenttoafilecalledrun2.py

After creating and saving the file. Run the following command to start theservice

$curl-ihttp://127.0.0.1:5000/people

{

"_items":[],

"_links":{

"self":{

"href":"people",

"title":"people"

},

"parent":{

"href":"/",

"title":"home"

}

},

"_meta":{

"max_results":25,

"total":0,

"page":1

}

}

$cd~/cloudmesh/eve

fromeveimportEve


importos

importgetpass

app=Eve()

@app.route('/student/albert')

defalberts_information():

data={

'firstname':'Albert',

'lastname':'Zweistsein',

'university':'IndianaUniversity',

'email':'[email protected]'

}

try:

data['username']=getpass.getuser()

except:

data['username']='not-found'

returnjsonify(**data)


app.run(debug=True,host="127.0.0.1")

$pythonrun2.py

Afterrunningthecommand,youcaninteractwiththeservicewhileenteringthefollowingurlinthewebbrowser:

Youcanalsoopenupasecondterminalandtypeinit

Thefollowinginformationwillbereturned:

ThisexampleillustrateshoweasyitistocreateRESTservicesinpythonwhilecombininginformationfromadictwithinformationretrievedfromthesystem.The important part is to understand the decorator app.route. The parameterspecifiestherouteoftheAPIendpointwhichwillbetheaddressappendedtothebasepath,http://127.0.0.1:5000.Itisimportantthatwereturnajsonifiedobject,whichcaneasilybedonewiththejsonifyfunctionprovidedbyflask.Asyoucanseethenameofthedecoratedfunctioncanbeanythingyoulok.Theroutespecifieshowweaccessitfromtheservice.

6.7.9RESTAPIOutputFormatsandRequestProcessing

Anotherway ofmanaging the data is to utilize class definitions and responsetypesthatweexplicitlydefine.

Ifwewant tocreateanobject likeStudent,wecanfirstdefineapythonclass.Createafilecalledstudent.py.Please,notethegetmethodthatreturnssimplythe information in the dict for the class. It is not related to the REST getfunction.

http://127.0.0.1:5000/student/alberts

$curlhttp://127.0.0.1:5000/student/alberts

{

"firstname":"Albert",

"lastname":"Zweistain",

"university":"IndianaUniversity",

"email":"[email protected]",

"username":"albert"

}

classStudent(object):

def__init__(self,firstname,lastname,university,email):

self.firstname=firstname

self.lastname=lastname

self.university=university

self.email=email

self.username='undefined'

defget(self):

returnself.__dict__

defsetUsername(self,name):

self.username=name

returnname

NextwedefineaRESTservicewithEveasshowninthefollowinglisting

Incontrasttoourearlierexample,wearenotusingthejsonifyobject,butcreateexplicitlyaresponsethatwereturntotheclients.Theresponseincludesaheaderthatwereturntheinformationinjsonformat,astatusof200,whichmeanstheobjectwasreturnedsuccessfully,andtheactualdata.

6.7.10RESTAPIUsingaClientApplication

�Thisexampleisnottested.Pleaseprovidefeedbackandimprove.

IntheSectionRestServiceswithEvewecreatedourownRESTAPIapplicationusing Python Eve.Now once the service running, awe need to learn how tointeractwithitthroughclients.

Firstgobacktotheworkingfolder:

Herewecreateanewpythonfilecalledclient.py.Thefileincludethefollowingcontent.

fromeveimportEve

fromstudentimportStudent

importplatform

importpsutil

importjson

fromflaskimportResponse

importgetpass

app=Eve()

@app.route('/student/albert',methods=['GET'])

defprocessor():

student=Student("Albert",

"Zweistein",

"IndianaUniversity",

"[email protected]")

response=Response()

response.headers["Content-Type"]="application/json;charset=utf-8"

try:

student.setUsername(getpass.getuser())

response.headers["status"]=200

except:

response.headers["status"]=500

response.data=json.dumps(student.get())

returnresponse


app.run(debug=True,host='127.0.0.1')

$cd~/cloudmesh/eve

importrequests

importjson

defget_all():

response=requests.get("http://127.0.0.1:5000/student")

Runthefollowingcommandinanewterminaltoexecutethesimpleclientby

Herewhenyourunthisclassforthefirsttime,itwillrunsuccessfully,butifyoutried it for the second time, itwill give you an error. Becausewe did set theemailtobeauniquefieldintheschemawhenwedesignedthesettings.pyfileinthebeginning.Soifyouwanttosaveanotherrecordyoumusthaveentrieswithuniqueemails.Inordertomakethisdynamicyoucanincludeainputreadingbyusingtheterminaltogetthestudentdatafirstandinsteadofthestaticdatayoucanuse theuser inputdata fromthe terminal togetdynamicdata.But for thisexercisewedonotexpectthatoranyotherformdatafunctionality.

Inordertogetthesaveddata,youcancommenttherecordsavingfunctionanduncommentthegetallfunction.Inpythoncommentingisdonebyusing#.

Thisclient isusing therequests python library to sendGET,POSTandotherHTTP requests to the server soyoucan leveragebuild inmethods to simplifyyourwork.

The get_all functionprovidesaway toget theoutput to theconsolewithall thedatainthestudentdatabase.Thesave_recordfunctionprovidesawaytosavedatainthedatabase.Youcancreatedynamicfunctionsinorder tosavedynamicdata.Howeveritmaytakesometimeforyoutoapplyasexercise.

WriteaRESTfulservicetodetermineausefulpieceofinformationoffofyour

print(json.dumps(response.json(),indent=4,sort_keys=True))

defsave_record():

headers={

'Content-Type':'application/json'

}

data='{"firstname":"Gregor",

"lastname":"vonLaszewski",

"university":"IndianaUniversity",

"email":"[email protected]",

"username":"jane"}'

response=requests.post('http://localhost:5000/student/',

headers=headers,

data=data)

print(response.json())


save_record()

get_all()

$pythonclient.py

computeri.e.diskspace,memory,RAM,etc.Inthisexercisewhatyouneedtodoisuseapythonlibrarytoextractdataaboutcomputerinformationmentionedpreviously and send these information to the user once the user calls an APIendpoint like http://localhost:5000/performance/ram, it must return the RAM value of thegivenmachine.Foreachinformationlikediskspace,RAM,etcyoucanuseanendpointpereachfeatureneeded.Asatipforthisexercise,usethepsutillibraryinpythontoretrievethedata,andthengettheseinformationintoanstringthenpopulateaclasscalledComputerandtrytosavetheobjectlikewise.

6.7.11Towardscmd5extensionstomanageeveandmongo �

�

Part of this section related to management of the mongo dbserviceisdonebythecm4commandwewillbedeveloppingaspartofthis class cmsmongoadmin that does all of the things explained next andmore.

NaturallyitisofadvantagetohaveincmsadministrationcommandstomanagemongoandevefromcmdinsteadoftargetsintheMakefile.Hence,weproposethat the classdevelops suchanextension.Wewill create in the repository theextension called admin andhope that students through collaborativework andpullrequestscompletesuchanadmincommand.

Theproposedcommandislocatedat:

https://github.com/cloudmesh/cloudmesh.rest/blob/master/cloudmesh/admin/command/admin.py

Itwillbeuptotheclasstoimplementsuchacommand.Pleasecoordinatewitheachother.

TheimplementationbasedonwhatweprovidedintheMakefileseemsstraightforward. A great extension is to load the objects definitions or evee.g.settings.pynotfromtheclass,butfromaplacein.cloudmesh.Iproposetoplacethefileat:~/.cloudmesh/db/settings.py

https://github.com/cloudmesh/cloudmesh.rest/blob/master/cloudmesh/admin/command/admin.py

thelocationof thisfile isusedwhentheServiceclass is initializedwithNone.Prior to starting the service the file needs to be copied there. This could beachievedwithasetcommand.

6.8HATEOAS☁�In the previous section we discussed the basic concepts about RESTful webservice.NextweintroduceyoutotheconceptofHATEOAS

HATEOASstandsforHypermediaastheEngineofApplicationStateandthisisenabled by the default configuration within Eve. It is useful to review theterminologyandattributesusedaspartofthisconfiguration.HATEOSexplainshowRESTAPIendpointsaredefinedanditprovidesacleardescriptiononhowtheAPIcanbeconsumedthroughtheseterms:

_links

Linksdescribetherelationofcurrentresourcebeingaccessedtotherestofthe resources. It is like if we have a set of links to the set of objects orserviceendpointsthatwearereferringintheRESTfulwebservice.Hereanendpoint refers toaservicecallwhich is responsible forexecutingoneoftheCRUDoperationsonaparticularobjectorsetofobjects.Moreonthelinks,thelinksobjectcontainsthelistofserviceableAPIendpointsorlistof services.WhenwearecallingaGET requestor anyother request,wecan use these service endpoints to execute different queries based on theuser purpose. For instance, a service call can be used to insert data orretrieve data from a remote database using a REST API call. Aboutdatabaseswewilldiscussindetailinanotherchapter.

title

The title in the rest endpoint is the name or topic that we are trying toaddress.Itdescribesthenatureoftheobjectbyasingleword.Forinstancestudent,bank-statement,salary,etccanbeatitle.

parent

The term parent refers to the very initial link or an API endpoint in a

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest-haetos.md

particularRESTfulwebservice.Generallythisisdenotedwiththeprimaryaddresslikehttp://example.com/api/v1/.

href

ThetermhrefreferstotheurlsegmentthatweusetoaccesstheaparticularREST API endpoint. For instance “student?page=1” will return the firstpageofstudentlistbyretrievingaparticularnumberofitemsfromaremotedatabase or a remote data source. The full url will look like this,“http://www.exampleapi.com/student?page=1”.

In addition to these fields eve will automatically create the follwoinginformationwhenresourcesarecreatedasshowcasedot

http://python-eve.org/features.html

Field Description_created itemcreationdate._updated itemlastupdatedon._etag ETag,tobeusedforconcurrencycontrolandconditionalrequests._id uniqueitemkey,alsoneededtoaccesstheindividualitemendpoint.

Pagenationinformationcanbeincludedinthe_metafield.

6.8.1Filtering

Clientscansubmitquerystringstotherestservicetoretrieveresourcesbasedona filter.This also allows sortingof the results queried.Onenice feature aboutusing mongo as a backend database is that Eve not only allows pythonconditionalexpressions,butalsomongoqueries.

Anumberofexamplestoconductsuchqueriesinclude:

Apythonexpression

$curl-i-ghttp://eve-demo.herokuapp.com/people?where={%22lastname%22:%20%22Doe%22}

$curl-ihttp://eve-demo.herokuapp.com/people?where=lastname=="Doe"

http://python-eve.org/features.html

6.8.2PrettyPrinting

Prettyprintingistypicallysupportedbyaddingtheparameter?prettyor?pretty=1

Ifthisdoesnotworkyoucanalwaysusepythontobeautifyajsonoutputwith

or

6.8.3XML

IfforsomereasonyouliketoretrievetheinformationinXMLyoucanspecifythisforexamplethroughcurlwithanAcceptheader

6.9EXTENSIONSTOEVE☁�Anumberofextensionshavebeendevelopedbythecommunity.Thisincludeseve-swagger,eve-sqlalchemy,eve-elastic,eve-mongoengine,eve-neo4j,eve.net,eve-auth-jwt,andflask-sentinel.

Naturallytherearemanymore.

Students have the opportunity to pic one of the community extensions andprovideasectionforthehandbook.

Pick one of the extension, research it and provide a small section for thehandbooksoweaddit.

6.9.1ObjectManagementwithEveandEvegenie


Eve makes the creation of a REST implementation in python easy. We willprovideyouwithanimplementationexamplethatshowcasesthatwecancreateRESTserviceswithoutwritingasinglelineofcode.Thecodeforthisislocated

$curl-ihttp://localhost/people?pretty

$curl-ihttp://localhost/people|python-mjson.tool

$curl-H"Accept:application/xml"-ihttp://localhost

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest-eve-extensions.md


athttps://github.com/cloudmesh/rest

Thiscodewillhaveamasterbranchbutwillalsohaveadevbranchinwhichwewilladdgraduallymoreobjects.Objectsinthedevbranchwillinclude:

virtualdirectoriesvirtualclustersjobsequencesinventories

You may want to check our active development work in the dev branch.Howeverforthepurposeofthisclassthemasterbranchwillbesufficient.

6.9.1.1Installation

Firstwehavetoinstallmongodb.Theinstallationwilldependonyouroperatingsystem.Fortheuseof therestserviceit isnot important tointegratemongodbinto the system upon reboot, which is focus of many online documents.However, for us it is better ifwe can start and stop the services explicitly fornow.

Onubuntu,youneedtodothefollowingsteps:

�TODO:Sectioncanbecontributedbystudent.

Onwindows10,youneedtodothefollowingsteps:

�TODO:Sectioncanbecontributedbystudent.IfyouelectWindows10.YoucouldbeusingtheonlinedocumentationprovidedbystartingitonWindows,orrunningitinadockercontainer.

OnmacOSyoucanusehome-brewandinstallitwith:

Infuturewemaywanttoaddsslauthenticationinwhichcaseyoumayneedtoinstallitasfollows:

$brewupdate

$brewinstallmongodb

$brewinstallmongodb--with-openssl

https://github.com/cloudmesh/rest

6.9.1.2Startingtheservice

WehaveprovidedaconvenientMakefilethatcurrentlyonlyworksformacOS.ItwillbeeasyforyoutoadaptittoLinux.Certainlyyoucanlookatthetargetsinthemakefileandreplicatethemonebyone.Importanttargetsaredeployandtest.

Whenusingthemakefileyoucanstarttheserviceswith:

ITwillstarttwoterminals.INoneyouwillseethemongoservice,intheotheryou will see the eve service. The eve service will take a file calledsample.settings.py that is base on sample.json for the start of the eve service.Themongo service is configured in such away that it only accepts incomingconnectionsfromthelocalhostwhichwillbesufficientforourcase.Themongodataiswrittenintothe$USER/.cloudmeshdirectory,somakesureitexists.

Totesttheservicesyoucansay:

Youwillseanumberofjsontextbeenwrittentothescreen.

6.9.1.3Creatingyourownobjects

Theexampledemonstratedhoweasy it is tocreateamongodbandaneverestservice. Now let us use this example to create your own. For this we havemodifiedatoolcalledevegenietoinstallitontoyoursystem.

Theoriginaldocumentationforevegenieislocatedat:

http://evegenie.readthedocs.io/en/latest/

However, we have improved evegenie while providing a commandline toolbasedonit.Theimprovedcodeislocatedat:

https://github.com/cloudmesh/evegenie

Youcloneitandinstallonyoursystemasfollows:

$makedeploy

$maketest

http://evegenie.readthedocs.io/en/latest/

https://github.com/cloudmesh/evegenie

Thisshouldinstallinyoursystemevegenie.YOucanverifythisbytyping:

If you see the path evegenie is installed.With evegenie installed its usage issimple:

Ittakesajsonfileasinputandwritesoutasettingsfilefortheuseineve.Letsassume the file is called sample.json, than the settings file will be calledsample.settings.py.Having the evegenieprogramwill allowus togenerate thesettings files easily. You can include them into your project and leverage theMakefile targets tostart theservices inyourproject. Incaseyougeneratenewobjects,makesureyourerunevegenie,killallpreviouswindowsinwhichyourun eve and mongo and restart. In case of changes to objects that you havedesignedandrunpreviously,youneedtoalsodeletethemongoddatabase.

6.10DJANGORESTFRAMEWORK☁�DjangoRESTframeworkisalargetoolkittodevelopWebAPIs.Thedevelopersof the framework provide the following reasons for using it aggording to thedevelopersofthatmodule:

1. TheWebbrowsableAPIimprovesusability.2. Authentication policies including packages for OAuth1a andOAuth2.

3. Serialization that supports both ORM and non-ORM datasources.

4. Customizableallthewaydown-justuseregularfunction-basedviewsifyoudonotneedthemorepowerfulfeatures.

5. Extensivedocumentation,andgreatcommunitysupport.6. Used and trusted by internationally recognised companies

$cd~/github

$gitclonehttps://github.com/cloudmesh/evegenie

$cdevegenie


$pipinstall.

$whichevegenie

$evegenie

Usage:

evegenie--help

evegenieFILENAME

https://github.com/cloudmesh-community/book/blob/master/chapters/rest/rest-django.md

includingMozilla,RedHat,Heroku,andEventbrite."

https://www.django-rest-framework.org/

EnexampleisprovidedontheirWebPageat

https://www.django-rest-framework.org/#example

To document your django framework with Swagger you can look at thisexample:

https://www.django-rest-framework.org/topics/documenting-your-api/

However,webelievethatforourpurposestheapproachtouseconexionfromanOpenAPI ismuchmore appealing, also using conexion and also flask for theREST service is easier to acomplish.Django is a large package thatwill takemortimetogettingusedto.

6.11GITHUBRESTSERVICES☁�InthissectionwewanttoexploreamorefeaturesofRESTservicesandhowtoaccessthem.NaturallymanycloudservicesprovidesuchRESTsinterfaces.ThisisvalidforIaaS,PaaS,andSaaS.

InsteadofusingaRESTserviceforIaaS,letushereinspectaRESTservicefortheGithub.complatform.

Itsinterfacesaredocumentednicelyat

https://developer.github.com/v3/

We see that Github offers many resources that can be accessed by the userswhichincludes

ActivitiesChecksGistsGitData


https://www.django-rest-framework.org/#example

https://www.django-rest-framework.org/topics/documenting-your-api/

https://github.com/cloudmesh-community/book/blob/master/chapters/prg/github.md

https://developer.github.com/v3/

GitHubAppsIssuesMigrationsMiscellaneousOrganizationsProjectsPullRequestsReactionsRepositoriesSearchesTeamsUsers

Most likely we forgot the one or the other Resource that we can access viaREST.Itwillbeoutofscopeforustoexplorealloftheseissues,soletusfocusonhowweforexampleaccessGithubIssues.Infactwewillusethescriptthatweusetocreateissuetablesforthisbooktoshowcasehoweasytheinteractionisandtoretrievetheinformation.

6.11.1Issues

The REST service for issues is described in the following Web page asspecification

https://developer.github.com/v3/issues/

Weseethefollowingfunctionality:

ListissuesListissuesforarepositoryGetasingleissueCreateanissueEditanissueLockanissueUnlockanissueCustommediatypes

As we have learned in our REST section we need to issue GET requests to

https://developer.github.com/v3/issues/

https://developer.github.com/v3/issues/#list-issues

https://developer.github.com/v3/issues/#list-issues-for-a-repository

https://developer.github.com/v3/issues/#get-a-single-issue

https://developer.github.com/v3/issues/#edit-an-issue

https://developer.github.com/v3/issues/#edit-an-issue

https://developer.github.com/v3/issues/#lock-an-issue

https://developer.github.com/v3/issues/#unlock-an-issue

https://developer.github.com/v3/issues/#custom-media-types

obtaininformationabouttheissues.Suchas

As response we obtain a json object with the information we need to furtherprocessit.Unfortunately,thefreetierofgithubhaslimitationsinregardstothefrequencywecanissuesuchrequeststotheservice,aswellasinthevolumeinregardstonumberofpagesreturnedtous.

Letusnowexplorehow toeasilyquerysome information. Inourexamplewelike to retrive the list of issues for a repository as LaTeX table but also asmarkdown. This waywe can conveniently integrate it in documents of eitherformat.AsLaTeXhasamoresophisticatedtablemanagement,letusfirstcreatea LaTeX table document and than use a program to convert LaTeX tomarkdown.ForthelaterwecanreuseaprogramcalledpandocthatcanconvertthetableforLaTeXtomarkdown.

Letusassumewehaveaprogramcalledissues.pythatprintsthetableinmarkdownformat

Anexampleforsuchaprogramislistesat.

https://github.com/cloudmesh-community/book/blob/master/bin/issues.py

Althoughpythonprovidestheverynicemodulerequestswhichwetypicallyuseforsuchissues.wehaveherejustwrappedthecommandlinecalltocurlintoasystemcommand and redirect its output to a file. However, as we only get limitedinformationbackinpages,weneedtocontinuesucharequestmultipletimes.Tokeepthingssimpleweidentifiedthatfortheprojectatthistimenotmorethatnpagesneedtobefetched,soweappendtheoutputfromeachpagetothefile.

Your task is it to improve this script and automatize this activity so that nomaximumfetcheshavetobeentered.

Thereasonwhythisprogramissoshortisthatweleveragethebuildinfunctionforjsondatastructuremanipulation,hearareadandadump.Whenwelookintheissue.jsonfilethatiscreatedasintermediaryfileweseealistofitemssuchas

GET/issues

GET/user/issues

$pythonissues.py

https://github.com/cloudmesh-community/book/blob/master/bin/issues.py

[

...

{

"url":"https://api.github.com/repos/cloudmesh-community/book/issues/46",

"repository_url":"https://api.github.com/repos/cloudmesh-community/book",

"labels_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/labels{/name}",

"comments_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/comments",

"events_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/events",

"html_url":"https://github.com/cloudmesh-community/book/issues/46",

"id":360613438,

"node_id":"MDU6SXNzdWUzNjA2MTM0Mzg=",

"number":46,

"title":"Taken:Virtualization",

"user":{

"login":"laszewsk",

"id":425045,

"node_id":"MDQ6VXNlcjQyNTA0NQ==",

"avatar_url":"https://avatars1.githubusercontent.com/u/425045?v=4",

"gravatar_id":"",

"url":"https://api.github.com/users/laszewsk",

"html_url":"https://github.com/laszewsk",

"followers_url":"https://api.github.com/users/laszewsk/followers",

"following_url":"https://api.github.com/users/laszewsk/following{/other_user}",

"gists_url":"https://api.github.com/users/laszewsk/gists{/gist_id}",

"starred_url":"https://api.github.com/users/laszewsk/starred{/owner}{/repo}",

"subscriptions_url":"https://api.github.com/users/laszewsk/subscriptions",

"organizations_url":"https://api.github.com/users/laszewsk/orgs",

"repos_url":"https://api.github.com/users/laszewsk/repos",

"events_url":"https://api.github.com/users/laszewsk/events{/privacy}",

"received_events_url":"https://api.github.com/users/laszewsk/received_events",

"type":"User",

"site_admin":false

},

"labels":[],

"state":"open",

"locked":false,

"assignee":{

"login":"laszewsk",

"id":425045,



"gravatar_id":"",












"type":"User",

"site_admin":false

},

"assignees":[

{

"login":"laszewsk",

"id":425045,



"gravatar_id":"",












"type":"User",

"site_admin":false

}

],

"milestone":null,

"comments":0,

"created_at":"2018-09-16T07:35:35Z",

Aswecanseefromthisentrythereisalotofinformationassociatedthatforourpurposeswedonotneed,butcertainlycouldbeusedtominegithubingeneral.

Weliketopointoutthatgithubisactivelyminedforexploitswherepasswordsare posted in clear text forAWS,Azure and other clouds. This is a commonmistakeasmanysampleprogramsaskthestudenttoplacethepassworddirectlyintotheirprogramsinsteadofusingaconfigurationfilethatisneverpartofthecoderepository.

6.11.2Exercise

E.github.issues.1:

Develop a new code like the one in this section, but use pythonrequestsinsteadoftheos.systemcall.

E.github.issues.2:

In the simple programwe hardcoded the number of page requests.Howcanwe findoutexactlyhowmanypagesweneed to retrieve?Implementyoursolution

E.github.issues.3:

Be inspiredby themanyREST interfaces.Howcan theybeused tomineinterestingthings.

E.github.issues.4:

Can you create a project, author, or technology map based oninformationthatisavailableingithub.Forexamplepythonprojectsmay include a requirements file, or developers may work on someprojects together, but others do other projects with others can youcreateanetwork?

"updated_at":"2018-09-16T07:35:35Z",

"closed_at":null,

"author_association":"CONTRIBUTOR",

"body":"DevelopasectionaboutVirtualization"

},

...

]

E.github.issues.5:

Use github to develop some cool python programs that show somestatistics about github. An example would be: Given a githubrepository,showthecheckinsbydataandvisualizethemgraphicallyforonecommitterandallcommitters.Usebokeahormatplotlib.

E.github.issues.6:

Develop a python program that retrieves a file. Deevlop a pythonprogramthatuploadsafile.Developaclassthatdoesthisanduseitin your proggram. Use docopt to create a manual page. Pleaserememberthispreparesyouforyourprojectsothisisveryusefultodo.

7MAPREDUCE

7.1INTRODUCTIONTOMAPREDUCE☁�In this section we discuss about the background of Mapreduce along withHadoopandcorecomponentsofHadoop.

Westartoutoursectionwithareviewofthepythonlambdaexpressionaswellas the map function. Understanding these concepts is helpful for our overallunderstandingofmapreduce.

Sobeforeyouwatchthevideo,weencourageyoutolearnSections{#s-python-lambda}and{#s-python-map}.

Nowthatyouhaveabasicunderstandingofthemapfunctionwerecommendtowatchourvideosaboutmapreduce,hadoopandsparkwhichweprovidewithinthischapter.

MapReduce,Hadoop,andSpark(19:02)HadoopA

MapReduceisaprogrammingtechniqueorprocessingcapabilitywhichoperatesin a cluster or a grid on a massive data set and brings out reliable output. Itworks on essentially two main functions – map() and reduce(). MapReduceprocesses large chunks of data so its highly beneficial to operate in multi-threaded fashion meaning parallel processing. MapReduce can also takeadvantageofdata locality so thatwedonot loosemuchoncommunicationofdatafromplacetoanother.

7.1.1MapReduceAlgorithm

MapReduce can operate on a filesystem, which is an unstructured data or adatabase, a structured data and these are the following three stages of itsoperation(seeFigure38):

1. Map:Thismethodprocessestheveryinitialdataset.Generally,thedataisin file formatwhich can be stored inHDFS (Hadoop File System).Map

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/mapreduce.md

https://youtu.be/HfuP2RJnQ6k?t=73

functionreadsthedatalinebylineandcreatesseveralchunksofdataandthatisagainstoredinHDFS.Thisbrokensetofdataisinkey/valuepairs.So in multi-threaded environment, there will be many worker nodesoperatingonthedatausingthismap()functionandwritethisintermediatedatainformofkey/valuetotemporarydatastorage.

2. Shuffle:Inthisstage,workernodeswillshuffleorredistributethedatainsuchawaythatthereisonlyonecopyforeachkey.

3. Reduce: This function always comes at last and it works on the dataproduced bymap and shuffle stages and produces even smaller chunk ofdatawhichisusedtocalculateoutput.

Figure38:MapReduceConceptualdiagram

TheShuffle operation is very important here as that ismainly responsible forreducing the communication cost. The main advantage of using MapReducealgorithmisthatitbecomesveryeasytoscaleupdataprocessingjustbyaddingsome extra computing nodes. Building up map and reduce methods aresometimesnontrivialbutoncedone,scalinguptheapplicationsissoeasythatitisjustamatterofchangingconfiguration.ScalabilityisreallybigadvantageofMapReducemodel. In the traditionalwayof dataprocessing, datawasmovedfromnodestothemasterandthentheprocessinghappensinmastermachine.Inthisapproach,welosebandwidthandtimeonmovingdatatomasterandparalleloperation cannot happen. Also master can get over-burdened and fail. InMapReduceapproach,Masternodedistributesthedatatotheworkermachineswhich are in themselves a processing unit. So all worker process the data inparallel and the time taken to process the data is reduced tremendously. (seeFigure39)

Figure39:MapReduceMasterworkerdiagram

7.1.1.1MapReduceExample:WordCount

LetusunderstandMapReducebyanexample.Forexample:wehaveatextfileas Sample.txt asCat, Bear,Camel,Bird,Cat,Bird,Camel,Cat,Bear,Camel,Cat,Camel

1. Firstwedividetheinputintofourpartssothatindividualnodescanhandletheload.

2. Wetokenizeeachwordandassignweightageofvalue“1”toeachword.3. Thiswaywewillhavealistofkey-valuepairswithkeybeingthewordand

valueas1.4. Afterthismappingphase,shufflingphasestartswhereallmapswithsame

keyaresentcorrespondingreducer.5. Noweachreducerwillhaveauniquekeyandalistofvaluesforeachkey

whichinthiscaseisall1s.6. Afterthat,eachreducerwillcountthetotalnumberof1sandassignsfinal

counttoeachword.7. Thefinaloutputisthenwrittentoafile.(seeFigure40)

Figure40:MapReduceWordCount[50]

Let us see an example of map() and reduce() methods in code for this wordcountexample.

Here we have created a class Map which extends Mapper from MapReduceframeworkandweoverridemap()methodtodeclarethekey/valuepairs.Next,therewillbeareducemethoddefinedinsideReduceclassasnextandbothinputandoutputhereisakey/valuepairs:

7.1.2HadoopMapReduceandHadoopSpark

publicstaticclassMapextendsMapper<LongWritable,

Text,

Text,

IntWritable>{

publicvoidmap(LongWritablekey,

Textvalue,

Contextcontext)

throwsIOException,InterruptedException{

Stringline=value.toString();

StringTokenizertokenizer=newStringTokenizer(line);

while(tokenizer.hasMoreTokens()){

value.set(tokenizer.nextToken());

context.write(value,newIntWritable(1));

}

}

publicstaticclassReduceextendsReducer<Text,

IntWritable,

Text,IntWritable>{

publicvoidreduce(Textkey,

Iterable<IntWritable>values,

Contextcontext)

throwsIOException,InterruptedException{

intsum=0;

for(IntWritablex:values){

sum+=x.get();

}

context.write(key,newIntWritable(sum));

}

}

InearlierversionofHadoop,wecoulduseMapReducewithHDFSdirectlybutfrom2.0onwards,YARN(ClusterResourceManagement) is introducedwhichacts as a layer betweenMapReduce and HDFS and using this YARN, manyotherBigDataframeworkscanconnecttoHDFSaswell.(seeFigure41)

Figure41:MapReduceHadoopandSpark[51]

Therearemanybigdataframeworksavailableandthereisalwaysaquestionastowhichoneistherightone.LeadingframeworksareHadoopMapReduceandApache Spark and choice depends on business needs. Let us start comparingbothoftheseframeworkswithrespecttotheirprocessingcapability.

7.1.2.1ApacheSpark

Apache Spark is lightning fast cluster computing framework. Spark is in-memorysystem.Sparkis100timefasterthanHadoopMapReduce.

7.1.2.2HadoopMapReduce

HadoopMapReducereadsandwritesondiskbecauseofthisitisaslowsystemandthataffectsthevolumeofdatabeenprocessed.ButHadoopisascalableandfaulttolerant,itusgoodforlinearprocessing.

7.1.2.3KeyDifferences

Thekeydifferencesbetweenthemareasfollows:

1. Speed:Sparkislightningfastclustercomputingframeworkandoperatesupto100timefasterin-memoryand10timesfasterthanHadoopondisk.In-memory processing reduces the disk read/write processeswhich are timeconsuming.

2. Complexity:SparkiseasytousesincetherearemanyAPIsavailablebutforHadoop,developersneedtocodethefunctionswhichmakesitharder.

3. ApplicationManagement:Sparkcanperformbatchprocessing,interactiveandMachineLearningandStreamingofdata,allinthesamecluster,whichmakesitacompleteframeworkfordataanalysiswhereasHadoopisjustabatchengineanditrequiresotherframeworksforothertaskswhichmakesitsomewhatdifficulttomanage.

4. Real-TimeDataAnalysis Spark is capable of processing real time datawith great efficiency. But Hadoop was designed primarily for batchprocessingsoitcannotlivedata.

5. FaultTolerance:Boththesystemsarefaulttolerantsothereisnoneedtorestarttheapplicationsfromscratch.

6. DataVolume:AsthedataforsparkisheldinmemorylargerdatavolumesarebettermanagedinHadoop.

7.1.3References

[52]https://www.ibm.com/analytics/hadoop/mapreduce[53]https://en.wikipedia.org/wiki/MapReduce[54]https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm[50] https://www.edureka.co/blog/mapreduce-tutorial/?utm_source=youtube&utm_campaign=mapreduce-tutorial-161216-wr&utm_medium=description[55] https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark[56]https://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce[51] https://www.youtube.com/watch?v=SqvAaB3vK8U&list=WL&index=25&t=2547s

7.2HADOOP☁�Hadoopisanopensourceframeworkforstorageandprocessingoflargedatasetsoncommodityclusters.HadoopinternallyusesitsownfilesystemcalledHDFS

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/hadoop-lectures.md

(HadoopDistributedFileSystem).

ThemotivationforHadoopwasintroducedinSectionMapreduce

7.2.1HadoopandMapReduce

InthissectionwediscussabouttheusageHadoopMapReducearchitecture.

Hadoop13:19HadoopB

7.2.2HadoopEcoSystem

InthissectionwediscussabouttheHadoopEcoSystemandthearchitecture.

Hadoop12:57HadoopC

7.2.3HadoopComponents

InthissectionwediscussaboutHadoopComponentsindetail.

Hadoop15:14HadoopD

7.2.4HadoopandtheYarnResourceManager

InthissectionwediscussaboutYarnresourcemanagerandnovelcomponentsadded to the Hadoop framework in case of improving the performance andminimizingfaulttolerance.

Hadoop14:55HadoopE

7.2.5PageRank

In thissectionwediscussabouta realworldproblemthatcanbesolvedusingtheMapReducetechnique.PageRankisaproblemsolvedbytheearlieststagesof theGoogle.inc. In this sectionwe discuss about the theoretical backgroundaboutthisproblemandwediscusshowthiscanbesolvedusingthemapreduce

https://youtu.be/-N5PpD2sy3Q?t=17

https://youtu.be/BaRHay32I80?t=18

https://youtu.be/MYOosbF6-dA?t=20

https://youtu.be/DVbtubsKrdg?t=40

concepts.

Hadoop25:41HadoopF

7.3INSTALLATIONOFHADOOP☁�This section isusingHadoopversion3.1.1 inUbuntu18.04.WealsodescribetheinstallationoftheYarnresourcemanager.Weassumethatyouhavessh,andrsyncinstalledanduseemacsaseditor.

Ifyouuseanewerversion,andliketoupdatethistextpleasehelp

7.3.1Releases

Hadoopchangesonregularbasis.Beforefollwoingthissection,werecommendthatyouvisit

https://hadoop.apache.org/releases.html

Thelistofdownloadablefilesisalsoavailableat

andverify thatyouuseanup todatversion.If theverisonof this instalation isoutdated.weaskyouasexcrsisetoupdateit.

7.3.2Prerequisites

7.3.3UserandUserGroupCreation

Forsecurityreasonswewill installhadoopinaparticularuserandusergroup.Wewillusethefollowing

Thesestepswillprovidesudoprivilegestothecreatedhduseruserandaddthe

sudoapt-getinstallssh

sudoapt-getinstallrsync

sudoapt-getinstallemacs

sudoaddgrouphadoop_group

sudoadduser--ingrouphadoop_grouphduser

sudoadduserhdusersudo

https://youtu.be/qr6mU04d69o?t=30

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/hadoop-installation.md

https://hadoop.apache.org/releases.html

usertothegrouphadoop_group.

7.3.4ConfiguringSSH

HereweconfigureSSHkeyforthelocalusertoinstallhadoopwithassh-key.This is different from the ssh-key you used for Github, FutureSystems, etc.FollowthissectiontoconfigureitforHadoopinstallation.

The ssh content is included here because, we are making a ssh key for thisspecificuser.Next,wehavetoconfiguresshtobeusedbythehadoopuser.

Follow the instructions as provided in the commandline. When you see thefollowingconsoleinput,pressENTER.Hereonlywewillcreatepasswordlesskeys.INgeneralthisisnotagoodidea,butforthiscasewemakeanexception.

Nextyouwillbeaskedtoenterapasswordforsshconfiguration,

Hereenterthesamepassword

Finallyyouwillseesomethinglikethisafterthesestepsarefinished.

sudosu-hduser

ssh-keygen-trsa

Enterfileinwhichtosavethekey(/home/hduser/.ssh/id_rsa):



Generatingpublic/privatersakeypair.

Enterfileinwhichtosavethekey(/home/hduser/.ssh/id_rsa):

Createddirectory'/home/hduser/.ssh'.



Youridentificationhasbeensavedin/home/hduser/.ssh/id_rsa.

Yourpublickeyhasbeensavedin/home/hduser/.ssh/id_rsa.pub.

Thekeyfingerprintis:

SHA256:0UBCPd6oYp7MEzCpOhMhNiJyQo6PaPCDuOT48xUDDc0hduser@computer

Thekey'srandomartimageis:

+---[RSA2048]----+

|.+ooo|

|.oE.oo|

|+.....+.|

|X+=.o..|

|XX.oo.S|

|Bo++.o|

|*o*+.|

|*..*.|

|+.o..|

+----[SHA256]-----+

Youhavesuccessfullyconfiguredssh.

7.3.5InstallationofJava

Ifyouarealreadyloggedintosu,youcanskipthenextcommand:

Nowexecutethefollowingcommandstodownloadandinstalljava

PleasenotethatusersmustacceptOracleOTNlicensebeforedownloadingJDK.

7.3.6InstallationofHadoop

FirstwewilltakealookonhowtoinstallHadoop3.1.1onUbuntu

16.04.Wemayneedapriorfolderstructuretodotheinstallationproperly.

7.3.7HadoopEnvironmentVariables

InUbuntutheenvironmentalvariablesaresetupinafilecalledbashrcatitcanbeaccessedthefollowingway

Nowaddthefollowingtoyour~/.bashrcfile

InEmacstosavethefileCtrl-X-SandCtrl-X-Ctoexit.Aftereditingyoumustupdatethevariablesinthesystem.

su-hduser

mkdir-p~/cloudmesh/bin

cd~/cloudmesh/bin

wget-c--header"Cookie:\

oraclelicense=accept-securebackup-cookie"\

"http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz"

tarxvzfjdk-8u191-linux-x64.tar.gz

cd~/cloudmesh/bin/

wgethttp://mirrors.sonic.net/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

tar-xzvfhadoop-3.1.1.tar.gz

emacs~/.bashrc

exportJAVA_HOME=~/cloudmesh/bin/jdk1.8.0_191

exportHADOOP_HOME=~/cloudmesh/bin/hadoop-3.1.1

exportYARN_HOME=$HADOOP_HOME

exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

exportPATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH

source~/.bashrc

java-version

If you have installed things properly therewill be no errors. It will show theversionasfollows,

Andverifyingthehadoopinstallation,

Ifyouhavesuccessfullyinstalledthis,theremustbeamessageshownasnext.

7.4HADOOPVIRTUALCLUSTERINSTALLATIONUSING

CLOUDMESH �☁�

No

Thisversionisdependentonanolderversionofcloudmesh.

:TODO:weneedtoaddtheinstalationinstructionsbasedonthisversion

7.4.1CloudmeshClusterInstallation

Beforeyoustartthislesson,youMUSTfinishcm_install.

This lesson is created and test under the newest version ofCloudmesh client.Updateyoursifnot.

javaversion"1.8.0_191"

Java(TM)SERuntimeEnvironment(build1.8.0_191-b12)

JavaHotSpot(TM)64-BitServerVM(build25.191-b12,mixedmode)

hadoop

Usage:hadoop[--configconfdir]COMMAND

whereCOMMANDisoneof:

fsrunagenericfilesystemuserclient

versionprinttheversion

jar<jar>runajarfile

checknative[-a|-h]checknativehadoopandcompressionlibrariesavailability

distcp<srcurl><desturl>copyfileordirectoriesrecursively

archive-archiveNameNAME-p<parentpath><src>*<dest>createahadooparchive

classpathprintstheclasspathneededtogetthe

credentialinteractwithcredentialproviders

Hadoopjarandtherequiredlibraries

daemonlogget/settheloglevelforeachdaemon

traceviewandmodifyHadooptracingsettings

or

CLASSNAMEruntheclassnamedCLASSNAME

Mostcommandsprinthelpwheninvokedw/oparameters.

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/cm-hadoop.md

Tomanagevirtualclusteroncloud, thecommandis cmcluster.Try cmclusterhelp toseewhatothercommandsareandwhatoptionstheysupported.

7.4.1.1CreateCluster

To create a virtual cluster on cloud, we must define an active clusterspecificationwithcmclusterdefinecommand.Forexample,wedefineaclusterwith3nodes:

Alloptionswillusethedefaultsettingifnotspecifiedduringcluster

define.Try cmclusterhelp command to seewhat options cmclusterdefine has andmeans,hereispartoftheusageinformation::

FloatingIPisavaluableandlimitedresourceoncloud.

cmclusterdefinewill assign floating IP to everynodewithin the cluster bydefault.ClustercreationwillfailifthefloatingIPsrunoutoncloud.Whenyourunintoerrorlikethis,useoption-Ior--no-floating-iptoavoidassigningfloatingIPsduringclustercreation:

ThenmanuallyassignfloatingIPtooneofthenodes.Usethisnodeasaloggingnodeorheadnodetologintoalltheothernodes.

Wecanhavemultiplespecificationsdefinedatthesametime.Everytimeanewcluster specification is defined, the counter of the default cluster name will

$cmclusterdefine--count3

$cmclusterhelp

usage:clustercreate[-nNAME][-cCOUNT][-C

CLOUD][-uNAME][-iIMAGE][-fFLAVOR][-kKEY][-sNAME]

[-AI]

Options:

-A--no-activateDonotactivatethiscluster

-I--no-floating-ipDonotassignfloatingIPs

-nNAME--name=NAMENameofthecluster

-cCOUNT--count=COUNTNumberofnodesinthecluster

-CNAME--cloud=NAMENameofthecloud

-uNAME--username=NAMENameoftheimageloginuser

-iNAME--image=NAMENameoftheimage

-fNAME--flavor=NAMENameoftheflavor

-kNAME--key=NAMENameofthekey

-sNAME--secgroup=NAMENAMEofthesecuritygroup

-oPATH--path=PATHOutputtothispath...

$cmclusterdefine--count3--no-floating-ip

increment.Hence,thedefaultclusternamewillbecluster-001,cluster-002,cluster-003andsoon.Usecmclusteravailtocheckalltheavailableclusterspecifications:

Withcmclusteruse[NAME],weareabletoswitchbetweendifferentspecificationswithgivenclustername:

Thiswillactivatespecificationcluster-001whichassignsfloatingIPduringcreationratherthanthelatestonecluster-002.

Withourclusterspecificationready,wecreatetheclusterwithcommandcmclusterallocate. This will create a virtual cluster on the cloud with the activatedspecification:

Eachspecificationcanhaveoneactivecluster,whichmeanscmclusterallocate doesnothingifthereisasuccessfullyactivecluster.

7.4.1.2CheckCreatedCluster

$cmclusteravail

cluster-001

count:3

image:CC-Ubuntu14.04

key:xl41

flavor:m1.small

secgroup:default

assignFloatingIP:True

cloud:chameleon

>cluster-002

count:3


key:xl41

flavor:m1.small

secgroup:default

assignFloatingIP:False

cloud:chameleon

$cmclusterusecluster-001

$cmclusteravail

>cluster-001

count:3


key:xl41

flavor:m1.small

secgroup:default

assignFloatingIP:True

cloud:chameleon

cluster-002

count:3


key:xl41

flavor:m1.small

secgroup:default

assignFloatingIP:False

cloud:chameleon

$cmclusterallocate

Withcommandcmclusterlist,wecanseetheclusterwiththedefaultnamecluster-001wejustcreated:

Usingcmclusternodes[NAME],wecanalsoseethenodesoftheclusteralongwiththeirassignedfloatingIPsofthecluster:

If option --no-floating-ip is included during definition, youwill see nodeswithoutfloatingIP:

Tologinoneofthem,usecommand cmvmassignIP[NAME] to assign a floating IP tooneofthem:

Thenyoucanloginthisnodeasaheadnodeofyourclusterbycmvmssh[NAME]:

7.4.1.3DeleteCluster

Usingcmclusterdelete[NAME],weareabletodeletetheclusterwecreated:

Option--allcandeletealltheclusterscreated,sobecareful:

:

$cmclusterdelete–all

Thenweneedtoundefineourclusterspecificationwithcommandcmclusterundefine

$cmclusterlist

cluster-001

$cmclusternodescluster-001

xl41-001129.114.33.147

xl41-002129.114.33.148

xl41-003129.114.33.149


xl41-004None

xl41-005None

xl41-006None

$cmvmipassignxl41-006


xl41-004None

xl41-005None

xl41-006129.114.33.150

$cmvmsshxl41-006

cc@xl41-006$

$cmclusterdeletecluster-001

[NAME]:

Option--allcandeletealltheclusterspecifications:

7.4.2HadoopClusterInstallation

Thissectionisbuiltuponthepreviousone.Pleasefinishthepreviousonebeforestartthisone.

7.4.2.1CreateHadoopCluster

Tocreate aHadoopcluster,weneed to first define a clusterwith cmclusterdefinecommand:

TodeployaHadoopcluster,weonlysupportimageCC-Ubuntu14.04

onChameleon.DONOTuseCC-Ubuntu16.04oranyotherimages.Youwillneedtospecifyitifit’snotthedefaultimage:

$cmclusterdefine–count3–imageCC-Ubuntu14.04

ThenwedefinetheHadoopclusterupontheclusterwedefinedusingcmhadoopdefinecommand:

Same as cmclusterdefine, you can define multiple specifications for the Hadoopclusterandcheckthemwithcmhadoopavail:

Wecanusecmhadoopuse[NAME]toactivatethespecificationwiththegivenname:

$cmclusterundefinecluster-001

$cmclusterundefine--all


$cmhadoopdefine

$cmhadoopavail

>stack-001

local_path:/Users/tony/.cloudmesh/stacks/stack-001

addons:[]

$cmhadoopusestack-001

MaynotbeavailableforcurrentversionofCloudmeshClient.

Beforedeploy,weneedtousecmhadoopsynctocheckout/synchronizetheBigDataStackfromGithub.com:

Toavoiderrors,makesureyouareabletoconnecttoGithub.comusingSSH:

https://help.github.com/articles/connecting-to-github-with-ssh/.

Finally,wearereadytodeployourHadoopcluster:

Thisprocesscouldtakeupto10minutesbasedonyournetwork.

To checkHadoop is working or not. Use cmvmssh to log into the Namenode of theHadoopcluster.It’susuallythefirstnodeofthecluster:

SwitchtouserhadoopandcheckHDFSissetupornot:

NowtheHadoopclusterisproperlyinstalledandconfigured.

7.4.2.2DeleteHadoopCluster

To delete the Hadoop cluster we created, use command cmclusterdelete[NAME] todeletetheclusterwithgivenname:

ThenundefinetheHadoopspecificationandtheclusterspecification:

MaynotbeavailableforcurrentversionofCloudmeshClient.

$cmhadoopsync

$cmhadoopdeploy

$cmvmsshnode-001

cc@hostname$

cc@hostname$sudosu-hadoop

hadoop@hostname$hdfsdfs-ls/

Found1items

drwxrwx----hadoophadoop,hadoopadmin02017-02-1517:26/tmp

$cmclusterdeletecluster-001

$cmhadoopundefinestack-001

$cmclusterundefinecluster-001

https://help.github.com/articles/connecting-to-github-with-ssh/

7.4.3AdvancedTopicswithHadoop

7.4.3.1HadoopVirtualClusterwithSparkand/orPig

To installSparkand/orPigwithHadoopcluster,we firstusecommand cmhadoopdefinebutwithADDONtodefinetheclusterspecification.

Forexample,wecreatea3-nodeSparkclusterwithPig.Todothat,allweneedistospecifysparkasanADDONduringHadoopdefinition:

Usingcmhadoopaddons,weareabletocheckthecurrentsupportedaddon:

Withcmhadoopavail,wecanseethedetailofthespecificationfortheHadoopcluster:

ThenweusecmhadoopsyncandcmhadoopdeploytodeployourSparkcluster:

Thisprocesswilltake15minutesorlonger.

Beforeweproceedtothenextstep,thereisonemorethingweneedto,whichistomakesureweareabletosshfromeverynodetootherswithoutpassword.Toachievethat,weneedtoexecutecmclustercross_ssh:

7.4.3.2WordCountExampleonSpark

Nowwiththeclusterready,let’srunasimpleSparkjob,WordCount,ononeofWilliam Shakespeare’s work. Use cmvmssh to log into the Namenode of the Sparkcluster.Itshouldbethefirstnodeofthecluster:


$cmhadoopdefinesparkpig

$cmhadoopaddons

$cmhadoopavail

>stack-001

local_path:/Users/tony/.cloudmesh/stacks/stack-001

addons:[u'spark',u'pig']

$cmhadoopsync

$cmhadoopdeploy

$cmclustercross_ssh

$cmvmsshnode-001

cc@hostname$

SwitchtouserhadoopandcheckHDFSissetupornot:

DownloadtheinputfilefromtheInternet:

Youcanalsouseanyothertextfileyoupreferred.CreateanewdirectorywordcountwithinHDFStostoretheinputandoutput:

Storetheinputtextfileintothedirectory:

Savethefollowingcodeaswordcount.pyonthelocalfilesystemonNamenode:

NextsubmitthejobtoYarnandrunindistribute:

Finally,takealookattheresultintheoutputdirectory:

cc@hostname$sudosu-hadoop

hadoop@hostname$

wget--no-check-certificate-Oinputfile.txt\

https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt

$hdfsdfs-mkdir/wordcount

$hdfsdfs-putinputfile.txt/wordcount/inputfile.txt

importsys

frompysparkimportSparkContext,SparkConf


#taktwoarguments,inputandoutput

iflen(sys.argv)!=3:

print("Usage:wordcount<input><output>")

exit(-1)

#createSparkcontextwithSparkconfiguration

conf=SparkConf().setAppName("SparkCount")

sc=SparkContext(conf=conf)

#readintextfile

text_file=sc.textFile(sys.argv[1])

#spliteachlineintowords

#counttheoccurrenceofeachword

#sorttheoutputbasedonword

counts=text_file.flatMap(lambdaline:line.split(""))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)\

.sortByKey()

#savetheresultintheoutputtextfile

counts.saveAsTextFile(sys.argv[2])

$spark-submit--masteryarn--deploy-modeclient--executor-memory1g\

--namewordcount--conf"spark.app.id=wordcount"wordcount.py\

hdfs://192.168.0.236:8020/wordcount/inputfile.txt\

hdfs://192.168.0.236:8020/wordcount/output

7.5SPARK

7.5.1SparkLectures☁�

ThissectioncoversanintroductiontoSparkthatissplitupintoeightparts.WediscussSparkbackground,RDDoperations,Shark,SparkML,SparkvsOtherFrameworks.

7.5.1.1MotivationforSpark

InthissectionwediscussaboutthebackgroundofSparkandcorecomponentsofSpark.

Spark15:57SparkA

7.5.1.2SparkRDDOperations

InthissectionwediscussaboutthebackgroundofRDDoperationsalongwithothertransformationfunctionalityinSpark.

Spark12:17SparkB

7.5.1.3SparkDAG

InthissectionwediscussaboutthebackgroundofDAG(directacyclicgraphs)operationsalongwithothercomponentslikeSharkintheearlierstagesofSpark.

$hdfsdfs-ls/wordcount/outputfile/

Found3items

-rw-r--r--1hadoophadoop,hadoopadmin02017-03-0721:28/wordcount/output/_SUCCESS

-rw-r--r--1hadoophadoop,hadoopadmin4831822017-03-0721:28/wordcount/output/part-00000

-rw-r--r--1hadoophadoop,hadoopadmin6396492017-03-0721:28/wordcount/output/part-00001

$hdfsdfs-cat/wordcount/output/part-00000|less

(u'',517065)

(u'"',241)

(u'"\'Tis',1)

(u'"A',4)

(u'"AS-IS".',1)

(u'"Air,"',1)

(u'"Alas,',1)

(u'"Amen"',2)

(u'"Amen"?',1)

(u'"Amen,"',1)

...

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/spark-lectures.md

https://youtu.be/zfrzMuRwltU

https://youtu.be/q6JES9P6IV0

Spark10:37SparkC

7.5.1.4Sparkvs.otherFrameworks

In this section we discuss about the real world applications that can be doneusing Spark. And also we discuss some comparision results obtained fromexperimentsdoneinSparkalongwithFrameworkslikeHarp,HarpDAAL,etc.Wediscussthebenchmarksandperformanceobtainedfromsuchexperiments.

Spark26:18SparkD

7.5.2InstallationofSpark☁�

InthissectionwewilldiscusshowtoinstallSpark2.3.2inUbuntu18.04.


Weassumethatyouhavessh,andrsyncinstalledanduseemacsaseditor.

7.5.2.2InstallationofJava

FirstdownloadJava8.

Thenaddtheenvironmentalvariablestothebashrcfile.

Sourcethebashrcfileafteraddingtheenvironmentalvariables.

sudoapt-getinstallssh

sudoapt-getinstallrsync

sudoapt-getinstallemacs


cd~/cloudmesh/bin

wget-c--header"Cookie:oraclelicense=accept-securebackup-cookie""http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz"

tarxvzfjdk-8u161-linux-x64.tar.gz

emacs~/.bashrc

exportJAVA_HOME=~/cloudmesh/bin/jdk1.8.0_161

exportPATH=$JAVA_HOME/bin:$PATH

source~/.bashrc

https://youtu.be/DX-oaUzjZAM

https://youtu.be/rQb5zspUmow

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/spark-installation.md

7.5.2.3InstallSparkwithHadoop

HereweuseSparkpackagedwithHadoop.InthispackageSparkusesHadoop2.7.0 in thepackagedversion.Note that inSectionHadoopInstallationweuseforthevanillaHadoopinstallationHadoop3.0.1.

Createthebasedirectoriesandgotothedirectory.

ThendownloadSpark2.3.2asfollows.

Nowextractthefile,

7.5.2.4SparkEnvironmentVariables

Openupbashrcfileandaddenvironmentalvariablesasfollows.

Gotothelastlineandaddthefollowingcontent.

Sourcethebashrcfile.

7.5.2.5TestSparkInstallation

Openupanewterminalandthenrunthefollowingcommand.

Ifithasbeenconfiguredproperly,itwilldisplaythefollowingcontent.


cd~/cloudmesh/bin

wgethttps://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

tarxzfspark-2.3.2-bin-hadoop2.7.tgz

mvspark-2.3.2-bin-hadoop2.7spark-2.3.2

emacs~/.bashrc

exportSPARK_HOME=~/cloudmesh/bin/spark-2.3.2

exportPATH=$SPARK_HOME/bin:$PATH

source~/.bashrc

spark-shell

SparkcontextWebUIavailableathttp://192.168.1.66:4041

Sparkcontextavailableas'sc'(master=local[*],appid=local-1521674331361).

Sparksessionavailableas'spark'.

Pleasecheck theconsoleLOGSand find theportnumberonwhich theSparkWebUIishosted.Itwillshowsomethinglike:

SparkcontextWebUIavailableat:<someurl>

Thentakealookthefollowingaddressinthebrowser.

Ifyousee theSparkDashboard, thenyoucanrealizeyouhave installedSparksuccessfully.

7.5.2.6InstallSparkWithCustomHadoop

InstallingSparkwith pre-existingHadoop version is favorable, if youwant touse the latest features from the latest Hadoop version or when you need aspecificHadoopversiondependingontheexternaldependenciestoyourproject.

FirstweneedtodownloadtheSparkpackagedwithoutHadoop.

ThendownloadSpark2.3.2asfollows.

Nowextractthefile,

Thenaddtheenvironmentalvariables,

IfyouhavealreadyinstalledSparkwithHadoopbyfollowingsection1.3pleaseupdatethecurrentSPARKHOMEvariablewiththenewpath.

Welcometo

______

/__/__________//__

_\\/_\/_`/__/'_/

/___/.__/\_,_/_//_/\_\version2.3.2

/_/

UsingScalaversion2.11.8(JavaHotSpot(TM)64-BitServerVM,Java1.8.0_151)

Typeinexpressionstohavethemevaluated.

Type:helpformoreinformation.

http://localhost:4041


cd~/cloudmesh/bin

wgethttps://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-without-hadoop.tgz

tarxzfspark-2.3.2-bin-without-hadoop.tgz

emacs~/.bashrc

Gotothelastlineandaddthefollowingcontent.

Sourcethebashrcfile.

7.5.2.7ConfiguringHadoop

NowwemustaddthecurrentHadoopversionthatweareusingforSpark.Openupanewterminalandthenrunthefollowing.

Nowweneedtoaddanewlinetoshowthecurrentpathtohadoopinstallation.Addthefollowingvariableintothespark-env.shfile.

SparkWebUI-HadoopPath

7.5.2.8TestSparkInstallation

Openupanewterminalandthenrunthefollowingcommand.

Ifithasbeenconfiguredproperly,itwilldisplaythefollowingcontent.

exportSPARK_HOME=~/cloudmesh/bin/spark-2.3.2-bin-without-hadoop

exportPATH=$SPARK_HOME/bin:$PATH

source~/.bashrc

cd$SPARK_HOME

cdconf

cpspark-env.sh.templatespark-evn.sh

emacsspark-env.sh

exportSPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoopclasspath)

spark-shell

Toadjustlogginglevelusesc.setLogLevel(newLevel).ForSparkR,usesetLogLevel(newLevel).

SparkcontextWebUIavailableathttp://149-160-230-133.dhcp-bl.indiana.edu:4040

Sparkcontextavailableas'sc'(master=local[*],appid=local-1521732740077).

Sparksessionavailableas'spark'.

Welcometo

______

/__/__________//__

_\\/_\/_`/__/'_/

/___/.__/\_,_/_//_/\_\version2.3.2

/_/

UsingScalaversion2.11.8(JavaHotSpot(TM)64-BitServerVM,Java1.8.0_151)

Typeinexpressionstohavethemevaluated.

Thentakealookthefollowingaddressinthebrowser.

Pleasecheck theconsoleLOGSand find theportnumberonwhich theSparkWebUIishosted.Itwillshowsomethinglike:SparkcontextWebUIavailableatthelogsfolder

7.5.3SparkStreaming☁�

7.5.3.1StreamingConcepts

Spark Streaming is one of the components extending fromCore Spark. Sparkstreaming provides a scalable fault tolerant systemwith high throughput. Forstreamingdata intospark, therearemany libraries likeKafka,Flume,Kinesis,etc.

7.5.3.2SimpleStreamingExample

Inthissection,wearegoingtofocusonmakingasimplestreamingapplicationusing thenetwork inyourcomputer.Herewearegoing toexposeaparticularportandfromthatportwearegoingtocontinuouslystreamdatabyuserentriesandthewordcountisbeingcalculatedastheoutput.

First,createaMakefile

ThenaddthefollowingcontenttoMakefile.

Please add a tab when adding the corresponding command for a giveninstructioninMakefile.Inpdfmodethetabisnotclearlyshown.

Nowweneedtocreatefilecalledstreaming.py

Type:helpformoreinformation.

http://localhost:4040

mkdir-p~/cloudmesh/spark/examples/streaming

cd~/cloudmesh/spark/examples/streaming

emacsMakefile

SPARKHOME=${SPARK_HOME}

run-streaming:

${SPARKHOME}/bin/spark-submitstreaming.pylocalhost9999

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/spark-streaming.md

Thenaddthefollowingcontent.

Torunthecode,weneedtoopenuptwoterminals.Terminal1:

Firstusenetstattoopenupaporttostartcommunication.

Terminal2:

NowruntheSparkprogrammeinthesecondterminal.

Inthisterminalyoucanseeanscriptrunningtryingtoreadthestreamcomingfromtheport9999.YoucanentertextsintheTerminal1andthesetextswillbetokenized and the word count is calculated and the result is shown in theTerminal2.

7.5.3.3SparkStreamingForTwitterData

In this sectionwearegoing to learnhow touseTwitterdata as the streamingdatasourceanduseSparkStreamingcapabilitiestoprocessthedata.Asthefirststepyoumustinstallthepythonpackagesusingpip.

emacsstreaming.py

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

#CreatealocalStreamingContextwithtwoworkingthreadandbatchintervalof1second

sc=SparkContext("local[2]","NetworkWordCount")

log4jLogger=sc._jvm.org.apache.log4j

LOGGER=log4jLogger.LogManager.getLogger(__name__)

LOGGER.info("Pysparkscriptloggerinitialized")

ssc=StreamingContext(sc,1)

#CreateaDStreamthatwillconnecttohostname:port,likelocalhost:9999

lines=ssc.socketTextStream("localhost",9999)

#Spliteachlineintowords

words=lines.flatMap(lambdaline:line.split(""))

#Counteachwordineachbatch

pairs=words.map(lambdaword:(word,1))

wordCounts=pairs.reduceByKey(lambdax,y:x+y)

#PrintthefirsttenelementsofeachRDDgeneratedinthisDStreamtotheconsole

wordCounts.pprint()

ssc.start()#Startthecomputation

ssc.awaitTermination(100)#Waitforthecomputationtoterminate

nc-lk9999

makerun-streaming

7.5.3.3.1Step1

7.5.3.3.2Step2

ThenyouneedtocreateanaccountinTwitterApps.Gotoandsignintoyourtwitteraccountorcreateanewtwitteraccount.Thenyouneedtocreateanewapplication,let’snamethisapplicationasCloudmesh-Spark-Streaming.

Firstyouneedtocreateanappwiththeappnamewesuggestedinthissection.Thewaytocreatetheappismentionedin+Figure42.

Figure42:CreateTwitterApp

Nextweneedtototakealookatthedashboardcreatedfortheapp.Youcanseehowyourdashboardlookslikein+Figure43.

sudopipinstalltweepy

Figure43:GoToTwitterAppDashboard

Nexttheapplicationtokensgeneratedmustbereviewedanditcanbefoundin+Figure44,youneedtogototheKeysandAccessTokenstab.

Figure44:CreateYourTwitterSettings

Nowyouneed togenerate the access tokens for the first time if youhavenotgenerated access tokens and this can be done by clicking the Createmyaccesstokenbutton.See+Figure45

Figure45:CreateYourTwitterAccessTokens

Theaccesstokensandkeysareblurredinthissectionforprivacyissues.

7.5.3.3.3Step3

LetusbuildasimpleTwitterApptoseeifeverythingisokay.

Add the following content to the file and make sure you update thecorrespondingtokenkeyswithyourtokenvalues.

mkdir-p~/cloudmesh/spark/streaming

cd~/cloudmesh/spark/streaming

emacstwitterstreaming.py

importtweepy

CONSUMER_KEY='your_consumer_key'

CONSUMER_SECRET='your_consumer_secret'

ACCESS_TOKEN='your_access_token'

ACCESS_TOKEN_SECRET='your_access_token_secret'

7.5.3.3.4Step4

Letusstartthetwitterstreamingexercise.WeneedtocreateaTweetListenerinorder to retrieve data from twitter regarding a topic of your choice. In thisexercise,wehavetriedkeywordsliketrump,indiana,messi.

Makeyourtoreplacestringsrelatedtosecretkeysandipaddressesbyreplacingthesevaluesdependingonyourmachineconfigurationandtwitterkeys.

Nowaddthefollowingcontent.

auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)

auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)

api=tweepy.API(auth)

status="Testing!"

api.update_status(status=status)

pythontwitterstreaming.py

mkdir-p~/cloudmesh/spark/streaming


emacstweetlistener.py

importtweepy

fromtweepyimportOAuthHandler

fromtweepyimportStream

fromtweepy.streamingimportStreamListener

importsocket

importjson

CONSUMER_KEY='YOUR_CONSUMER_KEY'

CONSUMER_SECRET='YOUR_CONSUMER_SECRET'

ACCESS_TOKEN='YOUR_ACCESS_TOKEN'

ACCESS_SECRET='YOUR_SECRET_ACCESS'

classTweetListener(StreamListener):

def__init__(self,csocket):

self.client_socket=csocket

defon_data(self,data):

try:

msg=json.loads(data)

print(msg['text'].encode('utf-8'))

self.client_socket.send(msg['text'].encode('utf-8'))

returnTrue

exceptBaseExceptionase:

print("Erroron_data:%s"%str(e))

returnTrue

defon_error(self,status):

print(status)

returnTrue

defsendData(c_socket):

auth=OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)

auth.set_access_token(ACCESS_TOKEN,ACCESS_SECRET)

twitter_stream=Stream(auth,TweetListener(c_socket))

twitter_stream.filter(track=['messi'])#youcanchangethistopic


7.5.3.3.5step5

Pleasereplacethelocalfilepathsmentionedinthiscodewithafilepathofyourpreferencedependingonyourworkstation.AndalsoIPaddressmustbereplacedwithyouripaddress.Thelogfolderpathmustbepre-createdandmakesuretoreplace the registerTempTable namewith respect to the entity that you are referring.Thiswillminimizetheconflictsamongdifferenttopicswhenyouneedtoplotitinasimplemanner.

AddthefollowingcontenttotheIpythonNotebookasfollows

Openupaterminal,

Then in thebrowser the jupyternotebook isbeing loaded.There create anewIPythonnotebookcalledtwittersparkstremer.

Thenaddthefollowingcontent.

s=socket.socket()

host="YOUR_MACHINE_IP"

port=5555

s.bind((host,port))

print("Listeningonport:%s"%str(port))

s.listen(5)

c,addr=s.accept()

print("Receivedrequestfrom:"+str(addr))

sendData(c)


jupyternotebook

frompysparkimportSparkContext

frompyspark.streamingimportStreamingContext

frompyspark.sqlimportSQLContext

frompyspark.sql.functionsimportdesc

sc=SparkContext('local[2]','twittersparkstreamer')

ssc=StreamingContext(sc,10)

sqlContext=SQLContext(sc)

ssc.checkpoint("file:///home/<your-username>/cloudmesh/spark/streaming/logs/messi")

socket_stream=ssc.socketTextStream("YOUR_IP_ADDRESS",5555)

lines=socket_stream.window(20)

fromcollectionsimportnamedtuple

fields=("tag","count")

Tweet=namedtuple('Tweet',fields)

(lines.flatMap(lambdatext:text.split(""))

.filter(lambdaword:word.lower().startswith("#"))

.map(lambdaword:(word.lower(),1))

7.5.3.3.6step6

OpenTerminal1,thendothefollowing

Itwillshowthat:

OpenTerminal2

Now we must start the Spark app by running the content in the IPythonNotebookbypressingSHIFT-ENTERineachboxtoruneachcommand.Makesurenotto run twice the starting command of the SparkContext or initialization ofSparkContext.

NowyouwillseestreamsintheTerminal1andyoucanseeplotsafterawhileintheIPythonNotebook.

Sampleoutputscanbeseenin+Figure46,+Figure47,+Figure48,+Figure49.

.reduceByKey(lambdaa,b:a+b)

.map(lambdarec:Tweet(rec[0],rec[1]))

.foreachRDD(lambdardd:rdd.toDF().sort(desc("count"))

.limit(10).registerTempTable("tweetsmessi")))#changetablenamedependingonyourentity

sqlContext

<pyspark.sql.context.SQLContextat0x7f51922ba350>

ssc.start()


importseabornassn

importtime

fromIPythonimportdisplay

count=0

whilecount<10:

time.sleep(20)

top_10_tweets=sqlContext.sql('Selecttag,countfromtweetsmessi')#changetablenameaccordingtoyourentity

top_10_df=top_10_tweets.toPandas()

display.clear_output(wait=True)

#sn.figure(figsize=(10,8))

sn.barplot(x="count",y="tag",data=top_10_df)

plt.show()

count=count+1

ssc.stop()


pythontweetslistener.py

Listeningonport:5555

Figure46:TwitterTopicMessi




7.5.4UserDefinedFunctionsinSpark☁�

ApacheSparkisafastandgeneralcluster-computingframeworkwhichperformcomputational tasksup to100xfaster thanHadoopMapReduce inmemory,or10x faster on disk for high speed large-scale streaming,machine learning andSQL workloads tasks. Spark offers support for the applications developmentemployingover80high-leveloperatorsusingJava,Scala,Python,andR.SparkpowersthecombinedorstandaloneuseofastackoflibrariesincludingSQLandDataFrames,MLlibformachinelearning,GraphX,andSparkStreaming.Sparkcanbeutilized in standalone clustermode, onEC2,onHadoopYARN,or onApache Mesos and it allows data access in HDFS, Cassandra, HBase, Hive,Tachyon,andanyHadoopdatasource.

User-definedfunctions(UDFs)arethefunctionscreatedbydeveloperswhenthebuilt-infunctionalitiesofferedinaprogramminglanguage,arenotsufficienttodo the requiredwork.Similarly,ApacheSparkUDFsalsoallowdevelopers toenablenewfunctionsinhigherlevelprogramminglanguagesbyextendingbuilt-in functionalities. It also allows developers to experiment with wide range ofoptionsforintegratingUDFswithSparkSQL,MLibandGraphXworkflows.

Thistutorialexplainsfollowing:

HowtoinstallSparkinLinux,WindowsandMacOS.

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/spark-udf.md

Howtocreateandutilizeuserdefinedfunctions(UDF)inSparkusingPython.

Howtoruntheprovidedexampleusingaprovideddockerfileandmakefile.

7.5.4.1Resources

https://spark.apache.org/http://www.scala-lang.org/https://docs.databricks.com/spark/latest/spark-sql/udf-in-python.html

7.5.4.2InstructionsforSparkinstallation

7.5.4.2.1Linux

First,JDK(Recommendedversion8)shouldbeinstalledtoapathwherethereisnospace.

http://www.oracle.com/technetwork/java/javase/downloads/index.html

Second, setup environment variables for JDK by adding bin folder path to touserpathvariable.

Next,downloadandextractScalapre-builtversionfrom

http://www.scala-lang.org/download/

Then,setupenvironmentvariablesforScalabyaddingbinfolderpathtotheuserpathvariable.

Next,downloadandextractApacheSparkpre-builtversion.

https://spark.apache.org/downloads.html

Then,setupenvironmentvariablesforsparkbyaddingbinfolderpathtotheuserpathvariable.

This$exportPATH=$PATH:/usr/local/java8/bin

$exportPATH=$PATH:/usr/local/scala/bin

https://spark.apache.org/

http://www.scala-lang.org/

https://docs.databricks.com/spark/latest/spark-sql/udf-in-python.html


http://www.scala-lang.org/download/


Finally,fortestingtheinstallation,pleasetypethefollowingcommand.

7.5.4.3Windows

First, JDK should be installed to a pathwhere there is no space in that path.RecommendedJAVAversionis8.


Second,setupenvironmentvariablesforjdkbyaddingbinfolderpathtotouserpathvariable.

Next,downloadandextractApacheSparkpre-builtversion.


Then,setupenvironmentvaribaleforsparkbyaddingbinfolderpathtotheuserpathvariable.

Next, download the winutils.exe binary and Save winutils.exe binary to adirectory(c:\hadoop\bin).

https://github.com/steveloughran/winutils

Then,changethewinutils.exepermissionusingfollowingcommandusingCMDwithadministratorpermission.

Ifyoursystemdoesnthavehivefolder,makesuretocreateC:\tmp\hivedirectory.

Next, setupenvironmentvariables forhadoopbyaddingbin folderpath to theuserpathvariable.

$exportPATH=$PATH:/usr/local/spark/bin

spark-shell

setJAVA_HOME=c:\java8

setPATH=%JAVA_HOME%\bin;%PATH%

setSPARK_HOME=c:\spark

setPATH=%SPARK_HOME%\bin;%PATH%

$winutils.exechmod-R777C:\tmp\hive



https://github.com/steveloughran/winutils

Then, installPython3.6with anaconda (This is a bundledpython installer forpyspark).

https://anaconda.org/anaconda/python


7.5.4.4MacOS

First, JDK should be installed to a pathwhere there is no space in that path.RecommandedJAVAversionis8.


Second,setupenvironmentvariablesforjdkbyadddingbinfolderpathtotouserpathvariable.

Next,InstallApacheSparkusingHomebrewwithfollowingcommands.

Then,setupenvironmentvaribaleforsparkwithfollowingcommands.

Next, install Python 3.6with anaconda (This is a bundled python installer forpyspark)



setHADOOP_HOME=c:\hadoop\bin

setPATH=%HADOOP_HOME%\bin;%PATH%

$pyspark

$exportJAVA_HOME=$(/usr/libexec/java_home)

$brewupdate

$brewinstallscala

$brewinstallapache-spark

$exportSPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"

$exportPYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

$exportPYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

$pyspark




7.5.4.5InstructionsforcreatingSparkUserDefinedFunctions

7.5.4.5.1Example:Temperatureconversion

In this examplewe convert temperature data fromCelsius to Fahrenheit withfilteringandsorting

7.5.4.5.1.1Descriptionaboutdataset

Thefiletemperature_data.csvcontainstemperaturedataofdifferentwheatherstationsandithasthefollowingstructure.ITE00100554,18000101,TMAX,-75,,,E,

ITE00100554,18000101,TMIN,-148,,,E,

GM000010962,18000101,PRCP,0,,,E,

EZE00100082,18000101,TMAX,-86,,,E,

GM000010962,18000104,PRCP,0,,,E,

EZE00100082,18000104,TMAX,-55,,,E,

Wewill only considerwheather station ID (column 0), entrytype (column 2),temperature(column3:itisin10*Celsius)

7.5.4.5.1.2HowtowriteapythonprogramwithUDF

First, we need to import the relevent libraries to use Spark sql built infunctionalitieslistedasfollows.

Then,weneedcreateauserdefinedfuctionwhichwill read the text inputandprocessthedataandreturnasparksqlRowobject.Itcanbecreatedaslistedasfollows.

Then we need to create a Spark SQL session as listed as follows with anapplicationname.

Next, we read the raw data using spark build-in function textFile() as shown

frompyspark.sqlimportSparkSession

frompyspark.sqlimportRow

defprocess_data(line):

fields=line.split(',')

stationID=fields[0]

entryType=fields[2]

temperature=float(fields[3])*0.1*(9.0/5.0)+32.0

returnRow(ID=stationID,t_type=entryType,temp=temperature)

spark=SparkSession.builder.appName("SimpleSparkSQLUDFexample").getOrCreate()

next.

Then,weconvert those read lines to aResilientDistributedDataset (RDD)ofRowobjectusingUDF(process_data)whichwecreatedaslistedasfollows.

AlternativelywecoludhavewrittentheUDFusingapythonlamdafunctiontodothesamethingasshownnext.

Now,we can convert ourRDDobject to a SparkSQLDataframe as listed asfollows.

Next, we can print and see the first 20 rows of data to validate our work asshownnext.

7.5.4.5.1.3Howtoexecuteapythonsparkscript

Youcanusespark-submitcommandtorunasparkscriptasshownnext.

Ifeverythingwentwell,youshouldseethefollowingoutput.

lines=spark.sparkContext.textFile("temperature_data.csv")

parsedLines=lines.map(process_data)

parsedLines=lines.map(lambdaline:Row(ID=line.split(',')[0],

t_type=line.split(',')[2],

temp=float(line.split(',')[3])*0.1*(9.0

/5.0)+32.0))

TempDataset=spark.createDataFrame(parsedLines)

TempDataset.show()

spark-submittemperature_converter.py

+-----------+------+-----------------+

|ID|t_type|temp|

+-----------+------+-----------------+

|EZE00100082|TMAX|90.14000000000001|

|ITE00100554|TMAX|90.14000000000001|

|ITE00100554|TMAX|89.42|

|EZE00100082|TMAX|88.88|

|ITE00100554|TMAX|88.34|

|ITE00100554|TMAX|87.80000000000001|

|ITE00100554|TMAX|87.62|

|ITE00100554|TMAX|87.62|

|EZE00100082|TMAX|87.26|

|EZE00100082|TMAX|87.08000000000001|

|EZE00100082|TMAX|87.08000000000001|

|ITE00100554|TMAX|86.72|

|ITE00100554|TMAX|86.72|

|ITE00100554|TMAX|86.72|

|EZE00100082|TMAX|86.72|

|ITE00100554|TMAX|86.0|

|ITE00100554|TMAX|86.0|

7.5.4.5.1.4Filteringandsorting

Now we are trying to find what is the maximum temperature reported for aparticluarwhetherstationandprintthedatainascendingorder.Wecanachievethisbyusingwhere()andorderBy()fundtionsasshownnext.

We achieved the filtering using temperature type and it filters out all the datawhichisnotaTMAX.

Finally,wecanprintthedatatoseewhetherthisworkedornotusingfollowingstatement.

Now,itisthetimetorunthepythonscriptagainusingfollowingcommand.

Ifeverythingwentwell,youshouldseethefollowingsortedandfilteredoutput.

Complete python script is listed as follows as well as under this directory(temperature_converter.py).

|ITE00100554|TMAX|86.0|

|ITE00100554|TMAX|85.64|

|ITE00100554|TMAX|85.64|

+-----------+------+-----------------+

onlyshowingtop20rows

TempDatasetProcessed=TempDataset.where(TempDataset['t_type']=='TMAX'

).orderBy('temp',ascending=False).cache()

TempDatasetProcessed.show()

spark-submittemperature_converter.py

+-----------+------+-----------------+

|ID|t_type|temp|

+-----------+------+-----------------+

|EZE00100082|TMAX|90.14000000000001|

|ITE00100554|TMAX|90.14000000000001|

|ITE00100554|TMAX|89.42|

|EZE00100082|TMAX|88.88|

|ITE00100554|TMAX|88.34|

|ITE00100554|TMAX|87.80000000000001|

|ITE00100554|TMAX|87.62|

|ITE00100554|TMAX|87.62|

|EZE00100082|TMAX|87.26|

|EZE00100082|TMAX|87.08000000000001|

|EZE00100082|TMAX|87.08000000000001|

|ITE00100554|TMAX|86.72|

|ITE00100554|TMAX|86.72|

|ITE00100554|TMAX|86.72|

|EZE00100082|TMAX|86.72|

|ITE00100554|TMAX|86.0|

|ITE00100554|TMAX|86.0|

|ITE00100554|TMAX|86.0|

|ITE00100554|TMAX|85.64|

|ITE00100554|TMAX|85.64|

+-----------+------+-----------------+

onlyshowingtop20rows

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/temperature_converter.py

7.5.4.6Instructionstoinstallandruntheexampleusingdocker

Followinglinkisthehomedirectoryfortheexampleexplainedinthistutorial.

https://github.com/cloudmesh-community/hid-sp18-409/tree/master/tutorial/spark_udfs

Itcontainsfollowingfiles

Pythonscriptwhichcontainstheexample:temperature_converter.pyTemperaturedatafile:temperature_data.csvRequiredpythondependenciesareputhere:requirements.txtDocker file which automatically setup the codebase with dependencyinstallation:Dockerfile

frompyspark.sqlimportSparkSession

frompyspark.sqlimportRow

defprocess_data(line):

fields=line.split(',')

stationID=fields[0]

entryType=fields[2]

temperature=float(fields[3])*0.1*(9.0/5.0)+32.0

returnRow(ID=stationID,t_type=entryType,temp=temperature)

#CreateaSparkSQLSession

spark=SparkSession.builder.appName('SimpleSparkSQLUDFexample'

).getOrCreate()

#Gettherawdata

lines=spark.sparkContext.textFile('temperature_data.csv')

#ConvertittoaRDDofRowobjects

parsedLines=lines.map(process_data)

#alternativelamdafundtion

parsedLines=lines.map(lambdaline:Row(ID=line.split(',')[0],

t_type=line.split(',')[2],

temp=float(line.split(',')[3])*0.1*(9.0

/5.0)+32.0))

#ConvertthattoaDataFrame

TempDataset=spark.createDataFrame(parsedLines)

#showfirst20rowstemperatureconverteddata

#TempDataset.show()

#SomeSQL-stylemagictosortallmoviesbypopularityinoneline!

TempDatasetProcessed=TempDataset.where(TempDataset['t_type']=='TMAX'

).orderBy('temp',ascending=False).cache()

#showfirst20rowsoffilteredandsorteddata

TempDatasetProcessed.show()

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/temperature_converter.py

https://github.com/cloudmesh-community/hid-sp18-409/tree/master/tutorial/spark_udfs

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/temperature_converter.py%20%22temperature_converter.py%22

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/temperature_data.csv%20%22temperature_data.csv%22

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/requirements.txt%20%22requirements.txt%22

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/Dockerfile%20%22Dockerfile%22

Makefilewhichwillexcutetheexamplewithasinglecommand:Makefile

Toinstalltheexampleusingdockerplessedothefollowingsteps.

First,youshouldinstalldockerintoyourcomputer.

Next, git clone the project . Alternatively you can also download the dockerimagefromthedockerhub.Thenyoudon’tneedtododockerbuild.

Then,changethedirectorytospark_udfsfolder.

Next,installtheserviceusingfollowingmakecommand

Finally,starttheserviceusingfollowingmakecommand

Now you should see the same output we saw at the end of the exampleexplanation.

7.6ADVANCEDHADOOP

7.6.1AmazonEMR(ElasticMapReduce) �☁�

AmazonEMRfacilitatesyoutoanalyzeandprocessvast(huge)amountsofdatabydistributingthecomputationalworkacrossaclusterofvirtualserversrunningin the AWS Cloud. The EMR cluster is managed using an open-sourceframework called Hadoop. Amazon EMR lets you focus on crunching oranalyzing your data without having to worry about time-consuming setup,management,and tuningofHadoopclustersor thecomputecapacity they relyonunlikeotherHadoopdistributorslikeCloudera,Hortonworksetc.,

Easy:TomaintainondemandbasisFast:AutoshrinkingofclusteranddynamicallyincreasememorybasedontheneedCost-effective:Scalaoutandinanytimebasedonthebusinessrequirement

$dockerpullkadupitiya/tutorial

$makedocker-build

$makedocker-start

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/Makefile%20%22Makefile%22

https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/amazon-emr-1.md

ormodels

EMR Supports other distributed framework such as Apache Spark, HBase,Presto,Flinkandetc.InteractwithdatainAWSdatastoressuchasAmazonS3,DynamoDBandetc.

ComponentsOfEMR:

StorageEC2instanceClustersSecurityKMS

7.6.1.1WhyEMR?

ThefollowingarreasonsgivenbyAmazoneforusingEMR

Easy toUse:Launchcluster ina5 to10minutes timeasmanyclusterofnodesasyouneedPay as you go: Pay an hourly rate (with AWS latest pricing model,customerscanchoosetopayinminutes)Flexible:EasilyAdd/Removecapacity(Autoscaleoutandinanytime)Reliable:Spendlesstimeformonitoringandcanutilizein-builtAWStoolswhichwillreduceoverheadSecure:Managefirewall(VPCbothprivateandsubnet)

7.6.1.2UnderstandingClustersandNodes

The component of Amazon EMR is the cluster. A cluster is a collection ofAmazonElasticComputeCloud(AmazonEC2)instances.Eachinstanceintheclusteriscalledanode.Eachnodehasarolewithinthecluster,referredtoasthenode type.Amazon EMR also installs different software components on eachnode type, giving each node a role in a distributed application like ApacheHadoop.

ThenodetypesinAmazonEMRareasfollows:

Master node: A node that manages the cluster by running softwarecomponents to coordinate the distribution of data and tasks among othernodes for processing. The master node tracks the status of tasks andmonitorsthehealthofthecluster.Everyclusterhasamasternode,anditispossibletocreateasingle-nodeclusterwithonlythemasternode.

Corenode:AnodewithsoftwarecomponentsthatruntasksandstoredataintheHadoopDistributedFileSystem(HDFS)onyourcluster.Multi-nodeclustershaveatleastonecorenode.

Tasknode:AnodewithsoftwarecomponentsthatonlyrunstasksanddoesnotstoredatainHDFS.Tasknodesareoptional.

Thefollowingdiagramrepresentsaclusterwithonemasternodeandfourcorenodes.

CluserandNodes

7.6.1.2.1SubmitWorktoaCluster

WhenyourunaclusteronAmazonEMR,youhaveseveraloptionsas tohowyouspecifytheworkthatneedstobedone.

Providetheentiredefinitionoftheworktobedoneinfunctionsthatyouspecify

asstepswhenyoucreateacluster.Thisistypicallydoneforclustersthatprocessasetamountofdataandthenterminatewhenprocessingiscomplete.

Create a long-running cluster anduse theAmazonEMRconsole, theAmazonEMRAPI, or theAWSCLI to submit steps,whichmay contain one ormorejobs.

Createacluster,connect to themasternodeandothernodesas requiredusingSSH, and use the interfaces that the installed applications provide to performtasksandsubmitqueries,eitherscriptedorinteractively.

7.6.1.2.2ProcessingData

Whenyou launchyourcluster,youchoose theframeworksandapplications toinstall for your data processing needs. To process data in yourAmazonEMRcluster,youcansubmitjobsorqueriesdirectlytoinstalledapplications,oryoucanrunstepsinthecluster.

SubmittingJobsDirectlytoApplications:

Youcansubmitjobsandinteractdirectlywiththesoftwarethatisinstalledin your Amazon EMR cluster. To do this, you typically connect to themaster node over a secure connection and access the interfaces and toolsthat are available for the software that runs directly on your cluster. Formoreinformation,seeConnecttotheCluster.

RunningStepstoProcessData

You can submit one or more ordered steps to an Amazon EMR cluster.Eachstepisaunitofworkthatcontainsinstructionstomanipulatedataforprocessingbysoftwareinstalledonthecluster.

Thefollowingisanexampleprocessusingfoursteps:

1. Submitaninputdatasetforprocessing.2. ProcesstheoutputofthefirststepbyusingaPigprogram.3. ProcessasecondinputdatasetbyusingaHiveprogram.4. Writeanoutputdataset.

Generally,whenyouprocessdata inAmazonEMR, the input isdatastoredasfilesinyourchosenunderlyingfilesystem,suchasAmazonS3orHDFS.Thisdatapassesfromonesteptothenextintheprocessingsequence.Thefinalstepwritestheoutputdatatoaspecifiedlocation,suchasanAmazonS3bucket.

Stepsareruninthefollowingsequence:

1. Arequestissubmittedtobeginprocessingsteps.2. ThestateofallstepsissettoPENDING.3. Whenthefirststepinthesequencestarts, itsstatechangestoRUNNING.

TheotherstepsremaininthePENDINGstate.4. Afterthefirststepcompletes,itsstatechangestoCOMPLETED.5. Thenext step in the sequence starts, and its state changes toRUNNING.

Whenitcompletes,itsstatechangestoCOMPLETED.6. This pattern repeats for each step until they all complete and processing

ends.

Thefollowingdiagramrepresentsthestepsequenceandchangeofstateforthestepsastheyareprocessed.

CluserandNodes

If a step fails during processing, its state changes toTERMINATED_WITH_ERRORS. You can determine what happens next foreach step. By default, any remaining steps in the sequence are set toCANCELLED and do not run.You can also choose to ignore the failure andallowremainingstepstoproceed,ortoterminatetheclusterimmediately.

Thefollowingdiagramrepresentsthestepsequenceanddefaultchangeofstatewhenastepfailsduringprocessing.

CluserandNodes

7.6.1.3AWSStorage

S3 - Cloud based storage - Using EMRFS can directly connects s3 storage -Accessiblefromanywhere

InstanceStore-Localstorage-DatawillbelostonstartandstopEC2instances

EBS-Networkattachedstorage-Datapreservedonstartandstop-AccessibleonlythroughEC2instances

7.6.1.4CreateEMRinAWS

7.6.1.4.1Createthebuckets

Login to AWS console and create the buckets athttps://aws.amazon.com/console/.Tocreatethebuckets,gotoservices(seeFigure 50, Figure 51), click on S3 under Storage, Figure 52, Figure 53,Figure54.ClickonCreatebucketbuttonandthenprovideallthedetailstocompletebucketcreation.AWSConsole

Figure50:AWSConsole

AWSLogin

Figure51:AWSLogin

S3–AmazonStorage

Figure52:AmazonStorage

S3–Createbuckets

Figure53:S3buckets

Figure54:S3buckets1

7.6.1.4.2CreateKeyPairs

LogintoAWSconsole,gotoservices,clickonEC2undercompute.SelecttheKeypairsresoure,clickonCreateKeyPairandprovideKeyPairnametocompletetheKeypairscreation.SeeFigure55

Download the.pemfileonceKeyvaluepair iscreated.This isneeded toaccess AWS Hadoop environment from client machine. This need to beimportedinPuttytoaccessyourAWSenvironemnt.SeeFigure56

7.6.1.4.2.1CreateKeyValuePairScreenshots

Figure55:AMSKeyValuePair

Figure56:AMSKeyValuePair1

7.6.1.5CreateStepExecution–HadoopJob

Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetailstocompletetocompletestepexecutioncreation.See:Figure57,Figure58,Figure59,Figure60,Figure61

Clustername(Example:HadoopJobStepExecutionCluster)Select Logging check box and provide S3 folder location (Example:s3://bigdata-raviAndOrlyiuproject/logs/)SelectlaunchmodeasStepexecutionSelectthesteptypeandcompletethestepconfigurationCompleteSoftwareConfigurationCompleteHardwareConfigurationCompleteSecurityandaccessAndthenclickoncreateclusterbutton

Once job started, if there are no errors output file will be created in theoutputdirectory.

7.6.1.5.0.1Screenshots

Figure57:AWSEMR

Figure58:AWSCreateEMR

Figure59:AWSConfigEMR

Figure60:AWSCreateCluster

Figure61:AWSCreateCluster1

7.6.1.6CreateaHiveCluster

Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetails tocomplete.See,Figure62,Figure63,Figure64

Clustername(Example:MyFirstCluster-Hive)SelectLoggingcheckboxselectedandprovideS3folderlocationSelectlaunchmodeasClusterComplete software configuration (select hive application) and click oncreatecluster

7.6.1.6.1CreateaHiveCluster-Screenshots

Figure62:HiveCluser

Figure63:HiveCluser1

Figure64:HiveCluser2

7.6.1.7CreateaSparkCluster

Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetails tocomplete.See,Figure65,Figure66,Figure67

Clustername(Example:MyCluster-Spark)SelectLoggingcheckboxselectedandprovideS3folderlocationSelectlaunchmodeasClusterCompletesoftwareconfigurationandclickoncreateclusterSelectapplicationasSpark

7.6.1.7.1CreateaSparkCluster-Screenshots

Figure65:SparkCluser



7.6.2Twister2☁�

7.6.2.1Introduction

Twister2[57] provides a data analytics hosting environment where it supportsdifferent data analytics including streaming, data pipelines and iterativecomputations. The functionality of Twister2 is similar to other Big dataframeworks suchasApacheSparkandApacheFlink.But thereare a fewkeydifferenceswhichdifferentiatesTwister2fromother frameworks.UnlikemanyotherbigdatasystemsthataredesignedarounduserAPIs,Twister2isbuiltfrombottomup to supportdifferentAPIsandworkloads.TheaimofTwister2 is todevelopacompletecomputingenvironmentfordataanalytics.

OnemajorgoalofTwister2is toprovideindependentcomponents, thatcanbeused by other big data systems and evolve separately. To this end Twister2supportsacomposablearchitecturewheredeveloperscaneasilyreplaceasmallcomponent in thesystemwithanew implementationveryeasily.Forexamplethe resource scheduling layer has several implementations it supports,Kubernetes,Mesos,Slurm,Nomadanda standalone implementation, If auserwants to add support for another resources scheduler such as Yarn they caneasilydosobyimplementingthewelldefinedinterfaces.

Twister2supportsbothbatchandstreamingapplications.Unlikeotherbigdataframeworkswhicheithersupportbatchorstreaminginthecoreanddevelopthe

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/twister.md

otherontopofthat,Twister2nativelysupportsbothbatchandstreaming.WhichallowsTwister2tomakeseparateoptimizationsforeachtype.

Twister2 project is still less than 2 years old and still in it’s early stages andgoing through rapid development to complete its functionality. It is an OpenSourceprojectwhichislicencedundertheApache2.0[58]

7.6.2.2Twister2API’s

Twister2 provides users with 3 levels on API’s which can be used to writeapplications.The3APIlevelsareshowninFigureFigure68.

Figure68:Twister2API’s

As shown in Figure 68 eachAPI level has different levels of abstraction andprogrammingcomplexities.TSetAPIisthemosthighlevelinTwister2whichinsomewaysissimlartotheRDDAPIinApacheSparkorDataSetAPIinApacheFlink.IftheuserwantsmorecontrolovertheapplicationdevelopmenttheycanopttouseamorelowerlevelAPI’s.

7.6.2.2.1TSetAPI

TSetAPI is themost abstract API provided by Twister2. This allows user todevelop theirprogramsat thedata layer, similar to theprogrammingmodelofApacheSpark.SimilartoRDDinSparkuserscanperformoperationsontopofTSetobjectswhichwillbeautomaticallyparallelizedbytheframework.TogetaslightunderstandingoftheTsetAPItakealookattheabstractexamplegivenonhowTSetAPIcanbeusedtoimplementKMeansalgorithm.publicclassKMeansJobextendsTaskWorker{

//......

@Override

publicvoidexecute(){

//.....

TSet<double[][]>points=TSetBuilder.newBuilder(config).createSource(newSource<double[][]>(){

//Codeforsourcefunctiontoreaddatapoints

}).cache();

WhenprogrammingattheTSetAPIlevelthedeveloperdoesneedtohandleanyinformationrelatedtotaskandcommunications.

Note:TheTSetAPI iscurrentlyunderdevelopmentandhasnotbeen releasedyetandthereforetheAPImaychangefromwhatwasdiscussedinthissection,anyonewho is interestedcan follow thedevelopmentprogressorcontribute totheprojectthroughtheGitHubrepo[58].

7.6.2.2.2TaskAPI

TheTaskAPIallowsdeveloperstocreatetheirapplicationattheTasklevel.ThedeveloperisresponsibleofmanagingtaskleveldetailswhendevelopingatthisAPI level, theupsideofusing theTaskAPI is that it ismore flexible than theTSetAPIsoitallowsdeveloperstoaddcustomoptimizationstotheapplicationcode.TheTSetAPIisbuiltontopoftheTaskAPIthereforetheaddedlayerofabstraction isbound toaddslightlymoreoverheads to the runtime,whichyoumightbeabletoavoidbydirectlycodingattheTaskAPIlevel.

TogetabetterunderstandingoftheTaskAPItakealookathowtheclassicmapreduce problem word count is implemented at using the Task API in thefollowingcodesegment.Thisisonlyaportionoftheexamplecode,youcanfindthecompletecodefortheexampleat[59].

TSet<double[][]>centroids=TSetBuilder.newBuilder(config).createSource(newSource<double[][]>(){

//Codeforsourcefunctiontoreadcenters(orgeneraterandomcenters)

}).cache();

for(

inti=0;

i<iterations;i++){

TSet<double[][]>KmeansTSet=points.map(newMapFunction<double[][],double[][]>(){

//CodeforKmeanscalculation,thiswillhaveaccesstothecentroidswhicharepassedin

});

KmeansTSet.addInput("centroids",centroids);

Link<double[][]>allReduced=KmeansTset.allReduce();

TSet<double[][]>newCentroids=allReduced.map(newMapFunction<double[][],Object>(){

/*Codethatproducesthenewcentersforthenextiteration.TheallReducewillresultin

asumorallthecenterssentbyeachworkersothismapfunctionsimplyneedstocomputethe

averagetogetthenewcenters

*/

});

centroids.override(newCentroids);

}

//.....

}

}

publicclassWordCountJobextendsTaskWorker{

//.....

@Override

publicvoidexecute(){

//sourceandaggregator

MoreTaskAPIexamplescanbefoundinTwister2documentations[60].

7.6.2.3OperatorAPI

The lowest level API provided by Twister2 is the Operator API, this allowsdevelopers todevelopapplicationsat the communication level.However sincethis API only abstracts out communication operations, details such as taskmanagementneedtobehandledbytheapplicationdeveloper.Againsimilar totheTaskAPI this provides the developerwithmore flexibility to createmoreoptimizedapplications,atthecostofbeinghardertoprogram.Twister2supportsavarietyofcommunicationpatterns,knownascollectivecommunicationsintheHPCworld.Thesecommunicationsarehighlyoptimizedusingvarious routingpatterns to reduce the number of communication calls that go through thenetworktoprovideuserswithaextremelyefficientOperatorAPI.ThefollowinglistshowthecommunicationoperationsthataresupportedbyTwister2.Youcanfind more information on each or these operations in the Twister2documentation[61].

ReduceGatherAllReduceAllGatherPartitionBroadcastKeyedReduceKeyedPartitionKeyedGather

WordSourcesource=newWordSource();

WordAggregatorcounter=newWordAggregator();

//buildthetaskgraph

TaskGraphBuilderbuilder=TaskGraphBuilder.newBuilder(config);

builder.addSource("word-source",source,4);

builder.addSink("word-aggregator",counter,4).keyedReduce("word-source",EDGE,

newReduceFn(Op.SUM,DataType.INTEGER),DataType.OBJECT,DataType.INTEGER);

builder.setMode(OperationMode.BATCH);

//executethegraph

DataFlowTaskGraphgraph=builder.build();

ExecutionPlanplan=taskExecutor.plan(graph);

taskExecutor.execute(graph,plan);

}

//.....

}

Initial Performance comparisons that are discussed in[62] show howTwister2outperformspopularframeworkssuchApacheFlink,ApacheSparkandApacheStrominmanyareas.ForexampletheFigure69showsacomparisionbetweenTwister2,MPI and Apache Spark versions of KMeans algorithm, please notethatthegraphisinlogarithmicscale

Figure69:KmeansPerformanceComparison[63]

Notation:*DFWreferstoTwister2*BSPreferstoMPI(OpenMPI)

This shows thatTwister2 performs around~10x faster thanApacheSpark forKMeans.AndthatitisonparwithimplementationsdoneusingOpenMPIwhichisawidelyusedHPCframework.

7.6.2.3.1Resources

http://www.iterativemapreduce.org/

http://www.cs.allegheny.edu/sites/amohan/teaching/CMPSC441/paper10.pdf

https://twister2.gitbook.io/twister2/

http://dsc.soic.indiana.edu/publications/Twister2.pdf

https://www.computer.org/csdl/proceedings/cloud/2018/7235/00/723501a383-abs.html

http://www.iterativemapreduce.org/

http://www.cs.allegheny.edu/sites/amohan/teaching/CMPSC441/paper10.pdf


http://dsc.soic.indiana.edu/publications/Twister2.pdf

https://www.computer.org/csdl/proceedings/cloud/2018/7235/00/723501a383-abs.html

7.6.3Twister2Installation☁�


BecauseTwister2 isstill in theearlystagesofdevelopmentabinaryrelease isnotavailableasofyet,thereforetotryoutTwister2usersneedtofirstbuildthebinariesfromthesourcecode.

OperatingSystem:Twister2istestedandknowntoworkon,RedHatEnterpriseLinuxServerrelease7Ubuntu14.05,Ubuntu16.10andUbuntu18.10

Java(Jdk1.8)CoveredinSection[s:hadoop-local-installation].

G++Compilersudoapt-getinstallg++

MavenInstallationExplainedinSectionMaven

OpenMPIInstallationExplainedinSectionOpenMPI

BazelBuildInstallationExplainedinSectionBazel

AdditionalLibrariesExplainedinSectionTwisterExtra

7.6.3.1.1MavenInstallation

ExecutethefollowingcommandstoinstallMavenlocally.

Addingenvironmentalvariables

Addthefollowinglineattheendofthefile.

mkdir-p~/cloudmesh/bin/maven

cd~/cloudmesh/bin/maven

wgethttp://mirrors.ibiblio.org/apache/maven/maven-3/3.5.2/binaries/apache-maven-3.5.2-bin.tar.gz

tarxzfapache-maven-3.5.2-bin.tar.gz

emacs~/.bashrc

MAVEN_HOME=~/cloudmesh/bin/maven/apache-maven-3.5.2

PATH=$MAVEN_HOME/bin:$PATH

exportMAVEN_HOMEPATH

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/twister-installation.md

7.6.3.1.2OpenMPIInstallation

When you compile Twister2 it will automatically download and compileOpenMPI3.1.2withit.Ifyoudon’tlikethisversionofOpenMPIandwantstouseyourownversion,youcancompileOpenMPIusingfollowinginstructions.

WerecommendusingOpenMPI3.1.2Download OpenMPI 3.0.0 from https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gzExtractthearchivetoafoldernamedopenmpi-3.1.2Alsocreateadirectorynamedbuild insome location.Wewilluse this toinstallOpenMPISetthefollowingenvironmentvariables

TheinstructionstobuildOpenMPIdependontheplatform.Therefore,wehighlyrecommendlookingintothe$OMPI_1101/INSTALLfile.Platformspecificbuildfilesareavailablein$OMPI_1101/contrib/platformdirectory.

In general, please specify --prefix=$BUILD and --enable-mpi-java as arguments toconfigurescript.IfInfinibandisavailable(highlyrecommended)specify--with-verbs=<path-to-verbs-installation>.Usually, thepath toverbs installation is /usr. Insummary,thefollowingcommandswillbuildOpenMPIforaLinuxsystem.

Ifeverythinggoeswellmpirun--versionwillshowmpirun(OpenMPI)3.1.2. Execute thefollowingcommandtoinstal$OMPI_312/ompi/mpi/java/java/mpi.jarasaMavenartifact.

7.6.3.1.3InstallExtras

Installtheotherrequirementsasfollows,

source~/.bashrc

BUILD=<path-to-build-directory>

OMPI_312=<path-to-openmpi-3.1.2-directory>

PATH=$BUILD/bin:$PATH

LD_LIBRARY_PATH=$BUILD/lib:$LD_LIBRARY_PATH

exportBUILDOMPI_312PATHLD_LIBRARY_PATH

cd$OMPI_312

./configure--prefix=$BUILD--enable-mpi-java

make-j8;makeinstall

mvninstall:install-file-DcreateChecksum=true-Dpackaging=jar-Dfile=$OMPI_312/ompi/mpi/java/java/mpi.jar-DgroupId=ompi-DartifactId=ompijavabinding-Dversion=3.1.2

sudo apt-get install g++ git build-essential automake cmake libtool-bin ziplibunwind-setjmp0-dev zlib1g-dev unzip pkg-config python-setuptools -y sudoapt-getinstallpython-devpython-pip

Now you have successfully installed the required packages. Let us compileTwister2.

7.6.3.1.4CompilingTwister2

Nowletsgetacloneofthesourcecode.

YoucancompiletheTwister2distributionbyusingthebazeltargetasfollows.

Thiswillbuildtwister2distributioninthefile

If you would like to compile the twister2 without building the distributionpackagesusethecommand

Forcompilingaspecifictargetsuchascommunications

7.6.3.1.5Twister2Distribution

After you’ve build the Twister2 distribution, you can extract it and use it tosubmitjobs.

7.6.4Twister2Examples☁�

Twister documentation lists several examples[64] that users can leverage tobetter understand the Twister2 API’s. Currently there are several

gitclonehttps://github.com/DSC-SPIDAL/twister2.git

cdtwister2

bazelbuild--config=ubuntuscripts/package:tarpkgs

bazel-bin/scripts/package/twister2-client-0.1.0.tar.gz

bazelbuild--config=ubuntutwister2/...

bazelbuild--config=ubuntutwister2/comms/src/java:comms-java

cdbazel-bin/scripts/package/

tar-xvftwister2-0.1.0.tar.gz

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/twister2-example.md

CommunicationAPIexamplesandTaskAPIexamplesavailableintheTwister2documentation. In this section we will go through how an example can beexecutedwithTwister2.

7.6.4.1SubmittingaJob

InordertorunanexampleusersneedtosubmittheexampletoTwister2usingthe twister command. This command is found inside the bin directory of thedistribution.

Hereisadescriptionofthecommand

submitisthecommandtoexecutecluster which resource manager to use, i.e. standalone, kubernetes, thisshould be the name of the configuration directory for that particularresourcemanagerjob-typeatthemomentweonlysupportjarjob-file-namethefilepathofthejobfile(thejarfile)job-class-namenameofthejobclasswithamainmethodtoexecute

Hereisanexamplecommand

Inthiscommand,clusterisstandaloneandhasprogramarguments.

For this exercise we are using the standlonemode to submit a job. HoweverTwister2 does support Kubernetes, Mesos, Slurm and Nomad resourceschedulersifuserswanttosubmitjobstolargerclusterdeployments.

7.6.4.2BatchWordCountExample

In this section we will run a batch word count example from Twister2. Thisexample only uses communication layer and resource scheduling layer. Thethreadsaremanagedbytheuserprogram.

twister2submitclusterjob-typejob-file-namejob-class-name[job-args]

./bin/twister2submitstandalonejarexamples/libexamples-java.jaredu.iu.dsc.tws.examples.task.ExampleTaskMain-itr80-workers4-size1000-op

Theexamplecodecanbefoundintwister2/examples/src/java/edu/iu/dsc/tws/examples/basic/batch/wordcount/

When we install Twister2, it will compile the examples. Lets go to theinstallationdirectoryandruntheexample.

Thiswillrun4executorswith8tasks.Soeachexecutorwillhavetwotasks.Atthe firstphase, the0-3 tasks running ineachexecutorwillgeneratewordsandaftertheyarefinished,5-8taskswillconsumethosewordsandcreateacount.

7.6.5HADOOPRDMA☁�

Acknowledgement:Thissectionwascopiedandmodifiedwithpermissionfromhttps://www.chameleoncloud.org/appliances/17/docs/

InChameleon cloud it is possible to launch a virtualHadoop cluster on bare-metalInfiniBandnodeswithSR-IOV.

TheCentOS7SR-IOVRDMA-HadoopisbasedonaCentOS7VirtualMachineimage,aVMstartupscriptandaHadoopclusterlaunchscript,sothatuserscanlaunchVMswithSR-IOVinordertorunRDMA-HadoopacrosstheseVMsonSR-IOVenabledInfiniBandclusters.

Imagename:CC-CentOS7-RDMA-HadoopDefaultuseraccount:ccRemoteaccess:Key-BasedSSHRootaccess:passwordlesssudofromtheccaccountChameleonadminaccess:enabledontheccadminaccountCloud-initenabledonboot:yesRepositories(Yum):EPEL,RDO(OpenStack)Installedpackages:RebuiltkerneltoenableIOMMUMellanoxSR-IOVdriversforInfiniBandKVMhypervisorStandarddevelopmenttoolssuchasmake,gcc,gfortran,etc.Configmanagementtools:Puppet,Ansible,Salt

cdbazel-bin/scripts/package/twister2-dist/

./bin/twister2submitstandalonejarexamples/libexamples-java.jaredu.iu.dsc.tws.examples.batch.wordcount.WordCountJob

https://github.com/cloudmesh-community/book/blob/master/chapters/mapreduce/hadoop-rdma.md

https://www.chameleoncloud.org/appliances/17/docs/

OpenStackcommand-lineclientsIncludedVMimagename:chameleon-rdma-hadoop-appliance.qcow2IncludedVMstartupscript:start-vm.shIncludedHadoopclusterlaunchscript:launch-hadoop-cluster.shDefaultVMrootpassword:nowlab

We refer to the chameleon cloud baremetal user guide for documentation onhow to reserve and provision resources using the appliance of CC-CentOS7-RDMA-Hadoop.

�linkmissing

7.6.5.1LaunchingaVirtualHadoopClusteronBare-metalInfiniBandNodeswithSR-IOVonChameleon

WeprovideaCentOS7VMimage(chameleon-rdma-hadoop-appliance.qcow2)andaHadoopclusterlaunchscript(launch-hadoop-cluster.sh)tofacilitateuserstosetupVirtualHadoopClusterseffortlessly.

First, launchbare-metal nodes using theRDMA-HadoopAppliance and selectoneofthenodesasthebootstrapnode.Thisnodewillserveasthehostforthemaster node of the Hadoop cluster and will also be used to setup the entirecluster.Now, ssh to thisnode.Beforeyoucan launch thecluster,youhave todownload your OpenStack credentials file (see how to download yourcredentialsfile).Then,createafile(henceforthreferredtoasips-file)withtheipaddressesof thebare-metal nodesyouwant to launchyourHadoopclusteron(excludingthebootstrapnode),eachonanewline.Next,runthesecommandsasroot:

The launch cluster scriptwill launchVMs for you, then install and configureHadoopontheseVMs.Notethatwhenyoulaunchtheclusterforthefirsttime,alotofinitializationisrequired.Dependingonthesizeofyourcluster,itmaytakesometimetosetupthecluster.Aftertheclustersetupiscomplete,thescriptwillprintanoutputtellingyouthattheclusterissetupandhowyoucanconnecttotheHadoopmasternode.NotethattheminimumrequiredmemoryforeachVM

[root@host]$cd/home/cc

[root@host]$./launch-hadoop-cluster.sh<num-of-vms-per-node><num-of-MB-per-VM><num-of-cores-per-VM><ips-file><openstack-credentials-file

is8,192MB.TheHadoopclusterwillalreadybesetupforuse.Formoredetailsonhowtouse theRDMA-Hadooppackage torun jobs,pleaserefer to itsuserguide.

7.6.5.2LaunchingVirtualMachinesManually

WeprovideaCentOS7VMimage(chameleon-rdma-hadoop-appliance.qcow2)andaVMstartupscript(start-vm.sh)tofacilitateuserstolaunchVMsmanually.Before you can launch aVM, you have to create a network port. To do this,sourceyourOpenStack credentials file (seehow todownloadyour credentialsfile)andrunthiscommand:

NotetheMACaddressandIPaddressareintheoutputof thiscommand.YoushouldusethisMACaddresswhilelaunchingaVMandtheIPaddresstosshtotheVM.YoualsoneedthePCIdeviceIDofthevirtualfunctionthatyouwanttoassigntotheVM.Thiscanbeobtainedbyrunning"lspci|grepMellanox"andlooking for the device ID (with format - XX:XX.X) of one of the virtualfunctionsasshownnext:

ThePCIdeviceIDoftheVirtualFunctionis03:00:1inthepreviousexample.

Now,youcan launchaVMonyour instancewithSR-IOVusing theprovidedVMstartupscriptandcorrespondingargumentsasfollowswiththerootaccount.

Please note that and are the ones you get from the outputs of previouscommands.AndisthenameofVMvirtualNICinterface.Forexample:

You can also edit corresponding fields in VM startup script to change thenumberofcores,memorysize,etc.

YoushouldnowhaveaVMrunningonyourbaremetalinstance.Ifyouwantto

[user@host]$neutronport-createsharednet1

[cc@host]$lspci|grepMellanox

03:00.0Networkcontroller:MellanoxTechnologiesMT27500Family[ConnectX-3]

03:00.1Networkcontroller:MellanoxTechnologiesMT27500/MT27520Family[ConnectX-3/ConnectX-3ProVirtualFunction]

...

[root@host]$./start-vm.sh<vm-mac><vm-ifname><virtual-function-device-id>

[root@host]$./start-vm.shfa:16:3e:47:48:00tap003:00:1

runmoreVMs on your instance, youwill have to createmore network ports.YouwillalsohavetochangethenameofVMvirtualNICinterfacetodifferentones(liketap1,tap2,etc.)andselectdifferentdeviceIDsofvirtualfunctions.

7.6.5.3ExtraInitializationwhenLaunchingVirtualMachines

InordertorunRDMA-HadoopacrossVMswithSR-IOV,andkeepthesizeofVM image small, extra initialization will be executed when launching VMautomatically,whichincludes:

DetectMellanoxSR-IOVdrivers,downloadandinstallitifnonexistentDetectJavapackageinstalled,downloadandinstallifnon-existentDetect RDMA-Hadoop package installed, download and install if non-existent

After finishing the extra initialization procedure, you should be able to runHadoopjobswithSR-IOVsupportacrossVMs.Notethatthisinitializationwillbe done automatically. For more details about the RDMA-Hadoop package,pleaserefertoitsuserguide.

7.6.5.4ImportantNoteforTearingDownVirtualMachinesandDeletingNetworkPorts

Onceyouaredonewithyourexperiments,youshouldkillallthelaunchedVMsanddelete the creatednetworkports. If youused the launch-hadoop-cluster.shscripttolaunchVMs,youcandothisbyrunningthekill-vms.shscriptasshownnext. This script will kill all launched VMs and also delete all the creatednetworkports.

Pleasenotethatitisimportanttodeleteunusedportsafterexperiments.

[root@host]$cd/home/cc

[root@host]$./kill-vms.sh<ips-file><openstack-credentials-file>

\end{vernatim}

IfyoulaunchedVMsusingthestart-vm.shscript,youshouldfirstmanuallykillalltheVMs.Then,deleteallthecreatednetworkportsusingthiscommand:

[user@host]$neutronport-deletePORT

8CONTAINER

8.1INTRODUCTIONTOCONTAINERS☁�

LearningObjectives

Knowingwhatacontaineris.DifferentiatingContainersfromVirtualMachines.Understandingthehistoricalaspectsthatleadtocontainers.

Thissectioncoversanintroductiontocontainersthatissplitupintofourparts.Wediscussmicroservices,serverlesscomputing,Docker,andkubernetes.

8.1.1Motivation-Microservices

Wediscussthemotivationforcontainersandcontrastthemtovirtualmachines.Additionally we provide a motivation for containers as they can be used tomicroservices.

Container11:01ContainerA

8.1.2Motivation-ServerlessComputing

We enhance our motivation while contrasting containers and microserviceswhile relating them to serverless computing. We anticipate that serverlesscomputingwillincreaseinimportanceoverthenextyears

Container15:08ContainerB

8.1.3Docker

Inorderforustousecontainers,wegobeyondthehistoricalmotivationthatwas

https://github.com/cloudmesh-community/book/blob/master/chapters/container/container.md

https://youtu.be/-HlB0eiwV10

https://youtu.be/fxDc5cL6MgQ

introducedinaprevioussectionandfocusonDockerapredominanttechnologyforcontainersonWindows,Linux,andmacOS

Container40:09ContainerC

8.1.4DockerandKubernetes

Wecontinueourdiscussionaboutdockerandintroducekubernetes,allowingustorunmultiplecontainersonmultipleserversbuildingaclusterofcontainers.

Container50:14ContainerD

8.2DOCKER

8.2.1IntroductiontoDocker☁�

Dockeris thecompanydrivingthecontainermovementandtheonlycontainerplatformprovidertoaddresseveryapplicationacrossthehybridcloud.Today’sbusinesses are under pressure to digitally transform but are constrained byexisting applications and infrastructure while rationalizing an increasinglydiverse portfolio of clouds, datacenters and application architectures. Dockerenables true independence between applications and infrastructure anddevelopers and IT ops to unlock their potential and creates amodel for bettercollaborationandinnovation.Anoverviewofdockerisprovidedat

https://docs.docker.com/engine/docker-overview/

https://youtu.be/A2b-LrnoMqg

https://youtu.be/V41oi2Bh8Cc

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-intro.md

https://docs.docker.com/engine/docker-overview/

Figure70:DockerContainers[ImageSource][65]

Figure70showshowdockercontainersfitintothesystem##Dockerplatform

Dockerprovidesusers anddeveloperswith the tools and technologies that areneeded tomanage their application development using containers. Developerscaneasilysetupdifferentenvironmentsfordevelopment,testingandproduction.

8.2.1.1DockerEngine

TheDocker engine can be thought of as the core of the docker runtime. Thedocker engine mainly provides 3 services. Figure 71 shows how the dockerengineiscomposed.

AlongrunningserverwhichmanagesthecontainersARESTAPIAcommandlineinterface

https://www.docker.com/sites/default/files/Package%20software%40x2.png

Figure71:DockerEngineComponentFlow[ImageSource][65]

8.2.1.2DockerArchitecture

Themainconceptofthedockerarchitectureisbasedonthesimpleclient-servermodel.Dockerclients communicatewith theDocker serveralsoknownas theDockerdaemontorequestvariousresourcesandservices.THedaemonmanagesall thebackgroundtasksthatneedtobeperformedtocompleteclientrequests.Managing and distributing containers, running the containers, buldingcontainers,etc.areresponsibilitiesoftheDockerdaemon.Figure72showshowthedockerarchitectureissetup.Theclientmoduleandservercanruneitherinthesamemachineorinseparatemachines.Inthelattercasethecommunicationbetweentheclientandserveraredonethroughthenetwork.

https://docs.docker.com/engine/docker-overview/#docker-engine

Figure72:DockerArchitecture[ImageSource][65]

8.2.1.3DockerSurvey

In 2016 Docker Inc. surveyed over 500 Docker developers and operationsexpertsinvariousphasesofdeployingcontainer-basedtechnologies.TheresultisavailableintheTheDockerSurvey2016asseeninFigure73.

https://www.docker.com/survey-2016

https://docs.docker.com/engine/docker-overview/#docker-architecture

https://www.docker.com/survey-2016

Figure73:DockerSurveyResults2016[ImageSource][65]

8.2.2RunningDockerLocally☁�

⚠�Pleaseverifyiftheinstructionsarestilluptodate.Rapidchangescouldmeantheycanbeoutdatedquickly.Alsoweassumetheubuntuinstalationsmayhavechangedandmaybedifferentbetween18.04and19.04.

Theofficial installationdocumentation fordockercanbe foundbyvisiting thefollowingWebpage:

https://www.docker.com/community-edition

Hereyouwillfindavarietyofpackages,oneofwhichwillhopefullysuitableforyouroperatingsystem.Thesupportedoperatingsystemscurrentlyinclude:

OSX,Windows,Centos,Debian,Fedora,Ubuntu,AWS,Azure

Please chose the onemost suitable for you. For your conveniencewe provideyouwithinstallationinstructionsforOSX(SectionDockeronOSX),Windows10(SectionDockeronWindows)andUbuntu(SectionDockeronubuntu).

8.2.2.1InstillationforOSX

ThedockercommunityeditionforOSXcanbefoundatthefollowinglink

https://blog.docker.com/2016/04/the-modern-software-supply-chain-runs-on-docker/

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-local.md

https://www.docker.com/community-edition

https://store.docker.com/editions/community/docker-ce-desktop-mac

WerecommendthatatthistimeyougettheversionDockerCEforMAC(stable)

https://download.docker.com/mac/stable/Docker.dmg

Clickingon the linkwill download a dmg file to yourmachine, that you thanwillneedtoinstallbydoubleclickingandallowingaccesstothedmgfile.UponinstallationawhaleinthetopstatusbarshowsthatDockerisrunning,andyoucanaccessitviaaterminal.

DockerintegratedinthemenubaronOSX

8.2.2.2InstallationforUbuntu

In order to install Docker community edition for Ubuntu, you first have toregistertherepositoryfromwhereyoucandownloadit.Thiscanbeachievedasfollows:

Nowthatyouhaveconfiguredtherepositorylocation,youcaninstallitafteryouhaveupdatedtheoperatingsystem.Theupdateandinstallisdoneasfollows:

Once installedexecute the followingcommand tomakesure the installation isdoneproperly

Thisshouldgiveyouanoutputsimilartothenext.

local$sudoapt-getupdate

local$sudoapt-getinstall\

apt-transport-https\

ca-certificates\

curl\

software-properties-common

local$curl-fsSLhttps://download.docker.com/linux/ubuntu/gpg|sudoapt-keyadd-

local$sudoapt-keyfingerprint0EBFCD88

local$sudoadd-apt-repository\

"deb[arch=amd64]https://download.docker.com/linux/ubuntu\

local$(lsb_release-cs)\

stable"


local$sudoapt-getinstalldocker-ce


local$sudosystemctlstatusdocker

docker.service-DockerApplicationContainerEngine

https://store.docker.com/editions/community/docker-ce-desktop-mac

https://download.docker.com/mac/stable/Docker.dmg

8.2.2.3InstallationforWindows10

DockerneedsMicrosoft’sHyper-Vtobeenabled,butitwillimpactrunningthevirtualmachines

StepstoInstall

Download Docker forWindows(Community Edition) from the followinglinkhttps://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exeFollowtheWizardstepsintheinstallerLaunchdockerDockerusuallylauchesautomaticallyduringwindowsstartup.

8.2.2.4TestingtheInstall

Totestifitworksexecutethefollowingcommandsinaterminal:

Youshouldseeanoutputsimilarto

Toseeifyoucanrunacontaineruse

Loaded:loaded(/lib/systemd/system/docker.service;enabled;vendorpreset:enabled)

Active:active(running)sinceWed2018-10-0313:02:04EDT;15minago

Docs:https://docs.docker.com

MainPID:6663(dockerd)

Tasks:39

local$dockerversion

dockerversion

Client:

Version:17.03.1-ce

APIversion:1.27

Goversion:go1.7.5

Gitcommit:c6d412e

Built:TueMar2800:40:022017

OS/Arch:darwin/amd64

Server:

Version:17.03.1-ce

APIversion:1.27(minimumversion1.12)

Goversion:go1.7.5

Gitcommit:c6d412e

Built:FriMar2400:00:502017

OS/Arch:linux/amd64

Experimental:true

local$dockerrunhello-world

https://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exe

Onceexecutedyoushouldseeanoutputsimilarto

8.2.3Dockerfile☁�

Inorderforustobuildcontainers,weneedtoknowwhatisinthecontainerandhow to create an image representing a container. To do this a convenientspecificationformatcalledDockerfilecanbeused.OnceaDockerfileiscreated,wecanbuildimagesfromit

Weshowcasehere theuseof adockerfileona simpleexampleusingaRESTservice.

Thisexampleiscopiedfromtheofficialdockerdocumentationhostedat

https://docs.docker.com/get-started/part2/#publish-the-image

8.2.3.1Specification

ItosbesttostartwithanemptydirectoryinwhichwecreateaDockerfile.

Unabletofindimage'hello-world:latest'locally

latest:Pullingfromlibrary/hello-world

78445dd45222:Pullcomplete

Digest:sha256:c5515758d4c5e1e838e9cd307f6c6a.....

Status:Downloadednewerimagefor

hello-world:latest

HellofromDocker!

Thismessageshowsthatyourinstallationappears

tobeworkingcorrectly.

Togeneratethismessage,Dockertookthefollowing

steps:

1.TheDockerclientcontactedtheDockerdaemon.

2.TheDockerdaemonpulledthe"hello-world"image

fromtheDockerHub.

3.TheDockerdaemoncreatedanewcontainerfromthat

imagewhichrunstheexecutablethatproducesthe

outputyouarecurrentlyreading.

4.TheDockerdaemonstreamedthatoutputtotheDocker

client,whichsentittoyourterminal.

Totrysomethingmoreambitious,youcanrunanUbuntu

containerwith:

local$dockerrun-itubuntubash

Shareimages,automateworkflows,andmorewitha

freeDockerID:

https://cloud.docker.com/

Formoreexamplesandideas,visit:

https://docs.docker.com/engine/userguide/

https://github.com/cloudmesh-community/book/blob/master/chapters/container/dockerfile.md

https://docs.docker.com/get-started/part2/#publish-the-image

Next,wecreateanemptyfilecalledDockerfile

WecopythefollowingcontentsintotheDockerfileandafterthatcreateasimpleRESTservice

Wealsocreatearequirements.txtfilethatweneedforinstallingthenecessarypythonpackages

The example application we use here is a student info served via a RESTfulserviceimplementedusingpythonflask.Itisstoredinthefileapp.py

Tobuildthecontainer,wecanusethefollowingcommand:

Toruntheserviceopenanewwindowandcdintothedirectorywhereyoucode

local$mkdir~/cloudmesh/docker

local$cd~/cloudmesh/docker

local$touchDockerfile

#UseanofficialPythonruntimeasaparentimage

FROMpython:3.7-slim

#Settheworkingdirectoryto/app

WORKDIR/app

#Copythecurrentdirectorycontentsintothecontainerat/app

COPY./app

#Installanyneededpackagesspecifiedinrequirements.txt

RUNpipinstall--trusted-hostpypi.python.org-rrequirements.txt

#Makeport80available

EXPOSE80

#Runapp.pywhenthecontainerlaunches

CMD["python","app.py"]

Flask

fromflaskimportFlask,jsonify

importos

app=Flask(__name__)

@app.route('/student/albert')

defalberts_information():

data={

'firstname':'Albert',

'lastname':'Zweistsein',

'university':'IndianaUniversity',

'email':'[email protected]'

}

returnjsonify(**data)


app.run(host="0.0.0.0",port=80)

local$dockerbuild-tstudents.

islocated.Nowsay

Yourdockercontainerwillrunandyoucanvisititbyusingthecommand

Tostopthecontainerdoa

andlocatetheidofthecontainer,e.g.,2a19776ab812,andthenrunthis

Todelete thedocker container image,youmust first sopall instancesusing itandtheremovetheimage.Youcanseetheimageswiththecommand

ThenyoucanlocateallcontainersusingthatimagewhilelookingintheIMAGEcolumn or using a simple fgrep in case you have many images. stop thecontainersusingthatimageandthatyoucansay

whilethenumberisthecontainerid

Onceyoukilledallcontainersusingthatimage,youcanremovetheimagewiththermicommand.

8.2.3.2References

Thereferencedocumentationaboutdockerfilescanbefoundat

https://docs.docker.com/engine/reference/builder/

8.2.4DockerHub☁�

Docker Hub is a cloud-based registry service which provides a “centralized

local$dockerrun-d-p4000:80students

local$curlhttp://localhost:4000/student/albert

local$dockerps

local$dockerstop2a19776ab812

local$dockerimages

local$dockerrm74b9b994c9bd

local$dockerrmi8b3246425402

https://docs.docker.com/engine/reference/builder/

https://github.com/cloudmesh-community/book/blob/master/chapters/container/dockerhub.md

resource for container image discovery, distribution and change management,user and team collaboration, and workflow automation throughout thedevelopment pipeline” [65]. There are both private and public repositories.Privaterepositorycanonlybeusedbypeoplewithintheirownorganization.

DockerHubisintegratedintoDockerasthedefaultregistry.Thismeansthatthedocker pull commandwill initialize the download automatically fromDockerHub[66]. It allowsusers todownload (pull),build, testandstore their imagesforeasydeploymentonanyhosttheymayhave[65].

8.2.4.1CreateDockerIDandLogIn

A log-in is not necessary for pulling Docker images from the Hub but it isnecessaryforpushingimagestodockerhubforsharing.ThustostoreimagesonDockerhubyouneed tocreateanaccountbyvisitingDockerHubWebpage.Dockerhub offers in general a free account, but it has restrictions. The freeaccount allows you to share images that you distriuute publically, but it onlyallows one private Docker Hub Repository. In case you needmore, you willneedtoupgradetoapaidplan.

For the rest of the tutorial we assume that you use the environment variableDOCKERHUBtoindicateyourusername.Itiseasisetifyousetitinyourshellwith

8.2.4.2SearchingforDockerImages

TherearetwowaystosearchforDockerimagesonDockerHub:

OnewayistousetheDockercommandlinetool.Wecanopenaterminalandrunthedockersearchcommand.Forexample,thefollowingcommandsearchesforcentOSimages:

youwillseeoutputsimilarto:

NAME DESCRIPTION STAR OFFICIAL AUTOMATED

local$exportDOCKERHUB=<PUTYOURDOCKERUSERNAMEHERE>

local$sudodockersearchcentos

https://hub.docker.com/

centos OfficialCentOS 4130 [OK]

ansible/centos7 AnsibleonCentos7 105 [OK]

…

Ifyoudonotwanttousesudowithdockercommandeachtimeyouneedtoaddthe current user into the docker group. You can do that using the followingcommand.

Thiswillpromptyoutoenterthepasswordforthecurrentuser.Nowyoushouldbeabletoexecutethepreviouscommandwithoutusingsudo.

Officialrepositoriesindockerhubarepublic,certifiedrepositoriesfromvendorsand contributors to Docker. They contain Docker images from vendors likeCanonical, Oracle, and Red Hat that you can use as the basis to build yourapplications and services. There is one official repository in this list, the firstone,centos.

TheotherwayistosearchviatheWebSearchBoxatthetopoftheDockerwebpagebytypingthekeyword.Thesearchresultscanbesortedbynumberofstars,numberofpulls,andwhetheritisanofficialimage.Thenforeachsearchresult,you can verify the information of the image by clicking the details button tomakesurethisistherightimagethatfitsyourneeds.

8.2.4.3PullingImages

Aparticularimage(takecentosasanexample)canbepulledusingthefollowingcommand:

Tags can be used to specify the image to pull. By default the tag is latest,thereforethepreviouscommandisthesameasthefollowing:

local$sudousermod-aGdocker${USER}

local$su-${USER}

local$dockerpullcentos

local$dockerpullcentos:latest

Youcanuseadifferenttag:

Tochecktheexistinglocaldockerimages,runthefollowingcommand:

Theresultsshow:

REPOSITORY TAG IMAGEID CREATED SIZEcentos latest 26cb1244b171 2weeksago 195MBcentos 6 2d194b392dd1 2weeksago 195MB

8.2.4.4CreateRepositories

In order to push images toDockerHub, you need to have a and account andcreatearepository.

WhenyoufirstcreateaDockerHubuser,youseeaGetstartedwithDockerHubscreen,fromwhichyoucanclickdirectlyintoCreateRepository.YoucanalsousetheCreatemenutoCreateRepository.Whencreatinganewrepository,youcanchoose toput it inyourDockerIDnamespace,or thatofanyorganizationthatyouareintheownersteam[67].

As an example,we created a repository cloudtechnologywith the namespace$DOCKERHUB (here DOCKERHUB is your docker hub username). Hence the full name is$DOCKERHUB/cloudtechnology

8.2.4.5PushingImages

Topushanimagetotherepositorycreated,thefollowingstepscanbefollowed.

First,logintoDockerHubfromthecommandlinebyspecifyingtheusername.Ifyouencounterpermissionissuespleaseusesudoinfrontofthecommand

Enterthepasswordwhenprompted.Ifeverythingworkedyouwillgetamessage

local$dockerpullcentos:6

local$dockerimages

$dockerlogin--username=$DOCKERHUB

similarto:

Second,checktheimageIDusing:

theresultlookssimilarto:

REPOSITORY TAG IMAGEID CREATED SIZEcloudmesh-nlp latest 1f26a5f7a1b4 10daysago 1.79GBcentos latest 26cb1244b171 2weeksago 195MB

centos latest 2d194b392dd1 2weeksago 195MB

Here, the the imagewith ID1f26a5f7a1b4 is the one to push toDockerHub.Youcanchooseanotherimageinsteadifyoulike.

Third,tagtheimage

Herewehave used a version number as a tag.However another goodwayofadding a tag is to use a keyword/tag that will help you understand what thiscontainershouldbeusedinconjunctionwith,orwhatitrepresents.

Fourth,nowthelistofimageswilllooksomethinglike

REPOSITORY TAG IMAGEID CREATED SIZEcloudmesh-nlp latest 1f26a5f7a1b4 10dago 1.79GB$DOCKERHUB/cloudmesh v1.0 1f26a5f7a1b4 10dago 1.79GBcentos latest 26cb1244b171 2wago 195MBcentos latest 2d194b392dd1 2wago 195MB

Fifth,Nowyoucanseeanimagesunderthename$DOCKERHUB/cloudmesh,wenowneedtopushthisimagetotherepositorythatwecreatedonthedockerhubwebsite.Forthatexecutethefollowingcommand.

LoginSucceeded

$dockerimages

$dockertag1f26a5f7a1b4$DOCKERHUB/cloudmesh:v1.0

Itshowssomethingsimilarto,tomakesureyoucancheckondockerhubiftheimagesthatwaspushedislistedintherepositorythatwecreated.

Sixth,nowtheimageisavailableonDockerHub.Everyonecanpullitsinceitisapublicrepositorybyusingcommand:

PleaserememberthattheUSERNAMEistheusernamefortheuserthatmakesthisimagepublicallyavailable.Ifyouaretheuseryouwillseethevaluebeingtheonefrom$DOCKERHUB,Ifnotyouwillseeheretheusernameoftheuseruploadingtheimage

8.2.4.6Resources

TheofficalOverviewofDockerHub[65]InformationaboutusingdockerrepositoriescanbefoundatRepositoriesonDockerHub[67]HowtoUseDockerHub[66]DockerTutorialSeries[68]

8.3DOCKERASPAAS

8.3.1DockerSwarm☁�

AswarmisagroupofmachinesthatarerunningDockerandarejoinedintoacluster.Dockercommandsareexecutedonaclusterbyaswarmmanager.Themachinesinaswarmcanbephysicalorvirtual.Afterjoiningaswarm,theyarereferredtoasnodes.

8.3.1.1Terminology

$dockerpush$DOCKERHUB/cloudmesh

Thepushreferstorepository[docker.io/$DOCKERHUB/cloudmesh]

18f9479cfc2c:Pushed

e9ddee98220b:Pushed

...

db584c622b50:Mountedfromlibrary/ubuntu

a94e0d5a7c40:Mountedfromlibrary/ubuntu

...

v1.0:digest:sha256:305b0f911077d9d6aab4b447b...size:3463

$dockerpullUSERNAME/cloudmesh

https://docs.docker.com/docker-hub/#use-official-repositories

https://docs.docker.com/docker-hub/repos/

https://www.linux.com/blog/learn/intro-to-linux/2018/1/how-use-dockerhub

https://rominirani.com/docker-tutorial-series-part-4-docker-hub-b51fb545dd8e

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-swarm.md

Inthissectionifacommandisprefixedwithlocal$itmeansthecommandistobeexecuted on your localmachine. If it is prefixedwith either master or worker thatmeans the command is tobe executed fromwithin a virtualmachine thatwascreated.

8.3.1.2CreatingaDockerSwarmCluster

Aswarmismadeupofmultiplenodes,whichcanbeeitherphysicalorvirtualmachines.Weusemasterasthenameofthehostthatisrunasmasterandworker-1asa host run as aworker,where the number indicatet the i-thworkerThe basicstepsare:

1. run

toenableswarmmodeandmakeyourcurrentmachineaswarmmanager,

2. thenrun

onothermachines tohave themjoin theswarmasworkers.Choosea tabdescribedinnexttoseehowthisplaysoutinvariouscontexts.WeuseVMstoquicklycreateatwo-machineclusterandturnitintoaswarm.

8.3.1.3CreateaSwarmClusterwithVirtualBox

Incaseyoudonothaveaccesstomultiplephysicalmachines,youcancreateavirtual cluster on yourmachinewith the help of virtual box. Instead of usingvagrant we can use the built in docker-machine command to start several virtualmachines.

Ifyoudonothavevirtualboxinstalledonyourmachineinstallitonyourmachine.Additionallyyouwouldrequiredocker-machinetobeinstalledonyourlocalmachine.Toinstalldocker-machineonpleasefollowinstructionsatthedockerdocumentationatInstallDockerMachine

Tocreatethevirtualmachinesyoucanusethecommandasfollows:

master$dockerswarminit

worker-1$dockerswarmjoin

https://docs.docker.com/machine/install-machine/

To list the VMs and get their ip addresses. Use this command to list themachinesandgettheirIPaddresses.

8.3.1.4InitializetheSwarmManagerNodeandAddWorkerNodes

Thefirstmachineactsasthemanager,whichexecutesmanagementcommandsandauthenticatesworkerstojointheswarm,andthesecondisaworker.

Toinstruct thefirstvmtobecomethemaster,firstweneedtologintothevmthatwasnamed master.Tologinyoucanuse ssh,executethefollowingcommandonyourlocalmachinetologintothemastervm.

Now sincewe are inside the master vmwe can configure this vm as the dockerswarmmanager.Executethefollowingcommandwithinthemastervmininitializeswarm

Ifyougetanerrorstatingsomethingsimilarto“couldnotchooseanIPaddresstoadvertisesincethissystemhasmultipleaddressesondifferentinterfaces”,usethe followingcommand instead.To find the IPaddress execute the commandifconfigandpicktheipaddresswhichismostsimmilarto192.x.x.x.

Theoutputwillooklikethis,whereIP-myvm1istheipaddressofthefirstvm

Nowthatwehavethedockerswarmmanagerupwecanaddworkermachinestotheswarm.Thecommandthatisprintedintheoutputshownpreviouslycanbe

local$docker-machinecreate--drivervirtualboxmaster

local$docker-machinecreate--drivervirtualboxworker-1

local$docker-machinels

local$docker-machinesshmaster

master$dockerswarminit

master$dockerswarminit--advertise-addr192.x.x.x

master$Swarminitialized:currentnode(p6hmohoeuggtwqj8xz91zbs5t)isnow

amanager.

Toaddaworkertothisswarm,runthefollowingcommand:

worker-1$dockerswarmjoin--tokenSWMTKN-1-5c3anju1pwx94054r3vx0v7j4obyuggfu2cmesnx

192.168.99.100:2377

Toaddamanagertothisswarm,run'dockerswarmjoin-tokenmanager'andfollowtheinstructions.

usedtojoinworkerstothemanager.Pleasenotethatyouneedtousetheoutputcommandthatisgeneratedwhenyourundockerswarminitsincethetokenvalueswillbedifferent.

Nowweneedtouseaseparateshelltologintotheworkervmthatwecreated.Openupanewshell(orterminal)andusethefollowingcommandtosshintotheworker

Once you are in the worker execute the following command to join worker to theswammanager.

Thegenericversionofthecommandwouldbeasfollows,youneedtofillinthecorrectvaluestovaluesmarkedas‘<>’toexecutethecommand.

Youwillseeanoutputstatingthatthismachinejoinedthedockerswarm.

If youwant to add another node as amanager to the current swarm you canexecutethefollowingcommandandfollowtheinstructions.Howeverthisisnotneededforthisexercise.

Rundocker-machinelstoverifythatworkerisnowtheactivemachine,asindicatedbytheasterisknexttoit.

Iftheastrixisnotpresentexecutethefollowingcommand

Theoutputwilllooksimilarto

local$docker-machinesshworker-1

worker-1$dockerswarmjoin--token

SWMTKN-1-5c3anju1pwx94054r3vx0v7j4obyuggfu2cmesnx192.168.99.100:2377

worker-1$dockerswarmjoin--token<token><myvmip>:<port>

Thisnodejoinedaswarmasaworker.

newvm$dockerswarmjoin-tokenmanager'

local$docker-machinels

local$sudosh-c'eval"$(docker-machineenvworker-1)";docker-machinels'

NAMEACTIVEDRIVERSTATEURLSWARMDOCKERERRORS

master-virtualboxRunningtcp://192.168.99.100:2376v18.06.1-ce

worker-1*virtualboxRunningtcp://192.168.99.102:2376v18.06.1-ce

8.3.1.5Deploytheapplicationontheswarmmanager

Nowwe can try to deploy a test application.Firstweneed to create a dockerconfigurationfilewhichwewillname docker-compse.yml.Sinceweare in thevmweneedtocreatethefileusingtheterminal.followthestepsgivennextthecreateandsavethefile.Firstlogintothemaster

Then,

Thiscommandwillopenaneditor.Press the Insertbutton toenableeditingandthencopypastethefollowingintothedocument.version:"3"

services:

web:

#replaceusername/repo:tagwithyournameandimagedetails

image:username/repo:tag

deploy:

replicas:5

resources:

limits:

cpus:"0.1"

memory:50M

restart_policy:

condition:on-failure

ports:

-"4000:80"

networks:

-webnet

networks:

webnet:

ThenprestheEcsbuttonandenter:wqtosaveandclosetheeditor.

Oncewe have the filewe can deploy the test application using the followingcommand.whichwillbeexecutedinthemaster

Toverify theservicesandassociatedcontainershavebeendistributedbetweenbothmasterandworker,executethefollowingcommand.

Theoutputwilllooksimilarto

```bash ID NAME IMAGE NODE DESIRED STATE CURRENT STATEERRORPORTSwpqtkv69qbeegetstartedlab_web.1username/repo:tagworker-

local$docker-machinesshworker-1

master$vidocker-compose.yml

master$dockerstackdeploy-cdocker-compose.ymlgetstartedlab

master$dockerstackpsgetstartedlab

1 Running Preparing 4 seconds ago whkiecyenuv0 getstartedlab_web.2username/repo:tag master Running Preparing 4 seconds ago 13obecvxohh1getstartedlab_web.3 username/repo:tagworker-1Running Preparing 5 secondsago 76srj0nflagi getstartedlab_web.4 username/repo:tag worker-1 RunningPreparing5secondsagoymqoonad5c1fgetstartedlab_web.5username/repo:tagmasterRunningPreparing5secondsago

8.3.2DockerandDockerSwarmonFutureSystems☁�

ThissectionisforIUstudentsonlythattakeclasseswithus.

This section introduces how to run Docker container on FutureSystems.CurrentlywehavedeployedDockerswarmonEcho.

8.3.2.1GettingAccess

YouwillneedanaccountonFutureSystemsandbeenrolledinanactiveproject.Toverify,trytoseeifyoucanlogintovictor.futuresystems.org.YouneedtobeamemberofavalidFutureSystemsproject,andhadsubmittedansshpublickeyviatheFutureSystemsportal.

ForFall2018classesatIUyouneedtobeinthefollowingproject:

https://portal.futuresystems.org/project/553

If your access to the victor host has been verified, try to login to the dockerswarmheadnode.ToconvenientlydothisletusdefinesomeLinuxenvironmentvariablestosimplifytheaccessandthematerialpresentedhere.Youcanplacethemeveninyour.bashrcor.bash_profilesotheinformationgetspopulatedwheneveryoustartanewterminal.Ifyoudirectlyedit thefilesmakesure toexecute thesourcecommandtorefreshtheenvironmentvariablesforthecurrentsessionusingsource.bashrcorsource.bash_profile.Oryoucanclosethecurrentshellandreopenanewone.

NowyoucanusethetwovariablesthatweresettologintotheEchoserer,usingthefollowingcommand

local$exportECHO=149.165.150.76

local$exportFS_USER=<putyourfutersystemaccountnamehere>

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-fs.md


Note: If you have access to india but not the docker swarm system, yourprojectmaynothavebeenauthorizedtoaccess thedockerswarmcluster.SendatickettoFutureSystemsticketsystemtorequestthis.

Onceloggedintothedockerswarmheadnode,trytorun:

toverifydockerrunworks.

8.3.2.2Creatingaserviceanddeploytotheswarmcluster

Whiledockerruncanstartacontainerandyoumayevenattach to itsconsole, therecommendedwaytouseadockerswarmclusteristocreateaserviceandhaveit run on the swarm cluster. The service will be scheduled to one or manynumberofthenodesoftheswarmcluster,basedontheconfiguration.Itisalsoeasy to scale up the service when more swarm nodes are available. Dockerswarmreallymakes iteasier forservice/applicationdevelopers to focuson thefunctionality development but notworrying about how andwhere to bind theservicetosomeresources/server.Thedeployment,access,andscalingup/downwhen necessary, are all managed transparently. Thus achieving the newparadigmofserverlesscomputing.

As an example, the following command creates a service and deploy it to theswarm cluster, if the port is in use the port 9001 used in the command can bechangedtoanavailableport.

TheNOTEBOOK_PASS_HASHcanbegeneratedinpython:

Sopassthroughthestringstartingwith‘sha1:......’.

Thecommandpullsapublishedimagefromdockercloud,startsacontainerandrunsascripttostarttheserviceinsidethecontainerwithnecessaryparameters.

local$ssh$FS_USER@$ECHO

echo$dockerrunhello-world

echo$dockerservicecreate--namenotebook_test-p9001:8888\

jupyter/datascience-notebookstart-notebook.sh

--NotebookApp.password=NOTEBOOK_PASS_HASH

>>>importIPython

>>>IPython.lib.passwd("YOUR_SELECTED_PASSWROD")

'sha1:52679cadb4c9:6762e266af44f86f3d170ca1......'

Theoption“-p9001:8888”mapstheserviceportinsidethecontainer(8888)toanexternalportoftheclusternode(9001)sotheservicecouldbeaccessedfromtheInternet.Inthisexample,youcanthenvisittheURL:

toaccess theJupyternotebook.Using the specifiedpasswordwhenyoucreatetheservicetologin.

Please note the service will be dynamically deployed to a container instance,which would be allocated to a swarm node based on the allocation policy.DockermakesthisprocesstransparenttotheuserandevencreatedmeshroutingsoyoucanaccesstheserviceusingtheIPaddressofthemanagementheadnodeof the swarm cluster, no matter which actual physical node the service wasdeployedto.

Thisalso implies that theexternalportnumberusedhas tobe freeat the timewhentheservicewascreated.

Someusefulrelatedcommands:

liststhecurrentlyrunningservices.

liststhedetailedinfoofthecontainerwheretheserviceisrunning.

listsalltherunningcontainersofanode.

listsallthenodesintheswarmcluster.

Tostoptheserviceandthecontainer:

8.3.2.3Createyourownservice

local$openhttp://$ECHO:9001

echo$dockerservicels

echo$dockerservicepsnotebook_test

echo$dockernodepsNODE

echo$dockernodels

echo$dockerservicermnoteboot_test

Youcancreateyourownserviceandrunit.Todoso,startfromabaseimage,e.g.,aubuntuimagefromthedockercloud.Thenyoucould:

Run a container from the image and attach to its console to develop theservice,andcreateanewimagefromthechangedinstanceusingcommand‘dockercommit’.

Create a dockerfile, which has the step by step building process of theservice,andthenbuildanimagefromit.

In reality, the first approach is probably useful when you are in the phase ofdevelop and debug your application/service. Once you have the step by stepinstructionsdevelopedthelatterapproachistherecommendedway.

Publishtheimagetothedockercloudbyfollowingthisdocumentation:

https://docs.docker.com/docker-cloud/builds/push-images/

Please make sure no sensitive information is included in the image to bepublished. Alternatively you could publish the image internally to the swarmcluster.

8.3.2.4Publishanimageprivatelywithintheswarmcluster

Oncetheimageispublishedandavailabletotheswarmcluster,youcouldstartanewservicefromtheimagesimilartotheJupyterNotebookexample.

8.3.2.5Exercises

E.Docker.Futuresystems.1:

Obtainanaccountonfuturesystems.

E.Docker.Futuresystems.2:

CreateaRESTservicewithswaggercodegenandrunitontheechocloud(seeexampleinthissection)

https://docs.docker.com/docker-cloud/builds/push-images/

8.3.3HadoopwithDocker☁�

In this section we will explore the Map/Reduce framework using HadoopprovidedthroughaDockercontainer.

Wewillshowcasethefunctionalityonasmallexamplethatcalculatesminimum,maximum,averageandstandarddeviationvaluesusingseveralinputfileswhichcontainfloatnumbers.

This section is based on the hadoop release 3.1.1 which includes significantenhancementsoverthepreviousversionofHadoop2.x.Changesincludetheuseofthefollowingsoftware:

CentOS7systemctlJavaSEDevelopmentKit8

ADockerfiletocreatethehadoopdeploymentisavailableat

*https://github.com/cloudmesh-community/book/blob/master/examples/docker/hadoop/3.1.1/Dockerfile

8.3.3.1BuildingHadoopusingDocker

YoucanbuildhadoopfromtheDockerfileasfollows:

ThecompletedockerimageforHadoopconsumes1.5GB.

Tousetheimageinteractivelyyoucanstartthecontainerasfollows:

Itmaytakeafewminutesatfirsttodownloadimage.

$mkdircloudmesh-community

$cdcloudmesh-community

$gitclonehttps://github.com/cloudmesh-community/book.git

$cdbook/examples/docker/hadoop/3.1.1

$dockerbuild-tcloudmesh/hadoop:3.1.1.

$dockerimages

REPOSITORYTAGIMAGEIDCREATEDSIZE

cloudmesh/hadoop3.1.1ba2c51f943481hourago1.52GB

$dockerrun-itcloudmesh/hadoop:3.1.1/etc/bootstrap.sh-bash

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-hadoop.md

https://github.com/cloudmesh-community/book/blob/master/examples/docker/hadoop/3.1.1/Dockerfile

8.3.3.2HadoopConfigurationFiles

Theconfigurationfilesareincludedintheconffolder

8.3.3.3VirtualMemoryLimit

INcaseyouneedmorememory,youcanincreaseitbychangingtheparametersinthefilemapred-site.xml,forexample:

mapreduce.map.memory.mbato4096mapreduce.reduce.memory.mbto8192

8.3.3.4hdfsSafemodeleavecommand

ASafemodeforHDFSisaread-onlymodefortheHDFScluster,whereitdoesnotallowanymodificationsoffilesandblocks.Namenodedisablessafemodeautomatically after startingupnormally. If required,HDFScouldbe forced toleavethesafemodeexplicitlybythiscommand:

8.3.3.5Examples

We included a statistics and a PageRank examples into the container. Theexamplesarealsoavailableingithubat

https://github.com/cloudmesh-community/book/tree/master/examples/docker/hadoop/3.1.1/examples

Weexplaintheexamplesnext

8.3.3.5.1StatisticalExamplewithHadoop

After we launch the container and use the interactive shell, we can run thestatisticsHadoopapplicationwhichcalculatestheminimum,maximim,average,and standard derivation from values stored in a number of input files. FigureFigure74showsthecomputingphasesinaMapReducejob.

$hdfsdfsadmin-safemodeleave

https://github.com/cloudmesh-community/book/tree/master/examples/docker/hadoop/3.1.1/examples

To achieve this, this Hadoop program reads multiple files from HDFS andprovides calculated values.Wewalk through every step from compiling JavasourcecodetoreadingaoutputfilefromHDFS.Theideaofthisexerciseistoget you startedwithHadoop and theMapReduce concept. Youmay seen theWordCount fromHadoop official website or documentation and this examplehasasamefunctions(Map/Reduce)exceptthatyouwillbecomputingthebasicstatisticssuchasmin,max,average,andstandarddeviationofagivendataset.

Theinputtotheprogramwillbeatextfile(s)carryingexactlyonefloatingpointnumber per line. The result file includes min, max, average, and standarddeviation.

Figure74:MapReduceexampleinDocker

8.3.3.5.1.1BaseLocation

Theexampleisavailablewithinthecontainerat:

8.3.3.5.1.2InputFiles

Atestinputfilesareavailableunder/cloudmesh/examples/statistics/input_datadirectoryinsideof the container.The statistics values for this input areMin: 0.20Max: 19.99Avg:9.51StdDev:5.55forallinputfiles.

10 files contain 55000 lines to process and each line is a random float pointvaluerangingfrom0.2to20.0.

8.3.3.5.1.3Compilation

The source code file name is MinMaxAvgStd.java which is available at

container$cd/cloudmesh/examples/statistics

/cloudmesh/examples/statistics/src.

TherearethreefunctionsinthecodeMap,ReduceandMainwhereMapreadseach lineof a file andupdatesvalues to calculateminimum,maximumvaluesandReducecollectsmapperstoproduceaverageandstandarddeviationvaluesatlast.

Thesecommandssimplypreparecompilingtheexamplecodeandthecompiledclassfilesaregeneratedatthedestlocation.

8.3.3.5.1.4ArchivingClassFiles

Jar command tool helps archiving classes in a single file which will be usedwhenHadoop runs this example. This is useful because a jar file contains allnecessaryfilestorunaprogram.

8.3.3.5.1.5HDFSforInput/Output

The input filesneed tobeuploaded toHDFSasHadoop runs thisexamplebyreadinginputfilesfromHDFS.

Ifuploadingiscompleted,youmayseefilelistingslike:

8.3.3.5.1.6RunProgramwithaSingleInputFile

$exportHADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoopclasspath`

$mkdir/cloudmesh/examples/statistics/dest

$javac-classpath$HADOOP_CLASSPATH-d/cloudmesh/examples/statistics/dest/cloudmesh/examples/statistics/src/MinMaxAvgStd.java

$cd/cloudmesh/examples/statistics

$jar-cvfstats.jar-C./dest/.

$exportPATH=$PATH:/HADOOP_HOME/bin

$hadoopfs-mkdirstats_input

$hadoopfs-putinput_data/*stats_input

$hadoopfs-lsstats_input/

Found10items

-rw-r--r--1rootsupergroup139422018-02-2823:16stats_input/data_1000.txt










Weare ready to run the program to calculate values from text files. First,wesimply run the programwith a single input file to see how itworks. data_1000.txtcontains1000linesoffloats,weusethisfilehere.

Thecommandrunswithinputparameterswhichindicateajarfile(theprogram,stats.jar), exercise.MinMaxAvgStd (package name.class name), input file path(stats_input/data_1000.txt)andoutputfilepath(stats_output_1000).

Thesampleresultsthattheprogramproduceslooklikethis:

$hadoopjarstats.jarexercise.MinMaxAvgStdstats_input/data_1000.txtstats_output_1000

18/02/2823:48:50INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032

18/02/2823:48:50INFOinput.FileInputFormat:Totalinputpathstoprocess:1

18/02/2823:48:50INFOmapreduce.JobSubmitter:numberofsplits:1

18/02/2823:48:50INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1519877569596_0002

18/02/2823:48:51INFOimpl.YarnClientImpl:Submittedapplicationapplication_1519877569596_0002

18/02/2823:48:51INFOmapreduce.Job:Theurltotrackthejob:http://f5e82d68ba4a:8088/proxy/application_1519877569596_0002/

18/02/2823:48:51INFOmapreduce.Job:Runningjob:job_1519877569596_0002

18/02/2823:48:56INFOmapreduce.Job:Jobjob_1519877569596_0002runninginubermode:false

18/02/2823:48:56INFOmapreduce.Job:map0%reduce0%



18/02/2823:49:05INFOmapreduce.Job:Jobjob_1519877569596_0002completedsuccessfully

18/02/2823:49:05INFOmapreduce.Job:Counters:49

FileSystemCounters

FILE:Numberofbytesread=81789

FILE:Numberofbyteswritten=394101

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=14067

HDFS:Numberofbyteswritten=86

HDFS:Numberofreadoperations=6

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=2

JobCounters

Launchedmaptasks=1

Launchedreducetasks=1

Data-localmaptasks=1

Totaltimespentbyallmapsinoccupiedslots(ms)=2107

Totaltimespentbyallreducesinoccupiedslots(ms)=2316

Totaltimespentbyallmaptasks(ms)=2107

Totaltimespentbyallreducetasks(ms)=2316

Totalvcore-secondstakenbyallmaptasks=2107

Totalvcore-secondstakenbyallreducetasks=2316

Totalmegabyte-secondstakenbyallmaptasks=2157568

Totalmegabyte-secondstakenbyallreducetasks=2371584

Map-ReduceFramework

Mapinputrecords=1000

Mapoutputrecords=3000

Mapoutputbytes=75783

Mapoutputmaterializedbytes=81789

Inputsplitbytes=125

Combineinputrecords=0

Combineoutputrecords=0

Reduceinputgroups=3

Reduceshufflebytes=81789

Reduceinputrecords=3000

Reduceoutputrecords=4

SpilledRecords=6000

ShuffledMaps=1

FailedShuffles=0

MergedMapoutputs=1

GCtimeelapsed(ms)=31

CPUtimespent(ms)=1440

Physicalmemory(bytes)snapshot=434913280

Virtualmemory(bytes)snapshot=1497260032

Totalcommittedheapusage(bytes)=402653184

ShuffleErrors

Thesecondlineofthefollowinglogsindicatesthatthenumberofinputfilesis1.

8.3.3.5.1.7ResultforSingleInputFile

WereadsresultsfromHDFSby:

Thesampleoutputlookslike:

8.3.3.5.1.8RunProgramwithMultipleInputFiles

Thefirstrunwasdoneprettyquickly(1440millisecondstookaccordingtotheprevioussampleresult)becausetheinputfilesizewassmall(1,000lines)anditwas a single file. We provides more input files with a larger size (2,000 to10,000 lines). Input files are already uploaded to HDFS. We simply run theprogramagainwithaslightchangeintheparameters.

Thecommandisalmostsameexceptthataninputpathisadirectoryandanewoutput directory. Note that every time that you run this program, the outputdirectorywillbecreatedwhichmeansthatyouhavetoprovideanewdirectorynameunlessyoudeleteit.

The sample outputmessages look like the followingwhich is almost identicalcomparedtothepreviousrunexcept that this timethenumberof inputfiles toprocessis10,seethelinetwonext:

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

BytesRead=13942

FileOutputFormatCounters

BytesWritten=86

$hadoopfs-catstats_output_1000/part-r-00000

Max:19.9678704297

Min:0.218880718983

Avg:10.225467263249385

Std:5.679809322880863

$hadoopjarstats.jarexercise.MinMaxAvgStdstats_input/stats_output_all

18/02/2823:17:18INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032

18/02/2823:17:18INFOinput.FileInputFormat:Totalinputpathstoprocess:10

8.3.3.5.1.9ResultforMultipleFiles

Theexpectedresultlookslike:

18/02/2823:17:18INFOmapreduce.JobSubmitter:numberofsplits:10

18/02/2823:17:18INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1519877569596_0001

18/02/2823:17:19INFOimpl.YarnClientImpl:Submittedapplicationapplication_1519877569596_0001

18/02/2823:17:19INFOmapreduce.Job:Theurltotrackthejob:http://f5e82d68ba4a:8088/proxy/application_1519877569596_0001/

18/02/2823:17:19INFOmapreduce.Job:Runningjob:job_1519877569596_0001

18/02/2823:17:24INFOmapreduce.Job:Jobjob_1519877569596_0001runninginubermode:false







18/02/2823:17:39INFOmapreduce.Job:Jobjob_1519877569596_0001completedsuccessfully

18/02/2823:17:39INFOmapreduce.Job:Counters:49

FileSystemCounters

FILE:Numberofbytesread=4496318

FILE:Numberofbyteswritten=10260627

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=767333

HDFS:Numberofbyteswritten=84

HDFS:Numberofreadoperations=33

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=2

JobCounters

Launchedmaptasks=10

Launchedreducetasks=1

Data-localmaptasks=10

Totaltimespentbyallmapsinoccupiedslots(ms)=50866

Totaltimespentbyallreducesinoccupiedslots(ms)=4490

Totaltimespentbyallmaptasks(ms)=50866

Totaltimespentbyallreducetasks(ms)=4490

Totalvcore-secondstakenbyallmaptasks=50866

Totalvcore-secondstakenbyallreducetasks=4490

Totalmegabyte-secondstakenbyallmaptasks=52086784

Totalmegabyte-secondstakenbyallreducetasks=4597760

Map-ReduceFramework

Mapinputrecords=55000

Mapoutputrecords=165000

Mapoutputbytes=4166312

Mapoutputmaterializedbytes=4496372

Inputsplitbytes=1251

Combineinputrecords=0

Combineoutputrecords=0

Reduceinputgroups=3

Reduceshufflebytes=4496372

Reduceinputrecords=165000

Reduceoutputrecords=4

SpilledRecords=330000

ShuffledMaps=10

FailedShuffles=0

MergedMapoutputs=10

GCtimeelapsed(ms)=555

CPUtimespent(ms)=16040

Physicalmemory(bytes)snapshot=2837708800

Virtualmemory(bytes)snapshot=8200089600

Totalcommittedheapusage(bytes)=2213019648

ShuffleErrors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

BytesRead=766082

FileOutputFormatCounters

BytesWritten=84

$hadoopfs-catstats_output_all/part-r-00000

8.3.3.5.2Conclusion

The example program of calculating some values by reading multiple filesshows howMap/Reduce iswritten by a Java programming language and howHadoop runs its program using HDFS.We also observed the one of benefitsusingDockercontainerwhichisthatthehassleofconfigurationandinstallationofHadoopisnotnecessaryanymore.

8.3.3.6Refernces

The details of the new version is available from the official site athttp://hadoop.apache.org/docs/r3.1.1/index.html

8.3.4DockerPagerank☁�

PageRankisapopularexamplealgorithmusedtodisplaytheabilityofbigdataapplications to run parallel tasks. This example will show how the dockerhadoopimagecanbeusedtoexecutethePagerankexamplewhichisavailablein/cloudmesh/examples/pagerank

8.3.4.1Usetheautomatedscript

Wemake the steps of compiling java source, archiving class files, load inputfilesandruntheprogramintoonesinglescript.Toexecuteitwiththeinputfile:PageRankDataGenerator/pagerank5000g50.input.0, using 5000 urls and 1iteration:

Resultwilllooklike

Theheadoftheresultwilllooklike

Max:19.999191254

Min:0.200268613863

Avg:9.514884854468903

Std:5.553921579413547

$cd/cloudmesh/examples/pagerank

$./compileAndExecHadoopPageRank.shPageRankDataGenerator/pagerank5000g50.input.050001

output.pagerank/part-r-00000

headoutput.pagerank/part-r-00000

http://hadoop.apache.org/docs/r3.1.1/index.html

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-pagerank-example-instruction.md

8.3.4.2Compileandrunbyhand

If onewants to generate the java class files and archive them as the previousexercise, one could use the following code (which is actually insidecompileAndExecHadoopPageRank.sh)

LoadinputfilestoHDFS

Runprogramwiththe[PageRankInputsFileDirectory][PageRankOutputDirectory][NumberofUrls][NumberOfIterations]

Result

8.3.5ApacheSparkwithDocker☁�

8.3.5.1PullImagefromDockerRepository

We use a Docker image from Docker Hub:(https://hub.docker.com/r/sequenceiq/spark/) This repository contains aDockerfiletobuildaDockerimagewithApacheSparkandHadoopYarn.

02.9999999999999997E-5

12.9999999999999997E-5

22.9999999999999997E-5

32.9999999999999997E-5

42.9999999999999997E-5

52.9999999999999997E-5

62.9999999999999997E-5

72.9999999999999997E-5

82.9999999999999997E-5

92.9999999999999997E-5

exportHADOOP_CLASSPATH=`$HADOOP_PREFIX/bin/hadoopclasspath`

mkdir/cloudmesh/examples/pagerank/dist

$find/cloudmesh/examples/pagerank/src/indiana/cgl/hadoop/pagerank/\

-name"*.java"|xargsjavac-classpath$HADOOP_CLASSPATH\

-d/cloudmesh/examples/pagerank/dist

$cd/cloudmesh/examples/pagerank/dist

$jar-cvfHadoopPageRankMooc.jar-C..

$exportPATH=$PATH:/$HADOOP_PREFIX/bin

$cd/cloudmesh/examples/pagerank/

$hadoopfs-mkdirinput.pagerank

$hadoopfs-putPageRankDataGenerator/pagerank5000g50.input.0input.pagerank

$hadoopjardist/HadoopPageRankMooc.jarindiana.cgl.hadoop.pagerank.HadoopPageRankinput.pagerankoutput.pagerank50001

$hadoopfs-catoutput.pagerank/part-r-00000

$dockerpullsequenceiq/spark:1.6.0

https://github.com/cloudmesh-community/book/blob/master/chapters/container/docker-spark.md

8.3.5.2RunningtheImage

Inthisstep,wewilllaunchaSparkcontainer.

8.3.5.2.1Runninginteractively

8.3.5.2.2Runninginthebackground

8.3.5.3RunSpark

Afteracontainerislaunched,wecanrunSparkinthefollowingtwomodes:(1)yarn-clientand(2)yarn-cluster.Thedifferencesbetweenthetwomodescanbefoundhere:https://spark.apache.org/docs/latest/running-on-yarn.html

8.3.5.3.1RunSparkinYarn-ClientMode

8.3.5.3.2RunSparkinYarn-ClusterMode

8.3.5.4ObserveTaskExecutionfromRunningLogsofSparkPi

LetusobserveSparktaskexecutionbyadjustingtheparameterofSparkPiandthePiresultfromthefollowingtwocommands.

8.3.5.5WriteaWord-CountApplicationwithSparkRDD

Let us write our own word-count with Spark RDD. After the shell has been

$dockerrun-it-p8088:8088-p8042:8042-hsandboxsequenceiq/spark:1.6.0bash

$dockerrun-d-hsandboxsequenceiq/spark:1.6.0-d

$spark-shell--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1

$spark-submit--classorg.apache.spark.examples.SparkPi--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1

$spark-submit--classorg.apache.spark.examples.SparkPi\

--masteryarn-client--driver-memory1g\

--executor-memory1g\

--executor-cores1$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar10

$spark-submit--classorg.apache.spark.examples.SparkPi\

--masteryarn-client--driver-memory1g\

--executor-memory1g\

--executor-cores1$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar10000

started,copyandpastethefollowingcodeinconsolelinebyline.

8.3.5.5.1LaunchSparkInteractiveShell

8.3.5.5.2PrograminScala

8.3.5.5.3LaunchPySparkInteractiveShell

8.3.5.5.4PrograminPython

8.3.5.6DockerSparkExamples

8.3.5.6.1K-MeansExample

FirstweneedtopulltheimagefromtheDockerHub:

Itwilltakesometimetodownloadtheimage.Nowwehavetorundockersparkimageinteractively.

Thiswilltakeyoutotheinteractivemode.

LetusrunasampleKMeansexample.ThisisalreadybuiltwithSpark.

Herewespecifythedatadatasetfromalocalfolderinsidetheimageandwerunthe sample classKMeans in the sample package. The sample data set used isinside the sample-data folder. Spark has it’s own format formachine learningdatasets.Herethekmeans_data.txtfilecontainstheKMeansdataset.

$spark-shell--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1

valtextFile=sc.textFile("file:///etc/hosts")

valwords=textFile.flatMap(line=>line.split("\\s+"))

valcounts=words.map(word=>(word,1)).reduceByKey(_+_)

counts.values.sum()

$pyspark--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1

textFile=sc.textFile("file:///etc/hosts")

words=textFile.flatMap(lambdaline:line.split())

counts=words.map(lambdaword:(word,1)).reduceByKey(lambdax,y:x+y)

counts.map(lambdax:x[1]).sum()

$dockerpullsequenceiq/spark-native-yarn

$dockerrun-i-t-hsandboxsequenceiq/spark-native-yarn/etc/bootstrap.sh-bash

Ifyourunthissuccessfully,youcangetanoutputasshownhere.

8.3.5.6.2JoinExample

Run the followingcommand todoa sample joinoperationonagivendataset.Hereweusetwodatasets,namelyjoin1.txtandjoin2.txt.Thenweperformthejoinoperationthatwediscussedinthetheorysection.

8.3.5.6.3WordCount

Inthisexamplethewordcount.txtwillusedtodothewordcountusingmultiplereducers. Number 1 at the end of the command determines the number ofreducers.As spark can runmultiple reducers,we can specify the number as aparametertotheprogramme.

8.3.5.7InteractiveExamples

Hereweneedanewimagetoworkon.Letusrunthefollowingcommand.Thiswillpullthenecessaryrepositoriesfromdockerhub,aswedonothavemostofthe dependencies related to it. This can take a few minutes to downloadeverything.

Hereyouwillgetthefollowingoutputintheterminal.

$./bin/spark-submit--classsample.KMeans\

--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext

\

--confupdate-classpath=true\

./lib/spark-native-yarn-samples-1.0.jar/sample-data/kmeans_data.txt

Finishediteration(delta=0.0)

Finalcenters:

DenseVector(0.15000000000000002,0.15000000000000002,0.15000000000000002)

DenseVector(9.2,9.2,9.2)



$./bin/spark-submit--classsample.Join--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext--confupdate-classpath=true./lib/spark-native-yarn-samples-1.0.jar/sample-data/join1.txt/sample-data/join2.txt

$./bin/spark-submit--classsample.WordCount--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext--confupdate-classpath=true./lib/spark-native-yarn-samples-1.0.jar/sample-data/wordcount.txt1

$dockerrun-it-p8888:8888-v$PWD:/cloudmesh/spark--namesparkjupyter/pyspark-notebook

dockerrun-it-p8888:8888-v$PWD:/cloudmesh/spark--namesparkjupyter/pyspark-notebook

Unabletofindimage'jupyter/pyspark-notebook:latest'locally

latest:Pullingfromjupyter/pyspark-notebook

a48c500ed24e:Pullcomplete

Pleasecopytheurlshownattheendoftheterminaloutputandgotothaturlinthebrowser.

Youwillseethefollowingoutputinthebrowser,(UseGoogleChrome)

JupyterNotebookinBrowser

First navigate to thework folder. Let us create a new python file here. Clickpython3inthenewmenu.

1e1de00ff7e1:Pullcomplete

0330ca45a200:Pullcomplete

471db38bcfbf:Pullcomplete

0b4aba487617:Pullcomplete

d44ea0cd796c:Pullcomplete

5ac827d588be:Pullcomplete

d8d7747a335e:Pullcomplete

08790511e3e9:Pullcomplete

e3c68aea9a5f:Pullcomplete

484c6d5fc38a:Pullcomplete

0448c1360cb9:Pullcomplete

61d7e6dc705d:Pullcomplete

92f1091ed72b:Pullcomplete

8045d3663a7e:Pullcomplete

1bde7ba25439:Pullcomplete

5618f8ed38b4:Pullcomplete

f08523cb6144:Pullcomplete

99eee56fda2f:Pullcomplete

b37b1ce39785:Pullcomplete

aee4b9eac4ea:Pullcomplete

f810ef87439d:Pullcomplete

038786dce388:Pullcomplete

ded31312ea33:Pullcomplete

30221ffdd1a6:Pullcomplete

da1d368f8592:Pullcomplete

523809a30a21:Pullcomplete

47ab1b230dd2:Pullcomplete

442f9435e1a9:Pullcomplete

Digest:sha256:f8b6309cd39481de1a169143189ed0879b12b56fe286d254d03fa34ccad90734

Status:Downloadednewerimageforjupyter/pyspark-notebook:latest

Containermustberunwithgroup"root"toupdatepasswdfile

Executingthecommand:jupyternotebook

[I15:47:52.900NotebookApp]Writingnotebookservercookiesecretto/home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret

[I15:47:53.167NotebookApp]JupyterLabextensionloadedfrom/opt/conda/lib/python3.6/site-packages/jupyterlab

[I15:47:53.167NotebookApp]JupyterLabapplicationdirectoryis/opt/conda/share/jupyter/lab

[I15:47:53.176NotebookApp]Servingnotebooksfromlocaldirectory:/home/jovyan

[I15:47:53.177NotebookApp]TheJupyterNotebookisrunningat:

[I15:47:53.177NotebookApp]http://(3a3d9f7e2565or127.0.0.1):8888/?token=f22492fe7ab8206ac2223359e0603a0dff54d98096ab7930

[I15:47:53.177NotebookApp]UseControl-Ctostopthisserverandshutdownallkernels(twicetoskipconfirmation).

[C15:47:53.177NotebookApp]

Copy/pastethisURLintoyourbrowserwhenyouconnectforthefirsttime,

tologinwithatoken:

http://(3a3d9f7e2565or127.0.0.1):8888/?token=f22492fe7ab8206ac2223359e0603a0dff54d98096ab7930

Createanewpythonfile

Now add the following content in the new file. In Jupyter notebook, you canenterapythoncommandorpythoncodeandpress

Thiswillrunthecodeinteractively.

Nowlet’screatethefollowingcontent.

Nowletusdothefollowing.

Inthefollowingstageweconfiguresparkcontextandimportthenecessaryfiles.

Nextstageweusesampledatasetbycreatingtheminformofanarrayandwetrainthekmeansalgorithm.

SHIFT+ENTER

importos

os.getcwd()

importpyspark

sc=pyspark.SparkContext('local[*]')

rdd=sc.parallelize(range(1000))

rdd.takeSample(False,5)

os.makedirs("data")

frompyspark.mllib.clusteringimportKMeans,KMeansModel

fromnumpyimportarray

frommathimportsqrt

frompyspark.mllib.linalgimportVectors

frompyspark.mllib.linalgimportSparseVector

sc.version

sparse_data=[

SparseVector(3,{1:1.0}),



SparseVector(3,{2:1.1})

]

model=KMeans.train(sc.parallelize(sparse_data),2,initializationMode='k-means||',

seed=50,initializationSteps=5,epsilon=1e-4)

model.predict(array([0.,1.,0.]))

model.predict(array([0.,0.,1.]))

Inthefinalstageweputsamplevaluesandcheckthepredictionsonthecluster.In addition to that feed the data using SparseVector format and we add thekmeansinitializationmode,theerrormarginandthepalatalization.Weputthestep size as 5 for this example. In the previous one we did not specify anyparameters.

Thepredicttermpredictstheclusteridwhichitbelongsto.

Theninthefollowingwayyoucancheckwhethertwodatapointsbelongtooneclusterornot.

8.3.5.7.1StopDockerContainer

8.3.5.7.2StartDockerContainerAgain

8.3.5.7.3RemoveDockerContainer

8.4KUBERNETES

8.4.1IntroductiontoKubernetes☁�

LearningObjectives

model.predict(sparse_data[0])

model.predict(sparse_data[2])

data=array([0.0,0.0,1.0,1.0,9.0,8.0,8.0,9.0]).reshape(4,2)

model=KMeans.train(sc.parallelize(data),2,initializationMode='random',

seed=50,initializationSteps=5,epsilon=1e-4)

model.predict(array([0.0,0.0]))==model.predict(array([1.0,1.0]))

model.predict(array([8.0,9.0]))

model.predict(array([8.0,9.0]))==model.predict(array([9.0,8.0]))

model.k

model.computeCost(sc.parallelize(data))

isinstance(model.clusterCenters,list)

$dockerstopspark

$dockerstartspark

$dockerrmspark

https://github.com/cloudmesh-community/book/blob/master/chapters/container/kubernetes-intro.md

WhatisKubernetes?Whatarecontainers?ClustercomponentsinKubernetesBasicUnitsinKubernetesRunanexamplewithMinikubeInteractiveonlinetutorialHaveasolidunderstandingofContainersandKubernetesUnderstandtheClustercomponentsofKubernetesUnderstandtheterminologyofKubernetesGainpracticalexperiencewithkubernetesWithminikubeWithaninteractiveonlinetutorial

Kubernetesisanopen-sourceplatformdesignedtoautomatedeploying,scaling,andoperatingapplicationcontainers.

https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

WithKubernetes,youcan:

Deployyourapplicationsquicklyandpredictably.Scaleyourapplicationsonthefly.Rolloutnewfeaturesseamlessly.Limithardwareusagetorequiredresourcesonly.Runapplicationsinpublicandprivateclouds.

Kubernetesis

Portable:public,private,hybrid,multi-cloudExtensible:modular,pluggable,hookable,composableSelf-healing:auto-placement,auto-restart,auto-replication,auto-scaling

8.4.1.1Whatarecontainers?

https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

Figure75:KubernetesContainers[ImageSource]

Figure75showsadepictionofthecontainerarchitecture.

8.4.1.2Terminology

Inkubernetesweareusingthefollowingterminology

Pods:

A pod (as in a pod of whales or pea pod) is a group of one or morecontainers(suchasDockercontainers),withsharedstorage/network,andaspecificationforhowtorunthecontainers.Apod’scontentsarealwaysco-located and co-scheduled, and run in a shared context.A podmodels anapplication-specific logical host. It contains one or more applicationcontainers which are relatively tightly coupled. In a pre-container world,theywouldhaveexecutedonthesamephysicalorvirtualmachine.

https://d33wubrfki0l68.cloudfront.net/e7b766e0175f30ae37f7e0e349b87cfe2034a1ae/3e391/images/docs/why_containers.svg

Services:

ServiceisanabstractionwhichdefinesalogicalsetofPodsandapolicybywhichtoaccessthem.Sometimestheyarecalledamicro-service.ThesetofPodstargetedbyaServiceis(usually)determinedbyaLabelSelector.

Deployments:

A Deployment controller provides declarative updates for Pods andReplicaSets.Youdescribeadesiredstate inaDeploymentobject,andtheDeployment controller changes the actual state to the desired state at acontrolledrate.YoucandefineDeploymentstocreatenewReplicaSets,orto remove existing Deployments and adopt all their resources with newDeployments.

8.4.1.3KubernetesArchitecture

ThearchitectureofkubernetsisshowninFigure76.

Figure76:Kubernetes(Source:Google)

8.4.1.4Minikube

To try out kubernetes on your own computer you can download and installminikube. It deploys and runs a single-node Kubernetes cluster inside a VM.Hence it provide a reasonable environment not only to try it out, but also fordevelopment[cite].

Inthissectionwewillfirstdiscusshowtoinstallminikubeandthenshowcaseanexample.

8.4.1.4.1Installminikube

8.4.1.4.1.0.1OSX$curl-Lominikubehttps://storage.googleapis.com/minikube/releases/v0.25.0/minikube-darwin-amd64&&chmod+xminikube&&

https://kubernetes.io/docs/setup/minikube/

8.4.1.4.1.0.2Windows10

We assume that you have installedOracleVirtualBox in yourmachinewhichmustbeaversion5.x.x.

Initially,weneedtodownloadtwoexecutables.

DownloadKubectl

DownloadMinikube

Afterdownloadingthesetwoexecutablesplacetheminthecloudmeshdirectorywe earlier created. Rename the minikube-windows-amd64.exe to minikube.exe. Make sureminikube.exeandkubectl.exelieinthesamedirectory.

8.4.1.4.1.0.3Linux

InstallingKVM2isimportantforUbuntudistributions

We are going to run minikube using KVM2 libraries instead of virtualboxlibrariesforwindowsinstallation.

TheninstallthedriversforKVM2,

8.4.1.4.2StartaclusterusingMinikube

8.4.1.4.2.0.1OSXMinikubeStart

8.4.1.4.2.0.2UbuntuMinikubeStart

$curl-Lominikubehttps://storage.googleapis.com/minikube/releases/v0.25.0/minikube-linux-amd64&&chmod+xminikube&&

$sudoaptinstalllibvirt-binqemu-kvm

$sudousermod-a-Glibvirtd$(whoami)

$newgrplibvirtd

$curl-LOhttps://storage.googleapis.com/minikube/releases/latest/docker-machine-driver-kvm2&&chmod+xdocker-machine-driver-kvm2

$minikubestart

$minikubestart--vm-driver=kvm2

http://storage.googleapis.com/kubernetes-release/release/v1.4.0/bin/windows/amd64/kubectl.exe

https://storage.googleapis.com/minikube/releases/v0.25.0/minikube-windows-amd64.exe

8.4.1.4.2.0.3Windows10MinikubeStart

InthiscaseyoumustrunWindowsPowerShellasadministrator.Forthissearchfor the application in search and right click and clickRun as administrator. Ifyouareanadministratoritwillrunautomaticallybutifyouarenotpleasemakesureyouprovidetheadminlogininformationinthepopup.

8.4.1.4.3Createadeployment

8.4.1.4.4Exposetheservi

8.4.1.4.5Checkrunningstatus

Thisstepistomakesureyouhaveapodupandrunning.

8.4.1.4.6Callserviceapi

8.4.1.4.7TakealookfromDashboard

Ifyouwanttogetaninteractivedashboard,

Browsetohttp://192.168.99.101:30000inyourwebbrowseranditwillprovideaGUIdashboardregardingminikube.

8.4.1.4.8Deletetheserviceanddeployment

$cdC:\Users\<username>\Documents\cloudmesh

$.\minikube.exestart--vm-driver="virtualbox"

$kubectlrunhello-minikube--image=k8s.gcr.io/echoserver:1.4--port=8080

$kubectlexposedeploymenthello-minikube--type=NodePort

$kubectlgetpod

$curl$(minikubeservicehello-minikube--url)

$minikubedashboard

$minikubedashboard--url=true

http://192.168.99.101:30000

$kubectldeleteservicehello-minikube

$kubectldeletedeploymenthello-minikube

8.4.1.4.9Stopthecluster

Forallplatformswecanusethefollowingcommand.

8.4.1.5InteractiveTutorialOnline

Start cluster https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactive/Deploy app https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactiveExplorehttps://kubernetes.io/docs/tutorials/kubernetes-basics/explore-intro/Exposehttps://kubernetes.io/docs/tutorials/kubernetes-basics/expose-intro/Scalehttps://kubernetes.io/docs/tutorials/kubernetes-basics/scale-intro/Update https://kubernetes.io/docs/tutorials/kubernetes-basics/update-interactive/MiniKube https://kubernetes.io/docs/tutorials/stateless-application/hello-minikube/

8.4.2UsingKubernetesonFutureSystems☁�

This section introduces you on how to use the Kubernetes cluster onFutureSystems. Currently we have deployed kubernetes on our cluster calledecho.

8.4.2.1GettingAccess

You will need an account on FutureSystems and upload the ssh key to theFutureSystemsportalfromthecomputerfromwhichyouwanttologintoecho.To verify, if you have access try to see if you can log intovictor.futuresystems.org. You need to be a member of a valid FutureSystemsproject.

ForFall2018classesatIUyouneedtobeinthefollowingproject:


$minikubestop

https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactive/

https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactive

https://kubernetes.io/docs/tutorials/kubernetes-basics/explore-intro/

https://kubernetes.io/docs/tutorials/kubernetes-basics/expose-intro/

https://kubernetes.io/docs/tutorials/kubernetes-basics/scale-intro/

https://kubernetes.io/docs/tutorials/kubernetes-basics/update-interactive/

https://kubernetes.io/docs/tutorials/stateless-application/hello-minikube/

https://github.com/cloudmesh-community/book/blob/master/chapters/container/kubernetes-fs.md


Ifyouhaveverifiedthatyouhaveaccesstothevictor,youcannowtrytologintothekubernetesclusterheadnodewiththesameusernameandkey.Runthesefirstonyourlocalmachinetosettheusernameandloginhost:

Thenyoucanlogintothekubernetesheadnodebyrunning:

NOTE: If you have access to victor but not the kubernetes system, yourprojectmaynothavebeenauthorizedtoaccessthekubernetescluster.SendatickettoFutureSystemsticketsystemtorequestthis.

Once you are logged in to the kubernetes cluster head node you can runcommandson theremoteechokubernetesmachine (all commands shown innextexceptstatedotherwise)tousethekubernetesinstallationthere.Firsttrytorun:

This will let you know if you have access to kubernetes and verifies if thekubectlcommandworksforyou.Naturallyitwillalsolistthepods.

8.4.2.2ExampleUse

ThefollowingcommandrunsanimagecalledNginxwithtworeplicas,Nginxisapopularwebseverwhichiswellknownasahighperformanceloadbalancer.

Asaresultofthisonedeploymentwascreated,andtwoPODsarecreatedandstarted. If you encounter and error stating that the deployment already existswhenexecutingthepreviouscommandthatisbecausethecommandhasalreadybeenexecuted.Toseethedeployment,pleaseusethecommand,thiscommandshouldworkevenifyounoticedtheerrormentioned.

ThiswillresultinthefollowingoutputNAMEDESIREDCURRENTUP-TO-DATEAVAILABLEAGE

$exportECHOK8S=149.165.150.85

$exportFS_USER=<putyourfutersystemaccountnamehere>

$ssh$FS_USER@$ECHOK8S

$kubectlgetpods

$kubectlrunnginx--replicas=2--image=nginx--port=80

$kubectlgetdeployment

nginx22227m

Toseethepodspleaseusethecommand

ThiswillresultinthefollowingoutputNAMEREADYSTATUSRESTARTSAGE

nginx-7587c6fdb6-4jnh61/1Running07m

nginx-7587c6fdb6-pxpsz1/1Running07m

Ifwewanttoseemoredetailedinformationwecnusethecommand

NAMEREADYSTATUSRESTARTSAGEIPNODE

nginx-75...-4jnh61/1Running08m192.168.56.2e003

nginx-75...-pxpsz1/1Running08m192.168.255.66e005

PleasenotetheIPaddressfield.MakesureyouareusingtheIPaddressthatislistedwhenyouexecute thecommandsincetheIPaddressmayhavechanged.Nowifwetrytoaccessthenginxhomepagewithwget(orcurl)

weseethefollowingoutput:--2018-02-2014:05:59--http://192.168.56.2/

Connectingto192.168.56.2:80...connected.

HTTPrequestsent,awaitingresponse...200OK

Length:612[text/html]

Savingto:'index.html'

index.html100%[=========>]612--.-KB/sin0s

2018-02-2014:05:59(38.9MB/s)-'index.html'saved[612/612]

Itverifiesthatthespecifiedimagewasrunning,anditisaccessiblefromwithinthecluster.

Nextweneedtostartthinkingabouthowweaccessthiswebserverfromoutsidethecluster.Wecanexplicitlyexposingtheservicewiththefollowingcommand.Youcanchangethenamethatissetusing--nametowhatyouwant.Giventhatisadherestothenamingstandards.Ifthenameyouenterisalreadyinthesystemyourcommandwillreturnanerrorsayingtheservicealreadyexists.

Wewillseetheresponse$service"nginx-external"exposed

$kubectlgetpods

$kubectlgetpods-owide

$wget192.168.56.2

$kubectlexposedeploymentnginx--type=NodePort--name=abc-nginx-ext

Tofindtheexposedipaddresses,wesimplyissuethecommand

WesesomethinglikethisNAMETYPECLUSTER-IPEXTERNPORT(S)AGE

AL-IP

kubernetesClusterIP10.96.0.1<none>443/TCP8h

abc-nginx-extNodePort10.110.177.35<none>80:31386/TCP3s

pleasenotethatwehavegivenauniquename.

ForIUstudents:

You could use your username or if you use one of our classes your hid. Thenumberpartwilltypicallybesufficient.Forclassusersthatdonotusethehidinthenamewewillterminateallinstanceswithoutnotification.Inaddition,welikeyou explicitly to add “-ext” to every container that is exposed to the internet.Naturallywewantyoutoshutdownsuchservicesiftheyarenotinuse.Failuretodosomayresultinterminationoftheservicewithoutnotice,andintheworstcaserevocationofyourprivilegestouseecho.

In our example you will find the port on which our service is exposed andremapped to.We find theport31386 in thevalue80:31386/TCP in the portscolumnfortherunningcontainer.

NowifwevisitthisURL,whichisthepublicIPoftheheadnodefollowedbytheexposedportnumber,fromabrowseronyourlocalmachinehttp://149.165.150.85:31386

youshouldseethe‘Welcometonginx’page.

Once you have done all the work needed using the service you can delete itusingthefollowingcommand.

8.4.2.3Exercises

$kubectlgetsvc

$kubectldeleteservice<service-name>

E.Kubernetes.fs.1:

Exploremorecomplexserviceexamples.

E.Kubernetes.fs.2:

Exploreconstructingacomplexwebappwithmultipleservices.

E.Kubernetes.fs.3:

Defineadeploymentwithayamlfiledeclaratively.

8.5SINGULARITY

8.5.1RunningSingularityContainersonComet☁�

Thissectionwascopiedfrom

https://www.sdsc.edu/support/user_guides/tutorials/singularity.html

and modified. To use it you will need an account on comet which can beobtained via XSEDE. In case you use this material as part of a class pleasecontactyourteacherformoreinformation.

8.5.1.1Background

WhatisSingularity?

“Singularityenablesuserstohavefullcontrolof theirenvironment.Singularity containers can be used to package entire scientificworkflows, software and libraries, and even data. This means thatyoudon’thavetoaskyourclusteradmintoinstallanythingforyou-youcanputitinaSingularitycontainerandrun.”

[fromtheSingularitywebsiteathttp://singularity.lbl.gov/]

There are numerous good tutorials on how to install and run Singularity onLinux,OSX,orWindowssowewon’tgointomuchdetailonthatprocesshere.

https://github.com/cloudmesh-community/book/blob/master/chapters/container/singularity/singularity.md

https://www.sdsc.edu/support/user_guides/tutorials/singularity.html

http://singularity.lbl.gov/

In this tutorial youwill learn how to run Singularity onComet. FirstwewillreviewhowtoaccessacomputenodeonCometandprovideasimpleexampletohelpgetyoustarted.TherearenumeroustutorialonhowtogetstartedwithSingularity,buttherearesomedetailsspecifictorunningSingularityonCometwhicharenotcoveredinthosetutorials.ThistutorialassumesyoualreadyhaveanaccountonComet.Youwillalsoneedaccesstoabasicsetofexamplefilestogetstarted.SDSChostsaGithubrepositorycontaininga’Helloworld!"examplewhichyoumayclonewiththefollowingcommand:

8.5.1.2TutorialContents

WhySingularity?Downloading&InstallingSingularityBuildingSingularityContainersRunningSingularityContainersonCometRunningTensorflowonCometUsingSingularity

8.5.1.3WhySingularity?

Listed next is a typical list of commands youwould need to issue in order toimplementafunctionalPythoninstallationforscientificresearch:

Singularityallowsyoutoavoidthistime-consumingseriesofstepsbypackagingthese commands in a re-usable and editable script, allowing you to quickly,easily, and repeatedly implement a custom container designed specifically foryouranalyticalneeds.

Figure77comparesaVMvs.Dockervs.Singularity.

gitclonehttps://github.com/hpcdevops/singularity-hello-world.git

COMMAND=apt-get-yinstalllibx11-dev

COMMAND=apt-getinstallbuild-essentialpython-libdev

COMMAND=apt-getinstallbuild-essentyialopenmpi-dev

COMMAND=apt-getinstallcmake

COMMAND=apt-getinstallg++

COMMAND=apt-getinstallgit-lfs

COMMAND=apt-getinstalllibXss.so.1

COMMAND=apt-getinstalllibgdal1-devlibproj-dev

COMMAND=apt-getinstalllibjsoncpp-devlibjsoncpp0

COMMAND=apt-getinstalllibmpich-dev--user

COMMAND=apt-getinstalllibpthread-stubs0libpthread-stubs0-devlibx11-devlibx11-d

COMMAND=apt-getinstalllibudev0:i386

COMMAND=apt-getinstallnumpy

COMMAND=apt-getinstallpython-matplotlib

COMMAND=apt-getinstallpython3`

Figure77:SingularityContainerArchitecture[69]

8.5.1.4Hands-OnTutorials

The following tutorial includes links to asciinema video tutorials created bySDSC HPC Systems Manager, Trevor Cooper which allow you to see the

console interactivity and output in detail. Look for the icon like the oneshowntotherightcorrespondingtothetaskyouarecurrentlyworkingon.

8.5.1.5Downloading&InstallingSingularity

Download&UnpackSingularityConfigure&BuildSingularityInstall&TestSingularity

8.5.1.5.1Download&UnpackSingularity

First we download and upack the source using the following commands(assumingyourusernameistest_userandyouareworkingonyourlocalcomputerwithsuperuserprivileges):

Singularity-downloadsourceandunpackinVirtualBoxVM(CentOS7)

Ifthefileissuccessfullyextracted,youshouldbeabletoviewtheresults:

[test_user@localhost~]$wgethttps://github.com/singularityware/singularity/

releases/download/2.5.1/singularity-2.5.1.tar.gztar-zxfsingularity-2.5.1.tar.gz

https://asciinema.org/a/12986

8.5.1.5.2Configure&BuildSingularity

Singularity-configureandbuildinVirtualBoxVM(CentOS7)

Next we configure and build the package. To configure, enter the followingcommand(wewillleaveoutthecommandprompts):

Tobuild,issuethefollowingcommand:make

Thismaytakeseveralsecondsdependingonyourcomputer.

8.5.1.5.3Install&TestSingularity

Singularity-installandtestinVirtualBoxVM(CentOS7)

Tocompletetheinstallationenter:sudomakeinstall

Youshouldbepromptedtoenteryouradminpassword.

Oncetheinstallationiscompleted,youcanchecktoseeifitsucceededinafewdifferentways:

whichsingularitysingularity-version

Youcanalsorunaselftestwiththefollowingcommand:singularityselftest

Theoutputshouldlooksomethinglike:+sh-ctest-f/usr/local/etc/singularity/singularity.conf(retval=0)OK+test-u

/usr/local/libexec/singularity/bin/action-suid(retval=0)OK+test-u/usr/local/libexec/singularity/bin/create-suid

(retval=0)OK+test-u/usr/local/libexec/singularity/bin/expand-suid(retval=0)OK+test-u

/usr/local/libexec/singularity/bin/export-suid(retval=0)OK+test-u/usr/local/libexec/singularity/bin/import-suid

(retval=0)OK+test-u/usr/local/libexec/singularity/bin/mount-suid(retval=0)OK

[test_user@localhost~]$cdsingularity-2.5.1/

[[email protected]]$ls

./configure



8.5.1.6BuildingSingularityContainers

TheprocessofbuildingaSingularitycontainerconsistsofafewdistinctstepsasfollows.

UpgradingSingularity(ifneeded)CreateanEmptyContainerImportintoContainerShellintoContainerWriteintoContainerBootstrapContainer

Wewillgothrougheachofthesestepsindetail.

8.5.1.6.1UpgradingSingularity

WerecommendbuildingcontainersusingthesameversionofSingularity,2.5.1,asexistsonComet.Thisisa2stepprocess.

Step1:runthenextscripttoremoveyourexistingSingularity:

Step2:runthefollowingscripttoinstallSingularity2.5.1:

#!/bin/bash

#

#AcleanupscripttoremoveSingularity

sudorm-rf/usr/local/libexec/singularity

sudorm-rf/usr/local/etc/singularity

sudorm-rf/usr/local/include/singularity

sudorm-rf/usr/local/lib/singularity

sudorm-rf/usr/local/var/lib/singularity/

sudorm/usr/local/bin/singularity

sudorm/usr/local/bin/run-singularity

sudorm/usr/local/etc/bash_completion.d/singularity

sudorm/usr/local/man/man1/singularity.1

#!/bin/bash

#

#AbuildscriptforSingularity(http://singularity.lbl.gov/)

declare-rSINGULARITY_NAME='singularity'

declare-rSINGULARITY_VERSION='2.5.1'

declare-rSINGULARITY_PREFIX='/usr/local'

declare-rSINGULARITY_CONFIG_DIR='/etc'

sudoaptupdate

sudoaptinstallpythondh-autoreconfbuild-essentialdebootstrap

cd../

tar-xzvf"${PWD}/tarballs/${SINGULARITY_NAME}-${SINGULARITY_VERSION}.tar.gz"

cd"${SINGULARITY_NAME}-${SINGULARITY_VERSION}"

./configure--prefix="${SINGULARITY_PREFIX}"--sysconfdir="${SINGULARITY_CONFIG_DIR}"

make

8.5.1.7CreateanEmptyContainer

Singularity-createcontainer

To create an empty Singularity container, you simply issue the followingcommand:singularitycreatecentos7.img

ThiswillcreateaCentOS7containerwithadefaultsizeof~805Mb.Dependingonwhat additional configurations you plan tomake to the container, this sizemay or may not be big enough. To specify a particular size, such as ~4 Gb,includethe-sparameter,asshowninthefollowingcommand:singularitycreate-s4096centos7.img

Toviewtheresultingimageinadirectorylisting,enterthefollowing:ls

8.5.1.8ImportIntoaSingularityContainer

Singularity-importDockerimage

Next,wewillimportaDockerimageintoouremptySingularitycontainer:singularityimportcentos7.imgdocker://centos:7

8.5.1.9ShellIntoaSingularityContainer

Singularity-shellintocontainer

OncethecontaineractuallycontainsaCentOS7installation,youcan‘shell’intoitwiththefollowing:singularityshellcentos7.img

Onceyou enter the container you should see a different commandprompt.At

sudomakeinstall




thisnewprompt,trytyping:whoami

Youruseridshouldbeidenticaltoyouruseridoutsidethecontainer.However,the operating system will probably be different. Try issuing the followingcommandfrominsidethecontainertoseewhattheOSversionis:cat/etc/*-release

8.5.1.10WriteIntoaSingularityContainer

Singularity-writeintocontainer

Next,let’stryingwritingintothecontainer(asroot):sudo/usr/local/bin/singularityshell-wcentos7.img

Youshouldbepromptedforyourpassword,andthenyoushouldseesomethinglikethefollowing:Invokinganinteractiveshellwithinthecontainer...

Next,let’screateascriptwithinthecontainersowecanuseittotesttheabilityofthecontainertoexecuteshellscripts:vihello_world.sh

Thepreviouscommandassumesyouknowthevieditor.Enterthefollowingtextintothescript,saveit,andquitthevieditor:#!/bin/bashecho"Hello,World!"

Youmayneedtochangethepermissionsonthescriptsoitcanbeexecutable:chmod+xhello_world.sh

Tryrunningthescriptmanually:./hello_world.sh

Theoutputshouldbe:Hello,World!


8.5.1.11BootstrappingaSingularityContainer

Singularity-bootstrappingacontainer

Bootstrapping a Singularity container allows you to use what is called a‘definitionsfile’soyoucanreproducetheresultingcontainerconfigurationsondemand.

Let us say youwant to create a containerwithUbuntu, but youmaywant tocreate variations on the configurations without having to repeat a long list ofcommands manually. First, we need our definitions file. Given next is thecontentsofadefinitionsfilewhichshouldsufficeforourpurposes.

Tobootstrapyourcontainer,firstweneedtocreateanemptycontainer.singularitycreate-s4096ubuntu.img

Now,wesimplyneedtoissuethefollowingcommandtoconfigureourcontainerwithUbuntu:sudo/usr/local/bin/singularitybootstrap./ubuntu.img./ubuntu.def

Thismay takeawhile tocomplete. Inprinciple,youcanaccomplish thesameresultbymanuallyissuingeachofthecommandscontainedinthescriptfile,butwhydothatwhenyoucanusebootstrappingtosavetimeandavoiderrors.

If all goes according to plan, you should then be able to shell into your newUbuntucontainer.

Bootstrap:docker

From:ubuntu:latest

%runscript

exececho"Therunscriptisthecontainersdefaultruntimecommand!"

%files

/home/testuser/ubuntu.def/data/ubuntu.def

%environment

VARIABLE=HELLOWORLD

ExportVARIABLE

%labels

[email protected]

%post

apt-getupdate&&apt-get-yinstallpython3gitwget

mkdir/data

echo"Thepostsectioniswhereyoucaninstallandconfigureyourcontainer."


8.5.1.12RunningSingularityContainersonComet

Of course, the purpose of this tutorial is to enable you to use the San DiegoSupercomputerCenter’sComet supercomputer to runyour jobs.This assumesyouhaveanaccountonCometalready.IfyoudonothaveanaccountonCometandyoufeelyoucanjustifytheneedforsuchanaccount(i.e.yourresearchislimited by the limited compute power you have in your government-fundedresearch lab),youcan requesta ‘StartupAllocation’ through theXSEDEUserPortal:

https://portal.xsede.org/allocations-overview#types-trial

YoumaycreateafreeaccountontheXUPifyoudonotalreadyhaveoneandthenproceedtosubmitanallocationrequestatthepreviouslygivenlink.

NOTE:SDSCprovidesaCometUserGuidetohelpgetyoustartedwithComet.LearnmoreaboutTheSanDiegoSupercomputerCenterathttp://www.sdsc.edu.

This tutorialwalksyou through the following four steps towards runningyourfirstSingularitycontaineronComet:

TransfertheContainertoCometRuntheContaineronCometAllocateResourcestoRuntheContainerIntegratetheContainerwithSlurmUseexistingCometContainers

8.5.1.12.1TransfertheContainertoComet

Singularity-transfercontainertoComet

Once you have created your container on your local system, youwill need totransferittoComet.Therearemultiplewaystodothisanditcantakeavaryingamountoftimedependingonitssizeandyournetworkconnectionspeeds.

Todothis,wewillusescp(securecopy).IfyouhaveaGlobusaccountandyourcontainersaremore than4Gbyouwillprobablywant touse that file transfermethodinsteadofscp.

https://portal.xsede.org/allocations-overview#types-trial

http://www.sdsc.edu/support/user_guides/comet.html

http://www.sdsc.edu


Browse to the directory containing the container. Copy the container to yourscratchdirectoryonComet.Byissuingthefollowingcommand:scp./centos7.imgcomet.sdsc.edu:/oasis/scratch/comet/test_user/temp_project/

Thecontaineris~805Mbsoitshouldnottaketoolong,hopefully.

8.5.1.12.2RuntheContaineronComet

Singularity-runcontaineronComet

Oncethefileistransferred,logintoComet(assumingyourCometuserisnamedtest_user):[email protected]

NavigatetoyourscratchdirectoryonComet,whichshouldbesomethinglike:[test_user@comet-ln3~]$cd/oasis/scratch/comet/test_user/temp_project/

Next,youshouldsubmitarequestforaninteractivesessionononeofComet’scompute,debug,orsharednodes.[test_user@comet-ln3~]$srun--pty--nodes=1--ntasks-per-node=24-pcompute-t01:00:00--wait0/bin/bash

Once your request is approved your command prompt should reflect the newnodeid.

BeforeyoucanrunyourcontaineryouwillneedtoloadtheSingularitymodule(if you are unfamiliar with modules on Comet, you may want to review theCometUserGuide).ThecommandtoloadSingularityonCometis:[test_user@comet-ln3~]$moduleloadsingularity

YoumayissuethepreviouscommandfromanydirectoryonComet.Recallthatweaddeda hello_world.sh script toour centos7.imgcontainer.Letus try executingthatscriptwiththefollowingcommand:[test_user@comet-ln3~]$singularityexec/oasis/scratch/comet/test_user/temp_project/singularity/centos7.img

/hello_world.sh

Ifallgoeswell,ÂyoushouldseeHello,World!intheconsoleoutput.Youmightalsoseesomewarningspertainingtonon-existentbindpoints.Youcanresolve


this by adding some additional lines to your definitions file before you buildyour container. We did not do that for this tutorial, but you would use acommandlikethefollowinginyourdefinitionsfile:#createbindpointsforSDSCHPCenvironmentmkdir/oasis/scratch//comet/temp_project

YouwillfindadditionalexampleslocatedinthefollowinglocationsonComet:/share/apps/examples/SI2017/Singularity

and/share/apps/examples/SINGULARITY

8.5.1.12.3AllocateResourcestoRuntheContainer

Singularity-allocateresourcestoruncontainer

It is best to avoid working on Comet’s login nodes since they can become aperformance bottleneck not only for you but for all other users. You shouldratherallocateresourcesspecificforcomputationally-intensivejobs.Toallocatea‘computenode’foryouruseronComet,issuethefollowingcommand:[test_user@comet-ln3~]$salloc-N1-t00:10:00

Thisallocation requestsa singlenode (-N1) fora total timeof10minutes (-t00:10:00). Once your request has been approved, your computer node nameshouldbedisplayed,e.g.comet-17-12.

Nowyoumaylogintothisnode:[test_user@comet-ln3~]$sshcomet-17-12

Noticethatthecommandprompthasnowchangedtoreflectthefactthatyouareonacomputenodeandnotaloginnode.[test_user@comet-06-04~]$

Next, load the Singularity module, shell into the container, and execute thehello_world.shscript:[test_user@comet-06-04~]$moduleloadsingularity[test_user@comet-06-04~]$singularityshellcentos7.img

[test_user@comet-06-04~]$./hello_world.sh


Ifallgoeswell,youshouldseeHello,World!intheconsoleoutput.

8.5.1.12.4IntegratetheContainerwithSlurm

Singularity-runcontaineronCometviaSlurm

Ofcourse,mostuserssimplywanttosubmittheirjobstotheCometqueueandletitruntocompletionandgoontootherthingswhilewaiting.SlurmisthejobmanagerforComet.

Givennextisajobscript(whichwewillnamesingularity_mvapich2_hellow.run)whichwillsubmit your Singularity container to the Comet queue and run a program,hellow.c(writteninCusingMPIandprovidedaspartoftheexampleswiththemvapich2defaultinstallation).#!/bin/bash``#SBATCH--job-name="singularity_mvapich2_hellow"#SBATCH--output="singularity_mvapich2_hellow.%j.out"

#SBATCH--error="singularity_mvapich2_hellow.%j.err"#SBATCH--nodes=2#SBATCH--ntasks-per-node=24#SBATCH--

time=00:10:00#SBATCH--export=allmoduleloadmvapich2_ibsingularityCONTAINER=/oasis/scratch/comet/$USER/temp_project/singularity/centos7-mvapich2.img

mpirunsingularityexec${CONTAINER}/usr/bin/hellow

Thepreviousscriptrequests2nodesand24taskspernodewithawalltimeof10minutes. Notice that twomodules are loaded (see the line beginning with‘module’), one for Singularity and one for MPI. An environment variable‘CONTAINER’isalsodefinedtomakeitalittleeasiertomanagelongreusabletextstringssuchasfilepaths.

Youmayneed toadda line specifyingwithallocation tobeused for this job.Whenyouarereadytosubmit thejobtotheCometqueue, issuethefollowingcommand:[test_user@comet-06-04~]$sbatch-pdebug./singularity_mvapich2_hellow.run

ToviewthestatusofyourjobintheCometqueue,issuethefollowing:[test_user@comet-06-04~]$squeue-utest_user

Whenthejobiscomplete,viewtheoutputwhichshouldbewrittentotheoutputfile singularity_mvapich2_hellow.%j.out where %j is the job ID (let’s say the job ID is1000001):


[test_user@comet-06-04~]$moresingularity_mvapich2_hellow.1000001.out

Theoutputshouldlooksomethinglikethefollowing:

8.5.1.12.5UseExistingCometContainers

SDSCUser Support staff,MartyKandes, has built several customSingularitycontainersdesignedspecificallyfortheCometenvironment.

LearnmoreaboutthesecontainersforComet.

Aneasywaytopullimagesfromthesingularityhuboncommentisprovidedinthenextvideo:

Singularity-pullfromsingularity-hubonComet

Comet supports the capability to pull a container directly from any properlyconfigured remote singularity hub. For example, the following command canpull a container from the hpcdevops singularity hub straight to an emptycontainerlocatedonComet:

The resulting container should be named something like singularity-hello-world.img.

LearnmoreaboutSingularityHubsandcontainercollectionsat:

https://singularity-hub.org/collections

That’sit!Congratulations!YoushouldnowbeabletorunSingularitycontainersonCometeitherinteractivelyorthroughthejobqueue.Wehopeyoufoundthistutorialuseful.Pleasecontactsupport@xsede.orgwithanyquestionsyoumighthave. Your Comet-related questions will be routed to the amazing SDSC

Helloworldfromprocess28of48











comet$singularitypullshub://hpcdevops/singularity-hello-world:master


https://singularity-hub.org/collections

mailto:[email protected]

SupportTeam.

8.5.1.13UsingTensorflowWithSingularity

Oneof themorecommonadvantagesofusingSingularity is theability tousepre-builtcontainersforspecificapplicationswhichmaybedifficulttoinstallandmaintain by yourself, such as Tensorflow. The most common example of aTensorflowapplication is character recognitionusing theMNISTdataset.Youcanlearnmoreaboutthisdatasetathttp://yann.lecun.com/exdb/mnist/.

XSEDE’sCometsupercomputersupportsSingularityandprovidesseveralpre-built container which run Tensorflow. Given next is an example batch scriptwhichrunsaTensorflowjobwithinaSingularitycontaineronComet.Copythisscriptandpasteitintoashellscriptnamedmnist_tensorflow_example.sb.

8.5.1.14Runthejob

TosubmitthescripttoComet,firstyou’llneedtorequestacomputenodewiththefollowingcommand(replaceaccountwithyourXSEDEaccountnumber):

TosubmitajobtotheCometqueue,issuethefollowingcommand:

When the job is done you should see an output file in your output directorycontainingsomethingresemblingthefollowing:

#!/bin/bash

#SBATCH--job-name="TensorFlow"

#SBATCH--output="TensorFlow.%j.%N.out"

#SBATCH--partition=gpu-shared

#SBATCH--nodes=1

#SBATCH--ntasks-per-node=6

#SBATCH--gres=gpu:k80:1

#SBATCH-t01:00:00

moduleloadsingularity

singularityexec

/share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.imglsb_release

-a

singularityexec

/share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.imgpython-m

tensorflow.models.image.mnist.convolutional

[test_user@comet-ln3~]$srun--account=your_account_code--partition=gpu-shared--gres=gpu:1--pty--nodes=1--ntasks-per-node=1-t00:30:00--wait=0--export=ALL/bin/bash

[test_user@comet-06-04~]$sbatchmnist_tensorflow_example.sb

DistributorID:Ubuntu

Description:Ubuntu16.04LTS

http://yann.lecun.com/exdb/mnist/

Congratulations!You have successfully trained a neural network to recognizeasciinumericcharacters.

8.6EXERCISES☁�E.Docker.1:MongoDBContainer

Develop a docker file that uses the mongo distribution fromDockerhubandstartsaMongoDBdatabaseontheregularportwhilecommunicatingtoyourcontainer.

What are the parameters on the command line that you need todefine?

E.Docker.2:MongoDBContainerwithauthentication

DevelopaMongoDBcontainer that includesanouthenticateduser.

Release:16.04

Codename:xenial

^[[33mWARNING:Nonexistentbindpoint(directory)incontainer:'/scratch'

^[[0mItensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcublas.solocally

Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcudnn.solocally

Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcufft.solocally

Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcuda.so.1locally

Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcurand.solocally

Itensorflow/core/common_runtime/gpu/gpu_init.cc:102]Founddevice0withproperties:

name:TeslaK80

major:3minor:7memoryClockRate(GHz)0.8235

pciBusID0000:85:00.0

Totalmemory:11.17GiB

Freememory:11.11GiB

Itensorflow/core/common_runtime/gpu/gpu_init.cc:126]DMA:0

Itensorflow/core/common_runtime/gpu/gpu_init.cc:136]0:Y

Itensorflow/core/common_runtime/gpu/gpu_device.cc:838]CreatingTensorFlowdevice(/gpu:0)->(device:0,name:TeslaK80,pcibusid:0000:85:00.0)

Extractingdata/train-images-idx3-ubyte.gz

Extractingdata/train-labels-idx1-ubyte.gz

Extractingdata/t10k-images-idx3-ubyte.gz

Extractingdata/t10k-labels-idx1-ubyte.gz

Initialized!

Step0(epoch0.00),40.0ms

Minibatchloss:12.054,learningrate:0.010000

Minibatcherror:90.6%

Validationerror:84.6%



Minibatcherror:6.2%




Minibatcherror:0.0%




Minibatcherror:0.0%


Testerror:0.9%

https://github.com/cloudmesh-community/book/blob/master/chapters/container/exercise.md

Youmustuse thecloudmesh.yaml file forspecifyingthe informationfortheadminuserandpassword.

1. Howdoyouaddtheuser?2. Howdoyoustartthecontainer?3. Showcase theuseof theauthenticationwitha simple script orpytest.

Youareallowedtousuedockercompose,butmakesureyoureadthepasswordondusernamefromtheyamlfile.YoUmustnotconfigureitbyhandinthecomposeyamlfile.Youcanusecloudmeshcommandstoreadtheusernameandpassword.

E.Docker.3:CloudmeshContainer

Inthisassignmentwewillexploretheuseoftwocontainers.WewillbeleveragingtheasisgnmentE.Docker.2.

First,youwillstarttheauthenticateddockerMongoDBcontainer

Youwillbewritinganadditionaldockerfile, thatcreatescloudmeshin a docker container. Upon start the parameter passed to thecontainerwillbeexecutedinthecontainer.Youwillusethe.sshand.cloudmeshdirectoryfromyournativefilesystem.

Forhints,pleaselookat

https://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Dockerfilehttps://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Makefile

Tojumpstartyoutry

Explore!UnderstandwhatisdoneintheMakefile

cmsconfigvaluecloudmesh.data.mongo.MONGO_USERNAME

cmsconfigvaluecloudmesh.data.mongo.MONGO_PASSWORD

makeimage

makeshell

https://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Dockerfile

https://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Makefile

Questions:

1. HowwouldyouneedtomodifytheDockerfiletocompleteit?2. Whay did we outcomment the MongoDB related tasks in theDockerfile?

3. How dowe need to establish communication to theMongoDBcontainer

4. Could docker compose help, or would it be too complicated,e.g.whatifthemongocontaineralreadyruns?

5. Why would it be dangerous to store the cloudmesh.yaml fileinsidethecontainer?Hint:DockerHub.

6. WhyshouldyouatthistimenotuploadimagestoDockerHub?

E.Docker.Swarm.1:Documentation

Develop a section in the handbook that deploys a Docker Swarmclusteronanumberofubuntumachines.Notethatthismayactuallybe easier as docker and docker swarm are distributed with recentversionsofubuntu.Just incaseweareprovidingalink toaneffortwefoundtoinstalldockerswarm.Howeverwehavenotcheckeditoridentifiedifitisuseful.

https://rominirani.com/docker-swarm-tutorial-b67470cf8872

E.Docker.Swarm.2:GoogleComputeEngine

Develop a section that deploys aDocker Swarm cluster onGoogleComputeEngine.Notethatthismayactuallybeeasierasdockeranddockerswarmaredistributedwithrecentversionsofubuntu.Justincaseweareprovidinga link toaneffortwe found to installdockerswarm.Howeverwehavenotcheckeditoridentifiedifitisuseful.

https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed

E.SingleNodeHadoop:

Setupasinglenodehadoopenvironment.

https://rominirani.com/docker-swarm-tutorial-b67470cf8872

https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed

Thisincludes:

1. CreateaDockerfilethatdeployshadoopinacontainer2. Developsampleapplicationsandtests to testyourcluster.Youcanusewordcountorsimilar.

youwill findacomprehensive installation instruction that setsupahadoopclusteronasinglenodeat

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

E.MultiNodeHadoop:

Setupahadoopclusterinadistributedenvironment.

Thisincludes:

1. CreatedockercomposeandDockerfilesthatdeployshadoopinkubernetes

2. Developsampleapplicationsandtests to testyourcluster.Youcanusewordcountorsimilar.

youwill findacomprehensive installation instruction that setsupahadoopclusterinadistributedenvironmentat

https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/ClusterSetup.html

Youcanusethissetofinstructionsoridentifyotherresourcesontheinternet that allow the creation of a hadoop cluster on kubernetes.Alternativelyyoucanusedockercomposeforthisexercise.

E.SparkCluster:Documentation

Develop a high quality section that installs a spark cluster inkubernetes. Test your deployment on minikube and also onFuturesystemsecho.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/ClusterSetup.html

You may want to get inspired from the talk Scalable SparkDeploymentusingKubernetes:

http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-1/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-2/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-3/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-4/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-5/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-6/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-7/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-8/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-9/

Makesureyoudonotplagiarize.

http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-1/









9NIST

9.1NISTBIGDATAREFERENECEARCHITECTURE☁�

LearningObjectives

ObtainanoverviewoftheNISTBigDataRefernceArchitecture.Understandthatyoucancontributetoitaspartofthisclass.

Oneofthemajortechnicalareasinthecloudistodefinearchitecturesthatcanwork with Big Data. For this reason NIST has work now for some time onidentifyinghowtocreateadatainteroperabilityframework.Theideahereisthatat one point architecture designers can pick services that they can chose tocombinethemaspartoftheirdatapipelineandintegrateinaconvenientfashionintotheirsolution.

BesidesjustbeingahighleveldescriptionNISTalsoencouragestheverificationof the architecture through interface specifications, especially those that arecurrentlyunderwayinVolume8ofthedocumentseries.Youhavetheuniqueopportunitytohelpshapethisinterfaceandcontributetoit.Wewillprovideyounot onlymechanisms on how you theoretically can do this, but also how youpracticallycancontribute.

Aspartofyourprojects in516youwillneed to integratea significant servicethatyoucancontributetotheNISTdocumentinformofaspecificationandinformofanimplementation.

9.1.1PathwaytotheNIST-BDRA

The Nist Big Data Public Working Group (NBD-PWG) was established ascollaboration between industry, academia and government “to create aconsensus-based extensible Big Data Interoperability Framework (NBDIF)which is a vendor-neutral, technology- and infrastructure-independent

https://github.com/cloudmesh-community/book/blob/master/chapters/nist/bdra.md

ecosystem” [70]. It will be helpful for Big Data stakeholders such as dataarchitects,datascientists,researchers,implementerstointegrateandutilize“thebestavailableanalyticstoolstoprocessandderiveknowledgethroughtheuseofstandard interfaces between swappable architectural components” [70]. TheNBDIFisbeingdevelopedinthreestages:

Stage 1: “Identify the high-level Big Data reference architecture keycomponents, which are technology, infrastructure, and vendor agnostic,”[70]introductionoftheBigDataReferenceArchitecture(NBD-RA);Stage2:“DefinegeneralinterfacesbetweentheNBD-RAcomponentswiththe goals to aggregate low-level interactions into high-level generalinterfaces and produce set ofwhite papers to demonstrate howNBD-RAcanbeused”[70];Stage3:“ValidatetheNBD-RAbybuildingBigDatageneralapplicationsthroughthegeneralinterfaces.[70]”

NisthasdevelopedthefollowingvolumesaslistedinTable:BDRAvolumesthatsurroundthecreationoftheNIST-BDRA.Werecommendthatyoutakeacloserlookatthesedocumentsasinthissectionweprovideafocussedsummarywiththeaspectofcloudcomputinginmind.

Table:NISTBDRAVolumes

.

Volumes Volume TitleNISTSP1500-1r1 Volume1 DefinitionsNISTSP1500-2r1 Volume2 TaxonomiesNISTSP1500-3r1 Volume3 UseCasesandRequirementsNISTSP1500-4r1 Volume4 SecurityandPrivacyNISTSP1500-5 Volume5 ReferenceArchitecturesWhitePaperSurveyNISTSP1500-6r1 Volume6 ReferenceArchitecture

NISTSP1500-7r1 Volume7 StandardsRoadmapNISTSP1500-9 Volume8 ReferenceArchitectureInterface(new)NISTSP1500-10 Volume9 AdoptionandModernization(new)

https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-1r1.pdf




https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-5.pdf





9.1.2BigDataCharacteristicsandDefinitions

Volume1oftheseriesintroducesthecommunitytocommondefinitionsthatareusedaspartofthefieldofBigdata.Thisincludestheanalysisofcharacteristicssuch as volume, velocity, variety, variability and the use of structures andunstructureddata.Aspartof thefieldofdatascienceandengineeringit listsanumberofareasthataretobebelievedtobeessentialincludingthattheymustmaster including data structures, parallelism, metadata, flow rate, visualcommunication. In addition we believe that an additional skill set must beprevalent that allows a data engineer to deploy such technologies onto actualsystems.

WehavesubmittedthefollowingproposaltoNIST:

3.3.6.Deployments:

A significant challange exists for data engineers to developarchitecturesandtheirdeploymentimplications.Thevolumeofdataandtheprocessingpowerneededtoanalysisthemmayrequiremanythousands of distributed compute resources. They can be part ofprivatedatacenters,virtualizedwiththehelpofvirtualmachinesorcontainersandevenutilizeserverlesscomputingtofocusintegrationof Big Data Function as a Service based architectures. As sucharchitecturesareassumedtobe largecommunitystandardssuchasleveragingDevOpswillbenecessary for theengineers tosetupandmanagesucharchitectures.Thisisespeciallyimportantwiththeswiftdevelopment of the field that may require rolling updates withoutinterruptionoftheservicesoffered.

Thisadditionreflectsthenewestinsightintowhatadatascientistneedstoknowandthenewestjobtrendsthatweobserved.

Toidentifywhatbigdataiswefindthefollowingcharacteristics

Volume: Big for data means lots of bytes. This could be achieved in manydifferentways.Typicallywelookat thaactualsizeofadataset,butalsohowthisdatasetisstoredforexampleinmanythousandsofsmallerfilesthatarepart

ofthedataset.Itisclearthatinmanyofsuchcasesanalysisofalargevolumeofdatawill impact the architecturaldesign for storage,but also theworkflowonhowthisdataisprocessed.

Velocity: We see often that big data is associated with high data flow ratescaused by for example streaming data. It can however also be caused byfunctions that are applied to large volumes of data and need to be integratedquicklytoreturntheresultasfastasposible.Needsforrealtimeprocessingaspart of the quality of service offered contribute also to this. Examples of IoTdevicesthatintegratenotonlydatainthecloud,butalsoontheedgeneedtobeconsidered.

Variety: In todaysworldwehavemanydifferentdata resources thatmotivatesophisticated data mashup strategies. Big data hence not only deals withinformation from one source but a variety of sources. The architectures andservices utilized are multiple and needed to enable automated analysis whileincorporatingvariousdatasource.

Another aspect of variety is that data can be structured or unstructured.NISTfindsthisaspectsoimportantthattheyincludeditsownsectionforit.

Variability:Anydataovertimewillchange.NaturallythatisnotanexceptioninBigdatawheredatamaybeatimetoliveorneedstobeupdatedinordernotto be stale or obsolete. Hence one of the characteristics that big data couldexhibitisthatitsdatabevariableandispronetochanges.

In addition to these general observations we also have to adress importantcharacteristicsthatareattachedwiththeDataitself.Thisincludes

Veracity:Veracityreferstotheaccuracyofthedata.Accuracycanbeincreasedbyaddingmetadata.

Validity:Referstodatathatisvalid.Whiledatacanbeaccuratelymeasured,itcouldbeinvalidbythetimeitisprocessed.

Volatility:Volatilityreferstothechangeinthedatavaluesovertime.

Value:Naturallywecanstorelotsof information,but if theinformationisnotvaluablethenwemaynotneedtostoreit.Thisisrecentlybeenseenasatrendas

some companies have transitioned data sets to the community as they do notprovidevaluetotheserviceprovidertojustifyitsprolongedmaintenance.

Inothercasesthedatahasbecomesovaluableandthattheservicesofferedhavebeen reduced for example as they provide too many resource needs by thecommunity.A good example isGoogle scholar that used to havemuchmoreliberaluseandtodayitsservicesaresignificantlyscaledbackforpublicusers.

9.1.3BigDataandtheCloud

WhilelookingatthecharacteristicsofBigDataitisobviousthatBigdataisontheonehandamotivatorforcloudcomputing,butontheotherhandexistingBigDataframeworksareamotivatorfordevelopingBigDataArchitecturesacertainway.

Hence we have to always look from both sides towards the creation ofarchitecturesrelatedtoaparticularapplicationofbigdata.

This is alsomotivated by the rich historywehave in the field of parallel anddistributed computing. For a long time engineers have dealtwith the issue ofhorizontalscaling,whichisdefinedbyaddingmorenodesorotherresourcestoacluster.Suchresourcesmayinclude

shareddiskfilesystems,distributedfilesystems,distributed data processing and concurrency frameworks, such asConcurrent sequential processes, workflows,MPI, map/reduce, or sharedmemory,resourcenegotiationtoestablishqualityofservice,datamovement,anddatatiers(asshowcasedinhighenergyphysicsLigo[71]andAtlas)

In addition to the horizontal scaling issues we also have to worry about theverticalscalingissues,thisishowtheoverallsytemarchitecturefitstogethertoadressanend-to-endusecase.Insucheffortswelookat

interfacedesigns,workflowsbetweencomponentsandservices,

privacyofdataandothersecurityissues,reusabilitywithinotheruse-cases.

Naturallythecloudofferstheabilitytocloudifyexistingrelationaldatabasesascloudserviceswhileleveragingtheincreasedperformanceandspecialhardwareandsoftwaresupportthatmaybeotherwiseunaffordableforanindividualuser.Howeverweseealsotheexplosivegrowthofnonsqldatabasesbecausesomeofthem can more effectively deal with the characteristics of big data thantraditional mostly weel structured data bases. In addition many of theseframeworks are able to introduce advanced capability such as distributed andreliableserviceintegration.

Althoughwehavebeenusedtothetermcloudwileusingvirtualizedresourcesand the term Grid by offering a network of supercomputers in a virtualorganization,WeshouldnotforgetthatCloudserviceprovidersalsoofferHighperformancecomputersresourcesforsomeoftheirmostadvancedusers.

Naturally such resources can be used not only for numerical intensifcomputations but also for big data applications as the Physics community hasdemonstrated.

9.1.4BigData,EdgeComputingandtheCloud

WhenlookingatthenumberofdevicesthatarebeingaddeddailytotheglobalIT infrastructureweobserve thatcellphonesandsoon InternetofThings (IoT)deviceswillproducethebulkofalldata.Howevernotalldatawillbemovedtothe cloud and lots of datawill be analyzed locally on the devices or evennotbeingconsideredtobeuploadedtothecloudeitherbecauseitprojecttoloworto high value to be moved. However a considerable portion will put newconstraintsonourservicesweofferinthecloudandanyarchitectureaddressingthis must be properly deal with scaling early on in the architectural designprocess.

9.1.5ReferenceArchitecture

NextwepresenttheBigdatareferencearchitecture.ItisDepictedinFigure78.Accordingtothedocument(Volume2) thefivemaincomponentsrepresentingthecentralrolesinclude

SystemOrchestrator:Defines and integrates the required data applicationactivitiesintoanoperationalverticalsystem;DataProvider:IntroducesnewdataorinformationfeedsintotheBigDatasystem;BigDataApplicationProvider:Executesa lifecycle tomeetsecurityandprivacyrequirementsaswellasSystemOrchestrator-definedrequirements;Big Data Framework Provider: Establishes a computing framework inwhich to execute certain transformation applications while protecting theprivacyandintegrityofdata;andDataConsumer:IncludesendusersorothersystemswhousetheresultsoftheBigDataApplicationProvider.

Inadditionwerecognizetwofabricslayers:

SecurityandPrivacyFabricManagementFabric

Figure78:NIST-BDRA(seeVolume2)

While lookingat theactorsdepicted inFigure79weneed tobeaware that ineachofthecategoriesaservicecanbeadded.Thisisanimportantdistinctiontothe original depiction in the definition as it is clear that an automated servicecouldactinbehalfoftheactorslistedineachofthecategories.


Figure79:NISTRoles(seeVolume2)

ForadetaileddefinitionwichisbeyondthescopeofthisdocumentwerefertotheVolume2ofthedocuments.

9.1.6FrameworkProviders

TraditionallycloudcomputinghasstartedwithofferingIaaS,followedbyPaaSandSaaS.WeseetheIaaSreflectedinthreecategoriesforbigdata:

1. Traditional compute and network resources including virtualizationframeworks

2. Data Organization and Distribution systems such as offered in IndexedStorageandFileSystems

3. Processing engines offering batch, interactive, and streaming services toprovidecomputingandanalyticsactivities

Messaging and communication takesplacebetween these layerwhile resourcemanagementisusedtoaddressefficiency.

Frameworks such as Spark andHadoop include components formmultiple ofthesecategoriestocreateaverticalintegratedsystem.Oftentheyareofferedbyaserviceprovider.However,oneneedstoberemindedthatsuchofferingsmay


notbe tailored to the individualuse-case and inefficiencies couldbeprevalentbecausetheserviceofferisoutdated,oritisnotexplicitlytunedtotheproblemathand.

9.1.7ApplicationProviders

The underlaying infrastructure is reused by big data application providerssupportingservicesandtasksuchas

DatacollectionsDatacurationDataAnalyticsDataVisualizationDataAccess

Throughtheinterplaybetweentheseservicesdataconsumersanddataproducerscanbeserved.

9.1.8Fabric

Securityandgeneralmanagementarepartofthegoverningfabricinwhichsuchanarchitectureisdeployed.

9.1.9Interfacedefinitions

TheinterfacedefinitionsfortheBDRAarespecifiedinVolume8.Weareinthesecond phase of our document specification while we switch from our pureResourcedescripyiontoanOpenAPIspecification.Beforewecanprovidemoredetailsweneedto introduceyoutoRESTwhichisanessential technologyformanymodercloudcomputingservices.

10AI

10.1ARTIFICIALINTELLIGENCESERVICEWITHREST �☁��{#sec:ai}

10.1.1UnsupervisedLearning

Keywords:clustering,kNN,MarkovModel

Unsupervisedlearningisalearningmethodwhenthetrainingdataisnotlabeled.Thisproblemcouldbemorechallengingbecausewearenotsupposed tohavepre-knowledgeandfindpatternsfromthedata.

Unsupervised learning could be very computing intensive and it has verycomplicatedmathprinciples,butveryuseful. In thischapter,wewill illustratesomemostpopularunsupervisedlearningalgorithmsandraiseexampleshowweapplythem,whichincludesKMeans,k-NN,MarkovModelandothers.

Itisimportanttoknowthatunsupervisedlearningisjustawayhowwelookatthe problem. Each algorithm is just an example on howwe solve a particularproblem.Beforeyouapplyanalgorithm,attentionshouldbegiventothereasonwhyweapplyaspecificalgorithm.

10.1.2KMeans

KMeansisoneofthemoststraightforwardunsupervisedlearningalgorithms.

slidesAI(40)UnsupervisedLearning

10.1.3Lab:PracticeonAI

Keywords:Docker,RESTService,Spark

slidesPracticeonAI(40)RESTservices

https://github.com/cloudmesh-community/book/blob/master/chapters/ai/ai.md

https://drive.google.com/file/d/1r62DpK-yK0L_v_KEBnmP6tdLfQFD7Lok/view?usp=sharing

https://drive.google.com/file/d/1pD4zbrFKkS7d6SsxIw33NIoDHQLIedXn/view?usp=sharing

10.1.4k-NN

k-NN is a non-parametric statistical method meaning there is no assumptionmade about the distribution of the data. Additionally the distribution is notassumedtobefixedi.e.thedistributionmaychangethroughtime.Theserelaxedassumptionsmakenon-parametrictestsextremelyvaluablewhenappliedtoreal-world data as a vast majority of real-world data have dynamic distributionsthough time,climatedata isanobviousexample.Non-Parametricdata isoftenordinal which means the variables have an inherent categorical order withunknown distances between the categories. A common example of a non-parametricstatisticaltestisthesigntestwherevaluesareassignedapositiveornegativesignbasedonbeingaboveorbelow themedian. Ink-NNpredictionsaremadeaboutunknownvaluesbymatching theunknownvalueswithsimilarknown values. Naturally the determination of ‘similar’ is of fundamentalimportance. This is done through the application of the Euclidean distancecalculationgivenbyEquation1.

d(i, j)= d(j, i)=√(i1 − j1)2+(i2 − j2)2+...(in − jn)2 =⎷

n∑n=1

(in − jn)2

To illustrate an example of calculating similarity using Equation 1 it can bedetermiendifacarisfastornotbyusingthedatainTable1.LetspretendweknownothingaboutcarsandareaskedifwethinkaChevyCorvette isfastornot.

Car make and model with associated horsepower, whether the vehicle has aracingstripeandiftheauthorthinksthecarisfastornotTable1.

Table1:CarData

CarName Horsepower(HP)

RacingStripe(YesorNo)

Fast(YesorNo)

ToyotaPrius 120 0 0TeslaRoadster 288 0 1BugattiVeyron 1200 1 1HondaCivic 158 1 0

LamborghiniAventador

695 1 1

NowletssayourfriendwantstoknowifaFordMustangwitharacingstripeisfast or not. This particular friend knows nothing about cars so decided to putanalyticstowork.SinceaMustanghasroughly300horsepowertheclosetcarinour dataset to this is the TeslaRoadster and since the Tesla is fastwewouldpredict theMustang tobe fast.Remember this iscompletelydependenton theauthors initial classification of whether a car is fast or not. Clearly theLamborghiniand theBugattiare fastbutmaybe theTesla isnot fast thereforegiving an incorrect answer. An example using the Mustang and the Tesla isgiveninthenextcalculation:

d(i, j)= √(300 − 288)2+(1 − 0)2 = 12.04

Wewereabletodeterminetheclosest,orfirstnearestneighborbyinspectionofthisdata,howeverwithamorerobustdatasetthismaynotbethecase.Inthesesituations to find the nearest neighbor theEuclidean distance is calculated forevery unique row entry and then ordered from smallest to largest distances,naturally the smallest distances are themost similar.Youmay notice that thevalues of horsepower are significantly larger in magnitude than the valuesassociated with racing stripes. This could be problematic in many real worldscenarioswherethecolumnsassociatedwithlargevaluesdonothaveasdirectofanimpactashorsepowerdoesonthevariablewearetryingtopredict–acarbeingfast.Inthecasewhereeachcolumnvaluehasequalpredictivepowerdatanormalizationshouldbeperformed.Thisistheprocessofcenteringeachcolumntoameanofzero(0)andastandarddeviationofone(1).Thisisdonebytakingthe columnmeans and subtracting the columnmeans from each column entryanddividingbythecolumnstandarddeviation.

Determineforyourselfifweuse2nearestneighborswhatthepredictionaboutthe Mustang would be given the data provided what about 3, 4 nearestneighbors?Whatisthemaximumnumberofk-nearestneighborswecouldhavegiventhedatasetinTable1?

Calculate the Euclidean Distances for all five row entries with respect to theMustang.

Normalize thedataand recalculate the first and secondnearestneighborswithrespecttotheMustang.Doesanythingchange?

In order to see k-NN in action we will look at an in depth example using adataset from the National Basketball Associated (NBA) from 2013, naturallytherearemoreuptodatedatasetsbutassportsanalyticsbecomesasignificantmarketmoreandmoredataisbecomingproprietary.ThisexamplewillpickanNBAplayer and determine themost similarNBAplayer in the dataset to theselected player using k nearest neighbors. The following is set up for you toexecuteinapythoncommandpromptlinebylineforinstructionalpurposes.

Thepreviousportionofcodeusespandas toopenthedownloadedcsvfileandnameitnba,naturallyyoucouldnamethefileanything.Ifyouwanttoviewthecolumnsinthecsvfilethefollowingcommandcanbeused.

Nowwe need to select a player from the dataset, wewill then determine themostsimilarplayer toourselectedplayer.Analysis like this isbecomingmoreandmore prevalent in professional sports due to the large amounts ofmoneyinvested in players. Scoutsmay use this type of analysis to determine who agivenprospect ismost similar too.This followingbit of code selects a playerfromthedataset.Noticethatthecolumnplayerisfirstselectedfollowedbytheplayername.

Thenextstepistoremoveanynon-numericcolumnsfromouranalysissinceweare using theEuclidean distance to calculate proximity and strings can not beevaluated insuchaway.One thingyoucando ifyouhavecolumns thathavevalueslikeyesandnoisassignzerosandonesaccordingly.Inourcasewewillonlyselectthecolumnswithnumericvalues.

#ThiscodewasadoptedfromDataquest-KnearestneighborsinPython:

#Writtenby:VikParuchuri

importpandas

importmath

withopen("/path/to/the/nba_2013.csv",'r')ascsvfile:

nba=pandas.read_csv(csvfile)

print(nba.columns.values)

selected_player=nba[nba["player"]=="LastNameFirstName"].iloc[0]

numeric_columns=['age','g','gs','mp','fg','fga','fg.','x3p',

'x3pa','x3p.','x2p','x2pa','x2p.','efg.','ft','fta','ft.',

'orb','drb','trb','ast','stl','blk','tov','pf','pts']

WenowhaveeverythingweneedtocalculatetheEuclideandistance,therearebuilt in functions available in python to calculate this howeverwewill defineourownasitisastraightforwardcomputation.Itisalsogoodpracticetodefineyourownfunctionswheneverpossible.

Applying our function using the following command will determine theEuclidean distance between the selected player and all other players in thedataset.

For sake of argument we will assume that all the data columns have equalpredictivecapabilitiessowewishtonormalize.Thiswilloftenbethecasewithsports statistics as total points and field goal percentage vary in magnitudesignificantly but total points does not necessarily holdmore predictive powerthanfieldgoalpercentage.Inordertonormalizeweagainmostonlyselectthethe numeric columns and text columns can not be normalized in the waydescribedpreviously.

Wecannowusebuiltinfunctionstocalculatethenearestneighborsinordertocomparetoourresultsattainedfromthepreviousexercise.Incaseyoudidnotnoticetheselected_player_distancearrayisanarraythatlistsalltheEuclideandistances.Wewillusethislatertoseeif thesameresult isobtainedbyusingthebuilt infunctions.Firstwewillimportthenecessarylibrariesshowninthenextcode.

Ifyouinspectedthetheselected_player_distancearrayyouwouldhavenoticedthattherewere severalNaN’s present thiswas due to having an incomplete dataset andmust be avoided. The following bit of code will replace all NA entries withzeros.

defeuclidean_distance(row):

"""

DefineourownEuclideandistancefunction

"""

euc_distance=0

forkinnumeric_columns:

euc_distance+=(row[k]-selected_player[k])**2

returnmath.sqrt(euc_distance)

selected_player_distance=nba.apply(euclidean_distance,axis=1)

nba_numeric=nba[numeric_columns]

#applynormalizationformulausingbuiltinpythonmath

#functionsforthemeanandstandarddeviation

nba_normalized=(nba_numeric-nba_numeric.mean())/nba_numeric.std()

fromscipy.spatialimportdistance

nba_normalized.fillna(0,inplace=True)

UsingthebuiltinEuclideandistancetodeterminetheEuclideandistancesofallplayersinthedatasettoourselectedplayer.

Herewecreateadataframetoholdthedistancesandthensortthevaluesfromlowest tohighest.Sinceourplayerwillnaturallybe in thedataset theselectedplayerwillbe thelowestvalue, thereforethesecondlowestvalueisassociatedwiththeplayermostcloselyrelatedtoourselectedplayer.

Inthepythonprompttype:

Themostsimilarplayertoyourselectedplayershouldappear.

10.1.5MachineLearningandCloudServices

10.1.5.1IntroductionandRegression

This video lecture covers logistic and linear regression models in additon toclustering models. The algorithims for the three methods are formalized andsolutions are presented. Additionally visulization techinqies are introducedincludingWebPlotvizandmatplotlib.

IntroductiontoMachineLearningforCloudServicesandRegression10:55

10.1.5.2K-meansClustering

VideolectureandslidescoveranintroductiontoK-meansclusters.

K-meansClusters17:15

10.1.5.3Visulization

player_normalized=nba_normalized[nba["player"]=="LastNameFirstName"]

euclidean_distances=nba_normalized.apply(lambdarow:

distance.euclidean(row,player_normalized),axis=1)

distance_frame=pandas.DataFrame(data={"dist":euclidean_distances,

"idx":euclidean_distances.index})

distance_frame.sort_values("dist",inplace=True)

second_smallest=distance_frame.iloc[1]["idx"]

most_similar_to_player=nba.loc[int(second_smallest)]["player"]

most_similar_to_player

https://www.youtube.com/watch?v=VdgGy0266Sk&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=2&t=0s

https://www.youtube.com/watch?v=aGRdp4TKc4c&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=2

Video lectureand slidescoverdatavisulaitzation techinquesusing stateof thesciencetoolslikeWebPlotViz.

DataVisualization30:10

10.1.5.4ClusteringExamples

Videolectureandslidescoverclusteringexamples.

ExamplesofClustering5:48

10.1.5.5GeneralClusteringwithExamples

VideolectureandslidestakeageneralizedapprachtoclusteringwithexamplesfromDr.GeofferyFox’sresearch.

GeneralClusteringandResearchExamples22:28

10.1.5.6InDepthExamplewithfourcenters

Videolectureandslidesuse1000datapointsandfourartificalcenterstoprovideandindepthexampleofclustering.Codeprovided.

Examplewithfourcenters20:53

10.1.5.7ParallelComputingandK-means

Video lecture and slides discuss parallel computing using K-means as anexample of how to accelerate time to completion by exploiting moderncomputinghardware.

ParallelComputingandK-means

10.1.6ExampleProjectwithSVM

https://www.youtube.com/watch?v=0WHhFAEwr9c&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=3

https://www.youtube.com/watch?v=_I23XQ-8hLg&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=4

https://www.youtube.com/watch?v=1Tn6LyiIhzw&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=5

https://www.youtube.com/watch?v=T2vssd_P2F4&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=6

https://www.youtube.com/watch?v=RXyktQ5KO48&list=PLy0VLh_GFyz9fRbuhUS59rUThN_G1VCdX&index=7

ThefollowingcodeissetupasanexampleprojectandwillshowhowtouseaRESTful service to download data. Additionally the differences between adynamic and static API will be showcased. First we begin by importing theappropriatelibraries.

Next we define three functions required to run this example: a function todownloadthedata;afunctiontopartitionthedata;andafunctiontogetthedataintotheappropriateformatoncedownloaded.

Defining thefirstAPIendpointwith thefollowinglinesofcodewillallowtheusertoexposetheAPI.Provethistoyourselfbyopeningabrowser,preferablygoogle,andfollowingtheurlnexttothecode.

NowOpentheapplicationinyourbrowserwith

The first API endpoint we will define is the endpoint to download the data,

importrequests


importnumpyasnp

fromsklearn.externals.joblibimportMemory

fromsklearn.datasetsimportload_svmlight_file


fromosimportlistdir

fromflaskimportFlask,request

app=Flask(__name__)

defdownload_data(url,filename):

r=requests.get(url,allow_redirects=True)

open(filename,'wb').write(r.content)

defdata_partition(filename,ratio):

file=open(filename,'r')

training_file=filename+'_train'

test_file=filename+'_test'

data=file.readlines()

count=0

size=len(data)

ftrain=open(training_file,'w')

ftest=open(test_file,'w')

forlineindata:

if(count<int(size*ratio)):

ftrain.write(line)

else:

ftest.write(line)

count=count+1

defget_data(filename):

data=load_svmlight_file(filename)

returndata[0],data[1]

@app.route('/')

defindex():

return"DemoProject!"


app.run(debug=True)

http://127.0.0.1:5000/

whichisdonebythefollowinglinesofcode.NotetheurlishardcodedintothisportionofthecodeaspassingurlstoanAPIisnotgoodpractice.

The following threeapiendpointsuse thedatapartitionandgetdata functionsdefinedpreviously.Thepartitionfunctionsplits thedatasets into twosections–testingandtraining.Inthisexamplethetestingportionofthedatasetis20%andthetrainingis80%ofthedataset.Laterwewillexplorehowtomakethispartdynamic,allowingtheusertochoosethepartitioningpercentage.

ThelastbitofcodeistheimplementationofSVMintoaRESTfulAPIendpoint.Again this is static and all parameters needed to tune the algorithm arehardcoded. It will be worth your time to extrapolate the discussion aboutdynamicAPIsinordertomaketheseparameterstunablebytheuserthroughtheurl.

@app.route('/api/download/data')

defdownload():

url=

'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale'

download_data(url=url,filename='iris.scale')

return"DataDownloaded"

@app.route('/api/data/partition')

defpartition():

data_partition('iris.scale',0.8)

return"SuccessfullyPartitioned"

@app.route('/api/get/data/test')

defgettestdata():

Xtest,ytest=get_data("iris.scale_test")

return"ReturnXtestandYtestarrays"

@app.route('/api/get/data/train')

defgettraindata():

Xtrain,ytrain=get_data("iris.scale_train")

return"ReturnXtrainandYtrainarrays"

@app.route('/api/experiment/svm')

defsvm():



clf=SVC(gamma=0.001,C=100,kernel='linear')

clf.fit(Xtrain,ytrain)

test_size=Xtest.shape[0]

accuarcy_holder=[]

foriinrange(0,test_size):

prediction=clf.predict(Xtest[i])

print("PredictionfromSVM:"+str(prediction)+",Expected

Label:"+str(ytest[i]))

accuarcy_holder.append(prediction==ytest[i])

correct_predictions=sum(accuarcy_holder)

print(correct_predictions)

total_samples=test_size

accuracy=

float(float(correct_predictions)/float(total_samples))*100

print("PredictionAccuracy:"+str(accuracy))

Inordertorunthisyouneedtomakeadirectoryinalocationofyourchoiceandcreateafilecalledmain.pythathasthecodeincludedinit.Thensimplytypethefollowingcommand ina terminalwhereyouhavenavigated to the locationofthedirectorythatyourcreated.

A continuous version of main.py is provided next for ease of use. Please becarefulwhencopyingandpastingasadditionalcharactersmayshowup,thiswasnoticedintheurlsections.

return"PredictionAccuracy:"+str(accuracy)

pythonmain.py

importrequests


importnumpyasnp

fromsklearn.externals.joblibimportMemory

fromsklearn.datasetsimportload_svmlight_file


fromosimportlistdir

fromflaskimportFlask,request

app=Flask(__name__)

defdownload_data(url,filename):

r=requests.get(url,allow_redirects=True)

open(filename,'wb').write(r.content)



training_file=filename+'_train'

test_file=filename+'_test'


count=0

size=len(data)



forlineindata:


ftrain.write(line)

else:

ftest.write(line)

count=count+1

defget_data(filename):

data=load_svmlight_file(filename)

returndata[0],data[1]

@app.route('/')

defindex():

return"DemoProject!"

@app.route('/api/download/data')

defdownload():

url=

'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale'

download_data(url=url,filename='iris.scale')

return"DataDownloaded"

@app.route('/api/data/partition')

defpartition():

data_partition('iris.scale',0.8)


@app.route('/api/get/data/test')

defgettestdata():


Asmentioned previously these these are examples of staticAPI endpoints. InmanyscenarioshavingadynamicAPIwouldbepreferred.Letsexplorethedatapartitionendpointandmodifythecodeforthestaticversiontomakeadynamicversion.Thenextpart is thefunctiondefinitionfor thedynamicversionof thedata_partitionfunction,andnotmuchhaschanged.Theonlychangemadewasthatstingswereappendedtothetestingandtrainingfilenamesforconvenience.Theratiowillmatchtheuserdefinedratioenteredthroughtheurl.

Now for defining the endpoint, naturally it starts the same way as the staticversionhowevernowwemustaddapartthatallowsfortheusertoentervalues.Thisisdonebyuseofbrackets<text>.

return"ReturnXtestandYtestarrays"

@app.route('/api/get/data/train')

defgettraindata():


return"ReturnXtrainandYtrainarrays"

@app.route('/api/experiment/svm')

defsvm():



clf=SVC(gamma=0.001,C=100,kernel='linear')

clf.fit(Xtrain,ytrain)

test_size=Xtest.shape[0]

accuarcy_holder=[]

foriinrange(0,test_size):

prediction=clf.predict(Xtest[i])

print("PredictionfromSVM:"+str(prediction)+",Expected

Label:"+str(ytest[i]))

accuarcy_holder.append(prediction==ytest[i])

correct_predictions=sum(accuarcy_holder)

print(correct_predictions)

total_samples=test_size

accuracy=

float(float(correct_predictions)/float(total_samples))*100

print("PredictionAccuracy:"+str(accuracy))

return"PredictionAccuracy:"+str(accuracy)


app.run(debug=True)



training_file=filename+'_train_'+str(ratio)

test_file=filename+'_test_'+str(ratio)


count=0

size=len(data)



forlineindata:


ftrain.write(line)

else:

ftest.write(line)

count=count+1

@app.route('/api/data/partition/<filename>/ratio/<ratio>')

defpartition(filename,ratio):

ratio=float(ratio)

path='data/'+filename

data_partition(path,ratio)


11REFERENCES

☁�

[1]L.Richardson,“Beautifulsouppythonpackageoverview.”WebPage,2019[Online].Available:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[2]C.WODEHOUSE,“Shouldyouusemongodb?Alookattheleadingnosqldatabase.” Web Page, 2018 [Online]. Available:https://www.upwork.com/hiring/data/should-you-use-mongodb-a-look-at-the-leading-nosql-database/

[3]Guru99, “Introduction tomongodb.”WebPage, 2018 [Online].Available:https://www.guru99.com/mongodb-tutorials.html#1

[4] MongoDB, “Https://www.mongodb.com/.” Web Page, 2018 [Online].Available:https://docs.mongodb.com/manual/introduction/

[5]M.Papiernik,“Howto installmongodbonubuntu18.04.”WebPage,Jun-2018 [Online]. Available:https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-18-04

[6]J.Ellingwood,“Initialserversetupwithubuntu18.04.”WebPage,Apr-2018[Online]. Available: https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04

[7]MongoDB,Databasesandcollections,4.0ed.NewYork,NewYork,USA:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/core/databases-and-collections/

[8]J.M.CraigBuckler,“Usingjoinsinmongodbnosqldatabases.”WebPage,Sep-2016 [Online]. Available: https://www.sitepoint.com/using-joins-in-mongodb-nosql-databases/

[9] MongoDB, Lookup (aggregation), 3.2 ed. New York City, New York,United States: MongoDB Inc, 2008 [Online]. Available:

https://github.com/cloudmesh-community/book/blob/master/chapters/empty.md

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://www.upwork.com/hiring/data/should-you-use-mongodb-a-look-at-the-leading-nosql-database/

https://www.guru99.com/mongodb-tutorials.html#1

https://docs.mongodb.com/manual/introduction/

https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-18-04

https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04

https://docs.mongodb.com/manual/core/databases-and-collections/

https://www.sitepoint.com/using-joins-in-mongodb-nosql-databases/

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

[10]MongoDB,MongoDB package components - mongoexport, 4.0 ed. NewYorkCity,NewYork,UnitedStates:MongoDBInc,2008[Online].Available:https://docs.mongodb.com/manual/reference/program/mongoexport/

[11]MongoDB, Security, 4.0 ed. New York City, New York, United States:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/security/

[12] MongoDB, “MongoDB atlas.” Web Page, 2018 [Online]. Available:https://www.mongodb.com/cloud/atlas

[13]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/api

[14]A. J. J. Davis, “Announcing pymongo3.”Web Page,Apr-2015 [Online].Available:https://emptysqua.re/blog/announcing-pymongo-3/

[15] M. Dirolf, “PyMongo.” Web Page, Jul-2018 [Online]. Available:https://github.com/mongodb/mongo-python-driver

[16] N. Leite, “MongoDB and python.” Web Page, Mar-2015 [Online].Available:https://www.slideshare.net/NorbertoLeite/mongodb-and-python

[17]V.Oleynik, “How do you usemongodbwith python?”Web Page,Mar-2017 [Online]. Available: https://gearheart.io/blog/how-do-you-use-mongodb-with-python/

[18] I. MongoDB, “Installing / upgrading.” Web pages, 2008 [Online].Available:http://api.mongodb.com/python/current/installation.html

[19] R. Python, “Introduction to mongodb and python.” Web Page, 2016[Online]. Available: https://realpython.com/introduction-to-mongodb-and-python/

[20]W3Schools,“Pythonmongodbcreatedatabase.”WebPage,1999[Online].Available:https://www.w3schools.com/python/python_mongodb_create_db.asp

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

https://docs.mongodb.com/manual/reference/program/mongoexport/

https://docs.mongodb.com/manual/security/

https://www.mongodb.com/cloud/atlas

https://api.mongodb.com/python/current/api

https://emptysqua.re/blog/announcing-pymongo-3/

https://github.com/mongodb/mongo-python-driver

https://www.slideshare.net/NorbertoLeite/mongodb-and-python

https://gearheart.io/blog/how-do-you-use-mongodb-with-python/

http://api.mongodb.com/python/current/installation.html

https://realpython.com/introduction-to-mongodb-and-python/

https://www.w3schools.com/python/python_mongodb_create_db.asp

[21]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/tutorial.html

[22] N. O’Higgins, PyMongo & python. O’Reilly, 2011 [Online]. Available:http://img105.job1001.com/upload/adminnew/2015-04-07/1428393873-MHKX3LN.pdf

[23]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/examples/aggregation.html

[24] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available: https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/

[25] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://docs.mongodb.com/manual/core/map-reduce/

[26] MongoDB, “PyMongo v2.0 documentation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/2.0/examples/map_reduce.html

[27] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/current/examples/copydb.html

[28] MongoEngine, “MongoEngine user documentation.” Web Page, 2009[Online].Available:http://docs.mongoengine.org/

[29] Wikipedia, “Object-relational mapping.” Web Page, May-2009 [Online].Available:https://en.wikipedia.org/wiki/Object-relational_mapping

[30] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/guide/defining-documents.html

[31] MongoEngine, “User guide: Document instances.” Web Page, 2009[Online]. Available: http://docs.mongoengine.org/guide/document-instances.html

[32] MongoEngine, “2.1 installing mongoengine.” Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/installing.html

https://api.mongodb.com/python/current/tutorial.html

http://img105.job1001.com/upload/adminnew/2015-04-07/1428393873-MHKX3LN.pdf

https://api.mongodb.com/python/current/examples/aggregation.html

https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/

https://docs.mongodb.com/manual/core/map-reduce/

https://api.mongodb.com/python/2.0/examples/map_reduce.html

https://api.mongodb.com/python/current/examples/copydb.html

http://docs.mongoengine.org/

https://en.wikipedia.org/wiki/Object-relational_mapping

http://docs.mongoengine.org/guide/defining-documents.html

http://docs.mongoengine.org/guide/document-instances.html

http://docs.mongoengine.org/guide/installing.html

[33]MongoEngine, “2.2 connection to mongodb.”Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/connecting.html

[34]MongoEngine,“Userguide2.5.Querying thedatabase.”WebPage,2009[Online].Available:http://docs.mongoengine.org/guide/querying.html

[35]Wikipedia,“Flask(webframework).”WebPage,2010[Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)

[36] MongoDB, “Flask-pymongo.” Web Page, 2008 [Online]. Available:https://flask-pymongo.readthedocs.io/en/latest/

[37]MongoDB,“Flaskmongoalchemy.”WebPage,2008 [Online].Available:https://pythonhosted.org/Flask-MongoAlchemy/

[38] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/projects/flask-mongoengine/en/latest/

[39] Wikipedia, “Flask (web framework).” Web Page, Oct-2018 [Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)

[40] R. T. Fielding and R. N. Taylor, Architectural styles and the design ofnetwork-based software architectures, vol. 7. University of California, IrvineDoctoraldissertation,2000.

[41] Wikipedia, “Representational state transfer.” Web Page, 2019 [Online].Available:https://en.wikipedia.org/wiki/Representational_state_transfer

[42] OpenAPI Initiative, “The openapi specification.” Web Page [Online].Available: https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md

[43] OpenAPI Initiative, “The openapi specification.” Web Page [Online].Available:https://github.com/OAI/OpenAPI-Specification

[44]RAML,“RAMLversion1.0:RESTfulapimodelinglanguage.”WebPage[Online]. Available: https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

http://docs.mongoengine.org/guide/connecting.html

http://docs.mongoengine.org/guide/querying.html

https://en.wikipedia.org/wiki/Flask_(web_framework)

https://flask-pymongo.readthedocs.io/en/latest/

https://pythonhosted.org/Flask-MongoAlchemy/

http://docs.mongoengine.org/projects/flask-mongoengine/en/latest/

https://en.wikipedia.org/wiki/Flask_(web_framework)

https://en.wikipedia.org/wiki/Representational_state_transfer

https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md

https://github.com/OAI/OpenAPI-Specification

https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

[45] R. H. Kevin Burke Kyle Conroy, “Flask-restful.” Web Page [Online].Available:https://flask-restful.readthedocs.io/en/latest/

[46] E. O. Ltd, “Django rest framework.” Web Page [Online]. Available:https://www.django-rest-framework.org/

[47] S. Software, “API development for everyone.” Web Page [Online].Available:https://swagger.io

[48] S. Software, “Swagger codegen documentation.” Web Page [Online].Available:https://swagger.io/docs/open-source-tools/swagger-codegen/

[49] A. Y. W. Hate, “OpenAPI.Tools.” Web Page [Online]. Available:https://openapi.tools/

[50] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.edureka.co/blog/mapreduce-tutorial/?utm_source=youtube&utm_campaign=mapreduce-tutorial-161216-wr&utm_medium=description

[51] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.youtube.com/watch?v=SqvAaB3vK8U&list=WL&index=25&t=2547s

[52] “Apache mapreduce.” Aug-2019 [Online]. Available:https://www.ibm.com/analytics/hadoop/mapreduce

[53] Wikipedia, “MapReduce.” Aug-2019 [Online]. Available:https://en.wikipedia.org/wiki/MapReduce

[54] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

[55] A. Khan, “Hadoop and spark.” Aug-2019 [Online]. Available:https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark.[Accessed:03-Sep-2017]

[56] “Apache spark vs hadoop mapreduce.” Aug-2019 [Online]. Available:https://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce/



https://swagger.io

https://swagger.io/docs/open-source-tools/swagger-codegen/

https://openapi.tools/

https://www.edureka.co/blog/mapreduce-tutorial/?utm_source=youtube&utm_campaign=mapreduce-tutorial-161216-wr&utm_medium=description

https://www.youtube.com/watch?v=SqvAaB3vK8U&list=WL&index=25&t=2547s

https://www.ibm.com/analytics/hadoop/mapreduce

https://en.wikipedia.org/wiki/MapReduce

https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark

https://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce/

[57] Twister, “Twister2: Twister2 Big Data Hosting Environment: Acomposable framework for high-performance data analytics.”Web Page, Feb-2017[Online].Available:https://twister2.gitbook.io/twister2/

[58] Twister, “Twister2: Twister2 Big Data Hosting Environment: Acomposable framework for high-performance data analytics.”Web Page, Feb-2017[Online].Available:https://github.com/DSC-SPIDAL/twister2/

[59]Twister,“Twister2wordcountexample.”Aug-2019.

[60] Twister, “Task examples.” Web Page, Feb-2017 [Online]. Available:https://twister2.gitbook.io/twister2/examples/task_examples

[61] Twister, “Communication Model.” Web Page, Feb-2017 [Online].Available:https://twister2.gitbook.io/twister2/concepts/communication/communication-model

[62]S.Kamburugamuveetal.,“Twister:Net-communicationlibraryforbigdataprocessing in hpc and cloud environments,” in 2018 ieee 11th internationalconferenceoncloudcomputing(cloud),2018,pp.383–391.

[63] Twister2, “Kmeans performance comparison.” Web Page, Jan-2019[Online].Available:https://twister2.gitbook.io/twister2/

[64] Twister, “Twister Examples.” Web Page, Feb-2017 [Online]. Available:https://twister2.gitbook.io/twister2/examples

[65] Docker, “Overview of docker hub.” Web Page, Mar-2018 [Online].Available:https://docs.docker.com/docker-hub/

[66]S.Bhartiya,“Howtousedockerhub.”Blog,Jan-2018[Online].Available:https://www.linux.com/blog/learn/intro-to-linux/2018/1/how-use-dockerhub

[67] Docker, “Repositories on docker hub.” Web Page, Mar-2018 [Online].Available:https://docs.docker.com/docker-hub/repos/

[68] R. Irani, “Docker tutorial series-part 4-docker hub.” Blog, Jul-2015[Online].Available: https://rominirani.com/docker-tutorial-series-part-4-docker-


https://github.com/DSC-SPIDAL/twister2/

https://twister2.gitbook.io/twister2/examples/task_examples

https://twister2.gitbook.io/twister2/concepts/communication/communication-model


https://twister2.gitbook.io/twister2/examples

https://docs.docker.com/docker-hub/

https://www.linux.com/blog/learn/intro-to-linux/2018/1/how-use-dockerhub

https://docs.docker.com/docker-hub/repos/

https://rominirani.com/docker-tutorial-series-part-4-docker-hub-b51fb545dd8e

hub-b51fb545dd8e

[69]G.M.Kurtzer,“SingularityContainersforScience.”Presentation,Jan-2019[Online]. Available: http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43

[70]NationalInstituteofStandars,“NISTbigdatapublicworkinggroup.”Aug-2019[Online].Available:https://bigdatawg.nist.gov/

[71] LIGO, “Ligo data grid.” Sep-2019 [Online]. Available: https://www.lsc-group.phys.uwm.edu/lscdatagrid/overview.html

http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43

https://bigdatawg.nist.gov/

https://www.lsc-group.phys.uwm.edu/lscdatagrid/overview.html

Date post:	25-May-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times