INTELLIGENTSYSTEMSENGENEERING
GeoffreyC.FoxGregorvonLaszewski
(c)GregorvonLaszewski,2018,2019
INTELLIGENTSYSTEMSENGENEERING
1PREFACE1.1Disclaimer☁�1.1.1Acknowledgment1.1.2Extensions
1.2Contributors☁�2SYLLABUS2.1e222:IntelligentSystemsEngeneeringII☁�2.1.1Teachingandlearningmethods2.1.2Representativebibliography2.1.3Grading2.1.4Incomplete2.1.5OtherclassesI423,I523,I524,B649,E516,E6162.1.6Communication2.1.6.1Howtotakethisclass
2.1.7CoveredTopics2.1.7.1Week1.OverviewofthisClass2.1.7.2 Week 1 and 2. Review of Python for Intelligent SystemsEngineering2.1.7.3Week2.ReviewofLinuxshellforOSX,Linux,andWindows2.1.7.4Week3.IntroductiontoREST2.1.7.5Week4.IntroductiontoScientificWriting2.1.7.6Week5to9.IntroductiontoCloudComputing2.1.7.7Week10:LectureFreeTime2.1.7.8Week11.IntroductiontoCloudPlatforms2.1.7.9 Week 12 to 16. Review of AI for AI-Cloud ComputingIntegration2.1.7.10CloudEdgeComputing2.1.7.11AlternativeProjects
2.2Assignments☁�2.2.1AccountCreation2.2.2Sections,Chapters,Examples2.2.3Project2.2.3.1ProjectDeliverables2.2.3.2ProjectTopic
2.2.4AlternateProject:VirtualCluster2.2.5AlternativeProject:100nodeRaspberryPicluster2.2.6Submissionofsectionsandchaptersandprojects
3PYTHON3.1IntroductiontoPython☁�3.1.1References
3.2Python3.7.4Installation☁�3.2.1Hardware3.2.2PrerequisitsUbuntu19.043.2.3PrerequisitsmacOS3.2.3.1InstallationfromAppleAppStore3.2.3.2Installationfrompython.org3.2.3.3InstallationfromHoembrew
3.2.4PrerequisitsUbuntu18.043.2.5PrerequisiteWindows103.2.5.1LinuxSubsystemInstall
3.2.6Prerequisitvenv3.2.7InstallPython3.7viaAnaconda3.2.7.1Downloadcondainstaller3.2.7.2Installconda3.2.7.3InstallPython3.7.4viaconda
3.3InteractivePython☁�3.3.1REPL(ReadEvalPrintLoop)3.3.2Interpreter3.3.3Python3FeaturesinPython2
3.4Editors☁�3.4.1Pycharm3.4.2Pythonin45minutes
3.5Language☁�3.5.1StatementsandStrings3.5.2Comments3.5.3Variables3.5.4DataTypes3.5.4.1Booleans3.5.4.2Numbers
3.5.5ModuleManagement3.5.5.1ImportStatement
3.5.5.2Thefrom…importStatement3.5.6DateTimeinPython3.5.7ControlStatements3.5.7.1Comparison3.5.7.2Iteration
3.5.8Datatypes3.5.8.1Lists3.5.8.2Sets3.5.8.3RemovalandTestingforMembershipinSets3.5.8.4Dictionaries3.5.8.5DictionaryKeysandValues3.5.8.6CountingwithDictionaries
3.5.9Functions3.5.10Classes3.5.11Modules3.5.12LambdaExpressions3.5.12.1map3.5.12.2dictionary
3.5.13Iterators3.5.14Generators3.5.14.1Generatorswithfunction3.5.14.2Generatorsusingforloop3.5.14.3GeneratorswithListComprehension3.5.14.4WhytouseGenerators?
3.6LIBRARIES3.6.1PythonModules☁�3.6.1.1UpdatingPip3.6.1.2UsingpiptoInstallPackages3.6.1.3GUI3.6.1.3.1GUIZero3.6.1.3.2Kivy
3.6.1.4FormattingandCheckingPythonCode3.6.1.5Usingautopep83.6.1.6WritingPython3CompatibleCode3.6.1.7UsingPythononFutureSystems3.6.1.8Ecosystem3.6.1.8.1pypi
3.6.1.8.2AlternativeInstallations3.6.1.9Resources3.6.1.9.1JupyterNotebookTutorials
3.6.1.10Exercises3.6.2DataManagement☁�3.6.2.1Formats3.6.2.1.1Pickle3.6.2.1.2TextFiles3.6.2.1.3CSVFiles3.6.2.1.4Excelspreadsheets3.6.2.1.5YAML3.6.2.1.6JSON3.6.2.1.7XML3.6.2.1.8RDF3.6.2.1.9PDF3.6.2.1.10HTML3.6.2.1.11ConfigParser3.6.2.1.12ConfigDict
3.6.2.2Encryption3.6.2.3DatabaseAccess3.6.2.4SQLite3.6.2.4.1Exercises �
3.6.3Plottingwithmatplotlib☁�3.6.4DocOpts☁�3.6.5CloudmeshCommandShell☁�3.6.5.1CMD53.6.5.1.1Resources3.6.5.1.2Installationfromsource3.6.5.1.3Execution3.6.5.1.4CreateyourownExtension3.6.5.1.5Bug:Quotes
3.6.6cmdModule☁�3.6.6.1Hello,Worldwithcmd3.6.6.2AMoreInvolvedExample3.6.6.3HelpMessages3.6.6.4UsefulLinks
3.6.7OpenCV☁�
3.6.7.1Overview3.6.7.2Installation3.6.7.3ASimpleExample3.6.7.3.1Loadinganimage3.6.7.3.2Displayingtheimage3.6.7.3.3ScalingandRotation3.6.7.3.4Gray-scaling3.6.7.3.5ImageThresholding3.6.7.3.6EdgeDetection
3.6.7.4AdditionalFeatures3.6.8SecchiDisk☁�3.6.8.1SetupforOSX3.6.8.2Step1:Recordthevideo3.6.8.3Step2:AnalysetheimagesfromtheVideo3.6.8.3.1ImageThresholding3.6.8.3.2EdgeDetection3.6.8.3.3Blackandwhite
3.7DATA3.7.1DataFormats☁�3.7.1.1YAML3.7.1.2JSON3.7.1.3XML
3.7.2MongoDBinPython☁�3.7.2.1CloudmeshMongoDBUsageQuickstart3.7.2.2MongoDB3.7.2.2.1Installation3.7.2.2.1.1Installationprocedure
3.7.2.2.2CollectionsandDocuments3.7.2.2.2.1Collectionexample3.7.2.2.2.2Documentstructure3.7.2.2.2.3CollectionOperations
3.7.2.2.3MongoDBQuerying3.7.2.2.3.1MongoQueriesexamples
3.7.2.2.4MongoDBBasicFunctions3.7.2.2.4.1Import/Exportfunctionsexamples
3.7.2.2.5SecurityFeatures3.7.2.2.5.1Collectionbasedaccesscontrolexample
3.7.2.2.6MongoDBCloudService3.7.2.3PyMongo3.7.2.3.1Installation3.7.2.3.2Dependencies3.7.2.3.3RunningPyMongowithMongoDeamon3.7.2.3.4ConnectingtoadatabaseusingMongoClient3.7.2.3.5AccessingDatabases3.7.2.3.6CreatingaDatabase3.7.2.3.7InsertingandRetrievingDocuments(Querying)3.7.2.3.8LimitingResults3.7.2.3.9UpdatingCollection3.7.2.3.10CountingDocuments3.7.2.3.11Indexing3.7.2.3.12Sorting3.7.2.3.13Aggregation3.7.2.3.14DeletingDocumentsfromaCollection3.7.2.3.15CopyingaDatabase3.7.2.3.16PyMongoStrengths
3.7.2.4MongoEngine3.7.2.4.1Installation3.7.2.4.2ConnectingtoadatabaseusingMongoEngine3.7.2.4.3QueryingusingMongoEngine
3.7.2.5Flask-PyMongo3.7.2.5.1Installation3.7.2.5.2Configuration3.7.2.5.3Connectiontomultipledatabases/servers3.7.2.5.4Flask-PyMongoMethods3.7.2.5.5AdditionalLibraries3.7.2.5.6ClassesandWrappers
3.7.3Mongoengine☁�3.7.3.1Introduction3.7.3.2Installandconnect3.7.3.3Basics
3.8CALCULATION3.8.1WordCountwithParallelPython☁�3.8.1.1GeneratingaDocumentCollection3.8.1.2SerialImplementation
3.8.1.3SerialImplementationUsingmapandreduce3.8.1.4ParallelImplementation3.8.1.5Benchmarking3.8.1.6Excersises3.8.1.7References
3.8.2NumPy☁�3.8.2.1InstallingNumPy3.8.2.2NumPyBasics3.8.2.3DataTypes:TheBasicBuildingBlocks3.8.2.4Arrays:StringingThingsTogether3.8.2.5Matrices:AnArrayofArrays3.8.2.6SlicingArraysandMatrices3.8.2.7UsefulFunctions3.8.2.8LinearAlgebra3.8.2.9NumPyResources
3.8.3Scipy☁�3.8.3.1Introduction3.8.3.2References
3.8.4Scikit-learn☁�3.8.4.1IntroductiontoScikit-learn3.8.4.2Installation3.8.4.3SupervisedLearning3.8.4.4UnsupervisedLearning3.8.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn3.8.4.6Stepsfordevelopingamachinelearningmodel3.8.4.7ExploratoryDataAnalysis3.8.4.7.1Barplot3.8.4.7.2Correlationbetweenattributes3.8.4.7.3HistogramAnalysisofdatasetattributes3.8.4.7.4BoxplotAnalysis3.8.4.7.5ScatterplotAnalysis
3.8.4.8DataCleansing-RemovingOutliers3.8.4.9PipelineCreation3.8.4.9.1 Defining DataFrameSelector to separate Numerical andCategoricalattributes3.8.4.9.2FeatureCreation/AdditionalFeatureEngineering
3.8.4.10CreatingTrainingandTestingdatasets3.8.4.11Creatingpipelinefornumericalandcategoricalattributes3.8.4.12Selectingthealgorithmtobeapplied3.8.4.12.1LinearRegression3.8.4.12.2LogisticRegression3.8.4.12.3Decisiontrees3.8.4.12.4KMeans3.8.4.12.5SupportVectorMachines3.8.4.12.6NaiveBayes3.8.4.12.7RandomForest3.8.4.12.8Neuralnetworks3.8.4.12.9DeepLearningusingKeras3.8.4.12.10XGBoost
3.8.4.13ScikitCheatSheet3.8.4.14ParameterOptimization3.8.4.14.1Hyperparameteroptimization/tuningalgorithms
3.8.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)3.8.4.15.1Creatingaparametergrid3.8.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.3.8.4.15.3ResultstablefromtheModelevaluationwithmetrics.3.8.4.15.4ROCAUCScore
3.8.4.16K-meansinscikitlearn.3.8.4.16.1Import
3.8.4.17K-meansAlgorithm3.8.4.17.1Import3.8.4.17.2Createsamples3.8.4.17.3Createsamples3.8.4.17.4Visualize3.8.4.17.5Visualize
3.8.5ParallelComputinginPython☁�3.8.5.1Multi-threadinginPython3.8.5.1.1ThreadvsThreading3.8.5.1.2Locks
3.8.5.2Multi-processinginPython3.8.5.2.1Process
3.8.5.2.2Pool3.8.5.2.2.1SynchronousPool.map()3.8.5.2.2.2AsynchronousPool.map_async()
3.8.5.2.3Locks3.8.5.2.4ProcessCommunication3.8.5.2.4.1Value
3.8.6Dask-RandomForestFeatureDetection☁�3.8.6.1Setup3.8.6.2Dataset3.8.6.3DetectingFeatures3.8.6.3.1DataPreparation
3.8.6.4RandomForest3.8.6.5Acknowledgement
4DEVOPSTOOLS4.1Refcards☁�4.2VirtualBox☁�4.2.1Installation4.2.2Guestadditions4.2.3Exercises
4.3Vagrant☁�4.3.1Installation4.3.1.1macOS4.3.1.2Windows �4.3.1.3Linux �
4.3.2Usage4.4LinuxShell☁�4.4.1History4.4.2Shell4.4.3Thecommandman4.4.4Multi-commandexecution4.4.5KeyboardShortcuts4.4.6bashrc,bash_profileorzprofile4.4.7Makefile4.4.8chmod4.4.9Exercises
4.5SecureShell☁�4.5.1ssh-keygen
4.5.2ssh-add4.5.3SSHAddandAgent4.5.3.1UsingSSHonMacOSX4.5.3.2UsingSSHonLinux4.5.3.3UsingSSHonRaspberryPi3/44.5.3.4AccessingaRemoteMachine
4.5.4SSHPortForwarding �4.5.4.1Prerequisites4.5.4.2HowtoRestarttheServer4.5.4.3TypesofPortForwarding4.5.4.4LocalPortForwarding4.5.4.5RemotePortForwarding4.5.4.6DynamicPortForwarding4.5.4.7sshconfig4.5.4.8Tips4.5.4.9References
4.5.5SSHtoFutureSystemsResources☁�4.5.5.1TestingyourFutureSystemssshkey
4.5.6Exercises☁�4.6Github☁�4.6.1Overview4.6.2UploadKey4.6.3Fork4.6.4Rebase4.6.5Remote4.6.6PullRequest4.6.7Branch4.6.8Checkout4.6.9Merge4.6.10GUI4.6.11Windows4.6.12GitfromtheCommandline4.6.13Configuration4.6.14Uploadyourpublickey4.6.15Workingwithadirectorythatwillbeprovidedforyou4.6.16README.ymlandnotebook.md4.6.17ContributingtotheDocument
4.6.17.1Stayuptodatewiththeoriginalrepo4.6.17.2Resources
4.6.18Exercises4.6.19GithubIssues4.6.19.1GitIssueFeatures4.6.19.2GithubMarkdown4.6.19.2.1Tasklists4.6.19.2.2Teamintegration4.6.19.2.3ReferencingIssuesandPullrequests4.6.19.2.4Emojis
4.6.19.3Notifications4.6.19.4cc4.6.19.5Interactingwithissues
4.6.20Glossary4.6.21Examplecommands4.6.21.1Localcommandstoversioncontrilyourfiles4.6.21.2Interactingwiththeremote
4.7GitPullRequest☁�4.7.1Introduction4.7.2Howtocreateapullrequest4.7.3Forktheoriginalrepository4.7.4Cloneyourcopy4.7.5Addinganupstream4.7.6Makingchanges4.7.7Creatingapullrequest
4.8Tig☁�5 Introduction to Cloud Computing and Data Engineering for CloudComputingandMachineLearning☁�5.1A.SummaryofIntroductiontoCloudComputing&DataEngineering5.2B.DefiningCloudsI5.3C.DefiningCloudsII5.4D.DefiningCloudsIII5.5E.Virtualization5.6F.TechnologyHypecycleI5.7G.TechnologyHypecycleII5.8H.CloudInfrastructureI5.9I.CloudInfrastructureII
5.10JCloudSoftware5.11K.CloudApplicationsI5.12LCloudApplicationsII5.13MCloudApplicationsIII5.14N.CloudsandParallelComputing5.15O.Storage5.16P.HPCandClouds5.17Q.ComparisonofDataAnalyticswithSimulation5.18R.Jobs5.19S.TheFutureI5.20T.TheFutureandotherIssuesII5.21U.TheFutureandotherIssuesIII
6REST6.1IntroductiontoREST☁�6.1.0.1CollectionofResources6.1.0.2SingleResource6.1.0.3RESTToolClassification
6.2OpenAPIRESTServiceswithSwagger☁�6.2.1SwaggerTools6.2.2SwaggerCommunityTools6.2.2.1ConvertingJsonExamplestoOpenAPIYAMLModels
6.3OpenAPI2.0Specification☁�6.3.1TheVirtualClusterexampleAPIDefinition6.3.1.1Terminology6.3.1.2Specification
6.3.2References6.4OpenAPI3.0RESTServiceviaIntrospection☁�6.4.1Verification6.4.2Swagger-UI6.4.3Mockservice6.4.4Exercise
6.5OpenAPIRESTServiceviaCodegen☁�6.5.1Step1:DefineYourRESTService6.5.2Step2:ServerSideStubCodeGenerationandImplementation6.5.2.1SetuptheCodegenEnvironment6.5.2.2GenerateServerStubCode6.5.2.3Fillintheactualimplementation
6.5.3Step3:InstallandRuntheRESTService:6.5.3.1Startavirtualenv:6.5.3.2Makesureyouhavethelatestpip:6.5.3.3Installtherequirementsoftheserversidecode:6.5.3.4Installtheserversidecodepackage:6.5.3.5Runtheservice6.5.3.6Verifytheserviceusingawebbrowser:
6.5.4Step4:GenerateClientSideCodeandVerify6.5.4.1Clientsidecodegeneration:6.5.4.2Installtheclientsidecodepackage:6.5.4.3UsingtheclientAPItointeractwiththeRESTservice
6.5.5TowardsaDistributedClientServer6.6FlaskRESTfulServices☁�6.7RestServiceswithEve☁�6.7.1UbuntuinstallofMongoDB6.7.2macOSinstallofMongoDB6.7.3Windows10InstallationofMongoDB6.7.4DatabaseLocation6.7.5Verification6.7.6BuildingasimpleRESTService6.7.7InteractingwiththeRESTservice6.7.8CreatingRESTAPIEndpoints6.7.9RESTAPIOutputFormatsandRequestProcessing6.7.10RESTAPIUsingaClientApplication6.7.11Towardscmd5extensionstomanageeveandmongo �
6.8HATEOAS☁�6.8.1Filtering6.8.2PrettyPrinting6.8.3XML
6.9ExtensionstoEve☁�6.9.1ObjectManagementwithEveandEvegenie6.9.1.1Installation6.9.1.2Startingtheservice6.9.1.3Creatingyourownobjects
6.10DjangoRESTFramework☁�6.11GithubRESTServices☁�6.11.1Issues
6.11.2Exercise7MAPREDUCE7.1IntroductiontoMapreduce☁�7.1.1MapReduceAlgorithm7.1.1.1MapReduceExample:WordCount
7.1.2HadoopMapReduceandHadoopSpark7.1.2.1ApacheSpark7.1.2.2HadoopMapReduce7.1.2.3KeyDifferences
7.1.3References7.2Hadoop☁�7.2.1HadoopandMapReduce7.2.2HadoopEcoSystem7.2.3HadoopComponents7.2.4HadoopandtheYarnResourceManager7.2.5PageRank
7.3InstallationofHadoop☁�7.3.1Releases7.3.2Prerequisites7.3.3UserandUserGroupCreation7.3.4ConfiguringSSH7.3.5InstallationofJava7.3.6InstallationofHadoop7.3.7HadoopEnvironmentVariables
7.4HadoopVirtualClusterInstallationUsingCloudmesh �☁�7.4.1CloudmeshClusterInstallation7.4.1.1CreateCluster7.4.1.2CheckCreatedCluster7.4.1.3DeleteCluster
7.4.2HadoopClusterInstallation7.4.2.1CreateHadoopCluster7.4.2.2DeleteHadoopCluster
7.4.3AdvancedTopicswithHadoop7.4.3.1HadoopVirtualClusterwithSparkand/orPig7.4.3.2WordCountExampleonSpark
7.5SPARK7.5.1SparkLectures☁�
7.5.1.1MotivationforSpark7.5.1.2SparkRDDOperations7.5.1.3SparkDAG7.5.1.4Sparkvs.otherFrameworks
7.5.2InstallationofSpark☁�7.5.2.1Prerequisites7.5.2.2InstallationofJava7.5.2.3InstallSparkwithHadoop7.5.2.4SparkEnvironmentVariables7.5.2.5TestSparkInstallation7.5.2.6InstallSparkWithCustomHadoop7.5.2.7ConfiguringHadoop7.5.2.8TestSparkInstallation
7.5.3SparkStreaming☁�7.5.3.1StreamingConcepts7.5.3.2SimpleStreamingExample7.5.3.3SparkStreamingForTwitterData7.5.3.3.1Step17.5.3.3.2Step27.5.3.3.3Step37.5.3.3.4Step47.5.3.3.5step57.5.3.3.6step6
7.5.4UserDefinedFunctionsinSpark☁�7.5.4.1Resources7.5.4.2InstructionsforSparkinstallation7.5.4.2.1Linux
7.5.4.3Windows7.5.4.4MacOS7.5.4.5InstructionsforcreatingSparkUserDefinedFunctions7.5.4.5.1Example:Temperatureconversion7.5.4.5.1.1Descriptionaboutdataset7.5.4.5.1.2HowtowriteapythonprogramwithUDF7.5.4.5.1.3Howtoexecuteapythonsparkscript7.5.4.5.1.4Filteringandsorting
7.5.4.6Instructionstoinstallandruntheexampleusingdocker7.6ADVANCEDHADOOP
7.6.1AmazonEMR(ElasticMapReduce) �☁�7.6.1.1WhyEMR?7.6.1.2UnderstandingClustersandNodes7.6.1.2.1SubmitWorktoaCluster7.6.1.2.2ProcessingData
7.6.1.3AWSStorage7.6.1.4CreateEMRinAWS7.6.1.4.1Createthebuckets7.6.1.4.2CreateKeyPairs7.6.1.4.2.1CreateKeyValuePairScreenshots
7.6.1.5CreateStepExecution–HadoopJob7.6.1.5.0.1Screenshots
7.6.1.6CreateaHiveCluster7.6.1.6.1CreateaHiveCluster-Screenshots
7.6.1.7CreateaSparkCluster7.6.1.7.1CreateaSparkCluster-Screenshots
7.6.2Twister2☁�7.6.2.1Introduction7.6.2.2Twister2API’s7.6.2.2.1TSetAPI7.6.2.2.2TaskAPI
7.6.2.3OperatorAPI7.6.2.3.1Resources
7.6.3Twister2Installation☁�7.6.3.1Prerequisites7.6.3.1.1MavenInstallation7.6.3.1.2OpenMPIInstallation7.6.3.1.3InstallExtras7.6.3.1.4CompilingTwister27.6.3.1.5Twister2Distribution
7.6.4Twister2Examples☁�7.6.4.1SubmittingaJob7.6.4.2BatchWordCountExample
7.6.5HADOOPRDMA☁�7.6.5.1LaunchingaVirtualHadoopClusteronBare-metalInfiniBandNodeswithSR-IOVonChameleon7.6.5.2LaunchingVirtualMachinesManually
7.6.5.3ExtraInitializationwhenLaunchingVirtualMachines7.6.5.4 Important Note for Tearing Down Virtual Machines andDeletingNetworkPorts
8CONTAINER8.1IntroductiontoContainers☁�8.1.1Motivation-Microservices8.1.2Motivation-ServerlessComputing8.1.3Docker8.1.4DockerandKubernetes
8.2DOCKER8.2.1IntroductiontoDocker☁�8.2.1.1DockerEngine8.2.1.2DockerArchitecture8.2.1.3DockerSurvey
8.2.2RunningDockerLocally☁�8.2.2.1InstillationforOSX8.2.2.2InstallationforUbuntu8.2.2.3InstallationforWindows108.2.2.4TestingtheInstall
8.2.3Dockerfile☁�8.2.3.1Specification8.2.3.2References
8.2.4DockerHub☁�8.2.4.1CreateDockerIDandLogIn8.2.4.2SearchingforDockerImages8.2.4.3PullingImages8.2.4.4CreateRepositories8.2.4.5PushingImages8.2.4.6Resources
8.3DOCKERASPAAS8.3.1DockerSwarm☁�8.3.1.1Terminology8.3.1.2CreatingaDockerSwarmCluster8.3.1.3CreateaSwarmClusterwithVirtualBox8.3.1.4InitializetheSwarmManagerNodeandAddWorkerNodes8.3.1.5Deploytheapplicationontheswarmmanager
8.3.2DockerandDockerSwarmonFutureSystems☁�
8.3.2.1GettingAccess8.3.2.2Creatingaserviceanddeploytotheswarmcluster8.3.2.3Createyourownservice8.3.2.4Publishanimageprivatelywithintheswarmcluster8.3.2.5Exercises
8.3.3HadoopwithDocker☁�8.3.3.1BuildingHadoopusingDocker8.3.3.2HadoopConfigurationFiles8.3.3.3VirtualMemoryLimit8.3.3.4hdfsSafemodeleavecommand8.3.3.5Examples8.3.3.5.1StatisticalExamplewithHadoop8.3.3.5.1.1BaseLocation8.3.3.5.1.2InputFiles8.3.3.5.1.3Compilation8.3.3.5.1.4ArchivingClassFiles8.3.3.5.1.5HDFSforInput/Output8.3.3.5.1.6RunProgramwithaSingleInputFile8.3.3.5.1.7ResultforSingleInputFile8.3.3.5.1.8RunProgramwithMultipleInputFiles8.3.3.5.1.9ResultforMultipleFiles
8.3.3.5.2Conclusion8.3.3.6Refernces
8.3.4DockerPagerank☁�8.3.4.1Usetheautomatedscript8.3.4.2Compileandrunbyhand
8.3.5ApacheSparkwithDocker☁�8.3.5.1PullImagefromDockerRepository8.3.5.2RunningtheImage8.3.5.2.1Runninginteractively8.3.5.2.2Runninginthebackground
8.3.5.3RunSpark8.3.5.3.1RunSparkinYarn-ClientMode8.3.5.3.2RunSparkinYarn-ClusterMode
8.3.5.4ObserveTaskExecutionfromRunningLogsofSparkPi8.3.5.5WriteaWord-CountApplicationwithSparkRDD8.3.5.5.1LaunchSparkInteractiveShell
8.3.5.5.2PrograminScala8.3.5.5.3LaunchPySparkInteractiveShell8.3.5.5.4PrograminPython
8.3.5.6DockerSparkExamples8.3.5.6.1K-MeansExample8.3.5.6.2JoinExample8.3.5.6.3WordCount
8.3.5.7InteractiveExamples8.3.5.7.1StopDockerContainer8.3.5.7.2StartDockerContainerAgain8.3.5.7.3RemoveDockerContainer
8.4KUBERNETES8.4.1IntroductiontoKubernetes☁�8.4.1.1Whatarecontainers?8.4.1.2Terminology8.4.1.3KubernetesArchitecture8.4.1.4Minikube8.4.1.4.1Installminikube8.4.1.4.2StartaclusterusingMinikube8.4.1.4.3Createadeployment8.4.1.4.4Exposetheservi8.4.1.4.5Checkrunningstatus8.4.1.4.6Callserviceapi8.4.1.4.7TakealookfromDashboard8.4.1.4.8Deletetheserviceanddeployment8.4.1.4.9Stopthecluster
8.4.1.5InteractiveTutorialOnline8.4.2UsingKubernetesonFutureSystems☁�8.4.2.1GettingAccess8.4.2.2ExampleUse8.4.2.3Exercises
8.5SINGULARITY8.5.1RunningSingularityContainersonComet☁�8.5.1.1Background8.5.1.2TutorialContents8.5.1.3WhySingularity?8.5.1.4Hands-OnTutorials
8.5.1.5Downloading&InstallingSingularity8.5.1.5.1Download&UnpackSingularity8.5.1.5.2Configure&BuildSingularity8.5.1.5.3Install&TestSingularity
8.5.1.6BuildingSingularityContainers8.5.1.6.1UpgradingSingularity
8.5.1.7CreateanEmptyContainer8.5.1.8ImportIntoaSingularityContainer8.5.1.9ShellIntoaSingularityContainer8.5.1.10WriteIntoaSingularityContainer8.5.1.11BootstrappingaSingularityContainer8.5.1.12RunningSingularityContainersonComet8.5.1.12.1TransfertheContainertoComet8.5.1.12.2RuntheContaineronComet8.5.1.12.3AllocateResourcestoRuntheContainer8.5.1.12.4IntegratetheContainerwithSlurm8.5.1.12.5UseExistingCometContainers
8.5.1.13UsingTensorflowWithSingularity8.5.1.14Runthejob
8.6Exercises☁�9NIST9.1NISTBigDataRefereneceArchitecture☁�9.1.1PathwaytotheNIST-BDRA9.1.2BigDataCharacteristicsandDefinitions9.1.3BigDataandtheCloud9.1.4BigData,EdgeComputingandtheCloud9.1.5ReferenceArchitecture9.1.6FrameworkProviders9.1.7ApplicationProviders9.1.8Fabric9.1.9Interfacedefinitions
10AI10.1ArtificialIntelligenceServicewithREST �☁�10.1.1UnsupervisedLearning10.1.2KMeans10.1.3Lab:PracticeonAI10.1.4k-NN
10.1.5MachineLearningandCloudServices10.1.5.1IntroductionandRegression10.1.5.2K-meansClustering10.1.5.3Visulization10.1.5.4ClusteringExamples10.1.5.5GeneralClusteringwithExamples10.1.5.6InDepthExamplewithfourcenters10.1.5.7ParallelComputingandK-means
10.1.6ExampleProjectwithSVM11REFERENCES
1PREFACE
SatNov2305:18:45EST2019☁�
1.1DISCLAIMER☁�ThisbookhasbeengeneratedwithCyberaideBookmanager.
Bookmanagerisatooltocreateapublicationfromanumberofsourcesontheinternet. It is especially useful to create customized books, lecture notes, orhandouts. Content is best integrated inmarkdown format as it is very fast toproducetheoutput.
Bookmanagerhasbeendevelopedbasedonourexperienceoverthelast3yearswith amore sophisticated approach.Bookmanager takes the lessons from thisapproachanddistributesatoolthatcaneasilybeusedbyothers.
The followingshieldsprovide some informationabout it.Feel free toclickonthem.
pypipypi v0.2.28v0.2.28 LicenseLicense Apache2.0Apache2.0 pythonpython 3.73.7 formatformat wheelwheel statusstatus stablestable buildbuild unknownunknown
1.1.1Acknowledgment
Ifyouusebookmanagertoproduceadocumentyoumustincludethefollowingacknowledgement.
“This document was produced with Cyberaide Bookmanagerdeveloped by Gregor von Laszewski available athttps://pypi.python.org/pypi/cyberaide-bookmanager. It is in theresponsibility of the user tomake sure an author acknowledgementsection is included in your document. Copyright verification ofcontentincludedinabookisresponsibilityofthebookeditor.”
Thebibtexentryis@Misc{www-cyberaide-bookmanager,
author={GregorvonLaszewski},
1.1.2Extensions
We are happy to discuss with you bugs, issues and ideas for enhancements.Pleaseusetheconvenientgithubissuesat
https://github.com/cyberaide/bookmanager/issues
Pleasedonotfilewithusissuesthatrelatetoaneditorsbook.Theywillprovideyouwiththeirownmechanismonhowtocorrecttheircontent.
1.2CONTRIBUTORS☁�Contributors are sorted by the first letter of their combined Firstname andLastnameandifnotavailablebytheirgithubID.Please,notethattheauthorsareidentifiedthroughgitlogsinadditiontosomecontributorsaddedbyhand.Thegit repository from which this document is derived contains more than thedocuments included in thisdocument.Thusnoteveryone in this listmayhavedirectlycontributedtothisdocument.Howeverifyoufindsomeonemissingthathascontributed(theymaynothaveusedthisparticulargit)pleaseletusknow.Wewilladdyou.Thecontributorsthatweareawareofinclude:
Anand Sriramulu, Ankita Rajendra Alshi, Anthony Duer, Arnav,AverillCate,Jr,BertoltSobolik,BoFeng,BradPope,Brijesh,DaveDeMeulenaere,De’AngeloRutledge,EliyahBenZayin,EricBower,Fugang Wang, Geoffrey C. Fox, Gerald Manipon, Gregor vonLaszewski, Hyungro Lee, Ian Sims, IzoldaIU, Javier Diaz, JeevanReddyRachepalli,JonathanBranam,JulietteZerick,KeithHickman,KeliFine,KennethJones,MallikChalla,ManiKagita,MiaoJiang,Mihir Shanishchara, Min Chen, Murali Cheruvu, Orly Esteban,Pulasthi Supun, Pulasthi Supun Wickramasinghe, Pulkit Maloo,Qianqian Tang, Ravinder Lambadi, Richa Rastogi, Ritesh Tandon,SaberSheybani,SachithWithana,SandeepKumarKhandelwal,SheriSanders, Shivani Katukota, Silvia Karim, Swarnima H. Sowani,
title={{CyberaideBookManager}},
howpublished={pypi},
month=apr,
year=2019,
url={https://pypi.org/project/cyberaide-bookmanager/}
}
Tharak Vangalapat, Tim Whitson, Tyler Balson, Vafa Andalibi,VibhathaAbeykoon,VineetBarshikar,YuLuo,ahilgenkamp,aralshi,azebrowski, bfeng, brandonfischer99, btpope, garbeandy,harshadpitkar, himanshu3jul, hrbahramian, isims1, janumudvari,joshish-iu, juaco77, karankotz, keithhickman08, kkp, mallik3006,manjunathsivan, niranda perera, qianqian tang, rajni-cs, rirasto,sahancha, shilpasingh21, swsachith, toshreyanjain, trawat87,tvangalapat,varunjoshi01,vineetb-gh,xianghangmi,zhengyili4321
2SYLLABUS
2.1E222:INTELLIGENTSYSTEMSENGENEERINGII☁�Inthisundergraduatecoursestudentswillbefamiliarizedwithdifferentspecificapplicationsandimplementationsofintelligentsystemsandtheiruseindesktopandcloudsolutions.
Piazza:LinkRegistrar:LinkLectureNotes:ePubIndianaUniversityFaculty:GeoffreyC.FoxCredits:3Hardware:Youwillneedacomputertotakethisclass,aphone,tablet,orchromebookisnotsufficient.Prerequisite(s):Knowledgeofaprogramminglanguage,theabilitytopickup other programming languages as needed, willingness to enhance yourknowledge fromonline resources and additional literature.Youwill needaccess to a modern computer that allows using virtual machines and/orcontainers. If such a system is not available to you can also use IUcomputersorcloudvirtualmachines.Thelaterhavetoberequested.CourseDescription:Link
Thisisanintroductoryclass.Incaseyouliketodoresearchandmoreadvancedtopics,considertakinganindependentstudywithDr.FoxorDr.vonLaszewski.
Anintroductionvideoisavailableat:
222ClassIntroductionandManagement
2.1.1Teachingandlearningmethods
LecturesAssignmentsincludingspecificlabactivities
Finalproject
2.1.2Representativebibliography
1. CloudComputingforScienceandEngineeringByIanFosterandDennisB.Gannon
https://mitpress.mit.edu/books/cloud-computing-science-and-engineering
2. (This document) Handbook of Clouds and Big Data, Gregor vonLaszewski, Geoffrey C. Fox, and Judy Qiu, Fall 2017,https://tinyurl.com/vonLaszewski-handbook
3. Use Cases in Big Data Software and Analytics Vol. 1, Gregor vonLaszewski, Fall 2017, https://tinyurl.com/cloudmesh/vonLaszewski-i523-v1.pdf
4. Use Cases in Big Data Software and Analytics Vol. 2, Gregor vonLaszewski, Fall 2017, https://tinyurl.com/cloudmesh/vonLaszewski-i523-v2.pdf
5. Use Cases in Big Data Software and Analytics Vol. 3, Gregor vonLaszewski,Fall2017,https://tinyurl.com/vonLaszewski-projects-v3
6. Big Data Software Vol 1., Gregor von Laszewski, Spring 2017,https://github. com/cloudmesh/sp17-i524/blob/master/paper1/proceedings.pdf
7. BigDataSoftwareVol2.,GregorvonLaszewski,Spring2017,
https://github.com/cloudmesh/sp17-i524/blob/master/paper2/proceedings.pdf
8. BigDataProjects,GregorvonLaszewski,Spring2017,
https://github.com/cloudmesh/sp17-i524/blob/master/project/projects.pdf
9. GregorvonLaszewski,GeoffreyC.Fox,CloudComputingandBigDatahttp://cyberaide.org/papers/vonLaszewski-bigdata.pdf
10. Introduction to Python for cloud Computinghttps://laszewski.github.io/book/python/
2.1.3Grading
GradeItem PercentageAssignments 30%FinalProject 60%Participation 10%
2.1.4Incomplete
Pleaseseetheuniversityregulationsforgettinganincomplete.However,asthisclassuses state-of-the-art technology that changes frequently,youmust expectthat an incompletemay result in significant additionalworkonyourbehalf asyourprojectmayneedsignificantupdatesoninfrastructure,technology,orevenprogrammingmodelsused.Itisbesttocompletethecoursewithinonesemester.
2.1.5OtherclassesI423,I523,I524,B649,E516,E616
IUoffersotherundergraduateclassesinthistopicareasuchasI423.Ifyouareinterestedintakingit,pleaseseewhentheyaretaught.Additionalgraduatelevelclassesrelatedwhichcanalsobetakenonlybyspecialpermissionincluding:
CSCI B-649 Cloud Computing is the same as E516 but for computersciencestudents.I524isthesameasE516butforDataEngineeringStudentsE516IntrodcustiontoCLoudCOmputingandCLoudEngeneering
All of these classes are project based and require a significant and consistenteffortoftimeonyourside.
2.1.6Communication
Toaskforhelpusepiazza:
PiazzaResourcesPiazzaQuestions
PleasedonotuseCANVASforcommunicatingwithus.UsePiazza.MakesureyouhaveaccesstoPiazza,whilepostingyourformalBio.
2.1.6.1Howtotakethisclass
This class is an undergraduate class that contains two sections that youmustattend.
In thisdocumentwewill introduceyoutheoretically tosomeconcepts thatareimportantforthisclass.Thisisdoneeitherthroughlectures,writtenmaterial,orpointerstoWebresources.Youareresponsibleto
1. listentotheonlinelecturesandunderstandthem.2. identifyadditionalmaterialthatmayhelpyouinunderstandingthelectures.
Thiscouldincludeadditionalresourcesontheinternet3. Contributetothematerialbycorrectingerrorsandupdatesyoumayfind.
Pleasenotethatwetrytokeepthematerialuptodatewithyourhelp.However,in our field software and documentation changes quickly and if you identifyupdatedmaterialweexpectthatyouhelpusfixingit.Youwillgetcreditdoingso.
Toallowyoutobemostflexibleintakingthisclass,wecertainlyallowyoutoworkahead.Thusyoucanuseallbuttheinpersonlecturesaheadoftime.TheSyllabuswillclearlyidentifywhichmaterialisavailable.Notethatthebookmayinclude sections that are notmarked in the syllabus.You do not have to readsuchsections.
Pleasenotethatthisclassdoesnothavesmallassignmentsandanyassignmentis likelytotakeyouasignificantamountof time.Thusit isadvisablethatyoustartyourassignmentsearlyandmakesureyoudonotdotheminthelastweekbefore the assignment is due. This contrasts other undergraduate classes, thatmayfocusontheassignmentofanumberoftoyexerises.Insteadwewillworkthroughout the entire semester towards a project youwill conduct. In order tomake it earlier for you, we will introduce graded checkpoints of all largeassignments.Thegradesforthesecheckpointsarefinalandcannotbeimprovedby work done later. Also here please be advised that somemay take severalweeks to conduct and it is your responsibility to devote enough time to theseactivities.
Toasureprogress,youwillhave tomanageanotebook.mdfile inyourgithub
directory (that wewill create for you) inwhich youwill update yourweeklyprogress. If youmiss a lecture, it is in your responsibility to inform yourselfwhatwasbeing taught.Attendanceandparticipationwillbegradedaswell astheupdatetonotebook.mdwillbegraded.
In the following calendar we put in the last day of the week when theassignmentsaretypicallydue
2.1.7CoveredTopics
Aspartofthisclassyouwillhavetoexplorethefollowingtopics.Thesetopicsare either included in this document, or we are pointing you within thisdocumenttootherdocumentswiththeinformation.
If we forgot anything let us know. The order of the lectures and the lecturematerialaresubjecttochangeasweseefit.
ThisweeklyAgendawillbeupdatedeveryweek.Yoarerequiredtocheckineveryweekforupdates.Atthistimewehaveincludedanapproximateweeklyagenda.
Toseethedifferencestopreviousversionsofthisdocument,youcanlookat:
https://github.com/cloudmesh-community/book/commits/master/chapters/e222-syllabus.md
Toseeifcheckinssucceedyoucanlookat:
https://circleci.com/gh/cloudmesh-community/book
Currently,thetopicscoveredintheclassincludethefollowing.
2.1.7.1Week1.OverviewofthisClass
Wewillprovideanoverviewofthisclass.
Logistic:Getfamiliarwiththeclassstructure.
Read:Preface;ClassOverview;Startreviewingyourpythonknowledge
AssignmentAccounts: Find a computer you can do the class programming on(tabletandchrombookwillnotsuffice).Getanaccountonpiazza.comwithyour???nameGetanaccountongithub.com(ThisisNOTtheIUgithub)andapplythereforagithubusername.Posttheusernameintoaformthatwillbesendtoyou.Makesurethattheaccountyousendusisyourgithub.comaccount.Thisisagradedassignmentthatmustbecompletedinthefirstweekofclass
Thismustbecompletedin thefirstweekbyFriday.(SurveywillbepostedonPiazza).
Assignment notebook: Once you get your github directory, update the filenotebook.md.Mindthespellingnotebookislowercase.Usesimplemarkdownbulletliststorecordyouractivities.
AssignmentDevelopmentEnvironment:(Multiweekassignment,tobecompletedin thefirtsmonth)It is important thatyouhaveadevelopmentenvironment toconducttheclassassignments.Werecommendthatyouusevirtualboxanduseubuntu.Wehaveprovidedanextensivesetofmaterialforyoutoachievethisinthisdocument.PleaseconsultadditionalresourcesformtheWebandutilizetheLabhours.
2.1.7.2Week1and2.ReviewofPythonforIntelligentSystemsEngineering
Theory:basicPythonLanguageTheory:pyenv,setup.py,modules
Practice:Livingwithoutanaconda
Pythonspecifictopicsinclude:
Assignment:InstallPythonanduseitthroughutthesemesterWhynotanaconda?Usingpython3.7pip
LanguageNumpyScipyOpenCVScikitLearn
Report:Createanemptyreportbasedonour template ingithub.TheTAsmaydothiswithyouintheLab.
GithubPullRequests:Findaspellingerrorintheclassmaterialandcreateapullrequesttocorrectit.
2.1.7.3Week2.ReviewofLinuxshellforOSX,Linux,andWindows
Theory:BasicLinuxShellPractice:SSHAssignment:sshkeygenerationonyourcomputer,uploadtogithub.com
2.1.7.4Week3.IntroductiontoREST
Theory:OverviewofREST,Eve,OpenAPIPractice:developaRESTservicewithOpenAPI
Theory:LearnaboutRESTservicesanduseSwaggerOpenAPItocreatearestservicethatreturnstheCPUinformationaboutyourcomputer
WewillbestartingtheclasswithintroducingyoutoRESTservicesthatprovidea foundation for setting up services in the cloud and to interact with theseservices.Aspartof thisclasswewillbe revisiting theRESTservicesandusethem to deploy them on a cloud as well as develop our own AI based restservicesinthesecondhalfoftheclass.
FocusonOpenAPIexamplepostedintheNISTgithubrepository
Projectteam:Buildaprojectteamwithnomorethan3people.Therewillnotbeanexception.Youareallowedtoworkalone.Makesureyourprojectteamdoestheworktogether.E.g.YOumustnothave3peopleontheteamand the project could have been executed by a single person. In case of
morethanonepersonthesumofthedeliverablesmustbelargerthanwhatoneteammembercanachieve.Itisanadvantagetoworkinateamasyoucancheckeachother.
Ifateammemberdoesnotcontributetotheproject,theteamhastherightto exclude the non working team member with consultation of theinstructors.Wewillhaveajointmeetingwiththeteamtoidentifythebestpathforward.Choseyour teammemberswisely. Ideallyyoushouldmakethisdecisioninthefirst3weeks.
2.1.7.5Week4.IntroductiontoScientificWriting
Theory:ScientificwritingwithmarkdownandbibtexPractice:Contributeasignificantchaptertothebook(asagroup)Practice:ProjectReport(asagroup)Practice:IntroductiontoEmacsPractice:Introductiontojabref
SeetheseparetePubformoreinformation:Link
AssignmentScientificWriting:Learnaboutmarkdown.Seeourclassnotesandinternet resources. Note that we use pandoc markdown that may not renderproperlyingithub,especiallywhenitcomestofigurecaptions,references,andbibliography.(Youhavetilltheendofthemonth).Installandusejabref.
Report: Learn bibtex and create references in report.bib that you use inreport.md.Make sure that you do only one report per team and update yourREADME.ymlfileaccordingly.CheckintheLabwiththeTAsifyouhavedoneitcorrect.
ProjectIdeadue:Aonepageformaldocumentthatsummarizestheproject.Thisisnotaproposal.TheworkdsIandproject,reportmustnotbeused.Itis essentiallya snapshotofyour final report.Discusswith theTAs in theLabhowtodefineaproject.
Github:makesureyourteammateshaveaccesstoyourprojectdirectory.
2.1.7.6Week5to9.IntroductiontoCloudComputing
Theory:
Introduction-PartAIntroduction-PartB-DefiningCloudsIIntroduction-PartC-DefiningCloudsIIIntroduction-PartD-DefiningCloudsIIIIntroduction-PartE-VirtualizationIntroduction-PartF-TechnologyHypecycleIIntroduction-PartG-TechnologyHypecycleIIIntroduction-PartH-IaaSIIntroduction-PartI-IaaSIIIntroduction-PartJ-CloudSoftwareIntroduction-PartK-ApplicationsIIntroduction-PartM-ApplicationsIIIIntroduction-PartN-ParallelismIntroduction-PartO-StorageReleasedIntroduction-PartP-HPCintheCloudReleasedIntroduction-PartQ-AnalyticsandSimulationReleasedIntroduction-PartR-JobsReleasedIntroduction-PartS-TheFutureReleasedIntroduction-PartT-SecurityReleasedIntroduction-PartU-FaultTolerance
Practice:ManagevirtualmachineswithVirtualboxPractice:ManagevirtualmachineswithCloudmeshv4Practice:ManageacontainerwithDocker
Theory:Containers
Week5:ProjectUpdatedue:Atwopageformaldocumentthatsummarizestheproject.Thisisnotaproposal.ThewordsIandproject,reportmustnotbeused.Itisessentiallyasnapshotofyourfinalreport.
Week7:ProjectUpdatedue:Amulti-paragraphdescriptionaboutthedatathat youuse for your project is to be added to your report.This includesdetailsaboutthedata.INadocumentedprogramyoushowcaseshowyoudownloadthedatawithpythonrequestinanautomatedfashion.
Week8:ProjectUpdateDue:Haveadocumentedprogramreadythatuses
a REST service to obtain data for your analysis. Identify how to dobenchmarks and time the execution of your project. Add planedbenchmarkstotourreport.Donotusethewordplanorwillwriteitinsuchaformasifitweredone.Insteadputa �onbenchmarksthatyouwillthatyouwrokon
Week9:Projectupdate:Studymatplotlibandbokeahandidentifyhowtovisualizeotheraspectsofyourprojects.YourarealsoallowedtouseD3.jsandaddonstoit.Youarenotallowedtousetablaeu.
2.1.7.7Week10:LectureFreeTime
March10-17Lecturefreetime,noclasssupport.Agodweektoworkaheadonyourproject.
2.1.7.8Week11.IntroductiontoCloudPlatforms
Wewill introduceyou to theconceptofMapreduce.Wewilldiscusssystemssuch asHadoop and Spark and how they differ. Youwill be deploying via acontainerhadooponyourmachineanduse it togainhandsonexperience.Westartwithusing cloudmeshonyour computer tomanagevirtualmachines thatyoumaybeabletouseduringyourtestdevelopments.
BackgroundaboutHadoop,SparkandTwister
Theory:BackgroundtoCloudmeshTheory:BackgroundtoHadoopTheory:BackgroundtoSpark
Theory:BackgroundtoTwister
Week11Projectupdate: Identifyanalysisalgorithmsforyourprojectandapplythem.Experimentwithwhatyoucandowiththedata
2.1.7.9Week12to16.ReviewofAIforAI-CloudComputingIntegration
TheoryIntroductiontobasicAI
Practice:DevelopanontrivialAIRESTservice
See#sec:ai
OverviewofAIforthisclassTheoryUnsupervisedLearningDeepLearning
Forecasting
Week12:Projectupdate:Identifyanalysisalgorithmsforyourprojectandapplythem.Experimentwithwhatyoucandowiththedata
Week13:Projectupdate:Identifyanalysisalgorithmsforyourprojectandapply them. Experiment with what you can do with the data. Startbenchmarks.
Week14:Projectupdate:Focusonyourprojectreportandfinalizeit.Theproject report must include references in bibtex format. Double-checkintegrationinproceedings.
Week15:Apr19-Projectduedate.
AstheProjectwilltaketimetogradeallprojectsareduetwoweeks(yes,you read correctly) before the semester ends. The project will have thefollowingartifacts:
completedprojectreportcompletedprojectcode
completed instructions on you to replicate your project on someoneelse’scomputeroracloudservice
anyotheroutstandingtask.
Week16:Apr26
Projectimprovementifneeded(majorityshouldbefinished)
Make sure your project report is showing up correctly in theproceedings
2.1.7.10CloudEdgeComputing
Iftimeallowswemayinadditionalsocover.
Theory:RaspberryPIasPlatform
2.1.7.11AlternativeProjects
if you are interested the following could be chosen by you as project.Participation in theseprojectsneed tobeapprovedbyDr.vonLaszewski.Theprojectstartsinthsicaseinweek2or3.
Project(ifelected):Documentthebuilda100nodeRaspberryPIClusterProject:EnvironmentalRobotBoat
2.2ASSIGNMENTS☁�Formoredetailsseethecoursesyllabusandoverviewpages.Wegiveherejustsomesummary.
2.2.1AccountCreation
Aspartoftheclassyouwillneedanumberofaccounts
piazza.comgithub.com
Optional accounts include (only apply for them if you know you need them.Note that applying for some accountsmay take 1 - 2weeks to complete, youshould have identified before themiddle of the semester if you need some ofthem.
futuresystems.org(optional)chameleoncloud.org(optional)
aws.com(optional)google.com(optional)azure.com(optional)watsonfromIBM(optional)googleIaas(optional)
Inourpiazzawehavedetailshowtosubmitthemtous.Wesplitthesubmissioninmultiplesub-assignmentsasthegithub.comandpiazza.comareneededwithinthefirstweek.
2.2.2Sections,Chapters,Examples
As part of the class, we expect you to get familiar with topics related tointelligentsystemsengeneering.ThosthatliketogoforanA+arealsoexpectedtocontributesignificantly tothisdocumentorhavea trulyoutstandingproject.ThisisdoneinSections,Examples,andChapters,orexcelentProjectreportsandcode.
Section:
A section is a small section that explains a topic that is not yet in thehandbookorimprovesanexistingsectionsignificantly.Itistypicallymulti-paragraphs long and can even include an example if needed. ExamplesectionsthathavebeenprovidedareforexampletheLambdasectioninthepythonchapter
Sampleofstudentcontributedsectionsinclude:
ProjectNaticLambdaExpressions
�pleasefixlinks
Chapter:
Achapter is amuch longer topic and is a coherentdescriptionof a topicrelatedtocloudcomputing.Achaptercouldeitherbeareviewofatopicoradetailedtechnicalcontribution.SeveralSections(10+)maybeasubstitute
forachapter.
Youwill be contributing a significant chapter that can be used by otherstudentsintheclassandintroducesthereadertoageneraltopicrelatedtothe topicof theclass. Inaddition it isexpected ifapplicable todevelopapracticalexampledemonstratinghowtouseatechnology.Thechapterandthepracticalexamplecanbedonetogether.Wedonotliketousethetermtutorial inourwriteupbut sometimeswe refer to it inourassignmentsassuch.Chaptersthatfocusontheorymaynothaveanexampleanditcanbesubstitutedbyalongertext.
Asampleofastudentcontributedchapteris*GraphQL.
Example:
An example is a document that showcases the use of a particulartechnology. Typically it is a console session or a program. ExamplesaugmentchaptersandSections.
It is expected from you that you self identify a section or achapterasthisshowscompetenceintheareaofcloudcomputing.Ifhoweveryoudonotknowwhat toselect,youmustattendanonlinehourwithusinwhichweidentifysectionsandchapterswithyou.Theemphasizehereisthatwedonotdecidethemforyou,butweidentifythemwithyou.
SampleTopics thatcould formasectionorchapterareclearlymarkedwitha
.Thereareplentyinthehandbook,butyouarewelcometodefineyourowncontributions.Discussthemwithusintheonlinehours.
Alistoftopicsidentifiedbystudentsismaintainedinaspreadsheet.
Seehttps://piazza.com/class/jgxybbf5rnx5qd?cid=201fordetails.
Youareexpectedtosignupinthisspreadsheet.THisisdonetoab=voidoverlapandfosteruniquenessoftheassignmentforsectionsandchapters.
2.2.3Project
Project:
Wereferwiththetermprojecttothemajoractivitythatyouchoseaspartofyour class. The default case is an implementation project that requires aprojectreportandprojectcode.
License:
AllprojectsaredevelopedunderanopensourcelicensesuchasApache2.0License.Youwill be required to add a LICENCE.txt file and if you useothersoftwareidentifyhowitcanbereusedinyourproject.Ifyourprojectusesdifferent licenses,pleaseadd inaREADME.mdfilewhichpackagesareusedandwhichlicensethesepackageshave.
ProjectReport:
A project report is an enhanced topic paper that includes not just theanalysisofatopic,butanactualcode,withbenchmarkanddemonstratedapplication use. Obviously it is longer than a term paper and includesdescriptions about reproducibility of the application. A README.md isprovided thatdescribeshowotherscan reproduceyourprojectand run it.Remember tables and figures donot count towards the paper length.Thefollowinglengthisrequired:
4pages,onestudentintheproject6pages,twostudentsintheproject8pages,threestudentsintheproject
Weestimatethatasinglepageisbetween1000-1200words.Pleasenotethatfor2018theformatwillbemarkdown,sothewordcountwillbeusedinstead.Howtousefigures isexplained in theNotationof thehandbook.Weusebibtexforbibliographies. Please be reminded that images and tables as well as code isexcludedfromthepagelength.Makesurethatyourtextismostlydevelopedbymidtermtime.
ProjectCode:
This is thedocumented andreproducible code and scripts that allows aTAdo replicate theproject. In caseyouuse images theymustbecreatedfrom scratch locally and may not be uploaded to services such asdockerhub.Youcanhoweverreusevendoruploadedimagessuchasfromubuntuorcentos.Allcode,scripts,anddocumentationmustbeuploadedtogithub.comundertheclassspecificgithubdirectory.
Data:
DataistobehostedonIUsgoogledriveifneeded.Ifyouhavelargerdata,it should be downloaded from the internet. It is in your responsibility todevelopadownloadprogram.Thedatamustnotbestoredingithub.Youwillbeexpectedtowriteapythonprogramthatdownloadsthedata.
WorkBreakdown:
Thisisanappendixtothedocumentthatdescribesindetailwhodidwhatintheproject.Thissectioncomesinanewpageafter thereferences.Itdoesnotcounttowardsthepagelengthofthedocument.ItalsoincludesexplicitURLstothegithistorythatdocumentsthestatisticstodemonstratenotonlyone student has worked on the project. If you can not provide such astatisticorallcheck-inshavebeenmadebyasinglestudent,theprojecthasshown that theyhavenotproperlyusedgit.Thuspointswillbedeductedfrom the project. Furthermore, if we detect that a student has notcontributed to a project we may invite the student to give a detailedpresentationoftheproject.
Bibliography:
All bibliography has to be provided in a jabref/bibtex file. There isNOEXCEPTION to this rule.Pleasebeadviseddoing references right takessometimesoyouwanttodothisearly.PleasenotethatexportsofEndnoteorotherbibliographymanagement toolsdonot lead toproperlyformattedbibtexfiles,despitetheyclaimingtodoso.Youwillhavetocleanthemupand we recommend to do it the other way around. Manage yourbibliographywithjabref.Makesurelabelsonlyincludechractersfrom[a-zA-Z0-9-].Usedashesandnotunderscoreor:inthelabel.
2.2.3.1ProjectDeliverables
Theobjectiveoftheprojectistodefineaclearproblemstatementandcreateaframeworktoaddressthatproblemasitrelatestocloudcomputing.Inthisclassitisespeciallyimportnattoaddressthereproducibilityofthedeployment.Atestand benchmark possibly including a dataset must be used to verify thecorrectnessofyourapproach.ProjectsrelatedtoNISTfocusonthespecificationandimplementation.Thereportherecanbesmaller,butthecontributionmustbeincludableinthespecificationdocument.
IngeneralanyprojectmustbedeployablebytheTA.Ifittakeshourstodeployyourproject,pleasetalktousbeforefinalsubmission.
Youhaveplentyoftimetomakeexecuteawonderfulproject.
The deliverables include but need to be updated according to your specificproject, for example if you do Edge Computing some deliverabl;es will bedifferent:
Providebenchmarks.
Take results in two different cloud services and your local PC (ex:ChameleonCloud,echokubernetes).Makesureyoursystemcanbecreatedanddeployedbasedonyourdocumentation.
Each team member must provide a benchmark on their computer and acloudIaaS,wherethecloudisdifferentfromeachteammember.
CreateaMakefilewith the tagsdeploy,run,kill,view,clean thatdeploysyourenvironment,runsapplication,killsit,viewstheresultandcleansupafterwards.Youare allowed tohavedifferentmakefiles for thedifferentcloudsanddifferentdirectories.Keepthecodeanddirectorystructurecleananddocumenthowtoreproduceyourresults.
Forpythonusearequirements.txtfilealso
FordockeruseaDockerfilealso
Writeareportthatincludesthefollowingsections
AbstractIntroductionDesign
ArchitectureImplementation
TechnologiesUsedResults
DeploymentBenchmarksApplicationBenchmarks
(Limitations)Conclusion(WorkBreakdown)
YourpaperwillnothaveaFutureWorksectionasthisimpliesthatyouwilldo work in future and your paper is incomplte, instead you can use anoptional“Limitations”section.
2.2.3.2ProjectTopic
As part of this class you will be developing a OpenAPI based ArtificialIntelligenceREST service and demonstrate its use.YOuwill be developing adocumentationandareportthatshowcasestheuseoftheservice.TheOpenAPIservicemustbenon trivial,e.g.youshoudlshowuploadofdata, sbmissionofparameters including the function to be executed, potential development of aGUIfortheservice.
Wewillworkwithyoutosolidifytheprojectthroughoutthesemester.
2.2.4AlternateProject:VirtualCluster
All students can contribute to the creation of theVirtualCluster code thatwewill be using throughout the class to improve and interface with cloud andcontainer frameworks. This project is typically done in a graduate class, butinterestedundergraduatescancontributealso.Thosethatliketocontributemusthave significant programming experience in either Python or Javascript. ThisprojectcouldreplacetheregilarAIRESTserviceproject.AweeklymeetinganddemonstratedprogresshastobeshowntoGregorvonLaszewski.
https://github.com/cloudmesh-community/cm
The residential students have been assigned this task, but online students canjoinandcontribute.
2.2.5AlternativeProject:100nodeRaspberryPicluster
In this project youwill be developing a 100 nodeRaspberry PI cluster. THisincludesputting thehardware together, anddeveloping software that allows tousesall100nodesasacluster.Softwareistobeusetomakemanagementeasiy.It is not sufficient to just install the softwarebut to develop a framework thatallowsustoeasilysharethisresourcewithotherusers.
Adocumentationhas tobewrittenfor thisprojectsootherscanreplicateyourclusterbuild.Agoodstartforthisis tolookatour cm-burncommand thatcreatesRaspberryPIOSbasedonmanipulationofthefilesystem
https://github.com/cloudmesh/cm-burnhttps://github.com/cloudmesh-community/cm
Substantialcontributionsareexpectedbeyondthehardwarebuild.WealsoliketodesignacasewithaLasercutterforthe100nodes.Buildingtheclusterwouldtake place in MESH and transportation to and from it is provided by theuniversity.Youwillbeabletoworkinanofficetheretoputtheclustertogether.AweeklymeetingwithGregorvonLaszewskiortheTAsisneededtoshowcaseprogress.
2.2.6Submissionofsectionsandchaptersandprojects
Sections and subsections are to be added to the book github repo. Do a pull
request.Theheadlineofthesectionneedstobemarkedwitha ifyoustill
workonit,markedwitha ifyouwantittobegraded.andhaveallhidsforpeoplethatcontributetothatsection.
In addition, simply add them to your README.yml file in your github repo.Addthefollowingtoit(Iamusinga18-516-18asexample).
Please look at https://github.com/cloudmesh-community/fa18-516-18 andhttps://raw.githubusercontent.com/cloudmesh-community/fa18-523-62/master/README.ymlforanexamples.Pleasenotethatincaseyouworkinagroupthecodeandreportissupposedtobeonlystoredinthefirsthidmentionedin the group field. If you store it in multiple directories your project will berejected.
YouMUST run yamllint on the README.yml file. YAML errors will givepointdeductions.
section:
-title:titleofthesection1
url:https://github.com/cloudmesh-community/book/chapters/...
-title:titleofthesection2
url:https://github.com/cloudmesh-community/book/chapters/...
-title:titleofthesection3
url:https://github.com/cloudmesh-community/book/chapters/...
chapter:
-title:titleofthechapter
url:https://github.com/cloudmesh-community/fa18-516-18/blob/master/chapter/whatever.md
group:fa18-523-62fa18-523-69
keyword:whatever
project:
-title:titleoftheproject
url:urlinyourhidspaceorthatofyourpartner
group:fa18-523-62fa18-523-69
keyword:kubernetes,NIST,Database
code:theurltothecode
other:
-activity:spellcheckedmddocument
url:puturlhere
3PYTHON
3.1INTRODUCTIONTOPYTHON☁�
LearningObjectives
Learn quickly Python under the assumption you know a programminglanguageWorkwithmodulesUnderstanddocoptsandcmdContuctsomepythonexamplestorefreshyourpythonknpwledgeLearnaboutthemapfunctioninPythonLearnhowtostartsubprocessesandrederecttheiroutputLearnmoreadvancedconstructssuchasmultiprocessingandQueuesUnderstandwhywedonotuseanacondaGetfamiliarwithpyenv
Portions of this lesson have been adapted from the official Python TutorialcopyrightPythonSoftwareFoundation.
Pythonisaneasytolearnprogramminglanguage.Ithasefficienthigh-leveldatastructuresandasimplebuteffectiveapproachtoobject-orientedprogramming.Python’ssimplesyntaxanddynamictyping,togetherwithitsinterpretednature,make it an ideal language for scripting and rapid application development inmanyareasonmostplatforms.ThePythoninterpreterandtheextensivestandardlibraryarefreelyavailableinsourceorbinaryformforallmajorplatformsfromthe PythonWeb site, https://www.python.org/, and may be freely distributed.ThesamesitealsocontainsdistributionsofandpointerstomanyfreethirdpartyPythonmodules,programsandtools,andadditionaldocumentation.ThePythoninterpretercanbeextendedwithnewfunctionsanddatatypesimplementedinCor C++ (or other languages callable from C). Python is also suitable as anextensionlanguageforcustomizableapplications.
Pythonisaninterpreted,dynamic,high-levelprogramminglanguagesuitableforawiderangeofapplications.
ThephilosophyofpythonissummarizedinTheZenofPythonasfollows:
ExplicitisbetterthanimplicitSimpleisbetterthancomplexComplexisbetterthancomplicatedReadabilitycounts
ThemainfeaturesofPythonare:
UseofindentationwhitespacetoindicateblocksObjectorientparadigmDynamictypingInterpretedruntimeGarbagecollectedmemorymanagementalargestandardlibraryalargerepositoryofthird-partylibraries
Python is used by many companies and is applied for web development,scientific computing, embedded applications, artificial intelligence, softwaredevelopment,andinformationsecurity,tonameafew.
The material collected here introduces the reader to the basic concepts andfeaturesofthePythonlanguageandsystem.Afteryouhaveworkedthroughthematerialyouwillbeableto:
usePythonusetheinteractivePythoninterfaceunderstandthebasicsyntaxofPythonwriteandrunPythonprogramshaveanoverviewofthestandardlibraryinstall Python libraries using pyenv for multipython interpreterdevelopment.
Edoenotattempttobecomprehensiveandcovereverysinglefeature,orevenevery commonly used feature. Instead, it introduces many of Python’s most
noteworthyfeatures,andwillgiveyouagoodideaofthelanguage’sflavorandstyle.After reading it, youwillbeable to readandwritePythonmodulesandprograms,andyouwillbereadytolearnmoreaboutthevariousPythonlibrarymodules.
Inordertoconductthislessonyouneed
AcomputerwithPython2.7.16or3.7.4FamiliaritywithcommandlineusageA text editor such as PyCharm, emacs, vi or others.You should identitywhichworksbestforyouandsetitup.
3.1.1References
Some important additional information can be found on the following Webpages.
PythonPipVirtualenvNumPySciPyMatplotlibPandaspyenvPyCharm
Python module of the week is a Web site that provides a number of shortexamplesonhowtousesomeelementarypythonmodules.Notallmodulesareequallyusefulandyoushoulddecideiftherearebetteralternatives.Howeverforbeginnersthissiteprovidesanumberofgoodexamples
Python2:https://pymotw.com/2/Python3:https://pymotw.com/3/
3.2PYTHON3.7.4INSTALLATION☁�
LearningObjectives
Learnhowtoinstallpython.FindadditionalinformationaboutPython.MakesureyourComputersupportsPython.
Inthissetionweexplainhowtoinstallpython3.7.4onacomputer.Likelymuchofthecodewillworkwithearlierversions,butwedothedevelopmentinPythononthenewestversionofpythonavailableathttps://www.python.org/downloads.
3.2.1Hardware
Python does not require any special hardware.We have installed Python notonlyonPC’sandLaptops,butalsoonRaspberryPI’sandLegoMindstorms.
However,therearesomethingstoconsider.Ifyouusemanyprogramsonyourdesktop and run them all at the same time you will find that in up-to-dateoperating systems you will find your self quickly out of memmory. This isespeciallytrueifyouuseeditorssuchasPyCharmwhichwehighlyrecommend.Furthermore,asyoulikelyhavelotsofdiskaccess,makesuretouseafastHDDorbetteranSSD.
AtypicalmoderndeveloperPCorLaptophas16GBRAMandanSSD.Youcancertainlydopythonona$35RapbperryPI,butyouprobablywillnotbeabletorun PyCharm. There are many alternative editors with lessMemory footprintavialable.
3.2.2PrerequisitsUbuntu19.04
Python 3.7 is installed in ubuntu 19.04. Therefore, it already fulfills theprerequisits.Howeverwerecommendthatyouupdate to thenewestversionofpythonandpip.Howeverwerecommendthatyouupdatethethenewestversionofpython.Pleasevisit:https://www.python.org/downloads
3.2.3PrerequisitsmacOS
3.2.3.1InstallationfromAppleAppStore
Youwant a number of useful tool on yourmacOS. They are not installed bydefault,butareavailableviaXcode.Firstyouneedtoinstallxcodefrom
https://apps.apple.com/us/app/xcode/id497799835
NextyouneedtoinstallmacOSxcodecommandlinetools:
3.2.3.2Installationfrompython.org
The easiest instalation of Python for cloudmesh is to use the instaltion fromhttps://www.python.org/downloads. Please, visit the page and follow theinstructions.Afterthisinstallyouhavepython3avalablefromthecommandline
3.2.3.3InstallationfromHoembrew
An alternative instalation is provided from Homebrew. To use this installmethod,youneed to installHomebrewfirst.Start theprocessby installing thepython3usinghomebrew.Installhomebrewusingtheinstructionintheirwebpage:
ThenyoushouldbeabletoinstallPython3.7.4using:
3.2.4PrerequisitsUbuntu18.04
We recommend you update your ubuntu version to 19.04 and follow theinstructionsforthatversioninstead,asitissignificantlyeasier.Ifyouhoweverarenotabletodoso,thefollowinginstructionsmaybehelpful.
Wefirstneed tomakesure that thecorrectversionof thePython3is installed.ThedefaultversionofPythononUbuntu18.04is3.6.Youcangettheversionwith:
$xcode-select--install
$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"
$brewinstallpython
$python3--version
Iftheversionisnot3.7.4ornewer,youcanupdateitasfollows:
Youcan thencheck the installedversionusing python3.7--version which should be3.7.4.
Nowwewillcreateanewvirtualenvironment:
Theeditthe~/.bashrcfileandaddthefollowinglineattheend:
nowactivatethevirtualenvironmentusing:
nowyoucaninstallthepipforthevirtualenvironmentwithoutconflictingwiththenativepip:
3.2.5PrerequisiteWindows10
Python 3.7 can be installed on Windows 10 using:https://www.python.org/downloads
For3.7.4cangoto thedownloadpageanddownloadoneof thedifferent filesforWindows.
LetusassumeyouchoetheWebbasedinstaller,thanyouclickonthefileintheedge browser (make sure the account you use has administrative priviledges).Followtheinstructionsthattheinstallergives.Importantisthatyouselectatonepoint“[x]AddtoPath”.Therewillbeanemptycheckmarkaboutthisthatyouwillclickon.
Onceitisinstalled.choseaterminalandexecute
$sudoapt-getupdate
$sudoaptinstallsoftware-properties-common
$sudoadd-apt-repositoryppa:deadsnakes/ppa
$sudoapt-getinstallpython3.7python3-devpython3.7-dev
$python3.7-mvenv--without-pip~/ENV3
aliasENV3="source~/ENV3/bin/activate"
ENV3
$source~/.bashrc
$curl"https://bootstrap.pypa.io/get-pip.py"-o"get-pip.py"
$pythonget-pip.py
$rmget-pip.py
However, ifyouhave installedconda for somereasonyouneed to readuponhowtoinstall3.7.4pythonincondaoridentifyhowtoruncondaandpython.orgatthesametime.Weseeoftenothersgivingthewronginstallationinstructions.
Analternative is tousepythonfromwithin theLinuxSubsystem.But thathassomelimitationsandyouwillneedtoexplorehowtoexxessthefilesysteminthesubssytemtohaveasmoothintegrationbetweenyourWindowshostsoyoucanforexampleusePyCharm.
3.2.5.1LinuxSubsystemInstall
ToactivatetheLinuxSubsystem,pleasefollowtheinstructionsat
https://docs.microsoft.com/en-us/windows/wsl/install-win10
Asuitabledistributionwouldbe
https://www.microsoft.com/en-us/p/ubuntu-1804-lts/9n9tngvndl3q?activetab=pivot:overviewtab
Howeverasitusesanolderversionofpythonyouwillahvetoupdateit.
3.2.6Prerequisitvenv
This step is highly recommend if you have not yet already installed a venv forpythontomakesureyouarenotinterferingwithyoursystempython.NotusingavenvcouldhavecatastrophicconsequencesandadestructionofyouroperatingsystemtoolsiftheyrealyonPython.Theuseofvenvissimple.Forourpurposesweassumethatyouusethedirectory:
Followthesestepsfirst:
Firstcdtoyourhomedirectory.Thenexecute
python--version
~/ENV3
$python3-mvenv~/ENV3
$source~/ENV3/bin/activate
Youcanaddat theendofyour .bashrc(ubuntu)or .bash_profile (macOS)filetheline
sotheenvironmentisalwaysloaded.Nowyouarereadytoinstallcloudmesh.
Checkifyouhavetherightversionofpythoninstalledwith
Tomakesureyouhaveanuptodateversionofpipissuethecommand
3.2.7InstallPython3.7viaAnaconda
3.2.7.1Downloadcondainstaller
Minicondaisrecommendedhere.Downloadan installerforWindows,macOS,andLinuxfromthispage:https://docs.conda.io/en/latest/miniconda.html
3.2.7.2Installconda
Followinstructionstoinstallcondaforyouroperatingsystems:
Windows. https://conda.io/projects/conda/en/latest/user-guide/install/windows.htmlmacOS. https://conda.io/projects/conda/en/latest/user-guide/install/macos.htmlLinux.https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
3.2.7.3InstallPython3.7.4viaconda
Itisveryimportanttomakesureyouhaveanewerversionofpipinstalled.Afteryou installed and created theENV3you need to activate it. This can be done
$source~/ENV3/bin/activate
$python--version
$pipinstallpip-U
$cd~
$condacreate-nENV3python=3.7.4
$condaactivateENV3
$condainstall-canacondapip
$condadeactivateENV3
with
Ifyou like toactivate itwhenyoustartanewterminal,pleaseadd this line toyour.bashrcor.bash_profile
Ifyouusezshpleaseadditto.zprofileinstead.
3.3INTERACTIVEPYTHON☁�Pythoncanbeusedinteractively.Youcanentertheinteractivemodebyenteringtheinteractiveloopbyexecutingthecommand:
Youwillseesomethinglikethefollowing:
The >>> is the prompt used by the interpreter. This is similar to bash wherecommonly$isused.
Sometimes it is convenient to show the promptwhen illustrating an example.This is to provide some context for what we are doing. If you are followingalongyouwillnotneedtotypeintheprompt.
Thisinteractivepythonprocessdoesthefollowing:
readyourinputcommandsevaluateyourcommandprinttheresultofevaluationloopbacktothebeginning.
This is why you may see the interactive loop referred to as aREPL:Read-Evaluate-Print-Loop.
3.3.1REPL(ReadEvalPrintLoop)
$condaactivateENV3
$python
$python
Python3.7.1(default,Nov242018,14:27:15)
[Clang10.0.0(clang-1000.11.45.5)]ondarwin
Type"help","copyright","credits"or"license"formoreinformation.
>>>
There are many different types beyond what we have seen so far, such asdictionariess,lists,sets.Onehandywayofusingtheinteractivepythonistogetthetypeofavalueusingtype():
Youcanalsoaskforhelpaboutsomethingusinghelp():
Usinghelp()opensupahelpmessagewithinapager.Tonavigateyoucanusethespacebartogodownapagewtogoupapage,thearrowkeystogoup/downline-by-line,orqtoexit.
3.3.2Interpreter
Althoughtheinteractivemodeprovidesaconvenienttooltotestthingsoutyouwillseequicklythatforourclasswewanttousethepythoninterpreterfromthecommandline.Letusassumetheprogramiscalledprg.py.Onceyouhavewrittenitinthatfileyousimplycancallitwith
Itisimportanttonametheprogramwithmeaningfulnames.
3.3.3Python3FeaturesinPython2
In this coursewewant to be able to seamlessly switch between python 2 andpython3.Thusitisconvenientfromthestarttousepython3syntaxwhenitissupportedalsoinpython2.Oneofthemostusedfunctionsistheprintstatementthathasinpython3parentheses.Toenableitinpython2youjustneedtoimportthisfunction:
Thefirstoftheseimportsallowsustousetheprintfunctiontooutputtexttothescreen, instead of the print statement, which Python 2 uses. This is simply a
>>>type(42)
<type'int'>
>>>type('hello')
<type'str'>
>>>type(3.14)
<type'float'>
>>>help(int)
>>>help(list)
>>>help(str)
$pythonprg.py
from__future__importprint_function,division
designdecisionthatbetterreflectsPython’sunderlyingphilosophy.
Otherfunctionssuchasthedivisionalsobehavedifferently.Thusweuse
Thisimportmakessurethatthedivisionoperatorbehavesinawayanewcomertothelanguagemightfindmoreintuitive.InPython2,division/isfloordivisionwhentheargumentsareintegers,meaningthatthefollowing
InPython3,division/isafloatingpointdivision,thus
3.4EDITORS☁�Thissectionismeanttogiveanoverviewofthepythoneditingtoolsneededfordoing for this course. There are many other alternatives, however, we dorecommendtousePyCharm.
3.4.1Pycharm
PyCharm is an Integrated Development Environment (IDE) used forprogramming in Python. It provides code analysis, a graphical debugger, anintegratedunittester,integrationwithgit.
Python8:56Pycharm
3.4.2Pythonin45minutes
AnadditionalcommunityvideoaboutthePythonprogramminglanguagethatwefoundontheinternet.Naturallytherearemanyalternativestothisvideo,butthevideoisprobablyagoodstart.ItalsousesPyCharmwhichwerecommend.
Python43:16PyCharm
Howmuchyouwanttounderstandofpythonisactuallyabituptoyou.While
from__future__importdivision
(5/2==2)isTrue
(5/2==2.5)isTrue
itsgood toknowclassesand inheritance,youmaybeable for thisclass togetawaywithoutusingit.However,wedorecommendthatyoulearnit.
PyCharmInstallation:Method1:PyCharmInstallationonubuntuusingumake
Once umake command is run, use the next command to install Pycharmcommunityedition:
IfyouwanttoremovePyCharminstalledusingumakecommand,usethis:
Method2:PyCharminstallationonubuntuusingPPA
PyCharm also has a Professional (paid) version which can be installed usingfollowingcommand:
Onceinstalled,gotoyourVMdashboardandsearchforPyCharm.
3.5LANGUAGE☁�3.5.1StatementsandStrings
LetusexplorethesyntaxofPythonwhilestartingwithaprintstatement
Thiswillprintontheterminal
The print function was given a string to process. A string is a sequence ofcharacters. A character can be a alphabetic (A through Z, lower and upper
sudoadd-apt-repositoryppa:ubuntu-desktop/ubuntu-make
sudoapt-getupdate
sudoapt-getinstallubuntu-make
umakeidepycharm
umake-ridepycharm
sudoadd-apt-repositoryppa:mystic-mirage/pycharm
sudoapt-getupdate
sudoapt-getinstallpycharm-community
sudoapt-getinstallpycharm
print("HelloworldfromPython!")
HelloworldfromPython!
case), numeric (any of the digits), white space (spaces, tabs, newlines, etc),syntacticdirectives(comma,colon,quotation,exclamation,etc),andsoforth.Astringisjustasequenceofthecharacterandtypicallyindicatedbysurroundingthecharactersindoublequotes.
StandardoutputisdiscussedintheSectionLinux.
So, what happened when you pressed Enter? The interactive Python programreadthelineprint("HelloworldfromPython!"),splititintotheprintstatementandthe"HelloworldfromPython!"string,andthenexecutedtheline,showingyoutheoutput.
3.5.2Comments
Commentsinpythonarefollowedbya#:
3.5.3Variables
Youcanstoredataintoavariabletoaccessitlater.Forinstance:
Thiswillprintagain
3.5.4DataTypes
3.5.4.1Booleans
Aboolean is a value that can have the values True or False. You can combinebooleanswithbooleanoperatorssuchasandandor
3.5.4.2Numbers
#Thisisacomment
hello='HelloworldfromPython!'
print(hello)
HelloworldfromPython!
print(TrueandTrue)#True
print(TrueandFalse)#False
print(FalseandFalse)#False
print(TrueorTrue)#True
print(TrueorFalse)#True
print(FalseorFalse)#False
Theinteractiveinterpretercanalsobeusedasacalculator.Forinstance,saywewantedtocomputeamultipleof21:
Wesawheretheprintstatementagain.Wepassedintheresultoftheoperation21 * 2.An integer (or int) in Python is a numeric valuewithout a fractionalcomponent(thosearecalledfloatingpointnumbers,orfloatforshort).
Themathematicaloperators compute the relatedmathematicaloperation to theprovidednumbers.Someoperatorsare:
Operator Function* multiplication/ division+ addition- subtraction** exponent
Exponentiationxyiswrittenasx**yisxtotheythpower.
Youcancombinefloatsandints:
Notethatoperatorprecedenceisimportant.Usingparenthesistoindicateaffecttheorderofoperationsgivesadifferenceresults,asexpected:
3.5.5ModuleManagement
AmoduleallowsyoutologicallyorganizeyourPythoncode.Groupingrelatedcodeintoamodulemakesthecodeeasiertounderstandanduse.AmoduleisaPythonobjectwitharbitrarilynamedattributesthatyoucanbindandreference.Amodule is a file consistingofPython code.Amodule candefine functions,classesandvariables.Amodulecanalsoincluderunnablecode.
print(21*2)#42
print(3.14*42/11+4-2)#13.9890909091
print(2**3)#8
print(3.14*(42/11)+4-2)#11.42
print(1+2*3-4/5.0)#6.2
print((1+2)*(3-4)/5.0)#-0.6
3.5.5.1ImportStatement
Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpreter searches before importing a module. The from…import StatementPython’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Itispreferredtouseforeachimportitsownlinesuchas:
Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpretersearchesbeforeimportingamodule.
3.5.5.2Thefrom…importStatement
Python’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Thefrom…importhasthefollowingsyntax:
3.5.6DateTimeinPython
Thedatetimemodulesuppliesclassesformanipulatingdatesandtimesinbothsimpleandcomplexways.Whiledateandtimearithmeticissupported,thefocusof the implementation is on efficient attribute extraction for output formattingand manipulation. For related functionality, see also the time and calendarmodules.
The import Statement You can use any Python source file as a module byexecutinganimportstatementinsomeotherPythonsourcefile.
Thismoduleoffersagenericdate/timestringparserwhichisabletoparsemostknownformatstorepresentadateand/ortime.
pandas is an open source Python library for data analysis that needs to be
importnumpy
importmatplotlib
fromdatetimeimportdatetime
fromdatetimeimportdatetime
fromdateutil.parserimportparse
imported.
Createastringvariablewiththeclassstarttime
Convertthestringtodatetimeformat
Creatingalistofstringsasdates
ConvertClass_datesstringsintodatetimeformatandsavethelistintovariablea
Useparse()toattempttoauto-convertcommonstringformats.Parsermustbeastringorcharacterstream,notlist.
Useparse()oneveryelementoftheClass_datesstring.
Useparse,butdesignatethatthedayisfirst.
Create adataframe.ADataFrame is a tabulardata structure comprisedof rowsand columns, akin to a spreadsheet, database table. DataFrame as a group ofSeriesobjectsthatshareanindex(thecolumnnames).
importpandasaspd
fall_start='08-21-2018'
datetime.strptime(fall_start,'%m-%d-%Y')\#
datetime.datetime(2017,8,21,0,0)
class_dates=[
'8/25/2017',
'9/1/2017',
'9/8/2017',
'9/15/2017',
'9/22/2017',
'9/29/2017']
a=[datetime.strptime(x,'%m/%d/%Y')forxinclass_dates]
parse(fall_start)#datetime.datetime(2017,8,21,0,0)
[parse(x)forxinclass_dates]
#[datetime.datetime(2017,8,25,0,0),
#datetime.datetime(2017,9,1,0,0),
#datetime.datetime(2017,9,8,0,0),
#datetime.datetime(2017,9,15,0,0),
#datetime.datetime(2017,9,22,0,0),
#datetime.datetime(2017,9,29,0,0)]
parse(fall_start,dayfirst=True)
#datetime.datetime(2017,8,21,0,0)
importpandasaspd
data={
Convertdf[`date`]fromstringtodatetime
3.5.7ControlStatements
3.5.7.1Comparison
Computer programs do not only execute instructions. Occasionally, a choiceneedstobemade.Suchasachoiceisbasedonacondition.Pythonhasseveralconditionaloperators:
Operator Function> greaterthan< smallerthan== equals!= isnot
Conditionsarealwayscombinedwithvariables.Aprogramcanmakeachoiceusingtheifkeyword.Forexample:
'dates':[
'8/25/201718:47:05.069722',
'9/1/201718:47:05.119994',
'9/8/201718:47:05.178768',
'9/15/201718:47:05.230071',
'9/22/201718:47:05.230071',
'9/29/201718:47:05.280592'],
'complete':[1,0,1,1,0,1]}
df=pd.DataFrame(
data,
columns=['dates','complete'])
print(df)
#datescomplete
#08/25/201718:47:05.0697221
#19/1/201718:47:05.1199940
#29/8/201718:47:05.1787681
#39/15/201718:47:05.2300711
#49/22/201718:47:05.2300710
#59/29/201718:47:05.2805921
importpandasaspd
pd.to_datetime(df['dates'])
#02017-08-2518:47:05.069722
#12017-09-0118:47:05.119994
#22017-09-0818:47:05.178768
#32017-09-1518:47:05.230071
#42017-09-2218:47:05.230071
#52017-09-2918:47:05.280592
#Name:dates,dtype:datetime64[ns]
x=int(input("Guessx:"))
ifx==4:
print('Correct!')
In this example,You guessed correctly! will only be printed if the variable xequals to four. Python can also executemultiple conditions using the elif andelsekeywords.
3.5.7.2Iteration
To repeat code, the for keyword can be used. For example, to display thenumbersfrom1to10,wecouldwritesomethinglikethis:
Thesecondargument to range,11, isnot inclusive,meaning that the loopwillonlygetto10beforeitfinishes.Pythonitselfstartscountingfrom0,sothiscodewillalsowork:
Infact,therangefunctiondefaultstostartingvalueof0,soitisequivalentto:
Wecanalsonestloopsinsideeachother:
In this case we have two nested loops. The code will iterate over the entirecoordinaterange(0,0)to(9,9)
3.5.8Datatypes
3.5.8.1Lists
see:https://www.tutorialspoint.com/python/python_lists.htm
x=int(input("Guessx:"))
ifx==4:
print('Correct!')
elifabs(4-x)==1:
print('Wrong,butclose!')
else:
print('Wrong,wayoff!')
foriinrange(1,11):
print('Hello!')
foriinrange(0,10):
print(i+1)
foriinrange(10):
print(i+1)
foriinrange(0,10):
forjinrange(0,10):
print(i,'',j)
Lists inPythonareorderedsequencesofelements,whereeachelementcanbeaccessedusinga0-basedindex.
Todefinealist,yousimplylistitselementsbetweensquarebrackets‘[]’:
Youcanalsouseanegative index ifyouwant tostartcountingelementsfromthe endof the list.Thus, the last element has index -1, the second before lastelementhasindex-2andsoon:
Pythonalsoallowsyoutotakewholeslicesofthelistbyspecifyingabeginningandendofthesliceseparatedbyacolon
Asyoucanseefromtheexample,thestartingindexinthesliceisinclusiveandtheendingone,exclusive.
Pythonprovidesavarietyofmethodsformanipulatingthemembersofalist.
Youcanaddelementswithappend’:
Asyoucansee,theelementsinalistneednotbeunique.
Mergetwolistswith‘extend’:
names=[
'Albert',
'Jane',
'Liz',
'John',
'Abby']
#accessthefirstelementofthelist
names[0]
#'Albert'
#accessthethirdelementofthelist
names[2]
#'Liz'
#accessthelastelementofthelist
names[-1]
#'Abby'
#accessthesecondlastelementofthelist
names[-2]
#'John'
#themiddleelements,excludingfirstandlast
names[1:-1]
#['Jane','Liz','John']
names.append('Liz')
names
#['Albert','Jane','Liz',
#'John','Abby','Liz']
names.extend(['Lindsay','Connor'])
names
Findtheindexofthefirstoccurrenceofanelementwith‘index’:
Removeelementsbyvaluewith‘remove’:
Removeelementsbyindexwith‘pop’:
Noticethatpopreturnstheelementbeingremoved,whileremovedoesnot.
Ifyouarefamiliarwithstacksfromotherprogramminglanguages,youcanuseinsertand‘pop’:
ThePythondocumentationcontainsafulllistoflistoperations.
To go back to the range function you used earlier, it simply creates a list ofnumbers:
3.5.8.2Sets
Pythonlistscancontainduplicatesasyousawpreviously:
#['Albert','Jane','Liz','John',
#'Abby','Liz','Lindsay','Connor']
names.index('Liz')\#2
names.remove('Abby')
names
#['Albert','Jane','Liz','John',
#'Liz','Lindsay','Connor']
names.pop(1)
#'Jane'
names
#['Albert','Liz','John',
#'Liz','Lindsay','Connor']
names.insert(0,'Lincoln')
names
#['Lincoln','Albert','Liz',
#'John','Liz','Lindsay','Connor']
names.pop()
#'Connor'
names
#['Lincoln','Albert','Liz',
#'John','Liz','Lindsay']
range(10)
#[0,1,2,3,4,5,6,7,8,9]
range(2,10,2)
#[2,4,6,8]
names=['Albert','Jane','Liz',
'John','Abby','Liz']
Whenwedonotwantthistobethecase,wecanuseaset:
Keepinmindthatthesetisanunorderedcollectionofobjects,thuswecannotaccessthembyindex:
However,wecanconvertasettoalisteasily:
Noticethatinthiscase,theorderofelementsinthenewlistmatchestheorderinwhichtheelementsweredisplayedwhenwecreatetheset.Wehadset(['Lincoln','John','Albert','Liz','Lindsay'])
andnowwehave['Lincoln','John','Albert','Liz','Lindsay'])
You should not assume this is the case in general. That is, do not make anyassumptionsabouttheorderofelementsinasetwhenitisconvertedtoanytypeofsequentialdatastructure.
You can change a set’s contents using the add, remove and update methodswhich correspond to the append, remove and extend methods in a list. Inaddition to these, set objects support the operations youmay be familiarwithfrommathematicalsets:union,intersection,difference,aswellasoperations tocheck containment. You can read about this in the Python documentation forsets.
3.5.8.3RemovalandTestingforMembershipinSets
Oneimportantadvantageofasetoveralististhataccesstoelementsisfast. Ifyou are familiarwith different data structures fromaComputerScience class,thePython list is implementedby an array,while the set is implementedby a
unique_names=set(names)
unique_names
#set(['Lincoln','John','Albert','Liz','Lindsay'])
unique_names[0]
#Traceback(mostrecentcalllast):
#File"<stdin>",line1,in<module>
#TypeError:'set'objectdoesnotsupportindexing
unique_names=list(unique_names)
unique_names[`Lincoln',`John',`Albert',`Liz',`Lindsay']
unique_names[0]
#`Lincoln'
hashtable.
Wewilldemonstratethiswithanexample.Letussaywehavealistandasetofthesamenumberofelements(approximately100thousand):
Wewill use the timeit Pythonmodule to time 100 operations that test for theexistenceofamemberineitherthelistorset:
The exact duration of the operations on your systemwill be different, but thetake away will be the same: searching for an element in a set is orders ofmagnitudefasterthaninalist.Thisisimportanttokeepinmindwhenyouworkwithlargeamountsofdata.
3.5.8.4Dictionaries
Oneoftheveryimportantdatastructuresinpythonisadictionaryalsoreferredtoasdict.
Adictionaryrepresentsakeyvaluestore:
Aconvenientfortoprintbynamedattributesis
Thisformofprintingwiththeformatstatementandareferencetodataincreasesreadabilityoftheprintstatements.
importsys,random,timeit
nums_set=set([random.randint(0,sys.maxint)for_inrange(10**5)])
nums_list=list(nums_set)
len(nums_set)
#100000
timeit.timeit('random.randint(0,sys.maxint)innums',
setup='importrandom;nums=%s'%str(nums_set),number=100)
#0.0004038810729980469
timeit.timeit('random.randint(0,sys.maxint)innums',
setup='importrandom;nums=%s'%str(nums_list),number=100)
#0.398054122924804
person={
'Name':'Albert',
'Age':100,
'Class':'Scientist'
}
print("person['Name']:",person['Name'])
#person['Name']:Albert
print("person['Age']:",person['Age'])
#person['Age']:100
print("{Name}{Age}'.format(**data))
Youcandeleteelementswiththefollowingcommands:
Youcaniterateoveradict:
3.5.8.5DictionaryKeysandValues
Youcanretrieveboth thekeysandvaluesofadictionaryusing thekeys()andvalues()methodsofthedictionary,respectively:
Bothmethodsreturnlists.Notice,however,thattheorderinwhichtheelementsappear in the returned lists (Age, Name, Class) is different from the order inwhichwe listed theelementswhenwedeclared thedictionary initially (Name,Age,Class).Itisimportanttokeepthisinmind:
Youcannotmakeanyassumptionsabouttheorderinwhichtheelementsofadictionarywillbe returnedby thekeys()andvalues()methods.
However,youcanassumethatifyoucallkeys()andvalues()insequence,theorderof elements will at least correspond in both methods. In the example Agecorrespondsto100,Nameto Albert,andClasstoScientist,andyouwillobservethe same correspondence in general as long as keys() and values() are called onerightaftertheother.
delperson['Name']#removeentrywithkey'Name'
#person
#{'Age':100,'Class':'Scientist'}
person.clear()#removeallentriesindict
#person
#{}
delperson#deleteentiredictionary
#person
#Traceback(mostrecentcalllast):
#File"<stdin>",line1,in<module>
#NameError:name'person'isnotdefined
person={
'Name':'Albert',
'Age':100,
'Class':'Scientist'
}
foriteminperson:
print(item,person[item])
#Age100
#NameAlbert
#ClassScientist
person.keys()#['Age','Name','Class']
person.values()#[100,'Albert','Scientist']
3.5.8.6CountingwithDictionaries
Oneapplicationofdictionariesthatfrequentlycomesupiscountingtheelementsinasequence.Forexample,saywehaveasequenceofcoinflips:
Theactual listdie_rollswill likelybedifferentwhenyouexecute thisonyourcomputersincetheoutcomesofthedierollsarerandom.
Tocomputetheprobabilitiesofheadsandtails,wecouldcounthowmanyheadsandtailswehaveinthelist:
In addition to how we use the dictionary counts to count the elements ofcoin_flips,noticeacouplethingsaboutthisexample:
1. We used the assert outcome in counts statement. The assert statement inPython allows you to easily insert debugging statements in your code tohelp you discover errors more quickly. assert statements are executedwhenevertheinternalPython__debug__variableissettoTrue,whichisalwaysthecaseunlessyoustartPythonwiththe-OoptionwhichallowsyoutorunoptimizedPython.
2. When we computed the probability of tails, we used the built-in sumfunction,whichallowedus toquickly find the totalnumberof coin flips.sumisoneofmanybuilt-infunctionyoucanreadabouthere.
3.5.9Functions
Youcanreusecodebyputtingitinsideafunctionthatyoucancallinotherparts
importrandom
die_rolls=[
random.choice(['heads','tails'])for_inrange(10)
]
#die_rolls
#['heads','tails','heads',
#'tails','heads','heads',
'tails','heads','heads','heads']
counts={'heads':0,'tails':0}
foroutcomeincoin_flips:
assertoutcomeincounts
counts[outcome]+=1
print('Probabilityofheads:%.2f'%(counts['heads']/len(coin_flips)))
#Probabilityofheads:0.70
print('Probabilityoftails:%.2f'%(counts['tails']/sum(counts.values())))
#Probabilityoftails:0.30
ofyourprograms.Functionsarealsoagoodwayofgroupingcodethatlogicallybelongs together in one coherentwhole.A function has a unique name in theprogram.Onceyoucallafunction,itwillexecuteitsbodywhichconsistsofoneormorelinesofcode:
The def keyword tells Python we are defining a function. As part of thedefinition,wehavethefunctionname,check_triangle,andtheparametersofthefunction–variablesthatwillbepopulatedwhenthefunctioniscalled.
Wecallthefunctionwitharguments4,5and6,whicharepassedinorderintotheparametersa,bandc.Afunctioncanbecalledseveral timeswithvaryingparameters.Thereisnolimittothenumberoffunctioncalls.
It is also possible to store the output of a function in a variable, so it can bereused.
3.5.10Classes
Aclass is an encapsulation of data and the processes thatwork on them.Thedata is represented inmember variables, and the processes are defined in themethodsoftheclass(methodsarefunctionsinsidetheclass).Forexample,let’sseehowtodefineaTriangleclass:
defcheck_triangle(a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
print(check_triangle(4,5,6))
defcheck_triangle(a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
result=check_triangle(4,5,6)
print(result)
classTriangle(object):
def__init__(self,length,width,
height,angle1,angle2,angle3):
ifnotself._sides_ok(length,width,height):
print('Thesidesofthetriangleareinvalid.')
elifnotself._angles_ok(angle1,angle2,angle3):
print('Theanglesofthetriangleareinvalid.')
self._length=length
self._width=width
self._height=height
Python has full object-oriented programming (OOP) capabilities, however wecannotcoveralloftheminthissection,soifyouneedmoreinformationpleaserefertothePythondocsonclassesandOOP.
3.5.11Modules
Nowwritethissimpleprogramandsaveit:
Asacheck,makesurethefilecontainstheexpectedcontentsonthecommandline:
Toexecuteyourprogrampassthefileasaparametertothepythoncommand:
Files in which Python code is stored are calledmodules. You can execute aPythonmoduleformthecommandlinelikeyoujustdid,oryoucanimportitinotherPythoncodeusingtheimportstatement.
Let us write a more involved Python program that will receive as input thelengths of the three sides of a triangle, andwill outputwhether they define avalidtriangle.Atriangleisvalidifthelengthofeachsideislessthanthesumofthelengthsoftheothertwosidesandgreaterthanthedifferenceofthelengthsoftheothertwosides.:
self._angle1=angle1
self._angle2=angle2
self._angle3=angle3
def_sides_ok(self,a,b,c):
return\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
def_angles_ok(self,a,b,c):
returna+b+c==180
triangle=Triangle(4,5,6,35,65,80)
print("Helloworld!")
$cathello.py
print("Helloworld!")
$pythonhello.py
Helloworld!
"""Usage:check_triangle.py[-h]LENGTHWIDTHHEIGHT
Checkifatriangleisvalid.
Arguments:
Assumingwesavetheprograminafilecalledcheck_triangle.py,wecanrunitlikeso:
Letusbreakthisdownabit.
1. Weare importing theprint_functionanddivisionmodules frompython3likewedidearlierinthissection.It’sagoodideatoalwaysincludetheseinyourprograms.
2. We’vedefinedabooleanexpressionthattellsusifthesidesthatwereinputdefine a valid triangle. The result of the expression is stored in thevalid_trianglevariable.insidearetrue,andFalseotherwise.
3. We’ve used the backslash symbol \ to format are code nicely. Thebackslash simply indicates that the current line is being continued on thenextline.
4. Whenweruntheprogram,wedothecheckif__name__=='__main__'. __name__ is aninternal Python variable that allows us to tell whether the current file isbeingrunfromthecommandline(value__name__),orisbeingimportedbyamodule (the value will be the name of the module). Thus, with thisstatementwe’rejustmakingsuretheprogramisbeingrunbythecommandline.
5. Weareusing thedocoptmodule tohandlecommand linearguments.Theadvantageofusing thismodule is that itgeneratesausagehelpstatementfor theprogramandenforces command line arguments automatically.Allofthisisdonebyparsingthedocstringatthetopofthefile.
6. Intheprintfunction,weareusingPython’sstringformattingcapabilitiestoinsertvaluesintothestringwearedisplaying.
LENGTHThelengthofthetriangle.
WIDTHThewidthofthetraingle.
HEIGHTTheheightofthetriangle.
Options:
-h--help
"""
fromdocoptimportdocopt
if__name__=='__main__':
arguments=docopt(__doc__)
a,b,c=int(arguments['LENGTH']),
int(arguments['WIDTH']),
int(arguments['HEIGHT'])
valid_triangle=\
a<b+canda>abs(b-c)and\
b<a+candb>abs(a-c)and\
c<a+bandc>abs(a-b)
print('Trianglewithsides%d,%dand%disvalid:%r'%(
a,b,c,valid_triangle
))
$pythoncheck_triangle.py456
Trianglewithsides4,5and6isvalid:True
3.5.12LambdaExpressions
As oppose to normal functions in Python which are defined using the def
keyword, lambda functions in Python are anonymous functions which do nothaveanameandaredefinedusing the lambda keyword.Thegeneric syntaxof alambda function is in form oflambdaarguments:expression, as shown in the followingexample:
Asyoucouldprobablyguess,theresultis:
Nowconsiderthefollowingexamples:
The power2 function defined in the expression, is equivalent to the followingdefinition:
Lambdafunctionsareusefulforwhenyouneedafunctionforashortperiodoftime.Note that theycanalsobeveryusefulwhenpassedasanargumentwithotherbuilt-infunctionsthattakeafunctionasanargument,e.g.filter()andmap().Inthenextexampleweshowhowalambdafunctioncanbecombinedwiththefilerfunction. Consider the array all_names which contains five words that rhymetogether.Wewanttofilterthewordsthatcontainthewordname.Toachievethis,wepassthefunctionlambdax:'name'inxas thefirstargument.This lambdafunctionreturns True if the word name exists as a sub-string in the string x. The secondargumentoffilterfunctionisthearrayofnames,i.e.all_names.
Asyoucansee,thenamesaresuccessfullyfilteredasweexpected.
InPython3,filterfunctionreturnsafilterobjectortheiteratorwhichgetslazilyevaluatedwhichmeans neitherwe can access the elements of the filter object
greeter=lambdax:print('Hello%s!'%x)
print(greeter('Albert'))
HelloAlbert!
power2=lambdax:x**2
defpower2(x):
returnx**2
all_names=['surname','rename','nickname','acclaims','defame']
filtered_names=list(filter(lambdax:'name'inx,all_names))
print(filtered_names)
#['surname','rename','nickname']
withindexnorwecanuselen()tofindthelengthofthefilterobject.
InPython,wecanhaveasmallusuallyasinglelineranonymousfunctioncalledLambda functionwhich canhave anynumberof arguments just like anormalfunctionbutwithonlyoneexpressionwithnoreturnstatement.Theresultofthisexpressioncanbeappliedtoavalue.
BasicSyntax:
Foranexample:afunctioninpython
SamefunctioncanwrittenasLambdafunction.Thisfunctionnamedasmultiplyishaving2argumentsandreturnstheirmultiplication.
Lambdaequivalentforthisfunctionwouldbe:
Here a and b are the 2 arguments and a*b is the expression whose value isreturnedasanoutput.
Alsowedon’tneedtoassignLambdafunctiontoavariable.
Lambdafunctionsaremostlypassedasparametertoafunctionwhichexpectsafunctionobjectslikeinmaporfilter.
3.5.12.1map
list_a=[1,2,3,4,5]
filter_obj=filter(lambdax:x%2==0,list_a)
#Convertthefilerobjtoalist
even_num=list(filter_obj)
print(even_num)
#Output:[2,4]
lambdaarguments:expression
defmultiply(a,b):
returna*b
#callthefunction
multiply(3*5)#outputs:15
multiply=Lambdaa,b:a*b
print(multiply(3,5))
#outputs:15
(lambdaa,b:a*b)(3*5)
Thebasicsyntaxofthemapfunctionis
mapfunctionsexpectsafunctionobjectandanynumberofiterableslikelistordictionary.Itexecutesthefunction_objectforeachelementinthesequenceandreturnsalistoftheelementsmodifiedbythefunctionobject.
Example:
IfwewanttowritesamefunctionusingLambda
3.5.12.2dictionary
Now,letsseehowwecaninterateoveradictionaryusingmapandlambdaLetssaywehaveadictionaryobject
We can iterate over this dictionary and read the elements of it usingmap andlambdafunctionsinfollowingway:
In Python3, map function returns an iterator or map object which gets lazilyevaluatedwhichmeans neitherwe can access the elements of themap objectwith indexnorwe canuse len() to find the lengthof themapobject.We canforceconvertthemapoutputi.e.themapobjecttolistasshownnext:
map(function_object,iterable1,iterable2,...)
defmultiply(x):
returnx*2
map(multiply2,[2,4,6,8])
#Output[4,8,12,16]
map(lambdax:x*2,[2,4,6,8])
#Output[4,8,12,16]
dict_movies=[
{'movie':'avengers','comic':'marvel'},
{'movie':'superman','comic':'dc'}]
map(lambdax:x['movie'],dict_movies)#Output:['avengers','superman']
map(lambdax:x['comic'],dict_movies)#Output:['marvel','dc']
map(lambdax:x['movie']=="avengers",dict_movies)
#Output:[True,False]
map_output=map(lambdax:x*2,[1,2,3,4])
print(map_output)
#Output:mapobject:<mapobjectat0x04D6BAB0>
list_map_output=list(map_output)
print(list_map_output)#Output:[2,4,6,8]
3.5.13Iterators
InPython, an iteratorprotocol isdefinedusing twomethods: __iter()__ and next().The former returns the iterator object and latter returns the next element of asequence.Someadvantagesofiteratorsareasfollows:
ReadabilitySupportssequencesofinfinitelengthSavingresources
Thereareseveralbuilt-inobjects inPythonwhich implement iteratorprotocol,e.g.string,list,dictionary.Inthefollowingexample,wecreateanewclassthatfollowstheiteratorprotocol.Wethenusetheclasstogeneratelog2ofnumbers:
As you can see,we first create an instance of the class and assign its __iter()__functiontoavariablecalledi.Thenbycallingthenext()functionfourtimes,wegetthefollowingoutput:
Asyouprobablynoticed,thelinesarelog2()of1,2,3,4respectively.
3.5.14Generators
frommathimportlog2
classLogTwo:
"Implementsaniteratoroflogtwo"
def__init__(self,last=0):
self.last=last
def__iter__(self):
self.current_num=1
returnself
def__next__(self):
ifself.current_num<=self.last:
result=log2(self.current_num)
self.current_num+=1
returnresult
else:
raiseStopIteration
L=LogTwo(5)
i=iter(L)
print(next(i))
print(next(i))
print(next(i))
print(next(i))
$pythoniterator.py
0.0
1.0
1.584962500721156
2.0
Before we go to Generators, please understand Iterators. Generators are alsoIteratorsbuttheycanonlybeinteratedoveronce.ThatsbecauseGeneratorsdonotstorethevaluesinmemoryinsteadtheygeneratethevaluesonthego.Ifwewanttoprintthosevaluesthenwecaneithersimplyiterateoverthemorusetheforloop.
3.5.14.1Generatorswithfunction
For example:we have a function named asmultiplyBy10which prints all theinputnumbersmultipliedby10.
Now,ifwewanttouseGeneratorsherethenwewillmakefollowingchanges.
InGenerators,weuseyield() function inplaceof return().Sowhenwe try toprintnew_numberslistnow,itjustprintsGeneratorsobject.ThereasonforthisisbecauseGeneratorsdontholdanyvalue inmemory, ityieldsone resultatatime.Soessentiallyitisjustwaitingforustoaskforthenextresult.Toprintthenextresultwecanjustsayprintnext(new_numbers),sohowitisworkingisitsreadingthefirstvalueandsquaringitandyieldingoutvalue1.Alsointhiscasewecanjustprintnext(new_numbers)5timestoprintallnumbersandifwedoitfor6thtimethenwewillgetanerrorStopIterationwhichmeannsGeneratorshasexausteditslimitandithasno6thelementtoprint.
3.5.14.2Generatorsusingforloop
Ifwenowwanttoprintthecompletelistofsquaredvaluesthenwecanjustdo:
defmultiplyBy10(numbers):
result=[]
foriinnumbers:
result.append(i*10)
returnresult
new_numbers=multiplyBy10([1,2,3,4,5])
printnew_numbers#Output:[10,20,30,40,50]
defmultiplyBy10(numbers):
foriinnumbers:
yield(i*10)
new_numbers=multiplyBy10([1,2,3,4,5])
printnew_numbers#Output:Generatorsobject
printnext(new_numbers)#Output:1
Theoutputwillbe:
3.5.14.3GeneratorswithListComprehension
Python has something called List Comprehension, ifwe use this thenwe canreplacethecompletefunctiondefwithjust:
Here the point to note is square brackets [] in line 1 is very important. Ifwechangeitto()thenagainwewillstartgettingGeneratorsobject.
Wecanget the individualelementsagain fromGenerators ifwedoa for loopovernew_numberslikewedidpreviously.Alternatively,wecanconvertitintoalistandthenprintit.
Buthereifweconvertthisintoalistthenwelooseonperformance,whichwewilljustseenext.
3.5.14.4WhytouseGenerators?
Generators are betterwithPerformance because it does not hold the values inmemoryandherewiththesmallexamplesweprovideitsnotabigdealsinceweare dealing with small amount of data but just consider a scenario where therecords are in millions of data set. And if we try to convert millions of dataelements into a list then that will definitely make an impact on memory and
defmultiplyBy10(numbers):
foriinnumbers:
yield(i*10)
new_numbers=multiplyBy10([1,2,3,4,5])
fornuminnew_numbers:
printnum
10
20
30
40
50
new_numbers=[x*10forxin[1,2,3,4,5]]
printnew_numbers#Output:[10,20,30,40,50]
new_numbers=(x*10forxin[1,2,3,4,5])
printnew_numbers#Output:Generatorsobject
new_numbers=(x*10forxin[1,2,3,4,5])
printlist(new_numbers)#Output:[10,20,30,40,50]
performancebecauseeverythingwillinmemory.
Lets see an example on how Generators help in Performance. First, withoutGenerators, normal function taking 1 million record and returns theresult[people]for1million.
I am justgivingapproximatevalues tocompare itwithnext executionbutwejusttrytorunitwewillseeaseriousconsumptionofmemorywithgoodamountoftimetaken.
names=['John','Jack','Adam','Steve','Rick']
majors=['Math',
'CompScience',
'Arts',
'Business',
'Economics']
#printsthememorybeforewerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(Before):{memory}Mb')
defpeople_list(people):
result=[]
foriinrange(people):
person={
'id':i,
'name':random.choice(names),
'major':randon.choice(majors)
}
result.append(person)
returnresult
t1=time.clock()
people=people_list(10000000)
t2=time.clock()
#printsthememoryafterwerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(After):{memory}Mb')
print('Took{time}seconds'.format(time=t2-t1))
#Output
Memory(Before):15Mb
Memory(After):318Mb
Took1.2seconds
names=['John','Jack','Adam','Steve','Rick']
majors=['Math',
'CompScience',
'Arts',
'Business',
'Economics']
#printsthememorybeforewerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(Before):{memory}Mb')
defpeople_generator(people):
foriinxrange(people):
person={
'id':i,
'name':random.choice(names),
'major':randon.choice(majors)
}
yieldperson
t1=time.clock()
people=people_list(10000000)
t2=time.clock()
Now after running the same code using Generators, wewill see a significantamount of performanceboostwith alomost 0Seconds.And the reasonbehindthisisthatincaseofGenerators,wedonotkeepanythinginmemorysosystemjustreads1atatimeandyieldsthat.
3.6LIBRARIES
3.6.1PythonModules☁�
OftenyoumayneedfunctionalitythatisnotpresentinPython’sstandardlibrary.Inthiscaseyouhavetwooption:
implementthefeaturesyourselfuseathird-partylibrarythathasthedesiredfeatures.
Oftenyoucanfindapreviousimplementationofwhatyouneed.Sincethisisacommonsituation,thereisaservicesupportingit:thePythonPackageIndex(orPyPiforshort).
Our task here is to install the autopep8 tool from PyPi. Thiswill allow us toillustratetheuseifvirtualenvironmentsusingthepyenvorvirtualenvcommand,andinstallinganduninstallingPyPipackagesusingpip.
3.6.1.1UpdatingPip
Itisimportantthatyouhavethenewestversionofpipinstalledforyourversionof python. Let us assume your python is registered with python and you usepyenv,thanyoucanupdatepipwith
without interferingwith a potential systemwide installed version of p ip that
#printsthememoryafterwerunthefunction
memory=mem_profile.memory_usage_resource()
print(f'Memory(After):{memory}Mb')
print('Took{time}seconds'.format(time=t2-t1))
#Output
Memory(Before):15Mb
Memory(After):15Mb
Took0.01seconds
pipinstall-Upip
maybeneededby the systemdefaultversionofpython.See the sectionaboutpyenvformoredetails
3.6.1.2UsingpiptoInstallPackages
Letusnowlookatanother important toolforPythondevelopment: thePythonPackageIndex,orPyPIforshort.PyPIprovidesalargesetofthird-partypythonpackages. If youwant todo something inpython, first checkpypi, asodd aresomeonealreadyranintotheproblemandcreatedapackagesolvingit.
InordertoinstallpackagefromPyPI,usethepipcommand.WecansearchforPyPIforpackages:
Itappearsthatthetoptworesultsarewhatwewantsoinstallthem:
Thiswill cause pip to download the packages fromPyPI, extract them, checktheir dependencies and install those as needed, then install the requestedpackages.
Youcanskip‘–trusted-hostpypi.python.org’optionifyouhave
patchedurllib3onPython2.7.9.
3.6.1.3GUI
3.6.1.3.1GUIZero
Installguizerowiththefollowingcommand:
Foracomprehensivetutorialonguizero,clickhere.
3.6.1.3.2Kivy
YoucaninstallKivyonmacOSasfollows:
$pipsearch--trusted-hostpypi.python.orgautopep8pylint
$pipinstall--trusted-hostpypi.python.orgautopep8pylint
sudopipinstallguizero
Ahelloworldprogramforkivy is included in thecloudmesh.robot repository.Whichyoucanfinehere
https://github.com/cloudmesh/cloudmesh.robot/tree/master/projects/kivy
To run the program, please download it or execute it in cloudmesh.robot asfollows:
Tocreatestandalonepackageswithkivy,pleasesee:
3.6.1.4FormattingandCheckingPythonCode
First,getthebadcode:
Examinethecode:
As you can see, this is very dense and hard to read. Cleaning it up by handwouldbeatime-consuminganderror-proneprocess.Luckily,thisisacommonproblemsothereexistacouplepackagestohelpinthissituation.
3.6.1.5Usingautopep8
Wecannowrunthebadcodethroughautopep8tofixformattingproblems:
Letuslookattheresult.Thisisconsiderablybetterthanbefore.Itiseasytotellwhattheexample1andexample2functionsaredoing.
It is a good idea to develop a habit of using autopep8 in your python-
brewinstallpkg-configsdl2sdl2_imagesdl2_ttfsdl2_mixergstreamer
pipinstall-UCython
pipinstallkivy
pipinstallpygame
cdcloudmesh.robot/projects/kivy
pythonswim.py
-https://kivy.org/docs/guide/packaging-osx.html
$wget--no-check-certificatehttp://git.io/pXqb-Obad_code_example.py
$emacsbad_code_example.py
$autopep8bad_code_example.py>code_example_autopep8.py
development workflow. For instance: use autopep8 to check a file, and if itpasses,makeanychangesinplaceusingthe-iflag:
IfyouusepyCharmyouhavetheabilitytouseasimilarfunctionwhilepressingonInspectCode.
3.6.1.6WritingPython3CompatibleCode
Towritepython2and3compatiblecodewerecommendthatyoutakealookat:http://python-future.org/compatible_idioms.html
3.6.1.7UsingPythononFutureSystems
ThisisonlyimportantifyouuseFuturesystemsresources.
InordertousePythonyoumustlogintoyourFutureSystemsaccount.Thenattheshellpromptexecutethefollowingcommand:
Thiswillmakethepythonandvirtualenvcommandsavailabletoyou.
Thedetailsofwhatthemoduleloadcommanddoesaredescribedinthefuturelessonmodules.
3.6.1.8Ecosystem
3.6.1.8.1pypi
The Python Package Index is a large repository of software for the Pythonprogramminglanguagecontaininga largenumberofpackages,manyofwhichcanbefoundonpypi.Thenice thingaboutpypi is thatmanypackagescanbeinstalledwiththeprogram‘pip’.
Todosoyouhave to locate the<package_name>forexamplewith thesearchfunctioninpypiandsayonthecommandline:
$autopep8file.py#checkoutputtoseeofpasses
$autopep8-ifile.py#updateinplace
$moduleloadpython
where package_name is the string name of the package. an example would be thepackagecalledcloudmesh_clientwhichyoucaninstallwith:
Ifallgoeswellthepackagewillbeinstalled.
3.6.1.8.2AlternativeInstallations
The basic installation of python is provided by python.org. However othersclaim to have alternative environments that allow you to install python. Thisincludes
CanopyAnacondaIronPython
Typically they include not only the python compiler but also several usefulpackages.Itisfinetousesuchenvironmentsfortheclass,butitshouldbenotedthat in both cases not every python librarymay be available for install in thegivenenvironment.Forexampleifyouneedtousecloudmeshclient,itmaynotbeavailableascondaorCanopypackage.This isalso thecaseformanyothercloudrelatedandusefulpythonlibraries.Hence,wedorecommendthat ifyouare new to python to use the distribution form python.org, and use pip andvirtualenv.
Additionally some python version have platform specific libraries ordependencies.Forexamplecocalibraries,.NETorotherframeworksareexamples.Fortheassignmentsandtheprojectssuchplatformdependentlibrariesarenottobeused.
If however you can write a platform independent code that works on Linux,macOSandWindowswhileusingthepython.orgversionbutdevelopitwithanyoftheothertoolsthatisjustfine.Howeveritisuptoyoutoguaranteethatthisindependence is maintained and implemented. You do have to writerequirements.txtfilesthatwillinstallthenecessarypythonlibrariesinaplatformindependent fashion.ThehomeworkassignmentPRG1hasevena requirement
$pipinstall<package_name>
$pipinstallcloudmesh_client
todoso.
Inordertoprovideplatformindependencewehavegivenintheclassaminimalpythonversionthatwehavetestedwithhundredsofstudents:python.org.Ifyouuseanyotherversion,thatisyourdecision.Additionallysomestudentsnotonlyusepython.orgbuthaveusediPythonwhichisfinetoo.Howeverthisclassisnotonlyaboutpython,butalsoabouthowtohaveyourcoderunonanyplatform.Thehomeworkisdesignedsothatyoucanidentifyasetupthatworksforyou.
Howeverwehaveconcernsifyouforexamplewantedtousechameleoncloudwhichwerequireyoutoaccesswithcloudmesh.cloudmeshisnotavailableasconda,canopy,orotherframeworkpackage.Cloudmeshclientisavailableformpypiwhichisstandardandshouldbesupportedbytheframeworks.Wehavenottestedcloudmeshonanyotherpythonversionthenpython.orgwhichistheopensourcecommunitystandard.Noneoftheotherversionsarestandard.
In factwe had students over the summer using canopyon theirmachines andtheygotconfusedas theynowhadmultiplepythonversionsanddidnotknowhow to switchbetween themandactivate the correct version.Certainly if youknowhowtodothat,thanfeelfreetousecanopy,andifyouwanttousecanopyall this isuptoyou.However thehomeworkandprojectrequiresyoutomakeyourprogramportabletopython.org.Ifyouknowhowtodothatevenifyouusecanopy,anaconda,oranyotherpythonversionthatisfine.Graderswilltestyourprograms on a python.org installation and not canopy, anaconda, ironpythonwhileusingvirtualenv. It isobviouswhy. Ifyoudonotknowthatansweryoumaywant to thinkabout thatevery timetheytestaprogramtheyneedtodoanewvirtualenvandrunvanillapythoninit.Ifweweretoruntwoinstallsinthesamesystem,thiswillnotworkaswedonotknowifonestudentwillcauseasideeffect foranother.Thusweas instructorsdonot justhave to lookatyourcode but code of hundreds of students with different setups. This is a nonscalablesolutionaseverytimewetestoutcodefromastudentwewouldhavetowipeout theOS, install itnew, installannewversionofwhateverpythonyouhave elected, become familiarwith that version and so on and on.This is thereason why the open source community is using python.org.We follow bestpractices.Usingotherversionsisnotacommunitybestpractice,butmayworkforanindividual.
We have however in regards to using other python version additional bonus
projectssuchas
deployrunanddocumentcloudmeshonironpythondeploy run and document cloudmesh on anaconda, develop script togenerateacondapackageformgithubdeployrunanddocumentcloudmeshoncanopy,developscripttogenerateacondapackageformgithubdeployrunanddocumentcloudmeshonironpythonotherdocumentationthatwouldbeuseful
3.6.1.9Resources
IfyouareunfamiliarwithprogramminginPython,wealsoreferyoutosomeofthenumerousonlineresources.YoumaywishtostartwithLearnPythonorthebookLearnPythontheHardWay.OtheroptionsincludeTutorialsPointorCodeAcademy, and the Python wiki page contains a long list of references forlearningaswell.Additionalresourcesinclude:
https://virtualenvwrapper.readthedocs.iohttps://github.com/yyuu/pyenvhttps://amaral.northwestern.edu/resources/guides/pyenv-tutorialhttps://godjango.com/96-django-and-python-3-how-to-setup-pyenv-for-multiple-pythons/https://www.accelebrate.com/blog/the-many-faces-of-python-and-how-to-manage-them/http://ivory.idyll.org/articles/advanced-swc/http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.htmlhttp://www.youtube.com/watch?v=0vJJlVBVTFghttp://www.korokithakis.net/tutorials/python/http://www.afterhoursprogramming.com/tutorial/Python/Introduction/http://www.greenteapress.com/thinkpython/thinkCSpy.pdfhttps://docs.python.org/3.3/tutorial/modules.htmlhttps://www.learnpython.org/en/Modules/_and/_Packageshttps://docs.python.org/2/library/datetime.htmlhttps://chrisalbon.com/python/strings/_to/_datetime.html
Averylonglistofusefulinformationarealsoavailablefrom
https://github.com/vinta/awesome-pythonhttps://github.com/rasbt/python_reference
This list may be useful as it also contains links to data visualization andmanipulationlibraries,andAItoolsandlibraries.Pleasenotethatforthisclassyoucanreusesuchlibrariesifnototherwisestated.
3.6.1.9.1JupyterNotebookTutorials
AShort Introduction toJupyterNotebooksandNumPyToviewthenotebook,open this link in a background tab https://nbviewer.jupyter.org/ and copy andpaste the following link in the URL input areahttps://cloudmesh.github.io/classes/lesson/prg/Jupyter-NumPy-tutorial-I523-F2017.ipynbThenhitGo.
3.6.1.10Exercises
E.Python.Lib.1:
Write a python program called iterate.py that accepts an integer nfromthecommandline.Passthisintegertoafunctioncallediterate.
Theiteratefunctionshouldtheniteratefrom1ton.Ifthei-thnumberis a multiple of three, print multiple of 3, if a multiple of 5 printmultipleof5,ifamultipleofbothprintmultipleof3and5,elseprintthevalue.
E:Python.Lib.2:
1. Createapyenvorvirtualenv~/ENV
2. Modify your ~/.bashrc shell file to activate your environmentuponlogin.
3. Installthedocoptpythonpackageusingpip
4. Write a program that uses docopt to define a commandlineprogram.Hint:modifytheiterateprogram.
5. Demonstratetheprogramworks.
3.6.2DataManagement☁�
Obviouslywhendealingwithbigdatawemaynotonlybedealingwithdatainoneformatbutinmanydifferentformats.Itisimportantthatyouwillbeabletomastersuchformatsandseamlesslyintegrateinyouranalysis.Thusweprovidesome simple examples on which different data formats exist and how to usethem.
3.6.2.1Formats
3.6.2.1.1Pickle
Pythonpickleallowsyoutosavedatainapythonnativeformatintoafilethatcan later be read in by other programs.However, the data formatmay not beportableamongdifferentpythonversionsthustheformatisoftennotsuitabletostoreinformation.Insteadwerecommendforstandarddatatouseeitherjsonoryaml.
Toreaditbackinuse
3.6.2.1.2TextFiles
Toreadtextfilesintoavariablecalledcontentyoucanuse
Youcanalsousethefollowingcodewhileusingtheconvenientwithstatement
importpickle
flavor={
"small":100,
"medium":1000,
"large":10000
}
pickle.dump(flavor,open("data.p","wb"))
flavor=pickle.load(open("data.p","rb"))
content=open('filename.txt','r').read()
withopen('filename.txt','r')asfile:
content=file.read()
Tosplitupthelinesofthefileintoanarrayyoucando
Thiscamalsobedonewiththebuildinreadlinesfunction
Incasethefileistoobigyouwillwanttoreadthefilelinebyline:
3.6.2.1.3CSVFiles
Oftendataiscontainedincommaseparatedvalues(CSV)withinafile.Toreadsuchfilesyoucanusethecsvpackage.
Usingpandasyoucanreadthemasfollows.
TherearemanyothermodulesandlibrariesthatincludeCSVreadfunctions.Incaseyouneedtosplitasinglelinebycomma,youmayalsousethesplitfunction.However,rememberitswillsplitateverycomma,includingthosecontainedinquotes.Sothismethodalthoughlookingoriginallyconvenienthaslimitations.
3.6.2.1.4Excelspreadsheets
PandascontainsamethodtoreadExcelfiles
3.6.2.1.5YAML
YAML is a very important format as it allows you easily to structure data in
withopen('filename.txt','r')asfile:
lines=file.read().splitlines()
lines=open('filename.txt','r').readlines()
withopen('filename.txt','r')asfile:
line=file.readline()
print(line)
importcsv
withopen('data.csv','rb')asf:
contents=csv.reader(f)
forrowincontent:
printrow
importpandasaspd
df=pd.read_csv("example.csv")
importpandasaspd
filename='data.xlsx'
data=pd.ExcelFile(file)
df=data.parse('Sheet1')
hierarchicalfieldsItisfrequentlyusedtocoordinateprogramswhileusingyamlasthespecificationforconfigurationfiles,butalsodatafiles.Toreadinayamlfilethefollowingcodecanbeused
Thenicepartisthatthiscodecanalsobeusedtoverifyifafileisvalidyaml.Towritedataoutwecanuse
The flow style set to false formats the data in a nice readable fashion withindentations.
3.6.2.1.6JSON
3.6.2.1.7XML
XML format is extensively used to transport data across the web. It has ahierarchicaldataformat,andcanberepresentedintheformofatree.
ASampleXMLdatalookslike:
PythonprovidestheElementTreeXMLAPItoparseandcreateXMLdata.
ImportingXMLdatafromafile:
ReadingXMLdatafromastringdirectly:
importyaml
withopen('data.yaml','r')asf:
content=yaml.load(f)
withopen('data.yml','w')asf:
yaml.dump(data,f,default_flow_style=False)
importjson
withopen('strings.json')asf:
content=json.load(f)
<data>
<items>
<itemname="item-1"></item>
<itemname="item-2"></item>
<itemname="item-3"></item>
</items>
</data>
importxml.etree.ElementTreeasET
tree=ET.parse('data.xml')
root=tree.getroot()
root=ET.fromstring(data_as_string)
Iteratingoverchildnodesinaroot:
ModifyingXMLdatausingElementTree:
Modifyingtextwithinatagofanelementusing.textmethod:
Adding/modifyinganattributeusing.set()method:
OtherPythonmodulesusedforparsingXMLdatainclude
minidom:https://docs.python.org/3/library/xml.dom.minidom.htmlBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/
3.6.2.1.8RDF
ToreadRDFfilesyouwillneedtoinstallRDFlibwith
ThiswillthanallowyoutoreadRDFfiles
Good examples on using RDF are provided on the RDFlib Web page athttps://github.com/RDFLib/rdflib
FromtheWebpageweshowcasealsohowtodirectlyprocessRDFdata fromtheWeb
forchildinroot:
print(child.tag,child.attrib)
tag.text=new_data
tree.write('output.xml')
tag.set('key','value')
tree.write('output.xml')
$pipinstallrdflib
fromrdflib.graphimportGraph
g=Graph()
g.parse("filename.rdf",format="format")
forentrying:
print(entry)
importrdflib
g=rdflib.Graph()
g.load('http://dbpedia.org/resource/Semantic_Web')
fors,p,oing:
prints,p,o
3.6.2.1.9PDF
The Portable Document Format (PDF) has been made available by AdobeInc.royaltyfree.ThishasenabledPDFtobecomeaworldwideadoptedformatthat also has been standardized in 2008 (ISO/IEC 32000-1:2008,https://www.iso.org/standard/51502.html). A lot of research is published inpapersmakingPDFoneofthede-factostandardsforpublishing.However,PDFis difficult to parse and is focused on high quality output instead of datarepresentation.Nevertheless,toolstomanipulatePDFexist:
PDFMiner
https://pypi.python.org/pypi/pdfminer/allowsthesimpletranslationofPDFinto text that than can be further mined. The manual page helps todemonstratesomeexampleshttp://euske.github.io/pdfminer/index.html.
pdf-parser.py
https://blog.didierstevens.com/programs/pdf-tools/ parses pdf documentsandidentifiessomestructuralelementsthatcanthanbefurtherprocessed.
Ifyouknowaboutothertools,letusknow.
3.6.2.1.10HTML
A very powerful library to parse HTML Web pages is provided withhttps://www.crummy.com/software/BeautifulSoup/
More details about it are provided in the documentation pagehttps://www.crummy.com/software/BeautifulSoup/bs4/doc/
�TODO:Studentscancontributeasection
BeautifulSoupisapythonlibrarytoparse,processandeditHTMLdocuments.
ToinstallBeautifulSoup,usepipcommandasfollows:
In order to process HTML documents, a parser is required. Beautiful Soup
$pipinstallbeautifulsoup4
supports the HTML parser included in Python’s standard library, but it alsosupports a number of third-party Python parsers like the lxml parser which iscommonlyused[1].
Followingcommandcanbeusedtoinstalllxmlparser
Tobeginwith,weimportthepackageandinstantiateanobjectasfollowsforahtmldocumenthtml_handle:
Now,wewilldiscussafewfunctions,attributesandmethodsofBeautifulSoup.
prettifyfunction
prettify() method will turn a Beautiful Soup parse tree into a nicely formattedUnicode string,witha separate line for eachHTML/XML tagand string. It isanalgoustopprint()function.Theobjectcreatedabovecanbeviewedbyprintingtheprettfiedversionofthedocumentasfollows:
tagObject
AtagobjectreferstotagsintheHTMLdocument.ItispossibletogodowntotheinnerlevelsoftheDOMtree.Toaccessatagdivunderthetagbody,itcanbedoneasfollows:
TheattrsattributeofthetagobjectreturnsadictionaryofallthedefinedattributesoftheHTMLtagaskeys.
has_attr()method
Tocheckifatagobjecthasaspecificattribute,has_attr()methodcanbeused.
$pipinstalllxml
frombs4importBeautifulSoup
soup=BeautifulSoup(html_handle,`lxml`)
print(soup.prettify())
body_div=soup.body.div
print(body_div.prettify())
ifbody_div.has_attr('p'):
print('Thevalueof\'p\'attributeis:',body_div['p'])
tagobjectattributes
name-Thisattributereturnsthenameofthetagselected.attrs -Thisattribute returnsadictionaryofall thedefinedattributesof theHTMLtagaskeys.contents -Thisattributereturnsa listofcontentsenclosedwithin theHTMLtagstring-ThisattributewhichreturnsthetextenclosedwithintheHTMLtag.ThisreturnsNoneiftherearemultiplechildrenstrings-Thisovercomesthelimitationofstringandreturnsageneratorofallstringsenclosedwithinthegiventag
Followingcodeshowcasesusageoftheabovediscussedattributes:
SearchingtheTree
find() function takes a filter expression as argument and returns the firstmatchfoundfindall()functionreturnsalistofallthematchingelements
select()functioncanbeusedtosearchthetreeusingCSSselectors
3.6.2.1.11ConfigParser
�TODO:Studentscancontributeasection
body_tag=soup.body
print("Nameofthetag:',body_tag.name)
attrs=body_tag.attrs
print('Theattributesdefinedforbodytagare:',attrs)
print('Thecontentsof\'body\'tagare:\n',body_tag.contents)
print('Thestringvalueenclosedin\'body\'tagis:',body_tag.string)
forsinbody_tag.strings:
print(repr(s))
search_elem=soup.find('a')
print(search_elem.prettify())
search_elems=soup.find_all("a",class_="sample")
pprint(search_elems)
#Select`a`tagwithclass`sample`
a_tag_elems=soup.select('a.sample')
print(a_tag_elems)
https://pymotw.com/2/ConfigParser/
3.6.2.1.12ConfigDict
https://github.com/cloudmesh/cloudmesh-common/blob/master/cloudmesh/common/ConfigDict.py
3.6.2.2Encryption
Oftenweneedtoprotecttheinformationstoredinafile.Thisisachievedwithencryption.Therearemanymethodsofsupportingencryptionandevenifafileisencrypteditmaybetargettoattacks.Thusitisnotonlyimportanttoencryptdatathatyoudonotwantotherstosebutalsotomakesurethatthesystemonwhichthedataishostedissecure.Thisisespeciallyimportantifwetalkaboutbigdatahavingapotentiallargeeffectifitgetsintothewronghands.
To illustrate one type of encryption that is non trivial we have chosen todemonstrate how to encrypt a file with an ssh key. In case you have opensslinstalledonyoursystem,thiscanbeachievedasfollows.
MostimportanthereareStep4thatencryptsthefileandStep5thatdecryptsthefile. Using the Python os module it is straight forward to implement this.However,weareprovidingincloudmeshaconvenientclassthatmakestheuseinpythonverysimple.
#!/bin/sh
#Step1.Creatingafilewithdata
echo"BigDataisthefuture.">file.txt
#Step2.Createthepem
opensslrsa-in~/.ssh/id_rsa-pubout>~/.ssh/id_rsa.pub.pem
#Step3.lookatthepemfiletoillustratehowitlookslike(optional)
cat~/.ssh/id_rsa.pub.pem
#Step4.encryptthefileintosecret.txt
opensslrsautl-encrypt-pubin-inkey~/.ssh/id_rsa.pub.pem-infile.txt-outsecret.txt
#Step5.decryptthefileandprintthecontentstostdout
opensslrsautl-decrypt-inkey~/.ssh/id_rsa-insecret.txt
fromcloudmesh.common.ssh.encryptimportEncryptFile
e=EncryptFile('file.txt','secret.txt')
e.encrypt()
e.decrypt()
Inourclassweinitializeitwiththelocationsofthefilethatistobeencryptedanddecrypted.Toinitiatethatactionjustcallthemethodsencryptanddecrypt.
3.6.2.3DatabaseAccess
�TODO:Students:defineconventionaldatabaseaccesssection
see:https://www.tutorialspoint.com/python/python_database_access.htm
3.6.2.4SQLite
�TODO:Studentscancontributetothissection
https://www.sqlite.org/index.html
https://docs.python.org/3/library/sqlite3.html
3.6.2.4.1Exercises �
E:Encryption.1:
Testtheshellscripttoreplicatehowthisexampleworks
E:Encryption.2:
Testthecloudmeshencryptionclass
E:Encryption.3:
What other encryptionmethods exist. Can you provide an exampleandcontributetothesection?
E:Encryption.4:
WhatistheissueofencryptionthatmakeitchallengingforBigData
E:Encryption.5:
Givenatestdatasetwithmanyfilestextfiles,howlongwillittaketo
encrypt anddecrypt themon variousmachines.Write a benchmarkthatyoutest.Developthisbenchmarkasagroup,testoutthetimeittakestoexecuteitonavarietyofplatforms.
3.6.3Plottingwithmatplotlib☁�
Abrief overviewofplottingwithmatplotlib alongwith examples is provided.Firstmatplotlibmustbeinstalled,whichcanbeaccomplishedwithpipinstallasfollows:
Wewillstartbyplottingasimplelinegraphusingbuiltinnumpyfunctionsforsineandcosine.Thisfirststepistoimporttheproperlibrariesshownnext.
Nextwewilldefinethevaluesforthexaxis,wedothiswiththelinspaceoptioninnumpy.Thefirsttwoparametersarethestartingandendingpoints,thesemustbescalars.Thethirdparameterisoptionalanddefinesthenumberofsamplestobe generated between the starting and ending points, this value must be aninteger.Additionalparametersforthelinspaceutilitycanbefoundhere:
Nowwewillusethesineandcosinefunctionsinordertogenerateyvalues,forthiswewill use the values of x for the argument of both our sine and cosinefunctionsi.e.cos(x).
Youcandisplay thevaluesof the threeparameterswehavedefinedby typingtheminapythonshell.
Havingdefinedxandyvalueswecangeneratealineplotandsinceweimportedmatplotlib.pyplotaspltwesimplyuseplt.plot.
$pipinstallmatplotlib
importnumpyasnp
importmatplotlib.pyplotasplt
x=np.linspace(-np.pi,np.pi,16)
cos=np.cos(x)
sin=np.sin(x)
x
array([-3.14159265,-2.72271363,-2.30383461,-1.88495559,-1.46607657,
-1.04719755,-0.62831853,-0.20943951,0.20943951,0.62831853,
1.04719755,1.46607657,1.88495559,2.30383461,2.72271363,
3.14159265])
Wecandisplaytheplotusingplt.show()whichwillpopupafiguredisplayingtheplotdefined.
Additionallywecanaddthesinelinetooutlinegraphbyenteringthefollowing.
Invoking plt.show() now will show a figure with both sine and cosine linesdisplayed.Nowthatwehaveafiguregenerateditwouldbeusefultolabelthexandyaxisandprovideatitle.Thisisdonebythefollowingthreecommands:
Alongwithaxislabelsandatitleanotherusefulfigurefeaturemaybealegend.Inordertocreatealegendyoumustfirstdesignatealabelfortheline,thislabelwill be what shows up in the legend. The label is defined in the initialplt.plot(x,y)instance,nextisanexample.
Theninordertodisplaythelegendthefollowingcommandisissued:
Thelocationisspecifiedbyusingupperorlowerandleftorright.Naturallyallthesecommandscanbecombinedandput ina filewith the .pyextensionandrunfromthecommandline.
�linkerror
plt.plot(x,cos)
plt.show()
plt.plot(x,sin)
plt.xlabel("X-label(units)")
plt.ylabel("Y-label(units)")
plt.title("AcleverTitleforyourFigure")
plt.plot(x,cos,label="cosine")
plt.legend(loc='upperright')
importnumpyasnp
importmatplotlib.pyplotasplt
x=np.linspace(-np.pi,np.pi,16)
cos=np.cos(x)
sin=np.sin(x)
plt.plot(x,cos,label="cosine")
plt.plot(x,sin,label="sine")
plt.xlabel("X-label(units)")
plt.ylabel("Y-label(units)")
plt.title("AcleverTitleforyourFigure")
plt.legend(loc='upperright')
plt.show()
Anexampleofabarchartisprecedednextusingdatafrom[T:fast-cars].
You can customize plots further by using plt.style.use(), in python 3. If youprovide thefollowingcommandinsideapythoncommandshellyouwillseealistofavailablestyles.
Anexampleofusingapredefinedstyleisshownnext.
Uptothispointwehaveonlyshowcasedhowtodisplayfiguresthroughpythonoutput, however web browsers are a popular way to display figures. OneexampleisBokeh, thefollowinglinescanbeenteredinapythonshellandthefigureisoutputtedtoabrowser.
3.6.4DocOpts☁�
Whenwewanttodesigncommandlineargumentsforpythonprogramswehavemany options. However, as our approach is to create documentation first,docoptsprovidesalsoagoodapprachforPython.Thecodeforitislocatedat
https://github.com/docopt/docopt
importmatplotlib.pyplotasplt
x=['ToyotaPrius',
'TeslaRoadster',
'BugattiVeyron',
'HondaCivic',
'LamborghiniAventador']
horse_power=[120,288,1200,158,695]
x_pos=[ifori,_inenumerate(x)]
plt.bar(x_pos,horse_power,color='green')
plt.xlabel("CarModel")
plt.ylabel("HorsePower(Hp)")
plt.title("HorsePowerforSelectedCars")
plt.xticks(x_pos,x)
plt.show()
print(plt.style.available)
plt.style.use('seaborn')
frombokeh.ioimportshow
frombokeh.plottingimportfigure
x_values=[1,2,3,4,5]
y_values=[6,7,2,3,6]
p=figure()
p.circle(x=x_values,y=y_values)
show(p)
Itcanbeinstalledwith
Asampleprogramsarelocatedat
https://github.com/docopt/docopt/blob/master/examples/options_example.py
Asampleprogramofusingdocoptsforourpurposesloksasfollows
Another good feature of using docopts is that we can use the same verbaldescriptioninotherprogramminglanguagesasshowcasedinthisbook.
3.6.5CloudmeshCommandShell☁�
3.6.5.1CMD5
Python’s CMD (https://docs.python.org/2/library/cmd.html) is a very usefulpackagetocreatecommandlineshells.Howeveritdoesnotallowthedynamicintegrationofnewlydefinedcommands.Furthermore,additionstoCMDneedtobe donewithin the same source tree. To simplify developing commands by anumber of people and to have a dynamic plugin mechanism, we developedcmd5.Itisarewriteonourearliereffortsincloudmeshclientandcmd3.
3.6.5.1.1Resources
$pipinstalldocopt
"""CloudmeshVMmanagement
Usage:
cm-govmstartNAME[--cloud=CLOUD]
cm-govmstopNAME[--cloud=CLOUD]
cm-goset--cloud=CLOUD
cm-go-h|--help
cm-go--version
Options:
-h--helpShowthisscreen.
--versionShowversion.
--cloud=CLOUDThenameofthecloud.
--mooredMoored(anchored)mine.
--driftingDriftingmine.
ARGUMENTS:
NAMEThenameoftheVM`
"""
fromdocoptimportdocopt
if__name__=='__main__':
arguments=docopt(__doc__,version='1.0.0rc2')
print(arguments)
Thesourcecodeforcmd5islocatedingithub:
https://github.com/cloudmesh/cmd5
We have discussed in Section ¿sec:cloudmesh-cms-install? how to installcloudmeshasdeveloperandhaveaccesstothesourcecodeinadirectorycalledcm.Asyoureadthisdocumentweassumeyouareadeveloperandcanskipthenextsection.
3.6.5.1.2Installationfromsource
WARNING:DONOT EXECUTE THIS IFYOUAREADEVELOPERORYOURENVIRONMENTWILLNOTPROPERLYWORK.
However,ifyouareauserofcloudmeshyoucaninstallitwith
3.6.5.1.3Execution
To run the shell you can activate it with the cms command. cms stands forcloudmeshshell:
Itwillprintthebannerandentertheshell:
Toseethelistofcommandsyoucansay:
Toseethemanualpageforaspecificcommand,pleaseuse:
$pipinstallcloudmesh-cmd5
(ENV2)$cms
+-------------------------------------------------------+
|_______|
|/___||_______||____________||__|
|||||/_\||||/_`|'_`_\/_\/__|'_\|
|||___||(_)||_||(_|||||||__/\__\||||
|\____|_|\___/\__,_|\__,_|_||_||_|\___||___/_||_||
+-------------------------------------------------------+
|CloudmeshCMD5Shell|
+-------------------------------------------------------+
cms>
cms>help
helpCOMMANDNAME
3.6.5.1.4CreateyourownExtension
OneofthemostimportantfeaturesofCMD5isitsabilitytoextenditwithnewcommands.Thisisdoneviapackagednamespaces.Werecommendyounameiscloudmesh-mycommand,wheremycommand is thenameof thecommand thatyou like to create. This can easily be done while using the sys* cloudmeshcommand (we suggest you use a different name than gregor maybe yourfirstname):
Itwilldownloadatemplatefromcloudmeshcalledcloudmesh-barandgenerateanewdirectorycloudmesh-gregorwithalltheneededfilestocreateyourowncommandandregister it dynamically with cloudmesh. All you have to do is to cd into thedirectoryandinstallthecode:
Addingyourowncommandiseasy.Itisimportantthatallobjectsaredefinedinthe command itself and that noglobal variables beuse in order to alloweachshell command to stand alone. Naturally you should develop API librariesoutside of the cloudmesh shell command and reuse them in order to keep thecommandcodeassmallaspossible.Weplacethecommandin:
Nowyoucangoaheadandmodifyyourcommandinthatdirectory.Itwilllooksimilarto(ifyouusedthecommandnamegregor):
$cmssyscommandgenerategregor
$cdcloudmesh-gregor
$pythonsetup.pyinstall
#pipinstall.
cloudmsesh/mycommand/command/gregor.py
from__future__importprint_function
fromcloudmesh.shell.commandimportcommand
fromcloudmesh.shell.commandimportPluginCommand
classGregorCommand(PluginCommand):
@command
defdo_gregor(self,args,arguments):
"""
::
Usage:
gregor-fFILE
gregorlist
Thiscommanddoessomeusefulthings.
Arguments:
FILEafilename
Options:
-fspecifythefile
"""
print(arguments)
ifarguments.FILE:
print("Youhaveusedfile:",arguments.FILE)
An important difference to other CMD solutions is that our commands canleverage (besides the standarddefinition), docopts as away todefine themanualpage. This allows us to use arguments as dict and use simple if conditions tointerpret the command. Using docopts has the advantage that contributors areforcedtothinkaboutthecommandanditsoptionsanddocumentthemfromthestart.Previouslywedidnotusebutargparseandclick.Howeverwenoticedthatforourcontributorsbothsystemsleadtocommandsthatwereeithernotproperlydocumentedor thedevelopersdelivered ambiguous commands that resulted inconfusionandwrongusagebysubsequentusers.Hence,wedorecommendthatyou use docopts for documenting cmd5 commands. The transformation isenabledbythe@commanddecoratorthatgeneratesamanualpageandcreatesaproper help message for the shell automatically. Thus there is no need tointroduceaseparatehelpmethodaswouldnormallybeneeded inCMDwhilereducingtheeffortittakestocontributenewcommandsinadynamicfashion.
3.6.5.1.5Bug:Quotes
Wehaveonebugincmd5thatrelatestotheuseofquotesonthecommandline
Forexampleyouneedtosay
Ifyou like tohelpus fix this thatwouldbegreat. it requires theuseof shlex.Unfortuantlywedidnotyettimetofixthis“feature”.
3.6.6cmdModule☁�
Ifyouconsiderusingthismodule,youmayinsteadwanttousecloudmeshcmd5instead as it provides some very nice features that are not included in cmd.Howevertodothebasics,cmdwilldo.
The Python cmd module is useful for any more involved command-lineapplication.ItisusedintheCloudmeshProject,forexample,andstudentshavefoundithelpful intheirprojectstodevelopquicklyhighqualitycommandlinetoolswithdocumentationsothatotherscanreplicateandusetheprograms.ThePythoncmdmodulecontainsapublicclass,Cmd,designedtobeusedasabase
return""
$cmsgregor-f\"filenamewithspaces\"
class for command processors such as interactive shells and other commandinterpreters.
3.6.6.1Hello,Worldwithcmd
Thisexampleshowsaverysimplecommandinterpreterthatsimplyrespondstothegreetcommand.
In order to demonstrate commands provided by cmd, let’s save the followingprograminafilecalledhelloworld.py.
Asessionwiththisprogrammightlooklikethis:
TheCmdclasscanbeusedtocustomizeasubclassthatbecomesauser-definedcommandprompt.Afteryouhaveexecutedyourprogram,commandsdefinedinyourclasscanbeused.Takenoteofthefollowinginthisexample:
from__future__importprint_function,division
importcmd
classHelloWorld(cmd.Cmd):
'''Simplecommandprocessorexample.'''
defdo_greet(self,line):
iflineisnotNoneandlen(line.strip())>0:
print('Hello,%s!'%line.strip().title())
else:
print('Hello!')
defdo_EOF(self,line):
print('bye,bye')
returnTrue
if__name__=='__main__':
HelloWorld().cmdloop()
$pythonhelloworld.py
(Cmd)help
Documentedcommands(typehelp<topic>):
========================================
help
Undocumentedcommands:
======================
EOFgreet
(Cmd)greet
Hello!
(Cmd)greetalbert
Hello,Albert!
<CTRL-Dpressed>
(Cmd)bye,bye
The methods of the class of the form do_xxx implement the shellcommands,withxxxbeingthenameofthecommand.Forexample,intheHelloWorldclass,thefunctiondo_greetmapstothegreetonthecommandline.
TheEOFcommandisaspecialcommandthatisexecutedwhenyoupressCTRL-Donyourkeyboard.
AssoonasanycommandmethodreturnsTrue theshellapplicationexits.Thus, in this example the shell is exited by pressing CTRL-D, since thedo_EOFmethodistheonlyonethatreturnsTrue.
Theshellapplicationisstartedbycallingthecmdloopmethodoftheclass.
3.6.6.2AMoreInvolvedExample
Letuslookatalittlemoreinvolvedexample.Savethefollowingcodeinafilecalledcalculator.py.
Asessionwiththisprogrammightlooklikethis:
from__future__importprint_function,division
importcmd
classCalculator(cmd.Cmd):
prompt='calc>>>'
intro='Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.'
defdo_add(self,line):
args=line.split()
total=0
forarginargs:
total+=float(arg.strip())
print(total)
defdo_subtract(self,line):
args=line.split()
total=0
iflen(args)>0:
total=float(args[0])
forarginargs[1:]:
total-=float(arg.strip())
print(total)
defdo_EOF(self,line):
print('bye,bye')
returnTrue
if__name__=='__main__':
Calculator().cmdloop()
$pythoncalculator.py
Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.
calc>>>help
Inthiscaseweareusingthepromptandintroclassvariablestodefinewhatthedefaultpromptlookslikeandawelcomemessagewhenthecommandinterpreterisinvoked.
Intheaddandsubtractcommandsweareusingthestripandsplitmethodstoparseallarguments.Ifyouwanttogetfancy,youcanusePythonmoduleslikegetoptsorargparseforthis,butthisisnotnecessaryinthissimpleexample.
3.6.6.3HelpMessages
Noticethatallcommandspresentlyshowupasundocumented.Toremedythis,wecandefinehelp_methodsforeachcommand:
Documentedcommands(typehelp<topic>):
========================================
help
Undocumentedcommands:
======================
EOFaddsubtract
calc>>>add
0
calc>>>add456
15.0
calc>>>subtract
0
calc>>>subtract102
8.0
calc>>>subtract10220
-12.0
calc>>>bye,bye
from__future__importprint_function,division
importcmd
classCalculator(cmd.Cmd):
prompt='calc>>>'
intro='Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.'
defdo_add(self,line):
args=line.split()
total=0
forarginargs:
total+=float(arg.strip())
print(total)
defhelp_add(self):
print('\n'.join([
'add[number,]',
'Addtheargumentstogetheranddisplaythetotal.'
]))
defdo_subtract(self,line):
args=line.split()
total=0
iflen(args)>0:
total=float(args[0])
forarginargs[1:]:
total-=float(arg.strip())
print(total)
Now,wecanobtainhelpfortheaddandsubtractcommands:
3.6.6.4UsefulLinks
cmsPython2Docs
cmdPython3Docs
Python Module of the Week: cmd – Create line-oriented commandprocessors
Python Module of the Week: cmd – Create line-oriented commandprocessors
3.6.7OpenCV☁�
LearningObjectives
Providesomesimplecalculationssowecantestcloudservices.
defhelp_subtract(self):
print('\n'.join([
'subtract[number,]',
'Subtractallfollowingargumentsfromthefirstargument.'
]))
defdo_EOF(self,line):
print('bye,bye')
returnTrue
if__name__=='__main__':
Calculator().cmdloop()
$pythoncalculator.py
Simplecalculatorthatcandoaddition,subtraction,multiplicationanddivision.
calc>>>help
Documentedcommands(typehelp<topic>):
========================================
addhelpsubtract
Undocumentedcommands:
======================
EOF
calc>>>helpadd
add[number,]
Addtheargumentstogetheranddisplaythetotal.
calc>>>helpsubtract
subtract[number,]
Subtractallfollowingargumentsfromthefirstargument.
calc>>>bye,bye
ShowcasesomeelementaryOpenCVfunctionsShowanenvironmentalimageanalysisapplicationusingSecchidisks
OpenCV (OpenSourceComputerVisionLibrary) is a library of thousands ofalgorithmsforvariousapplicationsincomputervisionandmachinelearning.Ithas C++, C, Python, Java and MATLAB interfaces and supports Windows,Linux,AndroidandMacOS. In this section,wewill explainbasic featuresofthislibrary,includingtheimplementationofasimpleexample.
3.6.7.1Overview
OpenCVhascountlessfunctionsforimageandvideosprocessing.Thepipelinestarts with reading the images, low-level operations on pixel values,preprocessinge.g.denoising,andthenmultiplestepsofhigher-leveloperationswhich vary depending on the application.OpenCV covers thewhole pipeline,especiallyprovidingalargesetoflibraryfunctionsforhigh-leveloperations.Asimpler library for image processing in Python is Scipy’s multi-dimensionalimageprocessingpackage(scipy.ndimage).
3.6.7.2Installation
OpenCV for Python can be installed on Linux in multiple ways, namelyPyPI(Python Package Index), Linux package manager (apt-get for Ubuntu),Condapackagemanager,andalsobuildingfromsource.YouarerecommendedtousePyPI.Here’sthecommandthatyouneedtorun:
ThiswastestedonUbuntu16.04withafreshPython3.6virtualenvironment.Inordertotest,importthemoduleinPythoncommandline:
If itdoesnotraiseanerror, it is installedcorrectly.Otherwise, try tosolvetheerror.
ForinstallationonWindows,see:
$pipinstallopencv-python
importcv2
https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_setup/py_setup_in_windows/py_setup_in_windows.html#install-opencv-python-in-windows
NotethatbuildingfromsourcecantakealongtimeandmaynotbefeasiblefordeployingtolimitedplatformssuchasRaspberryPi.
3.6.7.3ASimpleExample
Inthisexample,animageisloaded.Asimpleprocessingisperformed,andtheresultiswrittentoanewimage.
3.6.7.3.1Loadinganimage
TheimagewasdownloadedfromUSCstandarddatabase:
http://sipi.usc.edu/database/database.php?volume=misc&image=9
3.6.7.3.2Displayingtheimage
The image is saved in anumpyarray.Eachpixel is representedwith3values(R,G,B).Thisprovidesyouwithaccesstomanipulatetheimageatthelevelofsingle pixels. You can display the image using imshow function as well asMatplotlib’simshowfunction.
Youcandisplaytheimageusingimshowfunction:
oryoucanuseMatplotlib.IfyouhavenotinstalledMatplotlibbefore,installitusing:
Nowyoucanuse:
%matplotlibinline
importcv2
img=cv2.imread('images/opencv/4.2.01.tiff')
cv2.imshow('Original',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
$pipinstallmatplotlib
whichresultsinFigure1
Figure1:Imagedisplay
3.6.7.3.3ScalingandRotation
Scaling(resizing)theimagerelativetodifferentaxis
whichresultsinFigure2
Figure2:Scalingandrotation
importmatplotlib.pyplotasplt
plt.imshow(img)
res=cv2.resize(img,
None,
fx=1.2,
fy=0.7,
interpolation=cv2.INTER_CUBIC)
plt.imshow(res)
Rotationoftheimageforanangleoft
whichresultsinFigure3
Figure3:image
3.6.7.3.4Gray-scaling
whichresultsin+Figure4
Figure4:Graysacling
rows,cols,_=img.shape
t=45
M=cv2.getRotationMatrix2D((cols/2,rows/2),t,1)
dst=cv2.warpAffine(img,M,(cols,rows))
plt.imshow(dst)
img2=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
plt.imshow(img2,cmap='gray')
3.6.7.3.5ImageThresholding
whichresultsinFigure5
Figure5:ImageThresholding
3.6.7.3.6EdgeDetection
EdgedetectionusingCannyedgedetectionalgorithm
whichresultsinFigure6
Figure6:Edgedetection
3.6.7.4AdditionalFeatures
OpenCV has implementations of many machine learning techniques such as
ret,thresh=cv2.threshold(img2,127,255,cv2.THRESH_BINARY)
plt.subplot(1,2,1),plt.imshow(img2,cmap='gray')
plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')
edges=cv2.Canny(img2,100,200)
plt.subplot(121),plt.imshow(img2,cmap='gray')
plt.subplot(122),plt.imshow(edges,cmap='gray')
KMeansandSupportVectorMachines,thatcanbeputintousewithonlyafewlines of code. It also has functions especially for video analysis, featuredetection,objectrecognitionandmanymore.Youcanfindoutmoreaboutthemintheirwebsite
[OpenCV](https://docs.opencv.org/3.0-beta/index.html was initially developedfor C++ and still has a focus on that language, but it is still one of themostvaluableimageprocessinglibrariesinPython.
3.6.8SecchiDisk☁�
Wearedevelopinganautonomousrobotboatthatyoucanbepartofdevelopingwithinthisclass.Therobotbotisactuallymeasuringturbidityorwaterclarity.TraditionallythishasbeendonewithaSecchidisk.TheuseoftheSecchidiskisasfollows:
1. LowertheSecchidiskintothewater.2. Measurethepointwhenyoucannolongerseeit3. Recordthedepthatvariouslevelsandplotinageographical3Dmap
One of the thingswe can do is take a video of themeasurement instead of ahumanrecordingthem.Thanwecananalysethevideoautomaticallytoseehowdeep a diskwas lowered.This is a classical image analysis program.You areencouragedtoidentifyalgorithmsthatcanidentifythedepth.Themostsimplestseemstobetodoahistogramatavarietyofdepthsteps,andmeasurewhenthehistogramno longerchangessignificantly.Thedepthat that imagewillbe themeasurementwelookfor.
Thus ifwe analyse the imageswe need to look at the image and identify thenumbersonthemeasuringtape,aswellasthevisibilityofthedisk.
To show case how such a disk looks like we refer to the image showcasingdifferent Secchi disks. For our purpose the black-white contrast Secchi diskworkswell.SeeFigure7
Figure 7: Secchi disk types. A marine style on the left and thefreshwaterversionontherightwikipedia.
MoreinformationaboutSecchiDiskcanbefoundat:
https://en.wikipedia.org/wiki/Secchi/_disk
WehaveincludednextacoupleofexampleswhileusingsomeobviouslyusefulOpenCVmethods.Surprisingly,theuseoftheedgedetectionthatcomesinmindfirst to identify if we still can see the disk, seems to complicated to use foranalysis.Weatthistimebelievethehistogramwillbesufficient.
Pleaseinspectourexamples.
3.6.8.1SetupforOSX
First lest setup theOpenCVenvironment forOSX.Naturallyyouwillhave toupdatetheversionsbasedonyourversionsofpython.Whenwetriedtheinstallof OpenCV on MacOS, the setup was slightly more complex than otherpackages. This may have changed by now and if you have improvedinstructions, pleas elt us know. However we do not want to install it viaAnacondaoutoftheobviousreasonthatanacondainstallstomanyotherthings.importos,sys
fromos.pathimportexpanduser
3.6.8.2Step1:Recordthevideo
Recordthevideoontherobot
Wehaveactuallydonethisforyouandwillprovideyouwithimagesandvideosifyouareinterestedinanalyzingthem.SeeFigure8
3.6.8.3Step2:AnalysetheimagesfromtheVideo
Fornowwejustselected4imagesfromthevideo
os.path
home=expanduser("~")
sys.path.append('/usr/local/Cellar/opencv/3.3.1_1/lib/python3.6/site-packages/')
sys.path.append(home+'/.pyenv/versions/OPENCV/lib/python3.6/site-packages/')
importcv2
cv2.__version__
!pipinstallnumpy>tmp.log
!pipinstallmatplotlib>>tmp.log
%matplotlibinline
importcv2
importmatplotlib.pyplotasplt
img1=cv2.imread('secchi/secchi1.png')
img2=cv2.imread('secchi/secchi2.png')
img3=cv2.imread('secchi/secchi3.png')
img4=cv2.imread('secchi/secchi4.png')
figures=[]
fig=plt.figure(figsize=(18,16))
foriinrange(1,13):
figures.append(fig.add_subplot(4,3,i))
count=0
forimgin[img1,img2,img3,img4]:
figures[count].imshow(img)
color=('b','g','r')
fori,colinenumerate(color):
histr=cv2.calcHist([img],[i],None,[256],[0,256])
figures[count+1].plot(histr,color=col)
figures[count+2].hist(img.ravel(),256,[0,256])
count+=3
print("Legend")
print("Firstcolumn=imageofSecchidisk")
print("Secondcolumn=histogramofcolorsinimage")
print("Thirdcolumn=histogramofallvalues")
plt.show()
Figure8:Histogram
3.6.8.3.1ImageThresholding
SeeFigure9,Figure10,Figure11,Figure12defthreshold(img):
ret,thresh=cv2.threshold(img,150,255,cv2.THRESH_BINARY)
plt.subplot(1,2,1),plt.imshow(img,cmap='gray')
plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')
threshold(img1)
Figure9:Threshold1
Figure10:Threshold2
Figure11:Threshold3
Figure12:Threshold4
3.6.8.3.2EdgeDetection
threshold(img2)
threshold(img3)
threshold(img4)
SeeFigure13,Figure14,Figure15,Figure16,Figure17.EdgedetectionusingCannyedgedetectionalgorithm
Figure13:EdgeDetection1
Figure14:EdgeDetection2
Figure15:EdgeDetection3
deffind_edge(img):
edges=cv2.Canny(img,50,200)
plt.subplot(121),plt.imshow(img,cmap='gray')
plt.subplot(122),plt.imshow(edges,cmap='gray')
find_edge(img1)
find_edge(img2)
find_edge(img3)
find_edge(img4)
Figure16:EdgeDetection4
3.6.8.3.3Blackandwhite
Figure17:BackWhiteconversion
3.7DATA
3.7.1DataFormats☁�
3.7.1.1YAML
ThetermYAMLstandfor“YAMLAinotMarkupLanguage”.AccordingtotheWebPageat
http://yaml.org/
“YAML is a human friendly data serialization standard for all programming
bw1=cv2.cvtColor(img1,cv2.COLOR_BGR2GRAY)
plt.imshow(bw1,cmap='gray')
languages.”TherearemultipleversionsofYAMLexistingandoneneedstotakecare of that your software supports the right version. The current version isYAML1.2.
YAML is oftenused for configuration and inmany cases can alsobeused asXMLreplacement.ImportantistatYAMincontrasttoXMLremovesthetagswhilereplacingthemwithindentation.Thishasnaturallytheadvantagethatitismoreasily to read,however, the format is strictandneeds toadhere toproperindentation. Thus it is important that you check your YAML files forcorrectness,eitherbywritingforexampleapythonprogramthatreadyouryamlfile,oranonlineYAMLcheckersuchasprovidedat
http://www.yamllint.com/
An example on how to use yaml in python is provided in our next example.PleasenotethatYAMLisasupersetofJSON.OriginallyYAMLwasdesignedasamarkuplanguage.Howeverasitisnotdocumentorientedbutdataorientedithasbeenrecastanditdoesnolongerclassifyitselfasmarkuplanguage.
Resources:
http://yaml.org/https://en.wikipedia.org/wiki/YAMLhttp://www.yamllint.com/
3.7.1.2JSON
ThetermJSONstandforJavaScriptObjectNotation.Itistargetedasanopen-standard file format that emphasizes on integration of human-readable text totransmitdataobjects.Thedataobjectscontainattributevaluepairs.Althoughit
importos
importsys
importyaml
try:
yamlFilename=os.sys.argv[1]
yamlFile=open(yamlFilename,"r")
except:
print("filenamedoesnotexist")
sys.exit()
try:
yaml.load(yamlFile.read())
except:
print("YAMLfileisnotvalid.")
originates from JavaScript, the format itself is language independent. It usesbracketstoalloworganizationofthedata.PLeasenotethatYAMLisasupersetofJSONandnotallYAMLdocumentscanbeconvertedtoJSON.FurthermoreJSONdoesnotsupportcomments.ForthesereasonsweoftenprefertousYAMlinsteadofJSON.HoweverJSONdatacaneasilybetranslatedtoYAMLaswellasXML.
Resources:
https://en.wikipedia.org/wiki/JSONhttps://www.json.org/
3.7.1.3XML
XMLstandsforExtensibleMarkupLanguage.XMLallowstodefinedocumentswith the help of a set of rules in order to make it machine readable. Theemphasize here is on machine readable as document in XML can becomequickly complex and difficult to understand for humans. XML is used fordocumentsaswellasdatastructures.
AtutorialaboutXMLisavailableat
https://www.w3schools.com/xml/default.asp
Resources:
https://en.wikipedia.org/wiki/XML
3.7.2MongoDBinPython☁�
LearningObjectives
IntroductiontobasicMongoDBknowledgeUseofMongoDBviaPyMongoUseofMongoEngineMongoEngineandObject-Documentmapper,UseofFlask-Mongo
In today’s era, NoSQL databases have developed an enormous potential toprocess the unstructured data efficiently. Modern information is complex,extensive, andmaynothavepre-existing relationships.With the adventof theadvanced search engines, machine learning, and Artificial Intelligence,technology expectations to process, store, and analyze such data have growntremendously[2].TheNoSQLdatabaseenginessuchasMongoDB,Redis,andCassandra have successfully overcome the traditional relational databasechallenges such as scalability, performance, unstructured data growth, agilesprint cycles, andgrowingneeds of processingdata in real-timewithminimalhardwareprocessingpower[3].TheNoSQLdatabasesareanewgenerationofengines that do not necessarily require SQL language and are sometimes alsocalledNotOnlySQL databases.However,mostof themsupportvarious third-partyopenconnectivitydriversthatcanmapNoSQLqueriestoSQL’s.Itwouldbe safe to say that althoughNoSQL databases are still far from replacing therelationaldatabases,theyareaddinganimmensevaluewhenusedinhybridITenvironmentsinconjunctionwithrelationaldatabases,basedontheapplicationspecific needs [3].We will be covering theMongoDB technology, its driverPyMongo, itsobject-documentmapperMongoEngine, and theFlask-PyMongomicro-webframeworkthatmakeMongoDBmoreattractiveanduser-friendly.
3.7.2.1CloudmeshMongoDBUsageQuickstart
Beforeyoureadonwelikeyoutoreadthisquickstart.Theeasiestwayformanyof the activities we do to interact with MongoDB is to use our cloudmeshfunctionality.Thispreludesectionisnotintendedtodescribeallthedetails,butgetyoustartedquicklywhileleveragingcloudmesh
Thisisdoneviathecloudmeshcmd5andthecloudmesh_community/cmcode:
https://cloudmesh-community.github.io/cm/
ToinstallmongoonforexamplemacOSyoucanuse
Tostart,stopandseethestatusofmongoyoucanuse
$cmsadminmongoinstall
$cmsadminmongostart
Toadd anobject toMongo, you simplyhave to define a dictwithpredefinedvaluesforkindandcloud.InfuturesuchattributescanbepassedtothefunctiontodeterminetheMongoDBcollection.
When you invoke the function itwill automatically store the information intoMongoDB. Naturally this requires that the ~/.cloudmesh/cloudmesh.yaml file is properlyconfigured.
3.7.2.2MongoDB
TodayMongoDB isoneof leadingNoSQLdatabasewhich is fullycapableofhandling dynamic changes, processing large volumes of complex andunstructureddata,easilyusingobject-orientedprogrammingfeatures;aswellasdistributed system challenges [4]. At its core, MongoDB is an open source,cross-platform,documentdatabasemainlywritteninC++language.
3.7.2.2.1Installation
MongoDBcanbeinstalledonvariousUnixPlatforms,includingLinux,Ubuntu,AmazonLinux,etc[5].ThissectionfocusesoninstallingMongoDBonUbuntu18.04BionicBeaverusedasastandardOSforavirtualmachineusedasapartofBigDataApplicationClassduringthe2018Fallsemester.
3.7.2.2.1.1Installationprocedure
Beforeinstalling,itisrecommendedtoconfigurethenon-rootuserandprovidethe administrative privileges to it, in order to be able to perform generalMongoDBadmin tasks.Thiscanbeaccomplishedby loginas the rootuser inthefollowingmanner[6].
$cmsadminmongostop
$cmsadminmongostatus
fromcloudmesh.mongo.DataBaseDecoratorimportDatabaseUpdate
@DatabaseUpdate
deftest():
data={
"kind":"test",
"cloud":"testcloud",
"value":"hello"
}
returndata
result=test()
When logged in as a regular user, one can perform actions with superuserprivilegesbytypingsudobeforeeachcommand[6].
Oncetheusersetupiscompleted,onecanloginasaregularuser(mongoadmin)andusethefollowinginstructionstoinstallMongoDB.
To update the Ubuntu packages to the most recent versions, use the nextcommand:
ToinstalltheMongoDBpackage:
Tochecktheserviceanddatabasestatus:
VerifyingthestatusofasuccessfulMongoDBinstallationcanbeconfirmedwithanoutputsimilartothis:
Toverify theconfiguration,morespecifically the installedversion, server,andport,usethefollowingcommand:
Similarly,torestartMongoDB,usethefollowing:
To allow access toMongoDB from an outside hosted server one can use thefollowingcommandwhichopensthefire-wallconnections[5].
Statuscanbeverifiedbyusing:
$addusermongoadmin
$usermod-aGsudosammy
$sudoaptupdate
$sudoaptinstall-ymongodb
$sudosystemctlstatusmongodb
$mongodb.service-Anobject/document-orienteddatabase
Loaded:loaded(/lib/systemd/system/mongodb.service;enabled;vendorpreset:enabled)
Active:**active**(running)sinceSat2018-11-1507:48:04UTC;2min17sago
Docs:man:mongod(1)
MainPID:2312(mongod)
Tasks:23(limit:1153)
CGroup:/system.slice/mongodb.service
└─2312/usr/bin/mongod--unixSocketPrefix=/run/mongodb--config/etc/mongodb.conf
$mongo--eval'db.runCommand({connectionStatus:1})'
$sudosystemctlrestartmongodb
$sudoufwallowfromyour_other_server_ip/32toanyport27017
Other MongoDB configurations can be edited through the /etc/mongodb.conffilessuchasportandhostnames,filepaths.
Also, to complete this step, a server’s IPaddressmustbe added to thebindIPvalue[5].
MongoDB is now listening for a remote connection that can be accessed byanyonewithappropriatecredentials[5].
3.7.2.2.2CollectionsandDocuments
Each database within Mongo environment contains collections which in turncontaindocuments.Collectionsanddocumentsareanalogoustotablesandrowsrespectivelytotherelationaldatabases.Thedocumentstructureisinakey-valueform which allows storing of complex data types composed out of field andvalue pairs. Documents are objects which correspond to native data types inmanyprogramming languages, hence awell defined, embeddeddocument canhelpreduceexpensivejoinsandimprovequeryperformance.The_idfieldhelpstoidentifyeachdocumentuniquely[3].
MongoDB offers flexibility towrite records that are not restricted by columntypes.Thedata storageapproach is flexibleas it allowsone to storedataas itgrowsandtofulfillvaryingneedsofapplicationsand/orusers.ItsupportsJSONlikebinarypointsknownasBSONwheredatacanbestoredwithoutspecifyingthe type of data.Moreover, it can be distributed tomultiplemachines at highspeed. It includes a sharding feature that partitions and spreads the data outacrossvariousservers.ThismakesMongoDBanexcellentchoiceforclouddataprocessing. Its utilities can load high volumes of data at high speed whichultimately provides greater flexibility and availability in a cloud-basedenvironment[2].
ThedynamicschemastructurewithinMongoDBallowseasytestingofthesmall
$sudoufwstatus
$sudonano/etc/mongodb.conf
$logappend=true
bind_ip=127.0.0.1,your_server_ip
*port=27017*
sprints in theAgile projectmanagement life cycles and research projects thatrequirefrequentchangestothedatastructurewithminimaldowntime.Contrarytothisflexibleprocess,modifyingthedatastructureofrelationaldatabasescanbeaverytediousprocess[2].
3.7.2.2.2.1Collectionexample
ThefollowingcollectionexampleforapersonnamedAlbertincludesadditionalinformationsuchasage,status,andgroup[7].
3.7.2.2.2.2Documentstructure
3.7.2.2.2.3CollectionOperations
If collection does not exists, MongoDB database will create a collection bydefault.
3.7.2.2.3MongoDBQuerying
Thedataretrievalpatterns, thefrequencyofdatamanipulationstatementssuchas insert, updates, and deletes may demand for the use of indexes orincorporatingtheshardingfeaturetoimprovequeryperformanceandefficiencyof MongoDB environment [3]. One of the significant difference betweenrelationaldatabasesandNoSQLdatabasesare joins. In the relationaldatabase,onecancombineresultsfromtwoormoretablesusingacommoncolumn,oftencalled as key. The native table contains the primary key column while thereferenced table contains a foreign key. This mechanism allows one to makechangesinasinglerowinsteadofchangingallrowsinthereferencedtable.This
{
name:"Albert"
age:"21"
status:"Open"
group:["AI","MachineLearning"]
}
{
field1:value1,
field2:value2,
field3:value3,
...
fieldN:valueN
}
>db.myNewCollection1.insertOne({x:1})
>db.myNewCollection2.createIndex({y:1})
action is referred to as normalization.MongoDB is a document database andmainlycontainsdenormalizeddatawhichmeansthedataisrepeatedinsteadofindexedoveraspecifickey.Ifthesamedataisrequiredinmorethanonetable,itneedstoberepeated.ThisconstrainthasbeeneliminatedinMongoDB’snewversion 3.2. The new release introduced a $lookup feature whichmore likelyworksasaleft-outer-join.Lookupsarerestrictedtoaggregatedfunctionswhichmeansthatdatausuallyneedsometypeoffilteringandgroupingoperationstobe conducted beforehand. For this reason, joins in MongoDB require morecomplicated querying compared to the traditional relational database joins.Although at this time, lookups are still very far from replacing joins, this is aprominent feature that can resolve some of the relational data challenges forMongoDB[8].MongoDBqueriessupport regularexpressionsaswellas rangeasksforspecificfieldsthateliminatetheneedofreturningentiredocuments[3].MongoDB collections do not enforce document structure like SQL databaseswhichisacompellingfeature.However,itisessentialtokeepinmindtheneedsoftheapplications[2].
3.7.2.2.3.1MongoQueriesexamples
ThequeriescanbeexecutedfromMongoshellaswellasthroughscripts.
To query the data from a MongoDB collection, one would use MongoDB’sfind()method.
Theoutputcanbeformattedbyusingthepretty()command.
TheMongoDBinsertstatementscanbeperformedinthefollowingmanner:
“The$lookup command performs a left-outer-join to an unshardedcollectioninthesamedatabasetofilterindocumentsfromthejoinedcollectionforprocessing”[9].
>db.COLLECTION_NAME.find()
>db.mycol.find().pretty()
>db.COLLECTION_NAME.insert(document)
${
$lookup:
{
from:<collectiontojoin>,
localField:<fieldfromtheinputdocuments>,
ThisoperationisequivalenttothefollowingSQLoperation:
ToperformaLikeMatch(Regex),onewouldusethefollowingcommand:
3.7.2.2.4MongoDBBasicFunctions
WhenitcomestothetechnicalelementsofMongoDB,itpossesarichinterfacefor importing and storage of external data in various formats. By using theMongoImport/Exporttool,onecaneasilytransfercontentsfromJSON,CSV,orTSV files into a database. MongoDB supports CRUD (create, read, update,delete) operations efficiently and has detailed documentation available on theproductwebsite.Itcanalsoquerythegeospatialdata,anditiscapableofstoringgeospatial data in GeoJSON objects. The aggregation operation of theMongoDB process data records and returns computed results. MongoDBaggregationframeworkismodeledontheconceptofdatapipelines[10].
3.7.2.2.4.1Import/Exportfunctionsexamples
ToimportJSONdocuments,onewouldusethefollowingcommand:
The CSV import uses the input file name to import a collection, hence, thecollectionnameisoptional[10].
“Mongoexport is a utility that produces a JSON or CSV export ofdatastoredinaMongoDBinstance”[10].
3.7.2.2.5SecurityFeatures
foreignField:<fieldfromthedocumentsofthe"from"collection>,
as:<outputarrayfield>
}
}
$SELECT*,<outputarrayfield>
FROMcollection
WHERE<outputarrayfield>IN(SELECT*
FROM<collectiontojoin>
WHERE<foreignField>=<collection.localField>);`
>db.products.find({sku:{$regex:/789$/}})
$mongoimport--dbusers--collectioncontacts--filecontacts.json
$mongoimport--dbusers--typecsv--headerline--file/opt/backups/contacts.csv
$mongoexport--dbtest--collectiontraffic--outtraffic.json
DatasecurityisacrucialaspectoftheenterpriseinfrastructuremanagementandisthereasonwhyMongoDBprovidesvarioussecurityfeaturessuchasolebasedaccess control, numerous authentication options, and encryption. It supportsmechanisms such as SCRAM, LDAP, and Kerberos authentication. Theadministrator can create role/collection-based access control; also roles can bepredefined or custom. MongoDB can audit activities such as DDL, CRUDstatements,authenticationandauthorizationoperations[11].
3.7.2.2.5.1Collectionbasedaccesscontrolexample
Auserdefinedrolecancontainthefollowingprivileges[11].
3.7.2.2.6MongoDBCloudService
In regards to the cloud technologies, MongoDB also offers fully automatedcloudservicecalledAtlaswithcompetitivepricingoptions.MongoAtlasCloudinterface offers interactive GUI for managing cloud resources and deployingapplications quickly. The service is equipped with geographically distributedinstances to ensure no single point failure. Also, a well-rounded performancemonitoring interface allows users to promptly detect anomalies and generateindex suggestions to optimize the performance and reliability of the database.Global technology leaders such as Google, Facebook, eBay, and Nokia areleveragingMongoDB andAtlas cloud services makingMongoDB one of themostpopularchoicesamongtheNoSQLdatabases[12].
3.7.2.3PyMongo
PyMongo is the official Python driver or distribution that allowsworkwith aNoSQLtypedatabasecalledMongoDB[13].Thefirstversionofthedriverwasdevelopedin2009[14],onlytwoyearsafterthedevelopmentofMongoDBwasstarted.Thisdriverallowsdevelopers tocombinebothPython’sversatilityandMongoDB’sflexibleschemanatureintosuccessfulapplications.Currently,thisdriver supports MongoDB versions 2.6, 3.0, 3.2, 3.4, 3.6, and 4.0 [15].MongoDBandPythonrepresentacompatiblefitconsideringthatBSON(binary
$privileges:[
{resource:{db:"products",collection:"inventory"},actions:["find","update"]},
{resource:{db:"products",collection:"orders"},actions:["find"]}
]
JSON) used in this NoSQL database is very similar to Python dictionaries,whichmakes thecollaborationbetweenthe twoevenmoreappealing[16].Forthisreason,dictionariesaretherecommendedtoolstobeusedinPyMongowhenrepresentingdocuments[17].
3.7.2.3.1Installation
Prior to being able to exploit the benefits of Python and MongoDBsimultaneously,thePyMongodistributionmustbeinstalledusingpip.Toinstallitonallplatforms,thefollowingcommandshouldbeused[18]:$python-mpipinstallpymongo
SpecificversionsofPyMongocanbe installedwithcommand lines suchas inourexamplewherethe3.5.1versionisinstalled[18].
Asinglelineofcodecanbeusedtoupgradethedriveraswell[18].
Furthermore, the installation process can be completed with the help of theeasy_installtool,whichrequiresuserstousethefollowingcommand[18].
To do an upgrade of the driver using this tool, the following command isrecommended[18]:
There are many other ways of installing PyMongo directly from the source,however, theyrequireforCextensiondependencies tobe installedprior to thedriver installation step, as they are the ones that skim through the sources onGitHubandusethemostup-to-datelinkstoinstallthedriver[18].
Tocheckiftheinstallationwascompletedaccurately,thefollowingcommandisusedinthePythonconsole[19].
If the command returns zero exceptions within the Python shell, one can
$python-mpipinstallpymongo==3.5.1
$python-mpipinstall--upgradepymongo
$python-measy_installpymongo
$python-measy_install-Upymongo
importpymongo
considerforthePyMongoinstallationtohavebeencompletedsuccessfully.
3.7.2.3.2Dependencies
The PyMongo driver has a few dependencies that should be taken intoconsiderationpriortoitsusage.Currently,itsupportsCPython2.7,3.4+,PyPy,and PyPy 3.5+ interpreters [15]. An optional dependency that requires someadditionalcomponentstobeinstalledistheGSSAPIauthentication[15].FortheUnixbasedmachines, it requirespykerberos,while for theWindowsmachinesWinKerberos is needed to fullfill this requirement [15]. The automaticinstallation of this dependency can be done simultaneously with the driverinstallation,inthefollowingmanner:
Other third-party dependencies such as ipaddress, certifi, or wincerstore arenecessaryforconnectionswithhelpofTLS/SSLandcanalsobesimultaneouslyinstalledalongwiththedriverinstallation[15].
3.7.2.3.3RunningPyMongowithMongoDeamon
OncePyMongois installed, theMongodeamoncanberunwithaverysimplecommandinanewterminalwindow[19].
3.7.2.3.4ConnectingtoadatabaseusingMongoClient
In order to be able to establish a connectionwith a database, aMongoClientclass needs to be imported, which sub-sequentially allows the MongoClientobjecttocommunicatewiththedatabase[19].
Thiscommandallowsaconnectionwithadefault,localhostthroughport27017,however, depending on the programming requirements, one can also specifythosebylistingthemintheclient instanceorusethesameinformationviatheMongoURIformat[19].
3.7.2.3.5AccessingDatabases
$python-mpipinstallpymongo[gssapi]
$mongod
frompymongoimportMongoClient
client=MongoClient()
Since MongoClient plays a server role, it can be used to access any desireddatabasesinaneasyway.Todothat,onecanusetwodifferentapproaches.Thefirstapproachwouldbedoingthisvia theattributemethodwhere thenameofthe desired database is listed as an attribute, and the second approach, whichwouldincludeadictionary-styleaccess[19].Forexample,toaccessadatabasecalled cloudmesh_community, onewould use the following commands for theattributeandforthedictionarymethod,respectively.
3.7.2.3.6CreatingaDatabase
Creating a database is a straight forward process. First, one must create aMongoClientobjectandspecifytheconnection(IPaddress)aswellasthenameofthedatabasetheyaretryingtocreate[20].Theexampleof thiscommandispresentedinthefollowngsection:
3.7.2.3.7InsertingandRetrievingDocuments(Querying)
Creating documents and storing data using PyMongo is equally easy asaccessingandcreatingdatabases.Inordertoaddnewdata,acollectionmustbespecifiedfirst.Inthisexample,adecisionismadetousethecloudmeshgroupofdocuments.
Oncethisstepiscompleted,datamaybeinsertedusingtheinsert_one()method,whichmeans that only one document is being created.Of course, insertion ofmultiple documents at the same time is possible as well with use of theinsert_many()method[19].Anexampleofthismethodisasfollows:
Anotherexampleofthismethodwouldbetocreateacollection.Ifwewantedtocreateacollectionofstudents in thecloudmesh_community,wewoulddo it in
db=client.cloudmesh_community
db=client['cloudmesh_community']
importpymongo
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
cloudmesh=db.cloudmesh
course_info={
'course':'BigDataApplicationsandAnalytics',
'instructor':'GregorvonLaszewski',
'chapter':'technologies'
}
result=cloudmesh.insert_one(course_info)`
thefollowingmanner:
Retrievingdocumentsisequallysimpleascreatingthem.Thefind_one()methodcanbeusedtoretrieveonedocument[19].Animplementationofthismethodisgiveninthefollowingexample.
Similarly, to retieve multiple documents, one would use the find() methodinsteadofthe find_one().Forexample,tofindallcoursesthoughtbyprofessorvonLaszewski,onewouldusethefollowingcommand:
Onethingthatusersshouldbecognizantofwhenusingthefind()methodisthatit doesnot return results in an array formatbut as acursor object,which is acombinationofmethods thatwork together tohelpwithdataquerying[19]. Inordertoreturnindividualdocuments,iterationovertheresultmustbecompleted[19].
3.7.2.3.8LimitingResults
When itcomes toworkingwith largedatabases it isalwaysuseful to limit thenumberofquery results.PyMongosupports thisoptionwith its limit()method[20]. This method takes in one parameter which specifies the number ofdocumentstobereturned[20].Forexample,ifwehadacollectionwithalargenumber of cloud technologies as individual documents, one couldmodify thequery results to return only the top 10 technologies.To do this, the followingexamplecouldbeutilized:
student=[{'name':'John','st_id':52642},
{'name':'Mercedes','st_id':5717},
{'name':'Anna','st_id':5654},
{'name':'Greg','st_id':5423},
{'name':'Amaya','st_id':3540},
{'name':'Cameron','st_id':2343},
{'name':'Bozer','st_id':4143},
{'name':'Cody','price':2165}]
client=MongoClient('mongodb://localhost:27017/')
withclient:
db=client.cloudmesh
db.students.insert_many(student)
gregors_course=cloudmesh.find_one({'instructor':'GregorvonLaszewski'})
gregors_course=cloudmesh.find({'instructor':'GregorvonLaszewski'})
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
col=db['technologies']
topten=col.find().limit(10)
3.7.2.3.9UpdatingCollection
Updating documents is very similar to inserting and retrieving the same.Depending on the number of documents to be updated, one would use theupdate_one()orupdate_many()method[20].Twoparametersneedtobepassedintheupdate_one()methodforittosuccessfullyexecute.Thefirstargumentisthe query object that specifies the document to be changed, and the secondargumentistheobjectthatspecifiesthenewvalueinthedocument.Anexampleoftheupdate_one()methodinactionisthefollowing:
Updating all documents that fall under the same criteria can be donewith theupdate_many method [20]. For example, to update all documents in whichcoursetitlestartswithletterBwithadifferentinstructorinformation,wewoulddothefollowing:
3.7.2.3.10CountingDocuments
Counting documents can be done with one simple operation calledcount_documents()insteadofusingafullquery[21].Forexample,wecancountthedocumentsinthecloudmesh_commpunitybyusingthefollowingcommand:
Tocreateamorespecificcount,onewoulduseacommandsimilartothis:
Thistechnologysupportssomemoreadvancedqueryingoptionsaswell.Thoseadvanced queries allow one to add certain contraints and narrow down theresults even more. For example, to get the courses thought by professor vonLaszewskiafteracertaindate,onewouldusethefollowingcommand:
myquery={'course':'BigDataApplicationsandAnalytics'}
newvalues={'$set':{'course':'CloudComputing'}}
client=pymongo.MongoClient('mongodb://localhost:27017/')
db=client['cloudmesh']
col=db['courses']
query={'course':{'$regex':'^B'}}
newvalues={'$set':{'instructor':'GregorvonLaszewski'}}
edited=col.update_many(query,newvalues)
cloudmesh=count_documents({})
cloudmesh=count_documents({'author':'vonLaszewski'})
d=datetime.datetime(2017,11,12,12)
forcourseincloudmesh.find({'date':{'$lt':d}}).sort('author'):
pprint.pprint(course)
3.7.2.3.11Indexing
Indexing is a very important part of querying. It can greately improve queryperformancebutalsoaddfunctionalityandaideinstoringdocuments[21].
“To create a unique index on a key that rejects documents whosevalueforthatkeyalreadyexistsintheindex”[21].
Weneedtofirstlycreatetheindexinthefollowingmanner:
Thiscommandacutallycreatestwodifferentindexes.Thefirstoneisthe*_id*,createdbyMongoDBautomatically,andthesecondoneis theuser_id,createdbytheuser.
Thepurposeof those indexes is to cleverlyprevent future additionsof invaliduser_idsintoacollection.
3.7.2.3.12Sorting
Sortingontheserver-sideisalsoavaialableviaMongoDB.ThePyMongosort()methodisequivalenttotheSQLorderbystatementanditcanbeperformedaspymongo.ascending andpymongo.descending [22].Thismethod ismuchmoreefficient as it is being completed on the server-side, compared to the sortingcompleted on the client side. For example, to return all userswith first nameGregorsortedindescendingorderbybirthdatewewoulduseacommandsuchasthis:
3.7.2.3.13Aggregation
Aggregationoperationsareusedtoprocessgivendataandproducesummarizedresults. Aggregation operations collect data from a number of documents andprovide collective results by grouping data. PyMongo in its documentationoffers a separate framework that supports data aggregation. This aggregationframeworkcanbeusedto
result=db.profiles.create_index([('user_id',pymongo.ASCENDING)],
unique=True)
sorted(list(db.profiles.index_information()))
users=cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth',pymongo.DESCENDING))
foruserinusers:
printuser.get('email')
“provideprojectioncapabilitiestoreshapethereturneddata”[23].
In the aggregation pipeline, documents pass through multiple pipeline stageswhich convert documents into result data. The basic pipeline stages includefilters. Those filters act like document transformation by helping change thedocument output form. Other pipelines help group or sort documents withspecific fields. By using native operations from MongoDB, the pipelineoperatorsareefficientinaggregatingresults.
TheaddFieldsstageisusedtoaddnewfieldsintodocuments.Itreshapeseachdocument in stream, similarly to the project stage. The output document willcontain existing fields from input documents and the newly added fields 24].Thefollowingexampleshowshowtoaddstudentdetailsintoadocument.
Thebucketstageisusedtocategorizeincomingdocumentsintogroupsbasedonspecified expressions. Those groups are called buckets [24]. The followingexampleshowsthebucketstageinaction.
In the bucketAuto stage, the boundaries are automatically determined in anattempttoevenlydistributedocumentsintoaspecifiednumberofbuckets.Inthefollowingoperation,inputdocumentsaregroupedintofourbucketsaccordingtothevaluesinthepricefield[24].
db.cloudmesh_community.aggregate([
{
$addFields:{
"document.StudentDetails":{
$concat:['$document.student.FirstName','$document.student.LastName']
}
}
}])
db.user.aggregate([
{"$group":{
"_id":{
"city":"$city",
"age":{
"$let":{
"vars":{
"age":{"$subtract":[{"$year":newDate()},{"$year":"$birthDay"}]}},
"in":{
"$switch":{
"branches":[
{"case":{"$lt":["$$age",20]},"then":0},
{"case":{"$lt":["$$age",30]},"then":20},
{"case":{"$lt":["$$age",40]},"then":30},
{"case":{"$lt":["$$age",50]},"then":40},
{"case":{"$lt":["$$age",200]},"then":50}
]}}}}},
"count":{"$sum":1}}})
db.artwork.aggregate([
{
$bucketAuto:{
ThecollStatsstagereturnsstatisticsregardingacollectionorview[24].
Thecount stagepasses a document to thenext stage that contains thenumberdocumentsthatwereinputtothestage[24].
The facet stage helps processmultiple aggregation pipelines in a single stage[24].
The geoNear stage returns an ordered stream of documents based on theproximity to a geospatial point. The output documents include an additionaldistancefieldandcanincludealocationidentifierfield[24].
ThegraphLookup stage performs a recursive search on a collection. To eachoutputdocument, itaddsanewarrayfieldthatcontainsthetraversalresultsoftherecursivesearchforthatdocument[24].
groupBy:"$price",
buckets:4
}
}
])
db.matrices.aggregate([{$collStats:{latencyStats:{histograms:true}}
}])
db.scores.aggregate([{
$match:{score:{$gt:80}}},
{$count:"passing_scores"}])
db.artwork.aggregate([{
$facet:{"categorizedByTags":[{$unwind:"$tags"},
{$sortByCount:"$tags"}],"categorizedByPrice":[
//Filteroutdocumentswithoutapricee.g.,_id:7
{$match:{price:{$exists:1}}},
{$bucket:{groupBy:"$price",
boundaries:[0,150,200,300,400],
default:"Other",
output:{"count":{$sum:1},
"titles":{$push:"$title"}
}}}],"categorizedByYears(Auto)":[
{$bucketAuto:{groupBy:"$year",buckets:4}
}]}}])
db.places.aggregate([
{$geoNear:{
near:{type:"Point",coordinates:[-73.99279,40.719296]},
distanceField:"dist.calculated",
maxDistance:2,
query:{type:"public"},
includeLocs:"dist.location",
num:5,
spherical:true
}}])
db.travelers.aggregate([
{
$graphLookup:{
from:"airports",
startWith:"$nearestAirport",
connectFromField:"connects",
Thegroup stageconsumes thedocumentdatapereachdistinctgroup. IthasaRAM limit of 100MB. If the stage exceeds this limit, thegroup produces anerror[24].
The indexStats stage returns statistics regarding the use of each index for acollection[24].
The limit stage is used for controlling the number of documents passed to thenextstageinthepipeline[24].
ThelistLocalSessionsstagegivesthesessioninformationcurrentlyconnectedtomongosormongodinstance[24].
ThelistSessionsstagelistsoutallsessionthathavebeenactivelongenoughtopropagatetothesystem.sessionscollection[24].
Thelookupstageisusefulforperformingouterjoinstoothercollectionsinthesamedatabase[24].
connectToField:"airport",
maxDepth:2,
depthField:"numConnections",
as:"destinations"
}
}
])
db.sales.aggregate(
[
{
$group:{
_id:{month:{$month:"$date"},day:{$dayOfMonth:"$date"},
year:{$year:"$date"}},
totalPrice:{$sum:{$multiply:["$price","$quantity"]}},
averageQuantity:{$avg:"$quantity"},
count:{$sum:1}
}
}
]
)
db.orders.aggregate([{$indexStats:{}}])
db.article.aggregate(
{$limit:5}
)
db.aggregate([{$listLocalSessions:{allUsers:true}}])
useconfig
db.system.sessions.aggregate([{$listSessions:{allUsers:true}}])
{
$lookup:
{
from:<collectiontojoin>,
Thematchstageisusedtofilterthedocumentstream.Onlymatchingdocumentspasstonextstage[24].
Theproject stage is used to reshape the documents by adding or deleting thefields.
The redact stage reshapes stream documents by restricting information usinginformationstoredindocumentsthemselves[24].
ThereplaceRootstageisusedtoreplaceadocumentwithaspecifiedembeddeddocument[24].
Thesample stage isused to sampleoutdataby randomlyselectingnumberofdocumentsforminput[24].
Theskipstageskipsspecifiedinitialnumberofdocumentsandpassesremainingdocumentstothepipeline[24].
localField:<fieldfromtheinputdocuments>,
foreignField:<fieldfromthedocumentsofthe"from"collection>,
as:<outputarrayfield>
}
}
db.articles.aggregate(
[{$match:{author:"dave"}}]
)
db.books.aggregate([{$project:{title:1,author:1}}])
db.accounts.aggregate(
[
{$match:{status:"A"}},
{
$redact:{
$cond:{
if:{$eq:["$level",5]},
then:"$$PRUNE",
else:"$$DESCEND"
}}}]);
db.produce.aggregate([
{
$replaceRoot:{newRoot:"$in_stock"}
}
])
db.users.aggregate(
[{$sample:{size:3}}]
)
db.article.aggregate(
{$skip:5}
);
Thesort stage is usefulwhile reordering document stream by a specified sortkey[24].
The sortByCounts stage groups the incoming documents based on a specifiedexpressionvalueandcountsdocumentsineachdistinctgroup[24].
Theunwindstagedeconstructsanarrayfieldfromtheinputdocumentstooutputadocumentforeachelement[24].
Theoutstageisusedtowriteaggregationpipelineresultsintoacollection.Thisstageshouldbethelaststageofapipeline[24].
AnotheroptionfromtheaggregationoperationsistheMap/Reduceframework,whichessentiallyincludestwodifferentfunctions,mapandreduce.Thefirstoneprovidesthekeyvaluepairforeachtaginthearray,whilethelatterone
“sumsoveralloftheemittedvaluesforagivenkey”[23].
ThelaststepintheMap/Reduceprocessittocallthemap_reduce()functionanditerateovertheresults[23].TheMap/Reduceoperationprovidesresultdatainacollectionorreturnsresultsin-line.Onecanperformsubsequentoperationswiththesameinputcollectioniftheoutputofthesameiswrittentoacollection[25].Anoperationthatproducesresultsinain-lineformmustprovideresultswithintheBSONdocument size limit.Thecurrent limit for aBSONdocument is 16MB.Thesetypesofoperationsarenotsupportedbyviews[25].ThePyMongo’sAPI supports all features of the MongoDB’s Map/Reduce engine [26].Moreover,Map/Reduce has the ability to getmore detailed results by passingfull_response=Trueargumenttothemap_reduce()function[26].
db.users.aggregate(
[
{$sort:{age:-1,posts:1}}
]
)
db.exhibits.aggregate(
[{$unwind:"$tags"},{$sortByCount:"$tags"}])
db.inventory.aggregate([{$unwind:"$sizes"}])
db.inventory.aggregate([{$unwind:{path:"$sizes"}}])
db.books.aggregate([
{$group:{_id:"$author",books:{$push:"$title"}}},
{$out:"authors"}
])
3.7.2.3.14DeletingDocumentsfromaCollection
ThedeletionofdocumentswithPyMongo is fairly straight forward.Todo so,one would use the remove() method of the PyMongo Collection object [22].Similarlytothereadsandupdates,specificationofdocumentstoberemovedisamust.Forexample,removaloftheentiredocumentcollectionwithascoreof1,wouldrequiredonetousethefollowingcommand:
ThesafeparametersettoTrueensurestheoperationwascompleted[22].
3.7.2.3.15CopyingaDatabase
Copying databases within the same mongod instance or between differentmongodservers ismadepossiblewith thecommand()methodafterconnectingto the desired mongod instance [27]. For example, to copy the cloudmeshdatabase and name the new database cloudmesh_copy, one would use thecommand()methodinthefollowingmanner:
There are two ways to copy a database between servers. If a server is notpassword-prodected, one would not need to pass in the credentials nor toauthenticate to the admin database [27]. In that case, to copy a database onewouldusethefollowingcommand:
On the other hand, if the server where we are copying the database to isprotected,onewouldusethiscommandinstead:
3.7.2.3.16PyMongoStrengths
cloudmesh.users.remove({"score":1,safe=True})
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy')
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy',
fromhost='source.example.com')
client=MongoClient('target.example.com',
username='administrator',
password='pwd')
client.admin.command('copydb',
fromdb='cloudmesh',
todb='cloudmesh_copy',
fromhost='source.example.com')
One of PyMongo strengths is that allows document creation and queryingnatively
“through the use of existing language features such as nesteddictionariesandlists”[22].
Formoderately experienced Python developers, it is very easy to learn it andquicklyfeelcomfortablewithit.
“For these reasons, MongoDB and Python make a powerfulcombinationforrapid,iterativedevelopmentofhorizontallyscalablebackendapplications”[22].
Accordingto[22],MongoDBisveryapplicable tomodernapplications,whichmakesPyMongoequallyvaluable[22].
3.7.2.4MongoEngine
“MongoEngineisanObject-DocumentMapper,writteninPythonforworkingwithMongoDB”[28].
It is actually a library that allows a more advanced communication withMongoDBcomparedtoPyMongo.AsMongoEngineistechnicallyconsideredtobeanobject-documentmapper(ODM),itcanalsobeconsideredtobe
“equivalenttoaSQL-basedobjectrelationalmapper(ORM)”[19].
Theprimary techniquewhyonewoulduse anODM includesdataconversionbetweencomputersystemsthatarenotcompatiblewitheachother[29].Forthepurpose of converting data to the appropriate form, a virtual object databasemustbecreatedwithin theutilizedprogramming language[29].This library isalsousedtodefineschematafordocumentswithinMongoDB,whichultimatelyhelpswithminimizingcodingerrorsaswelldefiningmethodsonexistingfields[30].Itisalsoverybeneficialtotheoverallworkflowasittrackschangesmadetothedocumentsandaidsinthedocumentsavingprocess[31].
3.7.2.4.1Installation
Theinstallationprocessforthistechnologyisfairlysimpleasitisconsideredtobealibrary.Toinstallit,onewouldusethefollowingcommand[32]:
Ableeding-edgeversionofMongoEnginecanbeinstalleddirectlyfromGitHubbyfirstcloningtherepositoryonthelocalmachine,virtualmachine,orcloud.
3.7.2.4.2ConnectingtoadatabaseusingMongoEngine
Once installed, MongoEngine needs to be connected to an instance of themongod, similarly to PyMongo [33]. The connect() function must be used tosuccessfully complete this step and the argument that must be used in thisfunctionisthenameofthedesireddatabase[33].Priortousingthisfunction,thefunctionnameneedstobeimportedfromtheMongoEnginelibrary.
SimilarlytotheMongoClient,MongoEngineusesthelocalhostandport27017by default, however, the connect() function also allows specifying other hostsandportargumentsaswell[33].
Other types of connections are also supported (i.e. URI) and they can becompletedbyprovidingtheURIintheconnect()function[33].
3.7.2.4.3QueryingusingMongoEngine
ToqueryMongoDBusingMongoEngineanobjectsattribute isused,whichis,technically, a part of the document class [34]. This attribute is called theQuerySetManagerwhichinreturn
“createsanewQuerySetobjectonaccess”[34].
Tobeabletoaccessindividualdocumentsfromadatabase,thisobjectneedstobe iterated over. For example, to return/print all students in thecloudmesh_community object (database), the following command would beused.
$pipinstallmongoengine
frommongoengineimportconnect
connect('cloudmesh_community')
connect('cloudmesh_community',host='196.185.1.62',port=16758)
foruserincloudmesh_community.objects:
MongoEngine also has a capability of query filtering which means that akeyword can be used within the called QuerySet object to retrieve specificinformation [34]. Let us say one would like to iterate overcloudmesh_communitystudentsthatarenativesofIndiana.Toachievethis,onewouldusethefollowingcommand:
Thislibraryalsoallowstheuseofalloperatorsexceptfortheequalityoperatorin itsqueries, andmoreover,has thecapabilityofhandlingstringqueries,geoqueries,listquerying,andqueryingoftherawPyMongoqueries[34].
The string queries are useful in performing text operations in the conditionalqueries.Aquery to find a document exactlymatching andwith stateACTIVEcanbeperformedinthefollowingmanner:
ThequerytoretrievedocumentdatafornamesthatstartwithacasesensitiveALcanbewrittenas:
Toperformanexactsamequeryforthenon-key-sensitiveALonewouldusethefollowingcommand:
TheMongoEngineallowsdataextractionofgeographicallocationsbyusingGeoqueries.Thegeo_withinoperatorchecksifageometryiswithinapolygon.
Thelistquerylooksupthedocumentswherethespecifiedfieldsmatchesexactlytothegivenvalue.Tomatchallpagesthathavethewordcodingasaniteminthetagslistonewouldusethefollowingquery:
printcloudmesh_community.student
indy_students=cloudmesh_community.objects(state='IN')
db.cloudmesh_community.find(State.exact("ACTIVE"))
db.cloudmesh_community.find(Name.startswith("AL"))
db.cloudmesh_community.find(Name.istartswith("AL"))
cloudmesh_community.objects(
point__geo_within=[[[40,5],[40,6],[41,6],[40,5]]])
cloudmesh_community.objects(
point__geo_within={"type":"Polygon",
"coordinates":[[[40,5],[40,6],[41,6],[40,5]]]})
classPage(Document):
tags=ListField(StringField())
Page.objects(tags='coding')
Overall,itwouldbesafetosaythatMongoEnginehasgoodcompatibilitywithPython. It provides different functions to utilize Python easily withMongoDBand which makes this pair even more attractive to applicationdevelopers.
3.7.2.5Flask-PyMongo
“Flaskisamicro-webframeworkwritteninPython”[35].
ItwasdevelopedafterDjango,and it isverypythonic innaturewhich impliesthatitisexplicitlythetargetingthePythonusercommunity.ItislightweightasitdoesnotrequireadditionaltoolsorlibrariesandhenceisclassifiedasaMicro-Webframework.ItisoftenusedwithMongoDBusingPyMongoconnector,andit treats data within MongoDB as searchable Python dictionaries. TheapplicationssuchasPinterest,LinkedIn,andthecommunitywebpageforFlaskare using theFlask framework.Moreover, it supports various features such asthe RESTful request dispatching, secure cookies, Google app enginecompatibility,andintegratedsupportforunittesting,etc[35].Whenitcomestoconnectingtoadatabase,theconnectiondetailsforMongoDBcanbepassedasavariableorconfiguredinPyMongoconstructorwithadditionalargumentssuchasusernameandpassword,ifrequired.ItisimportantthatversionsofbothFlaskandMongoDBarecompatiblewitheachothertoavoidfunctionalitybreaks[36].
3.7.2.5.1Installation
Flask-PyMongocanbeinstalledwithaneasycommandsuchasthis:
PyMongocanbeaddedinthefollowingmanner:
3.7.2.5.2Configuration
TherearetwowaystoconfigureFlask-PyMongo.ThefirstwaywouldbetopassaMongoDBURItothePyMongoconstructor,whilethesecondwaywouldbeto
$pipinstallFlask-PyMongo
fromflaskimportFlask
fromflask_pymongoimportPyMongo
app=Flask(__name__)
app.config["MONGO_URI"]="mongodb://localhost:27017/cloudmesh_community"
mongo=PyMongo(app)
“assignittotheMONGO_URIFlaskconfiurationvariable”[36].
3.7.2.5.3Connectiontomultipledatabases/servers
Multiple PyMongo instances can be used to connect to multiple databases ordatabase servers. To achieve this, once would use a command similar to thefollowing:
3.7.2.5.4Flask-PyMongoMethods
Flask-PyMongo provides helpers for some common tasks.One of them is theCollection.find_one_or_404methodshowninthefollowingexample:
This method is very similar to the MongoDB’s find_one() method, however,insteadofreturningNoneitcausesa404NotFoundHTTPstatus[36].
Similarly,thePyMongo.send_fileandPyMongo.save_filemethodsworkon thefile-likeobjectsandsavethemtoGridFSusingthegivenfilename[36].
3.7.2.5.5AdditionalLibraries
Flask-MongoAlchemyandFlask-MongoEngineare theadditional libraries thatcanbeused toconnect toaMongoDBdatabasewhileusingenhancedfeatureswith the Flask app. The Flask-MongoAlchemy is used as a proxy betweenPython and MongoDB to connect. It provides an option such as server ordatabasebasedauthenticationtoconnecttoMongoDB.Whilethedefault issetserver based, to use a database-based authentication, the config valueMONGOALCHEMY_SERVER_AUTHparametermustbesettoFalse[37].
Flask-MongoEngine is the Flask extension that provides integration with theMongoEngine. It handles connection management for the apps. It can beinstalledthroughpipandsetupveryeasilyaswell.Thedefaultconfigurationis
app=Flask(__name__)
mongo1=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_one")
mongo2=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_two")
mongo3=PyMongo(app,uri=
"mongodb://another.host:27017/cloudmesh_community_Three")
@app.route("/user/<username>")
defuser_profile(username):
user=mongo.db.cloudmesh_community.find_one_or_404({"_id":username})
returnrender_template("user.html",user=user)
set to the local host and port 27017. For the custom port and in caseswhereMongoDB is running on another server, the host and port must be explicitlyspecifiedinconnectstringswithintheMONGODB_SETTINGSdictionarywithapp.config, alongwith the database username and password, in caseswhere adatabaseauthenticationisenabled.TheURIstyleconnectionsarealsosupportedandsupply theURIas thehost in theMONGODB_SETTINGS dictionarywithapp.config.TherearevariouscustomquerysetsthatareavailablewithinFlask-MongoenginethatareattachedtoMongoengine’sdefaultqueryset[38].
3.7.2.5.6ClassesandWrappers
Attributes such as cx and db in the PyMongo objects are the ones that helpprovideaccesstotheMongoDBserver[36].Toachievethis,onemustpasstheFlaskapptotheconstructororcallinit_app()[36].
“Flask-PyMongo wraps PyMongo’s MongoClient, Database, andCollectionclasses,andoverridestheirattributeanditemaccessors”[36].
This type of wrapping allows Flask-PyMongo to add methods to CollectionwhileatthesametimeallowingaMongoDB-styledottedexpressionsinthecode[36].
Flask-PyMongo creates connectivity between Python and Flask using aMongoDBdatabaseandsupports
“extensions that can add application features as if they wereimplementedinFlaskitself”[39],
hence, it canbeused as an additionalFlask functionality inPython code.Theextensions are there for the purpose of supporting form validations,authentication technologies, object-relational mappers and framework relatedtoolswhichultimatelyaddsalotofstrengthtothismicro-webframework[39].OneofthemainreasonsandbenefitswhyitisfrequentlyusedwithMongoDBisitscapabilityofaddingmorecontroloverdatabasesandhistory[39].
type(mongo.cx)
type(mongo.db)
type(mongo.db.cloudmesh_community)
3.7.3Mongoengine☁�
3.7.3.1Introduction
MongoEngine isadocumentmapper forworkingwithmongoldbwithpython.To be able to use mongo engine MongodD should be already installed andrunning.
3.7.3.2Installandconnect
Mongoenginecanbeinstalledbyrunning:
Thiswillinstallsix,pymongoandmongoengine.
To connect to mongoldb use connect () function by specifying mongoldbinstancename.Youdon’tneedtogotomongoshellbutthiscanbedonefromunix shell or cmd line. In this case we are connecting to a database namedstudent_db.
Ifmongodb is runningonaportdifferent fromdefaultport , portnumberandhost need to be specified. If mongoldb needs authentication username andpasswordneedtobespecified.
3.7.3.3Basics
Mongodbdoesnotenforceschemas.ComparingtoRDBMS,Rowinmongoldbis called a “document” and table can be compared toCollection. Defining aschemaishelpfulasitminimizescodingerror’s.Todefineaschemawecreateaclassthatinheritsfromdocument.
�TODO:Canyoufixthecodesectionsandlookattheexamplesweprovided.
$pipinstallmongoengine
frommongoengineimport*connect(‘student_db’)
frommongoengineimport*
classStudent(Document):
first_name=StringField(max_length=50)
last_name=StringField(max_length=50)
Fields are notmandatory but if needed, set the required keyword argument toTrue. There are multiple values available for field types. Each field can becustomizedbybykeywordargument.IfeachstudentissendingtextmessagestoUniversitiescentraldatabase,thesecanbestoredusingMongodb.Eachtextcanhavedifferentdatatypes,somemighthaveimagesorsomemighthaveurl’s.SowecancreateaclasstextandlinkittostudentbyusingReferencefield(similartoforeignkeyinRDBMS).
MongoDb supports adding tags to individual texts rather then storing themseparately and then having them referenced.Similarly Comments can also bestoreddirectlyinaText.
Foraccessingdata:ifweneedtogettitles.
Searchingtextswithtags.
3.8CALCULATION
3.8.1WordCountwithParallelPython☁�
We will demonstrate Python’s multiprocessing API for parallel computation bywriting a program that counts how many times each word in a collection ofdocumentsappear.
classText(Document):
title=StringField(max_length=120,required=True)
author=ReferenceField(Student)
meta={'allow_inheritance':True}
classOnlyText(Text):
content=StringField()
classImagePost(Text):
image_path=StringField()
classLinkPost(Text):
link_url=StringField()
classText(Document):
title=StringField(max_length=120,required=True)
author=ReferenceField(User)
tags=ListField(StringField(max_length=30))
comments=ListField(EmbeddedDocumentField(Comment))
fortextinOnlyText.objects:
print(text.title)
fortextinText.objects(tags='mongodb'):
print(text.title)
3.8.1.1GeneratingaDocumentCollection
Beforewebegin,letuswriteascriptthatwillgeneratedocumentcollectionsbyspecifying the number of documents and the number ofwords per document.Thiswillmakebenchmarkingstraightforward.
To keep it simple, the vocabulary of the document collection will consist ofrandomnumbersratherthanthewordsofanactuallanguage:
Notice thatwe are using the docoptmodule that you should be familiar withfromtheSection[PythonDocOpts](#s-python-docopts}tomakethescripteasytorunfromthecommandline.
'''Usage:generate_nums.py[-h]NUM_LISTSINTS_PER_LISTMIN_INTMAX_INTDEST_DIR
Generaterandomlistsofintegersandsavethem
as1.txt,2.txt,etc.
Arguments:
NUM_LISTSThenumberofliststocreate.
INTS_PER_LISTThenumberofintegersineachlist.
MIN_NUMEachgeneratedintegerwillbe>=MIN_NUM.
MAX_NUMEachgeneratedintegerwillbe<=MAX_NUM.
DEST_DIRAdirectorywherethegeneratednumberswillbestored.
Options:
-h--help
'''
from__future__importprint_function
importos,random,logging
fromdocoptimportdocopt
defgenerate_random_lists(num_lists,
ints_per_list,min_int,max_int):
return[[random.randint(min_int,max_int)\
foriinrange(ints_per_list)]foriinrange(num_lists)]
if__name__=='__main__':
args=docopt(__doc__)
num_lists,ints_per_list,min_int,max_int,dest_dir=[
int(args['NUM_LISTS']),
int(args['INTS_PER_LIST']),
int(args['MIN_INT']),
int(args['MAX_INT']),
args['DEST_DIR']
]
ifnotos.path.exists(dest_dir):
os.makedirs(dest_dir)
lists=generate_random_lists(num_lists,
ints_per_list,
min_int,
max_int)
curr_list=1
forlstinlists:
withopen(os.path.join(dest_dir,'%d.txt'%curr_list),'w')asf:
f.write(os.linesep.join(map(str,lst)))
curr_list+=1
logging.debug('Numberswritten.')
Youcangenerateadocumentcollectionwiththisscriptasfollows:
3.8.1.2SerialImplementation
Afirstserialimplementationofwordcountisstraightforward:
3.8.1.3SerialImplementationUsingmapandreduce
We can improve the serial implementation in anticipation of parallelizing theprogrambymakinguseofPython’smapandreducefunctions.
In short, you can use map to apply the same function to the members of acollection.Forexample,toconvertalistofnumberstostrings,youcoulddo:
pythongenerate_nums.py1000100000100docs-1000-10000
'''Usage:wordcount.py[-h]DATA_DIR
Readacollectionof.txtdocumentsandcounthowmanytimeseachword
appearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
logging.basicConfig(level=logging.DEBUG)
defwordcount(files):
counts={}
forfilepathinfiles:
withopen(filepath,'r')asf:
words=[word.strip()forwordinf.read().split()]
forwordinwords:
ifwordnotincounts:
counts[word]=0
counts[word]+=1
returncounts
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
counts=wordcount(glob.glob(os.path.join(args['DATA_DIR'],'*.txt')))
logging.debug(counts)
importrandom
nums=[random.randint(1,2)for_inrange(10)]
print(nums)
[2,1,1,1,2,2,2,2,2,2]
print(map(str,nums))
We can use reduce to apply the same function cumulatively to the items of asequence.Forexample,tofindthetotalofthenumbersinourlist,wecouldusereduceasfollows:
Wecansimplifythisevenmorebyusingalambdafunction:
YoucanreadmoreaboutPython’slambdafunctioninthedocs.
Withthisinmind,wecanreimplementthewordcountexampleasfollows:
['2','1','1','1','2','2','2','2','2','2']
defadd(x,y):
returnx+y
print(reduce(add,nums))
17
print(reduce(lambdax,y:x+y,nums))
17
'''Usage:wordcount_mapreduce.py[-h]DATA_DIR
Readacollectionof.txtdocumentsandcounthow
manytimeseachword
appearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
logging.basicConfig(level=logging.DEBUG)
defcount_words(filepath):
counts={}
withopen(filepath,'r')asf:
words=[word.strip()forwordinf.read().split()]
forwordinwords:
ifwordnotincounts:
counts[word]=0
counts[word]+=1
returncounts
defmerge_counts(counts1,counts2):
forword,countincounts2.items():
ifwordnotincounts1:
counts1[word]=0
counts1[word]+=counts2[word]
returncounts1
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
3.8.1.4ParallelImplementation
Drawingon theprevious implementationusing mapand reduce,we can parallelizetheimplementationusingPython’smultiprocessingAPI:
3.8.1.5Benchmarking
Totimeeachoftheexamples,enteritintoitsownPythonfileanduseLinux’stimecommand:
Theoutput contains the real run timeand theuser run time. real iswall clocktime-timefromstarttofinishofthecall.useristheamountofCPUtimespentin user-mode code (outside the kernel)within the process, that is, only actualCPUtimeusedinexecutingtheprocess.
per_doc_counts=map(count_words,
glob.glob(os.path.join(args['DATA_DIR'],
'*.txt')))
counts=reduce(merge_counts,[{}]+per_doc_counts)
logging.debug(counts)
'''Usage:wordcount_mapreduce_parallel.py[-h]DATA_DIRNUM_PROCESSES
Readacollectionof.txtdocumentsandcount,inparallel,howmany
timeseachwordappearsinthecollection.
Arguments:
DATA_DIRAdirectorywithdocuments(.txtfiles).
NUM_PROCESSESThenumberofparallelprocessestouse.
Options:
-h--help
'''
from__future__importdivision,print_function
importos,glob,logging
fromdocoptimportdocopt
fromwordcount_mapreduceimportcount_words,merge_counts
frommultiprocessingimportPool
logging.basicConfig(level=logging.DEBUG)
if__name__=='__main__':
args=docopt(__doc__)
ifnotos.path.exists(args['DATA_DIR']):
raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])
num_processes=int(args['NUM_PROCESSES'])
pool=Pool(processes=num_processes)
per_doc_counts=pool.map(count_words,
glob.glob(os.path.join(args['DATA_DIR'],
'*.txt')))
counts=reduce(merge_counts,[{}]+per_doc_counts)
logging.debug(counts)
$timepythonwordcount.pydocs-1000-10000
3.8.1.6Excersises
E.python.wordcount.1:
Run the threedifferentprograms (serial, serialw/mapandreduce,parallel)andanswerthefollowingquestions:
1. Is there any performance difference between the differentversionsoftheprogram?
2. Doesusertimesignificantlydifferfromrealtimeforanyoftheversionsoftheprogram?
3. Experimentwithdifferentnumbersofprocessesfortheparallelexample, starting with 1. What is the performance gain whenyougoalfrom1to2processes?From2to3?Whendoyoustopseeing improvement? (this will depend on your machinearchitecture)
3.8.1.7References
Map,FilterandReducemultiprocessingAPI
3.8.2NumPy☁�
NumPyisapopularlibrarythatisusedbymanyotherPythonpackagessuchasPandas, SciPy, and scikit-learn. It provides a fast, simple-to-use way ofinteracting with numerical data organized in vectors and matrices. In thissection,wewillprovideashortintroductiontoNumPy.
3.8.2.1InstallingNumPy
The most common way of installing NumPy, if it wasn’t included with yourPythoninstallation,istoinstallitviapip:
IfNumPyhasalreadybeeninstalled,youcanupdatetothemostrecentversionusing:
$pipinstallnumpy
YoucanverifythatNumPyisinstalledbytryingtouseitinaPythonprogram:
Notethat,byconvention,weimportNumPyusingthealias‘np’-wheneveryousee ‘np’ sprinkled in example Python code, it’s a good bet that it is usingNumPy.
3.8.2.2NumPyBasics
At its core, NumPy is a container for n-dimensional data. Typically, 1-dimensional data is called an array and 2-dimensional data is called amatrix.Beyond2-dimenionswouldbeconsideredamultidimensionalarray.Exampleswhereyou’llencounterthesedimenionsmayinclude:
1 Dimensional: time series data such as audio, stock prices, or a singleobservationinadataset.2 Dimensional: connectivity data between network nodes, user-productrecommendations,anddatabasetables.3+ Dimensional: network latency between nodes over time, video(RGB+time),andversioncontrolleddatasets.
All of these data can be placed into NumPy’s array object, just with varyingdimensions.
3.8.2.3DataTypes:TheBasicBuildingBlocks
Beforewedelveintoarraysandmatrices,wewillstartoffwiththemostbasicelement of those: a single value. NumPy can represent data utilizing manydifferentstandarddatatypessuchasuint8(an8-bitusigned integer), float64(a64-bitfloat),orstr(astring).Anexhaustivelistingcanbefoundat:
https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html
Beforemovingon,itisimportanttoknowaboutthetradeoffmadewhenusingdifferentdatatypes.Forexample,auint8canonlycontainvaluesbetween0and255.This,however,contrastswithfloat64whichcanexpressanyvaluefrom+/-
$pipinstall-Unumpy
importnumpyasnp
1.80e+308.Sowhywouldn’twejustalwaysusefloat64s?Thoughtheyallowustobemoreexpressiveintermsofnumbers,theyalsoconsumemorememory.Ifwewereworkingwith a12megapixel image, for example, storing that imageusinguint8valueswouldrequire3000*4000*8=96millionbits,or91.55MBof memory. If we were to store the same image utilizing float64, our imagewouldconsume8 timesasmuchmemory:768millionbitsor732.42MB.It’simportant use the right datatype for the job to avoid consuming unneccessaryresourcesorslowingdownprocessing.
Finally,whileNumPywillconvenientlyconvertbetweendatatypes,onemustbeawareofoverflowswhenusingsmallerdatatypes.Forexample:
In this example, it makes sense that 6+7=13. But how does 13+245=2? Putsimply, the object type (uint8) simply ran out of space to store the value andwrapped back around to the beginning. An 8-bit number is only capable ofstoring2^8,or256,uniquevalues.Anoperationthatresultsinavalueabovethatrangewill‘overflow’andcausethevaluetowrapbackaroundtozero.Likewise,anythingbelowthatrangewill‘underflow’andwrapbackaroundtotheend.Inour example, 13+245 became 258,whichwas too large to store in 8 bits andwrappedbackaroundto0andendedupat2.
NumPywill, generally, try to avoid this situation by dynamically retyping towhateverdatatypewillsupporttheresult:
Here,ouradditioncausedourarray,‘a’,tobeupscaledtouseuint16insteadofuint8. Finally, NumPy offers convenience functions akin to Python’s range()functiontocreatearraysofsequentialnumbers:
a=np.array([6],dtype=np.uint8)
print(a)
>>>[6]
a=a+np.array([7],dtype=np.uint8)
print(a)
>>>[13]
a=a+np.array([245],dtype=np.uint8)
print(a)
>>>[2]
a=a+260
print(test)
>>>[262]
X=np.arange(0.2,1,.1)
print(X)
>>>array([0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],dtype=float32)
Wecanusethisfunctiontoalsogenerateparametersspacesthatcanbeiteratedon:
3.8.2.4Arrays:StringingThingsTogether
With our knowledge of datatypes in hand, we can begin to explore arrays.Simply put, arrays can be thought of as a sequence of values (not neccesarilynumbers).Arraysare1dimensionalandcanbecreatedandaccessedsimply:
Arrays (and, later,matrices)arezero-indexed.Thismakes it convenientwhen,forexample,usingPython’srange()functiontoiteratethroughanarray:
Arraysare,also,mutableandcanbechangedeasily:
NumPy also includes incredibly powerful broadcasting features.Thismakes itvery simple to perform mathematical operations on arrays that also makesintuitivesense:
Arrayscanalsointeractwithotherarrays:
P=10.0**np.arange(-7,1,1)
print(P)
forx,pinzip(X,P):
print('%f,%f'%(x,p))
a=np.array([1,2,3])
print(type(a))
>>><class'numpy.ndarray'>
print(a)
>>>[123]
print(a.shape)
>>>(3,)
a[0]
>>>1
foriinrange(3):
print(a[i])
>>>1
>>>2
>>>3
a[0]=42
print(a)
>>>array([42,2,3])
a*3
>>>array([3,6,9])
a**2
>>>array([1,4,9],dtype=int32)
b=np.array([2,3,4])
print(a*b)
>>>array([2,6,12])
In this example, the result of multiplying together two arrays is to take theelement-wiseproductwhilemultiplyingbyaconstantwillmultiplyeachelementin the array by that constant. NumPy supports all of the basic mathematicaloperations: addition, subtraction, multiplication, division, and powers. It alsoincludesanextensivesuiteofmathematicalfunctions,suchaslog()andmax(),whicharecoveredlater.
3.8.2.5Matrices:AnArrayofArrays
Matrices can be thought of as an extension of arrays - rather than having onedimension,matriceshave2(ormore).Muchlikearrays,matricescanbecreatedeasilywithinNumPy:
Accessingindividualelementsissimilartohowwediditforarrays.Wesimplyneedtopassinanumberofargumentsequaltothenumberofdimensions:
In this example, our first index selected the row and the second selected thecolumn-givingusourresultof3.Matricescanbeextendingouttoanynumberofdimensionsbysimplyusingmoreindicestoaccessspecificelements(thoughuse-casesbeyond4maybesomewhatrare).
Matricessupportallofthenormalmathematialfunctionssuchas+,-,*,and/.Aspecialnote:the*operatorwillresultinanelement-wisemultiplication.Using@ornp.matmul()formatrixmultiplication:
MorecomplexmathematicalfunctionscantypicallybefoundwithintheNumPylibraryitself:
A full listing can be found at:
m=np.array([[1,2],[3,4]])
print(m)
>>>[[12]
>>>[34]]
m[1][0]
>>>3
print(m-m)
print(m*m)
print(m/m)
print(np.sin(x))
print(np.sum(x))
https://docs.scipy.org/doc/numpy/reference/routines.math.html
3.8.2.6SlicingArraysandMatrices
As one can imagine, accessing elements one-at-a-time is both slow and canpotentially require many lines of code to iterate over every dimension in thematrix. Thankfully, NumPy incorporate a very powerful slicing engine thatallowsustoaccessrangesofelementseasily:
The ‘:’value tellsNumPy toselectallelements in thegivendimension.Here,we’ve requested all elements in the first row. We can also use indexing torequestelementswithinagivenrange:
Here,weaskedNumPy togiveuselements4 through7 (ranges inPythonareinclusiveatthestartandnon-inclusiveattheend).Wecanevengobackwards:
Inthepreviousexample,thenegativevalueisaskingNumPytoreturnthelast5elementsof thearray.Had theargumentbeen‘:-5’,NumPywould’vereturnedeverythingBUTthelastfiveelements:
Becoming more familiar with NumPy’s accessor conventions will allow youwritemoreefficient,clearercodeasitiseasiertoreadasimpleone-lineaccessorthan it is a multi-line, nested loop when extracting values from an array ormatrix.
3.8.2.7UsefulFunctions
The NumPy library provides several convenient mathematical functions thatusers can use. These functions provide several advantages to codewritten by
m[1,:]
>>>array([3,4])
a=np.arange(0,10,1)
print(a)
>>>[0123456789]
a[4:8]
>>>array([4,5,6,7])
a[-5:]
>>>array([5,6,7,8,9])
a[:-5]
>>>array([0,1,2,3,4])
users:
They are open source typically have multiple contributors checking forerrors.Many of them utilize a C interface andwill runmuch faster than nativePythoncode.They’rewrittentoveryflexible.
NumPyarraysandmatricescontainmanyusefulaggregatingfunctionssuchasmax(),min(),mean(), etc These functions are usually able to run an order ofmagnitudefasterthanloopingthroughtheobject,soit’simportanttounderstandwhatfunctionsareavailabletoavoid‘reinventingthewheel.’Inaddition,manyof the functions are able to sum or average across axes, which make themextremely useful if your data has inherent grouping. To return to a previousexample:
In thisexample,wecreateda2x2matrixcontaining thenumbers1 through4.Thesumof thematrix returned theelement-wiseadditionof theentirematrix.Summing across axis 0 (rows) returned a new array with the element-wiseadditionacrosseachrow.Likewise,summingacrossaxis1(columns)returnedthecolumnarsummation.
3.8.2.8LinearAlgebra
Perhaps one of the most important uses for NumPy is its robust support forLinear Algebra functions. Like the aggregation functions described in theprevious section, these functions are optimized to be much faster than userimplementationsandcanutilizeprocessesorlevelfeaturestoprovideveryquickcomputations. These functions can be accessed very easily from the NumPypackage:
m=np.array([[1,2],[3,4]])
print(m)
>>>[[12]
>>>[34]]
m.sum()
>>>10
m.sum(axis=1)
>>>[3,7]
m.sum(axis=0)
>>>[4,6]
a=np.array([[1,2],[3,4]])
b=np.array([[5,6],[7,8]])
print(np.matmul(a,b))
Included in within np.linalg are functions for calculating theEigendecompositionofsquarematricesandsymmetricmatrices.Finally,togiveaquickexampleofhoweasy it is to implementalgorithms inNumPy,wecaneasilyuseittocalculatethecostandgradientwhenusingsimpleMean-Squared-Error(MSE):
Finally, more advanced functions are easily available to users via the linalglibraryofNumPyas:
3.8.2.9NumPyResources
https://docs.scipy.org/doc/numpyhttp://cs231n.github.io/python-numpy-tutorial/#numpyhttps://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.htmlhttps://en.wikipedia.org/wiki/Mean_squared_error
3.8.3Scipy☁�
SciPy is a library built around numpy and has a number of off-the-shelfalgorithmsandoperationsimplemented.Theseincludealgorithmsfromcalculus(such as integration), statistics, linear algebra, image-processing, signalprocessing,machinelearning.
To achieve this, SciPy bundels a number of useful open-source software formathematics,science,andengineering.Itincludesthefollowingpackages:
NumPy,
formanagingN-dimensionalarrays
>>>[[1922]
[4350]]
cost=np.power(Y-np.matmul(X,weights)),2).mean(axis=1)
gradient=np.matmul(X.T,np.matmul(X,weights)-y)
fromnumpyimportlinalg
A=np.diag((1,2,3))
w,v=linalg.eig(A)
print('w=',w)
print('v=',v)
SciPylibrary,
toaccessfundamentalscientificcomputingcapabilities
Matplotlib,
toconduct2Dplotting
IPython,
foranInteractiveconsole(seejupyter)
Sympy,
forsymbolicmathematics
pandas,
forprovidingdatastructuresandanalysis
3.8.3.1Introduction
First we add the usual scientific computing modules with the typicalabbreviations, including sp for scipy. We could invoke scipy’s statisticalpackageassp.stats,butforthesakeoflazinessweabbreviatethattoo.
Nowwecreatesomerandomdatatoplaywith.Wegenerate100samplesfromaGaussiandistributioncenteredatzero.
Howmanyelementsareintheset?
Whatisthemean(average)oftheset?
importnumpyasnp#importnumpy
importscipyassp#importscipy
fromscipyimportstats#referdirectlytostatsratherthansp.stats
importmatplotlibasmpl#forvisualization
frommatplotlibimportpyplotasplt#referdirectlytopyplot
#ratherthanmpl.pyplot
s=sp.randn(100)
print('Thereare',len(s),'elementsintheset')
Whatistheminimumoftheset?
Whatisthemaximumoftheset?
Wecanusethescipyfunctionstoo.What’sthemedian?
Whataboutthestandarddeviationandvariance?
Isn’tthevariancethesquareofthestandarddeviation?
How close are the measures? The differences are close as the followingcalculationshows
Howdoesthislookasahistogram?SeeFigure18,Figure19,Figure20
print('Themeanofthesetis',s.mean())
print('Theminimumofthesetis',s.min())
print('Themaximumofthesetis',s.max())
print('Themedianofthesetis',sp.median(s))
print('Thestandarddeviationis',sp.std(s),
'andthevarianceis',sp.var(s))
print('Thesquareofthestandarddeviationis',sp.std(s)**2)
print('Thedifferenceis',abs(sp.std(s)**2-sp.var(s)))
print('Andindecimalform,thedifferenceis%0.16f'%
(abs(sp.std(s)**2-sp.var(s))))
plt.hist(s)#yes,onelineofcodeforahistogram
plt.show()
Figure18:Histogram1
Letusaddsometitles.
Figure19:Histogram2
Typically we do not include titles when we prepare images for inclusion inLaTeX.Thereweusethecaptiontodescribewhatthefigureisabout.
plt.clf()#clearoutthepreviousplot
plt.hist(s)
plt.title("HistogramExample")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
plt.clf()#clearoutthepreviousplot
plt.hist(s)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
Figure20:Histogram3
Let us try out some linear regression, or curve fitting. See @#fig:scipy-output_30_0
Figure21:Result1
importrandom
defF(x):
return2*x-2
defadd_noise(x):
returnx+random.uniform(-1,1)
X=range(0,10,1)
Y=[]
foriinrange(len(X)):
Y.append(add_noise(X[i]))
plt.clf()#clearouttheoldfigure
plt.plot(X,Y,'.')
plt.show()
Nowlet’strylinearregressiontofitthecurve.
Whatistheslopeandy-interceptofthefittedcurve?
Nowlet’sseehowwellthecurvefitsthedata.We’llcallthefittedcurveF’.
To save images into a PDF file for inclusion into LaTeX documents you cansavetheimagesasfollows.Otherformatssuchaspngarealsopossible,butthequalityisnaturallynotsufficientforinclusioninpapersanddocuments.Forthatyoucertainlywant tousePDF.Thesaveof thefigurehas tooccurbeforeyouusetheshow()command.SeeFigure22
m,b,r,p,est_std_err=stats.linregress(X,Y)
print('Theslopeis',m,'andthey-interceptis',b)
defFprime(x):#thefittedcurve
returnm*x+b
X=range(0,10,1)
Yprime=[]
foriinrange(len(X)):
Yprime.append(Fprime(X[i]))
plt.clf()#clearouttheoldfigure
#theobservedpoints,bluedots
plt.plot(X,Y,'.',label='observedpoints')
#theinterpolatedcurve,connectedredline
plt.plot(X,Yprime,'r-',label='estimatedpoints')
plt.title("LinearRegressionExample")#title
plt.xlabel("x")#horizontalaxistitle
plt.ylabel("y")#verticalaxistitle
#legendlabelstoplot
plt.legend(['obseredpoints','estimatedpoints'])
#commentoutsothatyoucansavethefigure
#plt.show()
plt.savefig("regression.pdf",bbox_inches='tight')
plt.savefig('regression.png')
plt.show()
Figure22:Result2
3.8.3.2References
Formore informationaboutSciPywe recommend thatyouvisit the followinglink
https://www.scipy.org/getting-started.html#learning-to-work-with-scipy
Additionalmaterialandinspirationforthissectionarefrom
[ ]“GettingStartedguide”https://www.scipy.org/getting-started.html
[ ]Prasanth.“SimplestatisticswithSciPy.”Comfortat1AU.February28, 2011. https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/.
[ ] SciPy Cookbook. Lasted updated: 2015. http://scipy-cookbook.readthedocs.io/.
createbibtexentries
3.8.4Scikit-learn☁�
LearningObjectives
ExploratorydataanalysisPipelinetopreparedataFulllearningpipelineFinetunethemodelSignificancetests
3.8.4.1IntroductiontoScikit-learn
Scikit learnisaMachineLearningspecific libraryusedinPython.Librarycanbeusedfordataminingandanalysis.ItisbuiltontopofNumPy,matplotlibandSciPy.ScikitLearnfeaturesDimensionalityreduction,clustering,regressionandclassificationalgorithms.Italsofeaturesmodelselectionusinggridsearch,crossvalidationandmetrics.
Scikitlearnalsoenablesuserstopreprocessthedatawhichcanthenbeusedformachinelearningusingmoduleslikepreprocessingandfeatureextraction.
Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.
3.8.4.2Installation
Ifyoualreadyhaveaworkinginstallationofnumpyandscipy,theeasiestwaytoinstallscikit-learnisusingpip
3.8.4.3SupervisedLearning
SupervisedLearningisusedinmachinelearningwhenwealreadyknowasetofoutputpredictionsbasedon input characteristics andbasedon thatweneed topredictthetargetforanewinput.Trainingdataisusedtotrainthemodelwhichthencanbeusedtopredicttheoutputfromaboundedset.
$pipinstallnumpy
$pipinstallscipy-U
$pipinstall-Uscikit-learn
Problemscanbeoftwotypes
1. Classification:Trainingdatabelongstothreeorfourclasses/categoriesandbasedon the labelwewant topredict theclass/categoryfor theunlabeleddata.
2. Regression : Training data consists of vectorswithout any correspondingtargetvalues.Clusteringcanbeusedforthesetypeofdatasetstodeterminediscover groups of similar examples. Another way is density estimationwhichdeterminethedistributionofdatawithintheinputspace.Histogramisthemostbasicform.
3.8.4.4UnsupervisedLearning
UnsupervisedLearning isused inmachine learningwhenwehave the trainingsetavailablebutwithoutanycorrespondingtarget.Theoutcomeoftheproblemistodiscovergroupswithintheprovidedinput.Itcanbedoneinmanyways.
Fewofthemarelistedhere
1. Clustering:Discovergroupsofsimilarcharacteristics.2. Density Estimation : Finding the distribution of datawithin the provided
inputorchanging thedata fromahighdimensional space to twoor threedimension.
3.8.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn
Adatapipelineisasetofprocessingcomponentsthataresequencedtoproducemeaningfuldata.PipelinesarecommonlyusedinMachinelearning,sincethereislotofdatatransformationandmanipulationthatneedstobeappliedtomakedatauseful formachine learning.All components are sequenced in away thatthe output of one component becomes input for the next and each of thecomponentisselfcontained.Componentsinteractwitheachotherusingdata.
Evenifacomponentbreaks,thedownstreamcomponentcanrunnormallyusingthe last output. Sklearn provide the ability to build pipelines that can betransformedandmodeledformachinelearning.
3.8.4.6Stepsfordevelopingamachinelearningmodel
1. Explorethedomainspace2. Extracttheproblemdefinition3. Getthedatathatcanbeusedtomakethesystemlearntosolvetheproblem
definition.4. DiscoverandVisualizethedatatogaininsights5. Featureengineeringandpreparethedata6. Finetuneyourmodel7. Evaluateyoursolutionusingmetrics8. Onceprovenlaunchandmaintainthemodel.
3.8.4.7ExploratoryDataAnalysis
Exampleproject=Frauddetectionsystem
Firststepistoloadthedataintoadataframeinorderforaproperanalysistobedoneontheattributes.
Performthebasicanalysisonthedatashapeandnullvalueinformation.
Hereistheexampleoffewofthevisualdataanalysismethods.
3.8.4.7.1Barplot
Abarchartorgraphisagraphwithrectangularbarsorbinsthatareusedtoplotcategoricalvalues.Eachbarinthegraphrepresentsacategoricalvariableandtheheightofthebarisproportionaltothevaluerepresentedbyit.
Bargraphsareused:
TomakecomparisonsbetweenvariablesTovisualizeanytrendinthedata,i.e.,they show the dependence of one variable on another Estimate values of avariable
data=pd.read_csv('dataset/data_file.csv')
data.head()
print(data.shape)
print(data.info())
data.isnull().values.any()
Figure23:Exampleofscikit-learnbarplots
3.8.4.7.2Correlationbetweenattributes
Attributesinadatasetcanberelatedbasedondifferntaspects.
Examplesincludeattributesdependentonanotherorcouldbelooselyortightlycoupled.Alsoexampleincludestwovariablescanbeassociatedwithathirdone.
Inordertounderstandtherelationshipbetweenattributes,correlationrepresentsthebestvisualwaytogetaninsight.Positivecorrelationmeaningbothattributesmovingintothesamedirection.Negativecorrelationreferstooppostedirections.One attributes values increase results in value decrease for other. Zerocorrelationiswhentheattributesareunrelated.
plt.ylabel('Transactions')
plt.xlabel('Type')
data.type.value_counts().plot.bar()
#computethecorrelationmatrix
corr=data.corr()
#generateamaskforthelowertriangle
mask=np.zeros_like(corr,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
#setupthematplotlibfigure
f,ax=plt.subplots(figsize=(18,18))
Figure24:scikit-learncorrelationarray
3.8.4.7.3HistogramAnalysisofdatasetattributes
Ahistogramconsistsofasetofcountsthatrepresentthenumberoftimessomeeventoccurred.
#generateacustomdivergingcolormap
cmap=sns.diverging_palette(220,10,as_cmap=True)
#drawtheheatmapwiththemaskandcorrectaspectratio
sns.heatmap(corr,mask=mask,cmap=cmap,vmax=.3,
square=True,
linewidths=.5,cbar_kws={"shrink":.5},ax=ax);
%matplotlibinline
data.hist(bins=30,figsize=(20,15))
plt.show()
Figure25:scikit-learn
3.8.4.7.4BoxplotAnalysis
Box plot analysis is useful in detecting whether a distribution is skewed anddetectoutliersinthedata.fig,axs=plt.subplots(2,2,figsize=(10,10))
tmp=data.loc[(data.type=='TRANSFER'),:]
a=sns.boxplot(x='isFlaggedFraud',y='amount',data=tmp,ax=axs[0][0])
axs[0][0].set_yscale('log')
b=sns.boxplot(x='isFlaggedFraud',y='oldbalanceDest',data=tmp,ax=axs[0][1])
axs[0][1].set(ylim=(0,0.5e8))
c=sns.boxplot(x='isFlaggedFraud',y='oldbalanceOrg',data=tmp,ax=axs[1][0])
axs[1][0].set(ylim=(0,3e7))
d=sns.regplot(x='oldbalanceOrg',y='amount',data=tmp.loc[(tmp.isFlaggedFraud==1),:],ax=axs[1][1])
plt.show()
Figure26:scikit-learn
3.8.4.7.5ScatterplotAnalysis
The scatter plot displays values of two numerical variables as Cartesiancoordinates.plt.figure(figsize=(12,8))
sns.pairplot(data[['amount','oldbalanceOrg','oldbalanceDest','isFraud']],hue='isFraud')
Figure27:scikit-learnscatterplots
3.8.4.8DataCleansing-RemovingOutliers
Ifthetransactionamountislowerthan5percentoftheallthetransactionsANDdoesnotexceedUSD3000,wewillexcludeitfromouranalysistoreduceType1costsIfthetransactionamountishigherthan95percentofallthetransactionsAND exceeds USD 500000, we will exclude it from our analysis, and use ablanketreviewprocessforsuchtransactions(similartoisFlaggedFraudcolumninoriginaldataset)toreduceType2costslow_exclude=np.round(np.minimum(fin_samp_data.amount.quantile(0.05),3000),2)
high_exclude=np.round(np.maximum(fin_samp_data.amount.quantile(0.95),500000),2)
###UpdatingDatatoexcluderecordspronetoType1andType2costs
low_data=fin_samp_data[fin_samp_data.amount>low_exclude]
3.8.4.9PipelineCreation
Machinelearningpipelineisusedtohelpautomatemachinelearningworkflows.Theyoperateby enabling a sequenceofdata tobe transformedandcorrelatedtogether in a model that can be tested and evaluated to achieve an outcome,whetherpositiveornegative.
3.8.4.9.1DefiningDataFrameSelectortoseparateNumericalandCategoricalattributes
SamplefunctiontoseperateoutNumericalandcategoricalattributes.
3.8.4.9.2FeatureCreation/AdditionalFeatureEngineering
DuringEDAweidentifiedthattherearetransactionswherethebalancesdonottallyafterthetransactioniscompleted.Webelievethiscouldpotentiallybecaseswherefraudisoccurring.Toaccountforthiserrorinthetransactions,wedefinetwo new features“errorBalanceOrig” and “errorBalanceDest”, calculated byadjusting theamountwith thebeforeandafterbalances for theOriginatorandDestinationaccounts.
Below,wecreateafunctionthatallowsustocreatethesefeaturesinapipeline.
data=low_data[low_data.amount<high_exclude]
fromsklearn.baseimportBaseEstimator,TransformerMixin
#Createaclasstoselectnumericalorcategoricalcolumns
#sinceScikit-Learndoesn'thandleDataFramesyet
classDataFrameSelector(BaseEstimator,TransformerMixin):
def__init__(self,attribute_names):
self.attribute_names=attribute_names
deffit(self,X,y=None):
returnself
deftransform(self,X):
returnX[self.attribute_names].values
fromsklearn.baseimportBaseEstimator,TransformerMixin
#columnindex
amount_ix,oldbalanceOrg_ix,newbalanceOrig_ix,oldbalanceDest_ix,newbalanceDest_ix=0,1,2,3,4
classCombinedAttributesAdder(BaseEstimator,TransformerMixin):
def__init__(self):#no*argsor**kargs
pass
deffit(self,X,y=None):
returnself#nothingelsetodo
deftransform(self,X,y=None):
errorBalanceOrig=X[:,newbalanceOrig_ix]+X[:,amount_ix]-X[:,oldbalanceOrg_ix]
errorBalanceDest=X[:,oldbalanceDest_ix]+X[:,amount_ix]-X[:,newbalanceDest_ix]
returnnp.c_[X,errorBalanceOrig,errorBalanceDest]
3.8.4.10CreatingTrainingandTestingdatasets
Trainingsetincludesthesetofinputexamplesthatthemodelwillbefitinto ortrained on by adjusting the parameters. Testing dataset is critical to test thegeneralizability of the model . By using this set, we can get the workingaccuracyofourmodel.
Testingsetshouldnotbeexposedtomodelunlessmodeltraininghasnotbeencompleted.Thiswaytheresultsfromtestingwillbemorereliable.
3.8.4.11Creatingpipelinefornumericalandcategoricalattributes
IdentifyingcolumnswithNumericalandCategoricalcharacteristics.
3.8.4.12Selectingthealgorithmtobeapplied
Algorithimselectionprimarilydependsontheobjectiveyouaretryingtosolveandwhatkindofdatasetisavailable.Therearediffernttypeofalgorithmswhichcanbeappliedandwewilllookintofewofthemhere.
3.8.4.12.1LinearRegression
This algorithm can be applied when you want to compute some continuousvalue.Topredictsomefuturevalueofaprocesswhichiscurrentlyrunning,you
fromsklearn.model_selectionimporttrain_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42,stratify=y)
X_train_num=X_train[["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest"]]
X_train_cat=X_train[["type"]]
X_model_col=["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest","type"]
fromsklearn.pipelineimportPipeline
fromsklearn.preprocessingimportStandardScaler
fromsklearn.preprocessingimportImputer
num_attribs=list(X_train_num)
cat_attribs=list(X_train_cat)
num_pipeline=Pipeline([
('selector',DataFrameSelector(num_attribs)),
('attribs_adder',CombinedAttributesAdder()),
('std_scaler',StandardScaler())
])
cat_pipeline=Pipeline([
('selector',DataFrameSelector(cat_attribs)),
('cat_encoder',CategoricalEncoder(encoding="onehot-dense"))
])
cangowithregressionalgorithm.
Exampleswherelinearregressioncanusedare:
1. Predictthetimetakentogofromoneplacetoanother2. Predictthesalesforafuturemonth3. Predictsalesdataandimproveyearlyprojections.
3.8.4.12.2LogisticRegression
Thisalgorithmcanbeusedtoperformbinaryclassification.Itcanbeusedifyouwantaprobabilisticframework.Alsoincaseyouexpecttoreceivemoretrainingdata in the future that you want to be able to quickly incorporate into yourmodel.
1. Customerchurnprediction.2. CreditScoring&FraudDetectionwhichisourexampleproblemwhichwe
aretryingtosolveinthischapter.3. Calculatingtheeffectivenessofmarketingcampaigns.
3.8.4.12.3Decisiontrees
Decision trees handle feature interactions and they’re non-parametric. Doesntsupportonlinelearningandtheentiretreeneedstoberebuildwhennewtraning
fromsklearn.linear_modelimportLinearRegression
fromsklearn.preprocessingimportStandardScaler
importtime
scl=StandardScaler()
X_train_std=scl.fit_transform(X_train)
X_test_std=scl.transform(X_test)
start=time.time()
lin_reg=LinearRegression()
lin_reg.fit(X_train_std,y_train)#SKLearn'slinearregression
y_train_pred=lin_reg.predict(X_train_std)
train_time=time.time()-start
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.model_selectionimporttrain_test_split
X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)
X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)
model_lr_sklearn=LogisticRegression(multi_class="multinomial",C=1e6,solver="sag",max_iter=15)
model_lr_sklearn.fit(X_train,y_train)
y_pred_test=model_lr_sklearn.predict(X_test)
acc=accuracy_score(y_test,y_pred_test)
results.loc[len(results)]=["LRSklearn",np.round(acc,3)]
results
datasetcomesin.Memoryconsumptionisveryhigh.
Canbeusedforthefollowingcases
1. Investmentdecisions2. Customerchurn3. Banksloandefaulters4. BuildvsBuydecisions5. Salesleadqualifications
3.8.4.12.4KMeans
Thisalgorithmisusedwhenwearenotawareofthelabelsandoneneedstobecreatedbasedon the featuresofobjects.Examplewillbe todivideagroupofpeopleintodifferntsubgroupsbasedoncommonthemeorattribute.
ThemaindisadvantageofK-meanisthatyouneedtoknowexactlythenumberofclustersorgroupswhichisrequired.IttakesalotofiterationtocomeupwiththebestK.
3.8.4.12.5SupportVectorMachines
SVM is a supervised ML technique and used for pattern recognition and
fromsklearn.treeimportDecisionTreeRegressor
dt=DecisionTreeRegressor()
start=time.time()
dt.fit(X_train_std,y_train)
y_train_pred=dt.predict(X_train_std)
train_time=time.time()-start
start=time.time()
y_test_pred=dt.predict(X_test_std)
test_time=time.time()-start
fromsklearn.neighborsimportKNeighborsClassifier
fromsklearn.model_selectionimporttrain_test_split,GridSearchCV,PredefinedSplit
fromsklearn.metricsimportaccuracy_score
X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)
X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)
model_knn_sklearn=KNeighborsClassifier(n_jobs=-1)
model_knn_sklearn.fit(X_train,y_train)
y_pred_test=model_knn_sklearn.predict(X_test)
acc=accuracy_score(y_test,y_pred_test)
results.loc[len(results)]=["KNNArbitarySklearn",np.round(acc,3)]
results
classification problems when your data has exactly two classes. Its popular intextclassificationproblems.
FewcaseswhereSVMcanbeusedis
1. Detectingpersonswithcommondiseases.2. Hand-writtencharacterrecognition3. Textcategorization4. Stockmarketpriceprediction
3.8.4.12.6NaiveBayes
NaiveBayesisusedforlargedatasets.ThisalgoritmworkswellevenwhenwehavealimitedCPUandmemoryavailable.Thisworksbycalculatingbunchofcounts.Itrequireslesstrainingdata.Thealgorthimcantlearninterationbetweenfeatures.
NaiveBayescanbeusedinreal-worldapplicationssuchas:
1. Sentimentanalysisandtextclassification2. RecommendationsystemslikeNetflix,Amazon3. Tomarkanemailasspamornotspam4. Facerecognition
3.8.4.12.7RandomForest
RanmdonforestissimilartoDecisiontree.Canbeusedforbothregressionandclassificationproblemswithlargedatasets.
Fewcasewhereitcanbeapplied.
1. Predictpatientsforhighrisks.2. Predictpartsfailuresinmanufacturing.3. Predictloandefaulters.
fromsklearn.ensembleimportRandomForestRegressor
forest=RandomForestRegressor(n_estimators=400,criterion='mse',random_state=1,n_jobs=-1)
start=time.time()
forest.fit(X_train_std,y_train)
y_train_pred=forest.predict(X_train_std)
train_time=time.time()-start
start=time.time()
3.8.4.12.8Neuralnetworks
Neural network works based on weights of connections between neurons.Weights are trained and based on that the neural network can be utilized topredicttheclassoraquantity.Theyareresourceandmemoryintensive.
Fewcaseswhereitcanbeapplied.
1. Appliedtounsupervisedlearningtasks,suchasfeatureextraction.2. Extracts features from raw images or speech with much less human
intervention
3.8.4.12.9DeepLearningusingKeras
Keras is most powerful and easy-to-use Python libraries for developing andevaluating deep learning models. It has the efficient numerical computationlibrariesTheanoandTensorFlow.
3.8.4.12.10XGBoost
XGBooststandsforeXtremeGradientBoosting.XGBoostisanimplementationof gradient boosted decision trees designed for speed and performance. It isengineeredforefficiencyofcomputetimeandmemoryresources.
3.8.4.13ScikitCheatSheet
ScikitlearninghasputaveryindepthandwellexplainedflowcharttohelpyouchoosetherightalgorithmthatIfindveryhandy.
y_test_pred=forest.predict(X_test_std)
test_time=time.time()-start
Figure28:scikit-learn
3.8.4.14ParameterOptimization
Machinelearningmodelsareparameterizedsothattheirbehaviorcanbetunedforagivenproblem.Thesemodelscanhavemanyparametersand finding thebestcombinationofparameterscanbetreatedasasearchproblem.
Aparameterisaconfigurationthatispartofthemodelandvaluescanbederivedfromthegivendata.
1. Requiredbythemodelwhenmakingpredictions.2. Valuesdefinetheskillofthemodelonyourproblem.3. Estimatedorlearnedfromdata.4. Oftennotsetmanuallybythepractitioner.5. Oftensavedaspartofthelearnedmodel.
3.8.4.14.1Hyperparameteroptimization/tuningalgorithms
Gridsearchisanapproachtohyperparametertuningthatwillmethodicallybuildandevaluateamodelforeachcombinationofalgorithmparametersspecifiedin
agrid.
Random search provide a statistical distribution for each hyperparameter fromwhichvaluesmayberandomlysampled.
3.8.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)
3.8.4.15.1Creatingaparametergrid
3.8.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.
grid_param=[
[{#LogisticRegression
'model__penalty':['l1','l2'],
'model__C':[0.01,1.0,100]
}],
[{#keras
'model__optimizer':optimizer,
'model__loss':loss
}],
[{#SVM
'model__C':[0.01,1.0,100],
'model__gamma':[0.5,1],
'model__max_iter':[-1]
}],
[{#XGBClassifier
'model__min_child_weight':[1,3,5],
'model__gamma':[0.5],
'model__subsample':[0.6,0.8],
'model__colsample_bytree':[0.6],
'model__max_depth':[3]
}]
]
Pipeline(memory=None,
steps=[('preparation',FeatureUnion(n_jobs=None,
transformer_list=[('num_pipeline',Pipeline(memory=None,
steps=[('selector',DataFrameSelector(attribute_names=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest'
tol=0.0001,verbose=0,warm_start=False))])
fromsklearn.metricsimportmean_squared_error
fromsklearn.metricsimportclassification_report
fromsklearn.metricsimportf1_score
fromxgboost.sklearnimportXGBClassifier
fromsklearn.svmimportSVC
test_scores=[]
#MachineLearningAlgorithm(MLA)SelectionandInitialization
MLA=[
linear_model.LogisticRegression(),
keras_model,
SVC(),
XGBClassifier()
]
3.8.4.15.3ResultstablefromtheModelevaluationwithmetrics.
Figure29:scikit-learn
#createtabletocompareMLAmetrics
MLA_columns=['Name','Score','Accuracy_Score','ROC_AUC_score','final_rmse','Classification_error','Recall_Score','Precision_Score'
MLA_compare=pd.DataFrame(columns=MLA_columns)
Model_Scores=pd.DataFrame(columns=['Name','Score'])
row_index=0
foralginMLA:
#setnameandparameters
MLA_name=alg.__class__.__name__
MLA_compare.loc[row_index,'Name']=MLA_name
#MLA_compare.loc[row_index,'Parameters']=str(alg.get_params())
full_pipeline_with_predictor=Pipeline([
("preparation",full_pipeline),#combinationofnumericalandcategoricalpipelines
("model",alg)
])
grid_search=GridSearchCV(full_pipeline_with_predictor,grid_param[row_index],cv=4,verbose=2,scoring='f1',return_train_score
grid_search.fit(X_train[X_model_col],y_train)
y_pred=grid_search.predict(X_test)
MLA_compare.loc[row_index,'Accuracy_Score']=np.round(accuracy_score(y_pred,y_test),3)
MLA_compare.loc[row_index,'ROC_AUC_score']=np.round(metrics.roc_auc_score(y_test,y_pred),3)
MLA_compare.loc[row_index,'Score']=np.round(grid_search.score(X_test,y_test),3)
negative_mse=grid_search.best_score_
scores=np.sqrt(-negative_mse)
final_mse=mean_squared_error(y_test,y_pred)
final_rmse=np.sqrt(final_mse)
MLA_compare.loc[row_index,'final_rmse']=final_rmse
confusion_matrix_var=confusion_matrix(y_test,y_pred)
TP=confusion_matrix_var[1,1]
TN=confusion_matrix_var[0,0]
FP=confusion_matrix_var[0,1]
FN=confusion_matrix_var[1,0]
MLA_compare.loc[row_index,'Classification_error']=np.round(((FP+FN)/float(TP+TN+FP+FN)),5)
MLA_compare.loc[row_index,'Recall_Score']=np.round(metrics.recall_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'Precision_Score']=np.round(metrics.precision_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'F1_Score']=np.round(f1_score(y_test,y_pred),5)
MLA_compare.loc[row_index,'mean_test_score']=grid_search.cv_results_['mean_test_score'].mean()
MLA_compare.loc[row_index,'mean_fit_time']=grid_search.cv_results_['mean_fit_time'].mean()
Model_Scores.loc[row_index,'MLAName']=MLA_name
Model_Scores.loc[row_index,'MLScore']=np.round(metrics.roc_auc_score(y_test,y_pred),3)
#CollectMeanTestscoresforstatisticalsignificancetest
test_scores.append(grid_search.cv_results_['mean_test_score'])
row_index+=1
3.8.4.15.4ROCAUCScore
AUC-ROCcurve isaperformancemeasurementforclassificationproblematvarious thresholds settings. ROC is a probability curve and AUC representsdegree or measure of separability. It tells how much model is capable ofdistinguishing between classes. Higher the AUC, better the model is atpredicting0sas0sand1sas1s.
Figure30:scikit-learn
Figure31:scikit-learn
3.8.4.16K-meansinscikitlearn.
3.8.4.16.1Import
3.8.4.17K-meansAlgorithm
Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.
3.8.4.17.1Import
3.8.4.17.2Createsamples
3.8.4.17.3Createsamples
fromtimeimporttime
importnumpyasnp
importmatplotlib.pyplotasplt
fromsklearnimportmetrics
fromsklearn.clusterimportKMeans
fromsklearn.datasetsimportload_digits
fromsklearn.decompositionimportPCA
fromsklearn.preprocessingimportscale
np.random.seed(42)
digits=load_digits()
data=scale(digits.data)
np.random.seed(42)
digits=load_digits()
data=scale(digits.data)
n_samples,n_features=data.shape
n_digits=len(np.unique(digits.target))
labels=digits.target
sample_size=300
print("n_digits:%d,\tn_samples%d,\tn_features%d"%(n_digits,n_samples,n_features))
print(79*'_')
print('%9s'%'init''timeinertiahomocomplv-measARIAMIsilhouette')
print("n_digits:%d,\tn_samples%d,\tn_features%d"
%(n_digits,n_samples,n_features))
print(79*'_')
print('%9s'%'init'
'timeinertiahomocomplv-measARIAMIsilhouette')
defbench_k_means(estimator,name,data):
t0=time()
estimator.fit(data)
print('%9s%.2fs%i%.3f%.3f%.3f%.3f%.3f%.3f'
3.8.4.17.4Visualize
SeeFigure32
3.8.4.17.5Visualize
SeeFigure32
%(name,(time()-t0),estimator.inertia_,
metrics.homogeneity_score(labels,estimator.labels_),
metrics.completeness_score(labels,estimator.labels_),
metrics.v_measure_score(labels,estimator.labels_),
metrics.adjusted_rand_score(labels,estimator.labels_),
metrics.adjusted_mutual_info_score(labels,estimator.labels_),
metrics.silhouette_score(data,estimator.labels_,metric='euclidean',sample_size=sample_size)))
bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),name="k-means++",data=data)
bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),name="random",data=data)
metrics.silhouette_score(data,estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),
name="k-means++",data=data)
bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),
name="random",data=data)
#inthiscasetheseedingofthecentersisdeterministic,hencewerunthe
#kmeansalgorithmonlyoncewithn_init=1
pca=PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_,n_clusters=n_digits,n_init=1),name="PCA-based",data=data)
print(79*'_')
bench_k_means(KMeans(init=pca.components_,
n_clusters=n_digits,n_init=1),
name="PCA-based",
data=data)
print(79*'_')
reduced_data=PCA(n_components=2).fit_transform(data)
kmeans=KMeans(init='k-means++',n_clusters=n_digits,n_init=10)
kmeans.fit(reduced_data)
#Stepsizeofthemesh.DecreasetoincreasethequalityoftheVQ.
h=.02#pointinthemesh[x_min,x_max]x[y_min,y_max].
#Plotthedecisionboundary.Forthat,wewillassignacolortoeach
x_min,x_max=reduced_data[:,0].min()-1,reduced_data[:,0].max()+1
y_min,y_max=reduced_data[:,1].min()-1,reduced_data[:,1].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
#Obtainlabelsforeachpointinmesh.Uselasttrainedmodel.
Z=kmeans.predict(np.c_[xx.ravel(),yy.ravel()])
#Puttheresultintoacolorplot
Z=Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z,interpolation='nearest',
extent=(xx.min(),xx.max(),yy.min(),yy.max()),
cmap=plt.cm.Paired,
Figure32:Result
3.8.5ParallelComputinginPython☁�
InthismodulewewillreviewtheavailablePythonmodulesthatcanbeusedforparallel computing. The parallel computing can be in form of either multi-threadingormulti-processing.Inmulti-threadingapproach,thethreadsruninthesame shared memory heap whereas in case of multi-processing, the memoryheaps of processes are separate and independent, therefore the communicationbetweentheprocessesarealittlebitmorecomplex.
3.8.5.1Multi-threadinginPython
ThreadinginPythonisperfectforI/Ooperationswheretheprocessisexpectedto be idle regularly, e.g. web scraping. This is a very useful feature becauseseveral applications and script might spend the majority of their runtime onwaiting for network or data I/O. In several cases, e.g. web scraping, theresources, i.e. downloading from different websites, are most of the time
aspect='auto',origin='lower')
plt.plot(reduced_data[:,0],reduced_data[:,1],'k.',markersize=2)
#PlotthecentroidsasawhiteX
centroids=kmeans.cluster_centers_
plt.scatter(centroids[:,0],centroids[:,1],
marker='x',s=169,linewidths=3,
color='w',zorder=10)
plt.title('K-meansclusteringonthedigitsdataset(PCA-reduceddata)\n'
'Centroidsaremarkedwithwhitecross')
plt.xlim(x_min,x_max)
plt.ylim(y_min,y_max)
plt.xticks(())
plt.yticks(())
plt.show()
independent.Thereforetheprocessorcandownloadinparallelandjointheresultattheend.
3.8.5.1.1ThreadvsThreading
Therearetwobuilt-inmodulesinPythonthatarerelatedtothreading,namelythreadandthreading.TheformermoduleisdeprecatedforsometimeinPython2,andinPython3itisrenamedto_threadforthesakeofbackwardsincompatibilities.The_threadmoduleprovideslow-levelthreadingAPIformulti-threadinginPython,whereasthemodulethreadingbuildsahigh-levelthreadinginterfaceontopofit.
TheThread()isthemainmethodofthethreadingmodule,thetwoimportantargumentsof which are target, for specifying the callable object, and args to pass theargumentsforthetargetcallable.Weillustratetheseinthefollowingexample:
Thisistheoutputofthepreviousexample:
Incaseyouarenotfamiliarwiththeif__name__=='__main__:'statement,whatitdoesisbasicallymakingsurethatthecodenestedunderthisconditionwillberunonlyifyourunyourmoduleasaprogramanditwillnotrunincaseyourmoduleisimportedinanotherfile.
3.8.5.1.2Locks
Asmentionedprior,thememoryspaceissharedbetweenthethreads.Thisisatthe same time beneficial and problematic: it is beneficial in a sense that thecommunication between the threads becomes easy, however, you mightexperience strange outcome if you let several threads change same variablewithoutcaution,e.g.thread2changesvariablexwhilethread1isworkingwith
importthreading
defhello_thread(thread_num):
print("HellofromThread",thread_num)
if__name__=='__main__':
forthread_numinrange(5):
t=threading.Thread(target=hello_thread,arg=(thread_num,))
t.start()
In[1]:%runthreading.py
HellofromThread0
HellofromThread1
HellofromThread2
HellofromThread3
HellofromThread4
it.Thisiswhenlockcomesintoplay.Usinglock,youcanallowonlyonethreadtoworkwithavariable.Inotherwords,onlyasinglethreadcanholdthelock.Iftheother threadsneed toworkwith thatvariable, theyhave towaituntil theotherthreadisdoneandthevariableis“unlocked”.
Weillustratethiswithasimpleexample:
Supposewewant toprintmultiplesof3between1and12, i.e.3,6,9and12.Forthesakeofargument,wetrytodothisusing2threadsandanestedforloop.Thenwecreateaglobalvariablecalledcounterandweinitializeitwith0.Thenwhenever each of the incrementer1 or incrementer2 functions are called, the counter isincrementedby3 twice (counter is incrementedby6 in each function call). Ifyourunthepreviouscode,youshouldbereallyluckyifyougetthefollowingaspartofyouroutput:
Thereasonistheconflictthathappensbetweenthreadswhileincrementingthecounter in thenestedfor loop.Asyouprobablynoticed, thefirst levelfor loopisequivalentofadding3 to thecounterand theconflict thatmighthappen isnoteffective on that level but the nested for loop.Accordingly, the output of thepreviouscodeisdifferentineveryrun.Thisisanexampleoutput:
importthreading
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter1incrementedthecounterby1")
print("Counteris%d"%counter)
defincrementer2():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter2incrementedthecounterby1")
print("Counterisnow%d"%counter)
if__name__=='__main__':
t1=threading.Thread(target=incrementer1)
t2=threading.Thread(target=incrementer2)
t1.start()
t2.start()
Counterisnow3
Counterisnow6
Counterisnow9
Counterisnow12
We can fix this issue using a lock: whenever one of the function is going toincrementthevalueby3,itwillacquire()thelockandwhenitisdonethefunctionwillrelease()thelock.Thismechanismisillustratedinthefollowingcode:
Nomatterhowmanytimesyourunthiscode,theoutputwouldalwaysbeinthecorrectorder:
$python3lock_example.py
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris4
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris8
Greeter1incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris10
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris12
importthreading
increment_by_3_lock=threading.Lock()
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter+=1
print("Greeter1incrementedthecounterby1")
print("Counteris%d"%counter)
increment_by_3_lock.release()
defincrementer2():
globalcounter
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter+=1
print("Greeter2incrementedthecounterby1")
print("Counteris%d"%counter)
increment_by_3_lock.release()
if__name__=='__main__':
t1=threading.Thread(target=incrementer1)
t2=threading.Thread(target=incrementer2)
t1.start()
t2.start()
$python3lock_example.py
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris3
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Greeter1incrementedthecounterby1
Counteris6
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Using the Threading module increases both the overhead associated with threadmanagementaswellasthecomplexityoftheprogramandthatiswhyinmanysituations,employingmultiprocessingmodulemightbeabetterapproach.
3.8.5.2Multi-processinginPython
We already mentioned that multi-threading might not be sufficient in manyapplicationsandwemightneedtousemultiprocessingsometime,orbettertosaymostof the times. That is why we are dedicating this subsection to this particularmodule.ThismoduleprovidesyouwithanAPIforspawningprocessesthewayyou spawn threads using threading module. Moreover, there are somefunctionalities that are not even available in threading module, e.g. the Pool classwhichallowsyoutorunabatchofjobsusingapoolofworkerprocesses.
3.8.5.2.1Process
Similartothreadingmodulewhichwasemployingthread(aka_thread)underthehood,multiprocessingemploystheProcessclass.Considerthefollowingexample:
In this example, after importing the Processmodulewecreated a greeter() functionthattakesanameandgreetsthatperson.Italsoprintsthepid(processidentifier)oftheprocessthatisrunningit.Notethatweusedtheosmoduletogetthepid.Inthebottomofthecodeaftercheckingthe__name__='__main__'condition,wecreateaseriesofProcessesandstartthem.Finallyinthelastforloopandusingthejoinmethod,wetell Python towait for the processes to terminate. This is one of the possibleoutputsofthecode:
Counteris9
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Greeter2incrementedthecounterby1
Counteris12
frommultiprocessingimportProcess
importos
defgreeter(name):
proc_idx=os.getpid()
print("Process{0}:Hello{1}!".format(proc_idx,name))
if__name__=='__main__':
name_list=['Harry','George','Dirk','David']
process_list=[]
forname_idx,nameinenumerate(name_list):
current_process=Process(target=greeter,args=(name,))
process_list.append(current_process)
current_process.start()
forprocessinprocess_list:
process.join()
3.8.5.2.2Pool
ConsiderthePoolclassasapoolofworkerprocesses.ThereareseveralwaysforassigningjobstothePoolclassandwewillintroducethemostimportantonesinthis section. These methods are categorized as blocking or non-blocking. The formermeans that after calling the API, it blocks the thread/process until it has theresultoranswerreadyandthecontrolreturnsonlywhenthecallcompletes.Inthenon-blockinontheotherhand,thecontrolreturnsimmediately.
3.8.5.2.2.1SynchronousPool.map()
WeillustratethePool.mapmethodbyre-implementingourpreviousgreeterexampleusingPool.map:
Asyoucansee,wehavesevennamesherebutwedonotwanttodedicateeachgreeting toa separateprocess. Insteadwedo thewhole jobof“greetingsevenpeople”using“twoprocesses”.Wecreateapoolof3processeswithPool(processes=3)syntax and then we map an iterable called names to the greeter function usingpool.map(greeter,names).Asweexpected,thegreetingsintheoutputwillbeprintedfromthreedifferentprocesses:
Note that Pool.map() is in blocking category and does not return the control to yourscriptuntilitisdonecalculatingtheresults.ThatiswhyDone!isprintedafterallof
$python3process_example.py
Process23451:HelloHarry!
Process23452:HelloGeorge!
Process23453:HelloDirk!
Process23454:HelloDavid!
frommultiprocessingimportPool
importos
defgreeter(name):
pid=os.getpid()
print("Process{0}:Hello{1}!".format(pid,name))
if__name__=='__main__':
names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']
pool=Pool(processes=3)
sync_map=pool.map(greeter,names)
print("Done!")
$pythonpoolmap_example.py
Process30585:HelloJenna!
Process30586:HelloDavid!
Process30587:HelloMarry!
Process30585:HelloTed!
Process30585:HelloJerry!
Process30587:HelloTom!
Process30585:HelloJustin!
Done!
thegreetingsareover.
3.8.5.2.2.2AsynchronousPool.map_async()
As the name implies, you can use the map_async method, when you want assignmany function calls to a pool of worker processes asynchronously. Note thatunlike map, the order of the results is not guaranteed (as oppose to map) and thecontrolisreturnedimmediately.Wenowimplementthepreviousexampleusingmap_async:
As you probably noticed, the only difference (clearly apart from the map_async
methodname)iscallingthe wait()methodin the last line.The wait()method tellsyourscripttowaitfortheresultofmap_asyncbeforeterminating:
Note that the order of the results are not preserved.Moreover, Done! is printerbefore anyof the results,meaning that ifwedonot use the wait()method, youprobablywillnotseetheresultatall.
3.8.5.2.3Locks
Thewaymultiprocessingmoduleimplementslocksisalmostidenticaltothewaythethreadingmoduledoes.AfterimportingLockfrommultiprocessingallyouneedtodoistoacquireit,dosomecomputationandthenreleasethelock.WewillclarifytheuseofLockbyprovidinganexampleinnextsectionaboutprocesscommunication.
3.8.5.2.4ProcessCommunication
frommultiprocessingimportPool
importos
defgreeter(name):
pid=os.getpid()
print("Process{0}:Hello{1}!".format(pid,name))
if__name__=='__main__':
names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']
pool=Pool(processes=3)
async_map=pool.map_async(greeter,names)
print("Done!")
async_map.wait()
$pythonpoolmap_example.py
Done!
Process30740:HelloJenna!
Process30741:HelloDavid!
Process30740:HelloTed!
Process30742:HelloMarry!
Process30740:HelloJerry!
Process30741:HelloTom!
Process30742:HelloJustin!
Process communication in multiprocessing is one of the most important, yetcomplicated,featuresforbetteruseofthismodule.Asopposetothreading,theProcessobjects will not have access to any shared variable by default, i.e. no sharedmemoryspacebetweentheprocessesbydefault.Thiseffectisillustratedinthefollowingexample:
Probably you already noticed that this is almost identical to our example inthreadingsection.Now,takealookatthestrangeoutput:
Asyoucansee,itisasiftheprocessesdoesnotseeeachother.Insteadofhavingtwoprocessesonecounting to6and theothercountingfrom6 to12,wehavetwoprocessescountingto6.
Nevertheless, there are several ways that Processes from multiprocessing cancommunicatewitheachother,includingPipe,Queue,Value,ArrayandManager.Pipeand Queueare appropriate for inter-processmessage passing. To bemore specific, Pipe isuseful for process-to-process scenarios while Queue is more appropriate forprocesses-toprocessesones.ValueandArrayarebothusedtoprovideasynchronizedaccesstoashareddata(verymuchlikesharedmemory)andManagerscanbeusedondifferentdatatypes.Inthefollowingsub-sections,wecoverbothValueandArray
frommultiprocessingimportProcess,Lock,Value
importtime
globalcounter
counter=0
defincrementer1():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter1:Counteris%d"%counter)
defincrementer2():
globalcounter
forjinrange(2):
foriinrange(3):
counter+=1
print("Greeter2:Counteris%d"%counter)
if__name__=='__main__':
t1=Process(target=incrementer1)
t2=Process(target=incrementer2)
t1.start()
t2.start()
$pythoncommunication_example.py
Greeter1:Counteris3
Greeter1:Counteris6
Greeter2:Counteris3
Greeter2:Counteris6
sincetheyarebothlightweight,yetuseful,approaches.
3.8.5.2.4.1Value
The following example re-implements the broken example in the previoussection.Wefixthestrangeoutput,byusingbothLockandValue:
The usage of Lock object in this example is identical to the example in threadingsection.Theusageof counter ison theotherhand thenovelpart.First,note thatcounterisnotaglobalvariableanymoreandinsteaditisaValuewhichreturnsactypes object allocated from a shared memory between the processes. The firstargument 'i' indicates a signed integer, and the second argument defines theinitializationvalue.Inthiscaseweareassigningasignedintegerinthesharedmemory initialized to size 0 to the counter variable.We thenmodified our twofunctionsandpassthissharedvariableasanargument.Finally,wechangethewayweincrementthecountersincecounterisnotanPythonintegeranymorebutactypes signed integerwherewe can access its value using the value attribute.Theoutputofthecodeisnowasweexpected:
frommultiprocessingimportProcess,Lock,Value
importtime
increment_by_3_lock=Lock()
defincrementer1(counter):
forjinrange(3):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.1)
print("Greeter1:Counteris%d"%counter.value)
increment_by_3_lock.release()
defincrementer2(counter):
forjinrange(3):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.05)
print("Greeter2:Counteris%d"%counter.value)
increment_by_3_lock.release()
if__name__=='__main__':
counter=Value('i',0)
t1=Process(target=incrementer1,args=(counter,))
t2=Process(target=incrementer2,args=(counter,))
t2.start()
t1.start()
$pythonmp_lock_example.py
Greeter2:Counteris3
Greeter2:Counteris6
Greeter1:Counteris9
Greeter1:Counteris12
Thelastexamplerelatedtoparallelprocessing,illustratestheuseofbothValueandArray,aswellasa technique topassmultiplearguments toa function.Note thattheProcessobjectdoesnotacceptmultipleargumentsforafunctionandthereforewe need this or similar techniques for passingmultiple arguments. Also, thistechniquecanalsobeusedwhenyouwanttopassmultipleargumentstomapormap_async:
Inthisexamplewecreatedamultiprocessing.Array()objectandassignedittoavariablecallednames.Aswementionedbefore,thefirstargumentisthectypedatatypeandsincewewanttocreateanarrayofstringswithlengthof4(secondargument),weimportedthec_char_pandpasseditasthefirstargument.
Instead of passing the arguments separately,wemerged both the Value and Arrayobjects in a tuple andpassed the tuple to the functions.We thenmodified thefunctions to unpack the objects in the first two lines in the both functions.Finally we changed the print statement in a way that each process greets aparticularname.Theoutputoftheexampleis:
frommultiprocessingimportProcess,Lock,Value,Array
importtime
fromctypesimportc_char_p
increment_by_3_lock=Lock()
defincrementer1(counter_and_names):
counter=counter_and_names[0]
names=counter_and_names[1]
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.1)
name_idx=counter.value//3-1
print("Greeter1:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))
increment_by_3_lock.release()
defincrementer2(counter_and_names):
counter=counter_and_names[0]
names=counter_and_names[1]
forjinrange(2):
increment_by_3_lock.acquire(True)
foriinrange(3):
counter.value+=1
time.sleep(0.05)
name_idx=counter.value//3-1
print("Greeter2:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))
increment_by_3_lock.release()
if__name__=='__main__':
counter=Value('i',0)
names=Array(c_char_p,4)
names.value=['James','Tom','Sam','Larry']
t1=Process(target=incrementer1,args=((counter,names),))
t2=Process(target=incrementer2,args=((counter,names),))
t2.start()
t1.start()
3.8.6Dask-RandomForestFeatureDetection☁�
3.8.6.1Setup
First we need our tools. pandas gives us the DataFrame, very similar to R’sDataFrames.TheDataFrameisastructurethatallowsustoworkwithourdatamoreeasily.Ithasnicefeaturesforslicingandtransformationofdata,andeasywaystodobasicstatistics.
numpyhassomeveryhandyfunctionsthatworkonDataFrames.
3.8.6.2Dataset
We are using a dataset about the wine quality dataset, archived at UCI’sMachineLearningRepository(http://archive.ics.uci.edu/ml/index.php).
Nowwewillloadourdata.pandasmakesiteasy!
Like in R, there is a .describe() method that gives basic statistics for everycolumninthedataset.
fixedacidity volatileacidity citricacid residual
sugar chlorides
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000mean 8.319637 0.527821 0.270976 2.538806 0.087467
$python3mp_lock_example.py
Greeter2:GreetingJames!Counteris3
Greeter2:GreetingTom!Counteris6
Greeter1:GreetingSam!Counteris9
Greeter1:GreetingLarry!Counteris12
importpandasaspd
importnumpyasnp
#redwinequalitydata,packedinaDataFrame
red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)
#whitewinequalitydata,packedinaDataFrame
white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)
#rose?otherfruitwines?plumwine?:(
#forredwines
red_df.describe()
std 1.741096 0.179060 0.194801 1.409928 0.047065min 4.600000 0.120000 0.000000 0.900000 0.01200025% 7.100000 0.390000 0.090000 1.900000 0.07000050% 7.900000 0.520000 0.260000 2.200000 0.07900075% 9.200000 0.640000 0.420000 2.600000 0.090000max 15.900000 1.580000 1.000000 15.500000 0.611000
fixedacidity volatileacidity citricacid residual
sugar chlorides
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000mean 6.854788 0.278241 0.334192 6.391415 0.045772std 0.843868 0.100795 0.121020 5.072058 0.021848min 3.800000 0.080000 0.000000 0.600000 0.00900025% 6.300000 0.210000 0.270000 1.700000 0.03600050% 6.800000 0.260000 0.320000 5.200000 0.04300075% 7.300000 0.320000 0.390000 9.900000 0.050000max 14.200000 1.100000 1.660000 65.800000 0.346000
Sometimesitiseasiertounderstandthedatavisually.Ahistogramofthewhitewinequalitydatacitricacidsamplesisshownnext.Youcanofcoursevisualizeother columns’ data or other datasets. Just replace theDataFrame and columnname(seeFigure33).
#forwhitewines
white_df.describe()
importmatplotlib.pyplotasplt
defextract_col(df,col_name):
returnlist(df[col_name])
col=extract_col(white_df,'citricacid')#canreplacewithanotherdataframeorcolumn
plt.hist(col)
#TODO:addaxesandsuchtosetagoodexample
plt.show()
Figure33:Histogram
3.8.6.3DetectingFeatures
Letustryoutasomeelementarymachinelearningmodels.Thesemodelsarenotalways for prediction. They are also useful to find what features are mostpredictiveofavariableofinterest.Dependingontheclassifieryouuse,youmayneedtotransformthedatapertainingtothatvariable.
3.8.6.3.1DataPreparation
LetusassumewewanttostudywhatfeaturesaremostcorrelatedwithpH.pHofcourseisreal-valued,andcontinuous.Theclassifierswewanttouseusuallyneed labeled or integer data.Hence,wewill transform the pH data, assigningwineswithpHhigherthanaverageashi(morebasicoralkaline)andwineswithpHlowerthanaverageaslo(moreacidic).#refreshtomakeJupyterhappy
red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)
white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)
#TODO:datacleansingfunctionshere,e.g.replacementofNaN
#ifthevariableyouwanttopredictiscontinuous,youcanmaprangesofvalues
#tointeger/binary/stringlabels
#forexample,mapthepHdatato'hi'and'lo'ifapHvalueismorethanor
#lessthanthemeanpH,respectively
M=np.mean(list(red_df['pH']))#expectinelegantcodeinthesemappings
Lf=lambdap:int(p<M)*'lo'+int(p>=M)*'hi'#someC-stylehackery
#createthenewclassifiablevariable
red_df['pH-hi-lo']=map(Lf,list(red_df['pH']))
#andremovethepredecessor
delred_df['pH']
Nowwe specifywhich dataset and variable youwant to predict by assigningvluestoSELECTED_DFandTARGET_VAR,respectively.
Weliketokeepaparameterfilewherewespecifydatasourcesandsuch.Thisletsmecreategenericanalyticscodethatiseasytoreuse.
Afterwehavespecifiedwhatdatasetwewanttostudy,wesplitthetrainingandtestdatasets.Wethenscale (normalize) thedata,whichmakesmostclassifiersrunbetter.
Nowwepickaclassifier.Asyoucansee, therearemany to tryout,andevenmoreinscikit-learn’sdocumentationandmanyexamplesandtutorials.RandomForests are data scienceworkhorses.They are thego-tomethod formost datascientists. Be careful relying on them though–they tend to overfit.We try toavoidoverfittingbyseparatingthetrainingandtestdatasets.
3.8.6.4RandomForest
Nowwewilltestitoutwiththedefaultparameters.
Notethatthiscodeisboilerplate.Youcanuseitinterchangeablyformostscikit-
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.preprocessingimportStandardScaler
fromsklearnimportmetrics
#makeselectionsherewithoutdiggingincode
SELECTED_DF=red_df#selecteddataset
TARGET_VAR='pH-hi-lo'#thepredictedvariable
#generatenamelessdatastructures
df=SELECTED_DF
target=np.array(df[TARGET_VAR]).ravel()
deldf[TARGET_VAR]#nocheating
#TODO:datacleansingfunctioncallshere
#splitdatasetsfortrainingandtesting
X_train,X_test,y_train,y_test=train_test_split(df,target,test_size=0.2)
#setupthescaler
scaler=StandardScaler()
scaler.fit(X_train)
#applythescaler
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
#pickaclassifier
fromsklearn.treeimportDecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor
fromsklearn.ensembleimportRandomForestClassifier,ExtraTreesClassifier
clf=RandomForestClassifier()
learnmodels.
Nowoutputtheresults.ForRandomForests,wegetafeatureranking.Relativeimportances usually exponentially decay. The first few highly-ranked featuresareusuallythemostimportant.
Featureranking:
fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832sulphates0.047786freesulfurdioxide0.042727residualsugar0.037459quality0.021075
Sometimesit’seasiertovisualize.We’lluseabarchart.SeeFigure34
#testitout
model=clf.fit(X_train,y_train)
pred=clf.predict(X_test)
conf_matrix=metrics.confusion_matrix(y_test,pred)
var_score=clf.score(X_test,y_test)
#theresults
importances=clf.feature_importances_
indices=np.argsort(importances)[::-1]
#forthesakeofclarity
num_features=X_train.shape[1]
features=map(lambdax:df.columns[x],indices)
feature_importances=map(lambdax:importances[x],indices)
print'Featureranking:\n'
foriinrange(num_features):
feature_name=features[i]
feature_importance=feature_importances[i]
print'%s%f'%(feature_name.ljust(30),feature_importance)
plt.clf()
plt.bar(range(num_features),feature_importances)
plt.xticks(range(num_features),features,rotation=90)
plt.ylabel('relativeimportance(a.u.)')
plt.title('Relativeimportancesofmostpredictivefeatures')
plt.show()
Figure34:Result
3.8.6.5Acknowledgement
ThisnotebookwasdevelopedbyJulietteZerickandGregorvonLaszewski
importdask.dataframeasdd
red_df=dd.read_csv('winequality-red.csv',sep=';',header=0)
white_df=dd.read_csv('winequality-white.csv',sep=';',header=0)
4DEVOPSTOOLS
4.1REFCARDS☁�
LearningObjectives
Obtain quickly information about technical aspects with the help ofreferencecards.
We present youwith a list of useful short reference cards. This cards can beextremely useful to remind yourself about some important commands andfeatures.Havingthemcouldsimplifyyourinteractionwiththesystems,WenotonlycollectedheresomerefcardsaboutLinux,butalsoaboutotherusefultoolsandservices.
If you like to add new topics, let us know via your contribution (see thecontributionsection).
CheatSheets
CheatSheets
Editors
EmacsViVim
Documentation
LaTeXRST
Linux
LinuxMakefileGit
Cloud/Virtualization
OpenstackOpenstackvagrant
SQL
SQL
Languages
R
Python
PythonPythonDataNumpy/PandasPythonTutorialPythonPythonPythonAPIIndexPython3
4.2VIRTUALBOX☁�FordevelopmentpurposeswerecommendthatyouuseforthisclassanUbuntuvirtualmachinethatyousetupwiththehelpofvirtualbox.Werecommendthatyouusethecurrentversionofubuntuanddonotinstallorreuseaversionthatyouhavesetupyearsago.
As access to cloud resources requires some basic knowledge of linux andsecurity we will restrict access to our cloud services to those that have
demonstrated responsible use on their own computers. Naturally as it is yourowncomputeryoumustmakesureyoufollowpropersecurity.Wehaveseeninthe past students carelessly working with virtual machines and introducingsecurityvulnerabilitiesonourclouds justbecause“itwasnot theircomputer.”Hence,wewill allowusing of cloud resources only if you have demonstratedthatyou responsiblyuse a linuxvirtualmachineonyourowncomputer.Onlyafteryouhavesuccessfullyusedubuntuinavirtualmachineyouwillbeallowedtousevirtualmachinesonclouds.
Aclouddriverslicensetestwillbeconducted.Onlyafteryoupassitwewilletyougainaccess to thecloud infrastructure.Wewillannounce this test.Beforeyouhavenotpassedthetest,youwillnotbeabletousetheclouds.Furthermore,youdonothavetoaskusforjoinrequeststocloudprojectsbeforeyouhavenotpassed the test. Please be patient. Only students enrolled in the class can getaccesstothecloud.
Ifyouhoweverhaveaccesstoothercloudsyourselfyouarewelcometousethe,However,beremindedthatprojectsneedtobereproducible,onourcloud.ThiswillrequireyoutomakesureaTAcanreplicateit.
Letusnowfocusonusingvirtualbox.
4.2.1Installation
Firstyouwillneed to installvirtualbox. It is easy to install anddetails canbefoundat
https://www.virtualbox.org/wiki/Downloads
Afteryouhaveinstalledvirtualboxyoualsoneedtouseanimage.ForthisclasswewillbeusingubuntuDesktop16.04whichyoucanfindat:
http://www.ubuntu.com/download/desktop
Please note some hardware you may have may be too old or has too littleresources to be useful. We have heard from students that the following is aminimalsetupforthedesktopmachine:
multicoreprocessororbetterallowingtorunhypervisors
8GBsystemmemory
50GBoffreeharddrivespace
For virtualmachines youmay needmultiple,while theminimal configurationmaynotworkforallcases.
Asconfigurationweoftenuse
minimal
1core,2GBMemory,5GBdisk
latex
2core,4GBMemory,25GBdisk
Avideotoshowcasesuchaninstallisavailableat:
UsingUbuntuinVirtualbox(8:08)
Pleasenote that thevideoshowstheversion16.04.Youshouldhoweverusethenewestversionwhichatthistimeis18.04.
If you specify your machine too small you will not be able to install thedevelopment environment.Gregor used on hismachine 8GBRAMand25GBdiskspace.
Pleaseletusknowthesmallestconfigurationthatworks.
4.2.2Guestadditions
Thevirtualguestadditionsallowyoutoeasilydothefollowingtasks:
Resizethewindowsofthevm
Copyandpaste content between theGuest operating systemand thehostoperatingsystemwindows.
This way you can use many native programs on you host and copy contentseasilyintoforexampleaterminaloraneditorthatyourunintheVm.
Avideoislocatedat
Virtualbox(4:46)
Pleaserebootthemachineafterinstallationandconfiguration.
OnOSXyoucanonceyouhaveenabledbidirectionalcopyingintheDevicetabwith
OSXtoVbox:
commandcshiftCONTRLv
VboxtoOSX:
shiftCONTRLvshiftCONTRLv
On Windows the key combination is naturally different. Please consult yourwindowsmanual.IfyouletusknowTAswilladdtheinformationhere.
4.2.3Exercises
E.Virtualbox.1:
Installubuntudesktoponyourcomputerwithguestadditions.
E.Virtualbox.2:
Makesureyouknowhow topasteandcopybetweenyourhostandguestoperatingsystem.
E.Virtualbox.3:
Installtheprogramsdefinedbythedevelopmentconfiguration.
E.Virtualbox.4:
Provide us with the key combination to copy and paste betweenWindowsandVbox.
4.3VAGRANT☁�
LearningObjectives
Beabletoexperimentwithvirtualmachinesonyourcomputerbeforeyougoonacloud.SimulateavirtualclusterwithmultipleVMsrunningonyourcomputerifitisbigenough.
AconvenienttooltointerfacewithVirtualBoxisvagrant.Vagrantallowsustomanage virtualmachines directly from the commandline. It support also otherproviders and can be used to start virtual machines and even containers. Thelatest version of vagrant includes the ability to automatically fetch a virtualmachine image and start it on your local computer. It assumes that you havevirtualboxinstalled.Somekeyconceptsandadvertisementarelocatedat
https://www.vagrantup.com/intro/index.html:
Detaileddocumentationforitislocated
https://www.vagrantup.com/docs/index.html
Alistofboxesisavailablefrom
https://app.vagrantup.com/boxes/search
OneimagewewilltypicallyuseisUbuntu18.04.Pleasenotethatolderversionmaynotbesuitableforclassandwewillnotsupportanyquestionsaboutthem.Thisimageislocatedat
https://app.vagrantup.com/ubuntu/boxes/bionic64
4.3.1Installation
Vagrantiseasytoinstall.Youcangotothedownloadpageanddownloadandinstalltheappropriateversion:
https://www.vagrantup.com/downloads.html
4.3.1.1macOS
OnMacOS,downloadthedmgimage,andclickonit.Youwillfindapkginitthatyoudoubleclick.Afterinstallationvagrantisinstalledin
/usr/local/bin/vagrant
Makesure/usr/local/binisinyourPATHStartanewterminaltoverifythis.
Checkitwith
Ifitisnotinthepathput
exportPATH=/usr/local/bin:$PATH
intheterminalcommandorinyour~/.bash_profile
4.3.1.2Windows �
studentscontribute
4.3.1.3Linux �
echo$PATH
studentscontribute
4.3.2Usage
Todownload,startandloginintoinstallthe18.04:
Onceyouareloggedinyoucantesttheversionofpythonwith
Toinstallanewerversionofpython,andpipyoucanuse
To install the light weight idle development environment in case you do notwantousepyCharm,pleaseuse
So thatyoudonothave toalwaysuse thenumber3,youcanalsosetanaliaswith
Whenyouexitthevirtualmachinewiththe
ItdoesnotterminatetheVM.Youcanusefromyourhostsystemthecommandssuchas
tomanagethevm.
4.4LINUXSHELL☁�
host$vagrantinitubuntu/bionic64
host$vagrantup
host$vagrantssh
vagrant@ubuntu-bionic:~$sudoapt-getupdate
vagrant@ubuntu-bionic:~$python3--version
Python3.6.5
vagrant@ubuntu-bionic:~$sudoapt-getinstallpython3.7
vagrant@ubuntu-bionic:~$sudoapt-getinstallpython3-pip
vagrant@ubuntu-bionic:~$sudoapt-getinstallidle-python
aliaspython=python3
exitcommand
host$vagrantstatus
host$vagrantdestroy
host$vagrantsuspend
host$vagrantresume
LearningObjectives
BeabletoknowthebasiccommandstoworkinaLinuxterminal.GetfamiliarwithLinuxCommands
In this chapterwe introduce you to a number of useful shell commands.Youmayask:
“WhyishesokeenontellingmeallaboutshellsasIdohaveabeautifulGUI?”
YouwillsoonlearnthatAGUImaynotbethatsuitableifyouliketomanage10,100,1000,10000,…virtualmachines.Acommandline interfacecouldbemcuhsimplerandwouldallowscripting.
4.4.1History
LINUXisareimplementationbythecommunityofUNIXwhichwasdevelopedin1969byKenThompsonandDennisRitchieofBellLaboratoriesandrewritteninC.AnimportantpartofUNIXiswhat iscalledthekernelwhichallows thesoftwaretotalktothehardwareandutilizeit.
In 1991 Linus Torvalds started developing a Linux Kernel that was initiallytargeted forPC’s.Thismade itpossible to run itonLaptopsandwas lateronfurtherdevelopedbymakingitafullOperatingsystemreplacementforUNIX.
4.4.2Shell
Oneofthemostimportantfeaturesforuswillbetoaccessthecomputerwiththehelpofashell.Theshellistypicallyruninwhatiscalledaterminalandallowsinteractiontothecomputerwithcommandlineprograms.
TherearemanygoodtutorialsouttherethatexplainwhyoneneedsalinuxshellandnotjustaGUI.Randomlywepickedthefirstonethatcameupwithagooglequery.This isnotanendorsement for thematerialwepoint to,butcouldbeaworthwhilereadforsomeonethathasnoexperienceinShellprogramming:
http://linuxcommand.org/lc3_learning_the_shell.php
Certainlyyouarewelcome touseother resources thatmaysuiteyoubest.Wewillhowever summarize in table formanumberofuseful commands thatyoumayalsfindevenasaRefCard.
http://www.cheat-sheets.org/#Linux
We provide in the next table a number of useful commands that youwant toexplore.Formoreinformationsimplytypemanandthenameofthecommand.If you find a useful command that is missing, please add it with a Git pullrequest.
.
Command Descriptionmancommand manualpageforthecommandapropostext listallcommandsthathavetextinitls Directorylistingls-lisa listdetailstree listthedirectoriesingraphicalformcddirname Changedirectorytodirnamemkdirdirname createthedirectoryrmdirdirname deletethedirectorypwd printworkingdirectoryrmfile removethefilecpab copyfileatobmvab move/renamefileatobcata printcontentoffileacat-nfilename printcontentoffileawith
linenumberslessa printpagedcontentoffileahead-5a Displayfirst5linesoffilea
tail-5a Displaylast5linesoffilea
du-hs. showinhumanreadableformthespaceusedbythecurrentdirectory
df-h showthedetailsofthediskfilesystem
wcfilename countsthewordinafilesortfilename sortsthefileuniqfilename displaysonlyuniqentriesinthefile
tar-xvfdir tarsupacompressedversionofthedirectory
rsync faster,flexiblereplacementforrcpgzipfilename compressesthefilegunzipfilename compressesthefilebzip2filename compressesthefilewith
block-sorting
bunzip2filename uncompressesthefilewithblock-sorting
clear clearstheterminalscreen
touchfilenamechangefileaccessandmodificationtimesoriffiledoesnotexistcreatesfile
who
displaysalistofusersthatarecurrentlyloggedon,foreachusertheloginname,dateandtimeoflogin,ttyname,andhostnameifnotlocalaredisplayed
whoami displaystheuserseffectiveidseealsoid
echo-nstring writespecifiedargumentstostandardoutput
datedisplaysorsetsdate&time,wheninvokedwithoutargumentsthecurrentdateandtimearedisplayed
logout exitagivensession
exitwhenissuedattheshellprompttheshellwillexitandterminateanyrunningjobswithintheshell
killterminateorsignalaprocessbysendingasignaltothespecifiedprocessusuallybythepid
psdisplaysaheaderlinefollowedbyallprocessesthathavecontrollingterminals
sleep suspendsexecutionforanintervaloftimespecifiedinseconds
uptime displayshowlongthesystemhasbeenrunning
timecommand timesthecommandexecutioninseconds
find/[-name]file-name.txt
searchesaspecifiedpathordirectorywithagivenexpressionthattellsthefindutilitywhattofind,ifusedasshownthefindutilitywouldsearchtheentiredriveforafilenamedfile-name.txt
diff comparesfileslinebyline
hostname printsthenameofthecurrenthostsystem
which locatesaprogramfileintheuserspath
tail displaysthelastpartofthefilehead displaysthefirstlinesofafile
top displaysasortedlistofsystemprocesses
locatefilename findsthepathofafilegrep‘word’filename findsalllineswiththewordinitgrep-v‘word’filename findsalllineswithoutthewordinit
chmodug+rwfilenamechangefilemodesorAccessControlLists.Inthisexampleuserandgrouparechangedtoreadandwrite
chown changefileownerandgroup
history abuild-incommandtolistthepastcommands
sudo executeacommandasanotherusersu substituteuseridentityuname printtheoperatingsystemname
set-oemacs tellstheshelltouseEmacscommands.
chmodgo-rwxfile changesthepermissionofthefilechownusernamefile changestheownershipofthefilechgrpgroupfile changesthegroupofafilefgreptextfilename searchesthetextinthegivenfile
grep-Rtext. recursivelysearchesforxyzinallfiles
find.-name*.py findallfileswith.pyattheendps listtherunningprocesseskill-91234 killtheprocesswiththeid1234at quecommandsforlaterexecution
cron daemontoexecutescheduledcommands
crontab managethetimetableforexecutioncommandswithcron
mount/dev/cdrom/mnt/cdrom mountafilesystemfromacdromto/mnt/cdrom
users listtheloggedinuserswho displaywhoisloggedinwhoami printtheuseriddmesg displaythesystemmessagebufferlast indicatelastloginsofusersandttys
uname printoperatingsystemnamedate printsthecurrentdateandtimetimecommand printsthesys,realandusertimeshutdown-h“shutdown” shutdownthecomputerping pingahostnetstat shownetworkstatushostname printnameofcurrenthostsystem
traceroute printtheroutepacketstaketonetworkhost
ifconfig configurenetworkinterfaceparameters
host DNSlookuputility
whois Internetdomainnameandnetworknumberdirectoryservice
dig DNSlookuputilitywget non-interactivenetworkdownloadercurl transferaURLssh remoteloginprogramscp remotefilecopyprogramsftp securefiletransferprogram
watchcommand runanydesignatedcommandatregularintervals
awkprogramthatyoucanusetoselectparticularrecordsinafileandperformoperationsonthem
sed streameditorusedtoperformbasictexttransformations
xargs programthatcanbeusedtobuildandexecutecommandsfromSTDIN
catsome_file.json|python-mjson.tool quickandeasyJSONvalidator
4.4.3Thecommandman
OnLinuxyoufinda richsetofmanualpages for thescommands.Try topickoneandexecute:
Youwillseesomthinglikethis
4.4.4Multi-commandexecution
Oneoftheimportantfeaturesisthatonecanexecutemultiplecommandsintheshell.
Toexecutecommand2oncecommand1hasfinisheduse
Toexecutecommand2assoonascommand1forwardsoutputtostdoutuse
$manls
LS(1)BSDGeneralCommandsManualLS(1)
NAME
ls--listdirectorycontents
SYNOPSIS
ls[-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1][file...]
DESCRIPTION
Foreachoperandthatnamesafileofatypeotherthandirectory,
lsdisplaysitsnameaswellasanyrequested,associated
information.Foreachoperandthatnamesafileoftypedirectory,
lsdisplaysthenamesoffilescontainedwithinthatdirectory,as
wellasanyrequested,associatedinformation.
Ifnooperandsaregiven,thecontentsofthecurrentdirectoryare
displayed.Ifmorethanoneoperandisgiven,non-directory
operandsaredisplayedfirst;directoryandnon-directoryoperands
aresortedseparatelyandinlexicographicalorder.
Thefollowingoptionsareavailable:
-@Displayextendedattributekeysandsizesinlong(-l)output.
-1(Thenumericdigit``one''.)Forceoutputtobeoneentry
perline.Thisisthedefaultwhenoutputisnottoaterminal.
-AListallentriesexceptfor.and...Alwayssetforthe
super-user.
-aIncludedirectoryentrieswhosenamesbeginwithadot(.).
...onpurposecut...insteadtryityourslef
command1;command2
command1;command2
Toexecutecommand1inthebackgrounduse
4.4.5KeyboardShortcuts
These shortcutswill come inhandy.Note thatmanyoverlapwithemacs shortcuts.
.
Keys DescriptionUpArrow ShowthepreviouscommandCtrl+z Stopsthecurrentcommand
ResumewithfgintheforegroundResumewithbginthebackground
Ctrl+c HaltsthecurrentcommandCtrl+l ClearthescreenCtrl+a ReturntothestartofthelineCtrl+e GototheendofthelineCtrl+k CuteverythingafterthecursortoaspecialclipboardCtrl+y PastefromthespecialclipboardCtrl+d Logoutofcurrentsession,similartoexit
4.4.6bashrc,bash_profileorzprofile
Usageofaparticularcommandandall theattributesassociatedwith it,use mancommand.Avoidusingrm-rcommandtodeletefilesrecursively.Agoodwaytoavoidaccidentaldeletionistoincludethefollowinginthefile.bash_profileor.zprofileonmacOSor.bashrconotherplatforms:
4.4.7Makefile
command1&
aliasrm='rm-i'
aliasmv='mv-i'
aliash='history'
Makefiles allow developers to coordinate the execution of code compilations.ThisnotonlyincludesCorC++code,butanytranslationfromsourcetoafinalformat.Forus thiscould include thecreationofPDF files from latex sources,creation of docker images, and the creation of cloud services and theirdeployment through simple workflows represented in makefiles, or thecoordinationofexecutiontargets.
Asmakefilesincludeasimplesyntaxallowingstructuraldependenciestheycaneasilyadapted to fulfill simpleactivities tobeexecuted in repeated fashionbydevelopers.
An example of how to use Makefiles for docker is provided athttp://jmkhael.io/makefiles-for-your-dockerfiles/.
An example on how to use Makefiles for LaTeX is provided athttps://github.com/cloudmesh/book/blob/master/Makefile.
Makefiles include anumberof rules that aredefinedby a target name.Let usdefineatargetcalledhellothatprintsoutthestring“HelloWorld”.
Importanttorememberisthatthecommandsafteratargetarenotindentedjustbyspaces,butactuallybyasingleTABcharacter.Editorssuchasemacswillbeideal to edit such Makefiles, while allowing syntax highlighting and easymanipulation of TABs.Naturally other editorswill do that also. Please choseyoureditorofchoice.Oneofthebestfeaturesoftargetsisthattheycandependonothertargets.Thus,iwwedefine
ourmakefilewillfirstexecutehelloandthanallcommandsinhallo.Asyoucanseethiscanbeveryusefulfordefiningsimpledependencies.
Inadditionwecandefinevariablesinamakefilesuchas
hello:
@echo"HelloWorld"
hallo:hello
@echo"HalloWorld"
HELLO="HelloWorld"
hello:
@echo$(HELLO)
andcanusetheminourtextwith$invocations.
Moreover,insophisticatedMakefiles,wecouldevenmakethetargetsdependenton files and a target rules couldbedefined that only compiles those files thathavechangedsinceourlastinvocationoftheMakefile,savingpotentiallyalotoftime.However,forourworkherewejustusethemostelementarymakefiles.
Formoreinformationwerecommendyoutofindoutaboutitontheinternet.Aconvenient reference card sis available athttp://www.cs.jhu.edu/~joanne/unixRC.pdf.
4.4.8chmod
The chmod command stand for change mode and changes the accesspermissions foragiven file systemobject(s). Ituses the followingsyntax: chmod[options]mode[,mode]file1[file2…]. The option parametersmodify how the process runs,includingwhatinformationisoutputtedtotheshell:
Option: Description:-f,--silent,--quiet Forcesprocesstocontinueeveniferrorsoccur-v,--verbose Outputsforeveryfilethatisprocessed-c,--changes Outputswhenafileischanged--reference=RFile UsesRFileinsteadofModevalues-R,--recursive Makechangestoobjectsinsubdirectoriesaswell--help Showhelp--version Showversioninformation
Modes specifywhich rights togive towhichusers.Potentialusers include theuserwho owns the file, users in the file’sGroup, other users not in the file’sGroup,andall,andareabbreviatedas u,g, o,and a respectively.More thanoneuser can be specified in the same command, such as chmod–vug(operator)(permissions)file.txt.Ifnouserisspecified,thecommanddefaultstoa.Next,a+or- indicateswhetherpermissions shouldbeaddedor removed for the selecteduser(s).Thepermissionsareasfollows:
Permission: Description:r Readw Writex ExecutefileoraccessdirectoryX Executeonlyiftheobjectisadirectorys SettheuserorgroupIDwhenrunningt Restricteddeletionflagorstickymodeu Specifiesthepermissionstheuserwhoownsthefilehasg Specifiesthepermissionsofthegroupo Specifiesthepermissionsofusersnotinthegroup
Morethanonepermissioncanbealsobeusedinthesamecommandasfollows:
Multiplefilescanalsobespecified:
4.4.9Exercises
E.Linux.1
Familiarizeyourselfwiththecommands
E.Linux.2
Findmorecommandsthatyoufindusefulandaddthemtothispage.
E.Linux.3
Use the sort command to sort all lines of a file while removingduplicates.
E.Linux.4
Should there be other commands listed in the table with the Linux
$chmod–vo+rwfile.txt
$chmoda-x,o+rfile1.txtfile2.txt
commandsIfsowhich?Createapullrequestforthem.
E.Linux.5
Writeasectionexplainingchmod.Uselettersnotnumbers
E.Linux.6
Writeasectionexplainingchown.Uselettersnotnumbers
E.Linux.7
Writeasectionexplainingsuandsudo
E.Linux.8
Writeasectionexplainingcron,at,andcrontab
4.5SECURESHELL☁�
LearningObjectives
Thisisaveryimportantsectionsofthebook,studdyitcarefully.learnhowtouseSSHkeysLearn how to use ssh-add and ssh-keycahin so you only have to type inyourpasswordonceUnderstandthateachcomputerneedsitsownsshkey
SecureShellisanetworkprotocolallowinguserstosecurelyconnecttoremoteresourcesovertheinternet.InmanyservicesweneedtouseSSHtoassurethatweprotecthemessagessendbetweenthecommunicatingentities.SecureShellisbasedonpublickeytechnologyrequiringtogenerateapublic-privatekeypairon thecomputer.Thepublickeywill thanbeuploaded to the remotemachineand when a connection is established during authentication the public privatekeypair is tested. If theymatchauthentication isgranted.Asmanyusersmayhave to share a computer it is possible to add a list of public keys so that a
number of computers can connect to a server that hosts such a list. Thismechanismbuildsthebasisfornetworkedcomputers.
Inthissectionwewillintroduceyoutosomeofthecommandstoutilizesecureshell.Wewill reuse this technology in other sections to for example create anetwork of workstations to which we can log in from your laptop. For moreinformationpleasealsoconsultwiththeSSHManual.
Whateverotherstellyou,theprivatekeyshouldneverbecopiedto anothermachine. You almost alwayswant to have a passphraseprotectingyourkey.
4.5.1ssh-keygen
Thefirstthingyouwillneedtodoistocreateapublicprivatekeypair.Beforeyoudothischeckwhethertherearealreadykeysonthecomputeryouareusing:
If there are files named id_rsa.pub or id_dsa.pub, then the keys are set upalready,andwecanskipthegeneratingkeysstep.Howeveryoumustknowthepassphraseofthekey.Ifyouforgotityouwillneedtorecreatethekey.Howeveryouwill loseanyability toconnectwith theoldkey to theresources towhichyouuploadedthepublickey.Sobecareful.
Togenerateakeypairusethecommandssh-keygen.ThisprogramiscommonlyavailableonmostUNIXsystemsandmostrecentlyevenWindows10.
Togeneratethekey,pleasetype:
The commentwill remind youwhere the key has been created, you could forexampleusethehostnameonwhichyoucreatedthekey.
In the following textwewill use localname to indicate the username on yourcomputeronwhichyouexecutethecommand.
ls~/.ssh
$ssh-keygen-trsa-C<comment>
Thecommandrequirestheinteractionoftheuser.Thefirstquestionis:
We recommendusing thedefault location~/.ssh/ and thedefault name id_rsa.Todoso,justpresstheenterkey.
Thesecondandthirdquestionistoprotectyoursshkeywithapassphrase.Thispassphrasewillprotectyourkeybecauseyouneedtotypeitwhenyouwanttouseit.Thus,youcaneithertypeapassphraseorpressentertoleaveitwithoutpassphrase.Toavoidsecurityproblems,youMUSTchoseapassphrase.
Itwillaskyouforthelocationandnameofthenewkey.Itwillalsoaskyoufora passphrase, which youMUST provide. Please use a strong passphrase toprotectitappropriately.Somemayadviseyou(includingteachersandTA’s)tonotusepassphrases.ThisisWRONGasitallowssomeonethatgainsaccesstoyourcomputertoalsogainaccesstoallresourcesthathavethepublickey.Onlyfor some system related services youmay create passwordless keys, but suchsystemsneedtobeproperlyprotected.
Notusingpassphrasesposesasecurityrisk!
Makesuretonotjusttypereturnforanemptypassphrase:
and:
Ifexecutedcorrectly,youwillseesomeoutputsimilarto:
Enterfileinwhichtosavethekey(/home/localname/.ssh/id_rsa):
Enterpassphrase(emptyfornopassphrase):
Entersamepassphraseagain:
Generatingpublic/privatersakeypair.
Enterfileinwhichtosavethekey(/home/localname/.ssh/id_rsa):
Enterpassphrase(emptyfornopassphrase):
Entersamepassphraseagain:
Youridentificationhasbeensavedin/home/localname/.ssh/id_rsa.
Yourpublickeyhasbeensavedin/home/localname/.ssh/id_rsa.pub.
Thekeyfingerprintis:
34:87:67:ea:c2:49:ee:c2:81:d2:10:84:b1:3e:05:[email protected]
+--[RSA2048]----+
|.+...Eo=.|
|..=.o+o+o|
|O.=......|
Once,youhavegeneratedyourkey,youshouldhavetheminthe .sshdirectory.Youcancheckitby:
Ifeverythingisnormal,youwillseesomethinglike:
The directory ~/.ssh will also contain the private key id_rsa which youmust notshareorcopytoanothercomputer.
Never,copyyourprivatekeytoanothermachineorcheckitintoarepository!
Toseewhatisinthe.sshdirectory,pleaseuse
Typicallyyouwillsealistoffilessuchas
Incaseyouneedtochangeyourchangepassphrase,youcansimplyrunssh-keygen-pcommand. Then specify the location of your current key, and input (old and)newpassphrases.Thereisnoneedtore-generatekeys:
Youwillseethefollowingoutputonceyouhavecompletedthatstep:
|=...|
+-----------------+
$cat~/.ssh/id_rsa.pub
ssh-rsaAAAAB3NzaC1yc2EAAAADAQABAAABAQCXJH2iG2FMHqC6T/U7uB8kt
6KlRh4kUOjgw9sc4Uu+Uwe/kshuispauhfsjhfm,anf6787sjgdkjsgl+EwD0
thkoamyi0VvhTVZhj61pTdhyl1t8hlkoL19JVnVBPP5kIN3wVyNAJjYBrAUNW
4dXKXtmfkXp98T3OW4mxAtTH434MaT+QcPTcxims/hwsUeDAVKZY7UgZhEbiE
xxkejtnRBHTipi0W03W05TOUGRW7EuKf/4ftNVPilCO4DpfY44NFG1xPwHeim
Uk+t9h48pBQj16FrUCp0rS02Pj+4/9dNeS1kmNJu5ZYS8HVRhvuoTXuAY/UVc
$ls~/.ssh
authorized_keys
id_rsa
id_rsa.pub
known_hosts
ssh-keygen-p
Enterfileinwhichthekeyis(/home/localname/.ssh/id_rsa):
Enteroldpassphrase:
Keyhascomment'/home/localname/.ssh/id_rsa'
Enternewpassphrase(emptyfornopassphrase):
Entersamepassphraseagain:
Youridentificationhasbeensavedwiththenewpassphrase.
4.5.2ssh-add
Often you wil find wrong information about passphrases on the internet andpeoplerecommendingyounottouseone.Howeveritisinalmostallcasesbettertocreateakeypairandusessh-addtoaddthekeytothecurrentsessionsoitcanbeusedinbehalfofyou.Thisisaccomplishedwithanagent.
Thessh-addcommandaddsSSHprivatekeysintotheSSHauthenticationagentforimplementing single sign-on with SSH. ssh-add allows the user to use anynumberofservers thatarespreadacrossanynumberoforganizations,withouthavingtotypeinapasswordeverytimewhenconnectingbetweenservers.Thisiscommonlyusedbysystemadministratorstologintomultipleserver.
ssh-add can be runwithout arguments.When runwithout arguments, it adds thefollowingdefaultfilesiftheydoexist:
~/.ssh/identity-Containstheprotocolversion1RSAauthenticationidentityoftheuser.~/.ssh/id_rsa -Contains theprotocolversion1RSAauthentication identityoftheuser.~/.ssh/id_dsa -Contains theprotocolversion2DSAauthentication identityoftheuser.~/.ssh/id_ecdsa-Containstheprotocolversion2ECDSAauthenticationidentityoftheuser.
Toaddakeyyoucanprovidethepathofthekeyfileasanargumenttossh-add.Forexample,
wouldaddthefile~/.ssh/id_rsa
If thekeybeingaddedhasapassphrase, ssh-addwill run the ssh-askpassprogramtoobtain thepassphrasefromtheuser. If the SSH_ASKPASSenvironmentvariable isset,theprogramgivenbythatenvironmentvariableisusedinstead.
Some people use the SSH_ASKPASS environment variable in scripts to provide apassphraseforakey.Thepassphrasemightthenbehard-codedintothescript,orthescriptmightfetchitfromapasswordvault.
ssh-add~/.ssh/id_rsa
Thecommandlineoptionsofssh-addareasfollows:
Option Description
-c
Causesaconfirmationtoberequestedfromtheusereverytimetheaddedidentitiesareusedforauthentication.Theconfirmationisrequestedusingssh-askpass.
-D Deletesallidentitiesfromtheagent.
-d
Deletesthegivenidentitiesfromtheagent.Theprivatekeyfilesfortheidentitiestobedeletedshouldbelistedonthecommandline.
-epkcs11 Removekeyprovidedbypkcs11
-LListspublickeyparametersofallidentitiescurrentlyrepresentedbytheagent.
-lListsfingerprintsofallidentitiescurrentlyrepresentedbytheagent.
-spkcs11 Addkeyprovidedbypkcs11.
-tlife
Setsthemaximumtimetheagentwillkeepthegivenkey.Afterthetimeoutexpires,thekeywillbeautomaticallyremovedfromtheagent.Thedefaultvalueisinseconds,butcanbesuffixedformforminutes,hforhours,dfordays,orwforweeks.
-X Unlockstheagent.Thisasksforapasswordtounlock.
-x
Lockstheagent.Thisasksforapassword;thepasswordisrequiredforunlockingtheagent.Whentheagentislocked,itcannotbeusedforauthentication.
4.5.3SSHAddandAgent
Tonotalwaystypeinyourpassword,youcanusessh-addaspreviouslydiscussed
It prompts the user for a private key passphrase and add it to a list of keysmanagedby thessh-agent.Once it is in this list,youwillnotbeaskedfor thepassphraseaslongastheagentisrunning.withyourpublickey.Tousethekeyacrossterminalshellsyoucanstartansshagent.
Tostarttheagentpleaseusethefollowingcommand:
oruse
Itisimportantthatyouusethebackquote,locatedunderthetilde(USkeyboard),ratherthanthesinglequote.OncetheagentisstarteditwillprintaPIDthatyoucanusetointeractwithlater
Toaddthekeyusethecommand
Toremovetheagentusethecommand
Toexecute thecommandupon logout,place it inyour .bash_logout (assumingyouusebash).
OnOSXyou can also add the key permanently to the keychain if you do toefollowing:
Modifythefile.ssh/configandaddthefollowinglines:
4.5.3.1UsingSSHonMacOSX
MacOSX comeswith an ssh client. In order to use it you need to open theTerminal.appapplication.Go to Finder, thenclick Go in themenubarat the topof thescreen.NowclickUtilitiesandthenopentheTerminalapplication.
4.5.3.2UsingSSHonLinux
AllLinuxversionscomewithsshandcanbeusedrightfromtheterminal.
$eval`ssh-agent`
$eval"$(ssh-agent-s)"
$ssh-add
kill$SSH_AGENT_PID
ssh-add-K~/.ssh/id_rsa
Host*
UseKeychainyes
AddKeysToAgentyes
IdentityFile~/.ssh/id_rsa
4.5.3.3UsingSSHonRaspberryPi3/4
SSHisavailableonRaspbian.However,tosshintothePIyouhavetoactivateitviatheconfigurationmenu.
4.5.3.4AccessingaRemoteMachine
Once thekeypair isgenerated,youcanuse it toaccessa remotemachine.Todod so the public key needs to be added to the authorized_keys file on the remotemachine.
Theeasiestwaytodotisistousethecommandssh-copy-id.
Notethatthefirsttimeyouwillhavetoauthenticatewithyourpassword.
Alternatively, if thessh-copy-id isnotavailableonyoursystem,youcancopythefilemanuallyoverSSH:
Nowtry:
andyouwillnotbepromptedforapassword.However,ifyousetapassphrasewhencreatingyourSSHkey,youwillbeasked toenter thepassphraseat thattime (and whenever else you log in in the future). To avoid typing in thepasswordallthetimeweusethessh-addcommandthatwedescribedearlier.
4.5.4SSHPortForwarding �
�TODO:Addimagestoillustratetheconcepts
SSH Port forwarding (SSH tunneling) creates an encrypted secure connectionbetweenalocalcomputerandaremotecomputerthroughwhichservicescanberelayed. Because the connection is encrypted, SSH tunneling is useful fortransmittinginformationthatusesanunencryptedprotocol.
$ssh-copy-iduser@host
$cat~/.ssh/id_rsa.pub|sshuser@host'cat>>.ssh/authorized_keys'
$sshuser@host
$ssh-add
4.5.4.1Prerequisites
Beforeyoubegin,youneedtocheckifforwardingisallowedontheSSHserveryouwillconnectto.YoualsoneedtohaveaSSHclientonthecomputeryouareworkingon.
IfyouareusingtheOpenSSHserver:
andlookandchangethefollowing:
Set the GatewaysPorts variable only if you are going to use remote port forwarding(discussed later in this tutorial). Then, you need to restart the server for thechangetotakeeffect.
4.5.4.2HowtoRestarttheServer
Ifyouareon:
Linux,dependingupontheinitsystemusedbyyourdistribution,run:
Note that depending on your distribution, you may have to change theservicetosshinsteadofsshd.
Mac,youcanrestarttheserverusing:
Windows and want to set up a SSH server, have a look at MSYS2 orCygwin.
4.5.4.3TypesofPortForwarding
TherearethreetypesofSSHPortforwarding:
$vi/etc/ssh/sshd_config
AllowTcpForwarding=Yes
GatewayPorts=Yes
$sudosystemctlrestartsshd
$sudoservicesshdrestart
$sudolaunchctlunload/System/Library/LaunchDaemons/ssh.plist
$sudolaunchctlload-w/System/Library/LaunchDaemons/ssh.plist
4.5.4.4LocalPortForwarding
Local port forwarding lets you connect from your local computer to anotherserver. Itallowsyou to forward trafficonaportofyour localcomputer to theSSH server, which is forwarded to a destination server. To use local portforwarding,youneedtoknowyourdestinationserver,andtwoportnumbers.
Example1:
Where <host> should be replaced by the name of your laptop. The -L optionspecifies local port forwarding. For the duration of the SSH session, pointingyourbrowserathttp://localhost:8080/wouldsendyoutohttp://cloudcomputing.com
Example2:
Thisexampleopensaconnectiontothewww.cloudcomputing.comjumpserver,and forwards any connection to port 80 on the local machine to port 80 onintra.example.com.
Example3:
By default, anyone (even on different machines) can connect to the specifiedportontheSSHclientmachine.However,thiscanberestrictedtoprogramsonthesamehostbysupplyingabindaddress:
Example4:
This would forward two connections, one to www.cloudcomputing.com, the other towww.cloud.com.Pointingyourbrowserathttp://localhost:8080/woulddownloadpagesfromwww.cloudcomputing.com,andpointingyourbrowsertohttp://localhost:12345/woulddownloadpagesfromwww.cloud.com.
Example5:
$ssh-L8080:www.cloudcomputing.org:80<host>
$ssh-L80:intra.example.com:80www.cloudcomputing.com
$ssh-L127.0.0.1:80:intra.example.com:80www.cloudcomputing.com
$ssh-L8080:www.Cloudcomputing.com:80-L12345:cloud.com:80<host>
ThedestinationservercanevenbethesameastheSSHserver.
TheLocalForwardoptionintheOpenSSHclientconfigurationfilecanbeusedtoconfigureforwardingwithouthavingtospecifyitoncommandline.
4.5.4.5RemotePortForwarding
Remote port forwarding is the exact opposite of local port forwarding. Itforwardstrafficcomingtoaportonyourservertoyourlocalcomputer,andthenit is sent to adestination.The first argument shouldbe the remoteportwheretrafficwillbedirectedontheremotesystem.Thesecondargumentshouldbetheaddressandporttopointthetraffictowhenitarrivesonthelocalsystem.
SSH does not by default allow remote hosts to forwarded ports. To enableremoteforwardingaddthefollowingto:/etc/ssh/sshd_config
andrestartSSH
Aftercompletingthepreviousstepsyoushouldbeabletoconnecttotheserverremotely, even fromyour localmachine. ssh-R first creates an SSH tunnel thatforwards traffic from the server on port 9000 to your local machine on port3000.
4.5.4.6DynamicPortForwarding
Dynamic port forwarding turns your SSH client into a SOCKS proxy server.SOCKS is a little-known but widely-implemented protocol for programs torequestanyInternetconnectionthroughaproxyserver.Eachprogramthatusestheproxyserverneedstobeconfiguredspecifically,andreconfiguredwhenyoustopusingtheproxyserver.
$ssh-L5900:localhost:5900<host>
$ssh-R9000:localhost:[email protected]
GatewayPortsyes
$sudovim/etc/ssh/sshd_config
$sudoservicesshrestart
TheSSHclient creates aSOCKSproxy at port 5000 on your local computer.AnytrafficsenttothisportissenttoitsdestinationthroughtheSSHserver.
Next,you’llneedtoconfigureyourapplicationstousethisserver.TheSettingssectionofmostwebbrowsersallowyoutouseaSOCKSproxy.
4.5.4.7sshconfig
Defaults and other configurations can be added to a configuration file that isplacedinthesystem.Thesshprogramonahostreceivesitsconfigurationfrom
thecommandlineoptionsauser-specificconfigurationfile:~/.ssh/configasystem-wideconfigurationfile:/etc/ssh/ssh_config
Nextweprovideanexampleonhowtouseaconfigfile
4.5.4.8Tips
UseSSHkeys
Youwillneedtousesshkeystoaccessremotemachines
Noblankpassphrases
Inmostcasesyoumustuseapassphrasewithyourkey.Infactifwefindthat you use passwordless keys to futuresystems and to chameleon cloudresources,wemay elect to give you anF for the assignment in question.Therearesomeexceptions,buttheywillbeclearlycommunicatedtoyouinclass.Youwillaspartofyourclouddrivers license testexplainhowyougainaccesstofuturesystemsandchameleontoexplicitlyexplainthispointandprovideuswithreasonswhatyoucannotdo.
Akeyforeachserver
Under no circumstances copy the same private key on multiple servers.This violates securitybest practices.Create for each server a newprivate
keyandusetheirpublickeystogainaccesstotheappropriateserver.
UseSSHagent
Soastonottotypeinallthetimethepassphraseforakey,werecommendusingssh-agenttomanagethelogin.Thiswillbepartofyourclouddriverslicense.
Butshutdownthessh-agentifnotinuse
keepanofflinebackup,putencryptthedrive
Youmayforsomeofourprojectsneedtomakebackupsofprivatekeysonotherserversyousetup.IfyouliketomakeabackupyoucandosoonaUSBstick,butmakesurethataccesstothestickisencrypted.Donotstoreanythingelseonthatkeyandlookit inasafeplace.Ifyoulosethestick,recreateallkeysonallmachines.
4.5.4.9References
TheSecureShell:TheDefinitiveGuide,2Ed(O’ReillyandAssociates)
4.5.5SSHtoFutureSystemsResources☁�
LearningObjectives
ObtainaFuturesystemaccountsoyoucanusekubernetesordockerswarmorotherservicesofferedbyFutureSystems.
Next, you need to upload the key to the portal. Youmust be logged into theportaltodoso.
Step1:Logintotheportalhttps://portal.futuresystems.org/
Step2:Clickonthe“MYACCOUNT”link.
Step3:Clickon“EDIT”
Step 4: Paste your ssh key into the boxmarked Public SSHKey. Use a texteditortoopenthe id_rsa.pub.Copytheentirecontentsof thisfile into thesshkeyfieldaspartofyourprofileinformation.Manyerrorsareintroducedbyusersinthisstepastheydonotpasteandcopycorrectly.
Ifyouneedtoaddkeys,usetheAddanotheritembutton
Atthispoint,youhaveuploadedyourkey.However,youwillstillneedtowaittillallaccountshavebeensetuptousethekey,orifyoudidnothaveanaccounttillithasbeencreatedbyanadministrator.Please,checkyouremailforfurtherupdates. You can also refresh this page and see if the boxes in your accountstatusinformationareallgreen.Thenyoucancontinue.
4.5.5.1TestingyourFutureSystemssshkey
IfyouhavehadnoFutureSystemaccountbefore,youneedtowaitforuptotwobusinessdayssowecanverifyyour identityandcreate theaccount.Sopleasewait.Otherwise,testingyournewkeyisalmostinstantaneousonindia.Forotherclusterslikeitcantakearound30minutestoupdatethesshkeys.
Tologintoindiasimplytypetheusualsshcommandsuchas:
Thefirsttimeyousshintoamachineyouwillseeamessagelikethis:
Youhavetotypeyesandpressenter.Thenyouwillbeloggingintoindia.OtherFutureSystem machines can be reached in the same fashion. Just replace thenameindia,withtheappropriateFutureSystemsresourcename.
4.5.6Exercises☁�
E.SSH.1:
Theauthenticityofhost'india.futuresystems.org(192.165.148.5)'cannotbeestablished.
RSAkeyfingerprintis11:96:de:b7:21:eb:64:92:ab:de:e0:79:f3:fb:86:dd.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
CreateanSSHkeypair
E.SSH.2:
Uploadthepublickeytogitrepositoryyouuse.
E.SSH.3:
Whatistheoutputofakeythathasapassphrasewhenexecutingthefollowingcommand.Testitoutonyourkey
E.SSH.4
Getanaccountonfuturesystems.org(ifyouareauthorizedtodoso).Upload your key to https://futuresystems.org. Login toindia.futuresystems.org. Note that this could take some time asadministratorsneedtoapproveyou.Bepatient.
E.SSH.5:
Whatcanhappen if youcopyyourprivate key toamachineon thenetwork?
E.SSH.6:
ShouldIsharemyprovatekeywithothers?
E.SSH.7:
AssumeIparticipateinavideoconferencecallandIaccidentlysharemyprivatekey.WhatshouldIdo?
E.SSH.8:
AssumeIparticipateinavideoconferencecallandIaccidentlysharemypublickey.WhatshouldIdo?
4.6GITHUB☁�
$grepENCRYPTED~/.ssh/id_rsa
LearningObjectives
Be able to use the github cloud sevices to collaborately develop contentsandprograms.Beabletousegithubaspartofanopensourceproject.
In some classes thematerialmay be openly shared in code repositories. Thisincludesclassmaterial,papersandproject.Hence,weneedsomemechanismtosharecontentwithalargenumberofstudents.
First, we like to introduce you to git and github.com (Section 1.1).Next,weprovideyouwiththebasiccommandstointeractwithgitfromthecommandline(Section1.12).Thanwewillintroduceyouhowyoucancontributetothissetofdocumentationswithpullrequests.
4.6.1Overview
Githubisacoderepositorythatallowsthedevelopmentofcodeanddocumentswithmanycontributors inadistributed fashion.Therearemanygood tutorialsabout github. Some of them can be found on the github Web page. Aninteractivetutorialisforexampleavailableat
https://try.github.io/
However,althoughthesetutorialsarehelpfulinmanycasestheydonotaddresssome cases. For example, you have already a repository set up by yourorganization and you do not have to completely initialize it. Thus do not justreplicate the commands in the tutorial, or theoncewepresentherebeforenotevaluatingtheirconsequences.Ingeneralmakesureyouverifyifthecommanddoeswhatyouexpectbeforeyouexecuteit.
Amoreextensivelistoftutorialscanbefoundat
https://help.github.com/articles/what-are-other-good-resources-for-learning-git-and-github
The github foundation has a number of excellent videos about git. If you areunfamiliar with git and you like to watch videos in addition to reading thedocumentationwerecommendthesevideos
https://www.youtube.com/user/GitHubGuides/videos
Next,weintroducesomeimportantconceptsusedingithub.
4.6.2UploadKey
Beforeyoucanworkwitharepositoryinaneasyfashionyouneedtouploadapublickeyinordertoaccessyourrepository.Naturally,youneedtogenerateakeyfirstwhichisexplainedinthesectionaboutsshkeygeneration( �TODO:lessons-ssh-generate-key include link ) before you upload one. Copy thecontentsofyour.ssh/id_rsa.pubfileandaddthemtoyourgithubkeys.
MoreinformationonthistopiccanbefoundonthegithubWebpage.
4.6.3Fork
Forking is the first step to contributing to projects onGitHub.Forking allowsyoutocopyarepositoryandworkonitunderyourownaccount.Next,creatinga branch, making some changes, and offering a pull request to the originalrepository,roundsoutyourcontributiontotheopensourceproject.
Git1:41Fork
4.6.4Rebase
When you start editing your project, you diverge from the original version.During your developing, the original version may be updated, or otherdevelopersmay have some of their branches implementing good features thatyouwouldlike to includeinyourcurrentwork.That iswhenRebasebecomesuseful.When youRebase to certain points, could be a newerMaster or othercustombranch,consideryougraftallyouron-goingworkrighttothatpoint.
Rebasemay fail, because some times it is impossible to achievewhatwe just
describedasconflictsmayexist.Forexample,youand the to-be-rebasedcopybotheditedsomecommontextsection.Once thishappens,humaninterventionneedstotakeplacetoresolvetheconflict.
Git4:20Rebase
4.6.5Remote
Collaborating with others involves managing the remote repositories andpushing and pulling data to and from them when you need to share work.Managingremoterepositoriesincludesknowinghowtoaddremoterepositories,remove remotes thatareno longervalid,managevarious remotebranchesanddefinethemasbeingtrackedornot,andmore.
Thoughoutthissemester,youwilltypicallyworkontworemoterepos.Oneistheofficeclassrepo,andanotheristherepoyouforkedfromtheclassrepo.Theclass repo is used as the centralized, authority and final version of all studentsubmissions. The repo under your own Github account is for your personalstorage.Toshowprogressonaweeklybasisyouneedtocommityourchangeson a weekly basis. However make sure that things in the master branch areworking.Ifnot,justuseanotherbranchtoconductyourchangesandmergeatalatertime.Welikeyoutocallyourdevelopmentbranchdev.
https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes
4.6.6PullRequest
Pull requests are ameans of starting a conversation about a proposed changeback into a project.Wewill be taking a look at the strength of conversation,integrationoptions for fuller informationabout a change, andcleanup strategyforwhenapullrequestisfinished.
Git4:26PullRequest
4.6.7Branch
Branches are an excellent way to not only work safely on features or
experiments, but they are also the key element in creating Pull Requests onGitHub. Lets take a look atwhywewant branches, how to create and deletebranches,andhowtoswitchbranchesinthisepisode.
Git2:25Branch
4.6.8Checkout
Change where and what you are working on with the checkout command.Whetherwe are switching branches, wanting to look at theworking tree at aspecific commit in history, or discarding editswewant to throw away, all ofthesecanbedonewiththecheckoutcommand.
Git3:11Checkout
4.6.9Merge
Onceyouknowbranches,mergingthatworkintomasteristhenaturalnextstep.Findouthowtomergebranches,identifyandcleanupmergeconflictsoravoidconflictsuntilalaterdate.Lastly,wewilllookatcombiningthemergedfeaturebranchintoasinglecommitandcleaningupyourfeaturebranchaftermerges.
Git3:11Merge
4.6.10GUI
UsingGraphicalUserInterfacescansupplementyouruseofthecommandlinetogetthebestofbothworlds.GitHubforWindowsandGitHubforMacallowforswitchingtocommandline,easeofgrabbingrepositoriesfromGitHub,andparticipating in a particular pull request. We will also see the auto-updatingfunctionalityhelpsusstayuptodatewithstableversionsofGitonthecommandline.
Git3:47GUI
There aremany other git GUI tools available that directly integrate into your
operatingsystemfinders,windows,…,orPyCharm.It isup toyouto identifysuchtoolsandseeiftheyareusefulforyou.Mostofthepeopleweworkwithusgitfromthecommandline,eveniftheyusePyCharm,eclipse,orothertoolsthathavebuildingitsupport.Youcanidentifyatoolthatworksbestforyou.
4.6.11Windows
ThisisaquicktourofGitHubforWindows.ItoffersGitHubnewcomersabriefoverview of what this feature-loaded version control tool and an equallypowerfulwebapplicationcandofordevelopers,designers,andmanagersusingWindows in both the open source and commercial software worlds. More:http://windows.github.com
Git1:25Windows
4.6.12GitfromtheCommandline
Althoughgithub.comprovidesapowerfulGUIandotherGUItoolsareavailabletointerfacewithgithub.com,theuseofgitfromthecommandlinecanoftenbefasterandinmanycasesmaybesimpler.
Gitcommandlinetoolscanbeeasilyinstalledonavarietyofoperatingsystemsincluding Linux, macOS, and Windows. Many great tutorials exist that willallow you to complete this task easily.We found the following two tutorialssufficienttogetthetaskaccomplished:
https://git-scm.com/book/en/v2/Getting-Started-Installing-Githttps://www.atlassian.com/git/tutorials/install-git
Although the later is provided by an alternate repository to github. Theinstallationinstructionsareveryniceandarenotimpactedbyit.Onceyouhaveinstalledgityouneedtoconfigureit.
4.6.13Configuration
Once you installed Git, you can need to configure it properly. This includessettingupyourusername,emailaddress,lineendings,andcolor,alongwiththe
settings’associatedconfigurationscopes.
Git2:47Configuration
Itisimportantthatmakesurethatusethe gitconfig command to initializegit forthefirsttimeoneachnewcomputersystemorvirtualmachineyouuse.Thiswillensurethatyouuseonallresourcesthesamenameande-mailsothatgithistoryand logwill showconsistentlyyourcheckinsacrossalldevicesandcomputersyouuse.Ifyoudonotdothis,yourcheckinsingitdonotshowupinaconsistentfashion as a single user. Thus on each computer execute the followingcommands:
whereyoureplacetheinformationwiththeinformationrelatedtoyou.Youcansettheeditortoemacswith:
Naturallyifyouhappentowanttouseothereditorsyoucanconfigurethembyspecifyingthecommandthatstartsthemup.Youwillalsoneedtodecideifyouwanttopushbranchesindividuallyorallbranchesatthesametime.Itwillbeuptoyoutomakewhatwillworkforyoubest.Wefoundthatthefollowingseemstoworkbest:
Moreinformationaboutafirsttimesetupisdocumentedat:
Tocheckyoursetupyoucansay:
One problem we observed is that students often simply copy and pasteinstructions,butdonotreadcarefullytheerrorthatisreportedbackanddonotfix it.Overlooking thepropersetof thepush.default isoftenoverlooked.Thusweremindyou:Pleasereadtheinformationonthescreenwhenyousetup.
4.6.14Uploadyourpublickey
$gitconfig--globaluser.name"AlbertZweistein"
$gitconfig--globalcore.editoremacs
gitconfig--globalpush.defaultmatching
*http://git-scm.com/book/en/Getting-Started-First-Time-Git-Setup
$gitconfig--list
Pleaseuploadyourpublickeytotherepositoryasdocumentedingithub,whilegoing toyouraccountand find it in settings.Thereyouwill findapanelSSHkeythatyoucanclickonwhichbringsyoutothewindowallowingyoutoaddanew key. If you have difficulties with this find a video from the githubfoundationthatexplainsthis.
4.6.15Workingwithadirectorythatwillbeprovidedforyou
Incaseyourcourseprovidedyouwithagithubdirectory,startingandworkinginitisgoingtoberealsimple.Pleasewaittillanannouncementtotheclassissendbeforeyouaskusquestionsaboutit.
If you are the only student working on this you still need to make sure thatpapersorprogramsyoumanageintherepositoryworkanddonotinterferewithscriptsthatinstructorsmayusetocheckyourassignments.Thusitisgodtostillcreateabranch,work in thebranchandthanmerge thebranch into themasteronceyouverifiedthingswork.Afteryoumergedyoucanpushthecontenttothegithubrepository.
Tip:Pleaseuseonlylowercasecharactersinthedirectorynamesandnospecialcharacterssuchas @;/_ and spaces. In general we recommend that you avoidusing directory names with capital letters spaces and _ in them. This willsimplifyyourdocumentationeffortsandmaketheURLsfromgitmorereadable.Also while on some OS’s the directories MyDirectory is different frommydirectoryonmacOSitisconsideredthesameandthusrenamingfromcapitaltolowercasecannotbedonewithoutfirstrenamingittoanotherdirectory.
Yourhomeworkforsubmissionshouldbeorganizedaccordingtofoldersinyourclonerepository.Tosubmitaparticularassignment,youmustfirstadditusing:
Afterwards,commititusing:
Thenpushittoyourremoterepositoryusing:
gitadd<nameofthefileyouareadding>
gitcommit-m"messagedescribingyoursubmission"
gitpush
Ifyouwanttomodifyyoursubmission,youonlyneedto:
afterwards:
If you lose any documents locally, you can retrieve them from your remoterepositoryusing:
4.6.16README.ymlandnotebook.md
In case you take classes e516 and e616 with us you will have to create aREADME.yaml and notebook.md file in the top most directory of yourrepository. It serves thepurposeof identifyingyour submission forhomeworkandinformationaboutyourself.
It is important to follow the format precisely. As it is yaml it is an easyhomeworktowritea4linepythonscriptthatvalidatesiftheREADME.yamlfileisvalid.Inadditionyoucanuseprogramssuchasyamllintwhichisdocumentedat
https://yamllint.readthedocs.io/en/latest/
Thisfileisusedtointegrateyourassignmentsintoaproceedings.Anexampleisprovidedat
https://github.com/cloudmesh-community/hid-sample/blob/master/README.yml
AnyderivationfromthisformatwillnotallowustoseeyourhomeworkasourautomatedscriptswillusetheREADME.ymltodetectthem.MakesurethefiledoesnotcontainayTABs.Pleasealsomindthatallfilenamesofallhomeworkandthemaindirectorymustbelowercaseanddonotincludespaces.Thiswillsimplifyyourtaskofmanagingthefilesacrossdifferentoperatingsystems.
In case you work in a team, on a submission, the document will only besubmitted in the author andhid that is listed first.All other readme files,will
gitcommit-m"messagerelatingtoupdatedfile"
gitpush
gitpull
haveforthatparticularartifactaduplicate:yesentrytoindicatethatthissubmissionismanagedelsewhere.The teamwillbe responsible tomanage theirownpullrequests, but if the team desires we can grant access for all members to arepositorybyauser.Pleasebeaware thatyoumustmakesureyoucoordinatewithyourteam.
Wewillnotacceptsubmissionofhomeworkaspdfdocumentsor tarfiles.Allassignmentsmust be submitted as code and the reports in native latex and ingithub.WehaveascriptthatwillautomaticallycreatethePDFandincludeitina proceedings. There is no exception from this rule and all reports notcompilable will be returned without review and if not submitted within thedeadlinereceiveapenalty.
PleasecheckwithyourinstructorontheformatoftheREADME.yamlfileasitcouldbedifferentforyourclass.
Toseeanexamplefor thenotebook.mdfile,youcanvisitoursamplehid,andbrowsetothenotebook.mdfile.Alternativelyyoucanvisitthefollowinglink
https://github.com/cloudmesh-community/hid-sample/blob/master/notebook.md
Thepurposeofthenotebookmdfileistorecordwhatyoudidintheclasstous.Wewillusethisfileattheendoftheclasstomakesureyouhaverecordedonaweekly basis what you did for the class. Inactivity is a valid response. Notupdatingthenotebook,isnot.
Thesampledirectorycontainsotherusefuldirectoriesandsamples,thatyoumaywant to investigate in more detail. One of the most important samples is thegithubissues(seeSection1.19).Thereisevenavideointhatsectionaboutthisandshowcasesyouhowtoorganizeyourtaskswithinthisclass,whilecopyingthe assignments from piazza into one ormore github issues.Aswe are aboutcloud computing, using the services offered by a prominent cloud computingservicesuchasgithubispartofthelearningexperienceofthiscourse.
4.6.17ContributingtotheDocument
Itisrelativelyeasytocontributetothedocumentifyouunderstandhowtouse
github.Thefirst thingyouwillneedtodois tocreateaforkof therepository.TheeasiestwaytodothisistovisittheURL
https://github.com/cloudmesh-community/book
TowardstheupperrightcorneryouwillfindalinkcalledFork.Clickonitandchoseintowhichaccountyouliketoforktheoriginalrepository.Nextyouwillcreate a colne from your corked directory. Youwill see in your fork a greenclonebutton.Youwill seeaURL thatyoucancopy intoyour terminal. If thelinksdoesnotincludeyourusername,itisthewronglink.
Inyourterminalyounowsay
Nowcdintothisdirectoryandmakeyourchanges.
Usetheusualgitcommandssuchasgitadd,gitcommit,gitpush
Noteyouwillpushintoyourlocaldirectory.
4.6.17.1Stayuptodatewiththeoriginalrepo
Formtimetotimeyouwillseethatothersarecontributingtotheoriginalrepo.Tostayuptodateyouwanttonotonlysyncfromyourlocalcopy,butalsofromtheoriginalrepo.Tolinkyourrepowithwhatiscalledtheupstreamyouneedtodothefollowingonce,soyoucanissuegitpullthaalsopullsfromtheupstream
Makesureyouhaveupstreamrepodefined:
NowGetlatestfromupstream:
Inthisstep,theconflictingfileshowsup(inmycaseitwasrefs.bib):
gitcolnehttps://github.com/<yourusername>/book
$cdbook
$gitremoteaddupstream\
https://github.com/cloudmesh-community/book
$gitrebaseupstream/master
$gitstatus
shouldshowthenameoftheconflictingfile:
shouldshowtheactualdifferences.Maybe insomecases, It iseasy tosimplytakelatestversionfromupstreamandreapplyyourchanges.
So you can decide to checkout one version earlier of the specific file.At thisstage, the re-base should be complete. So, you need to commit and push thechangestoyourfork:
Then reapplyyourchanges to refs.bib - simplyuse thebackedupversionandusetheeditortoredothechanges.
Atthisstage,onlyrefs.bibischanged:
shouldshowthechangesonlyinrefs.bib.Committhischangeusing:
Andfinallypushthelastcommittedchange:
The changes in the file to resolve merge conflict automatically goes to theoriginalpullrequestandthepullrequestcanbemergedautomatically.
You still have to issue the pull request from the Github Web page so it isregisteredwiththeupstreamrepository.
4.6.17.2Resources
ProGitbookOfficialtutorialOfficialdocumentationTutorialsPointongitTrygitonline
$gitdiff<filename>
$gitcommit
$gitrebaseorigin/master
$gitpush
$gitstatus
$gitcommit-a-m"new:usr:<message>"
$gitpush
GitHubresourcesforlearninggitNote:thisisforgithubandnotforgitlab.Howeverasitisforgttheonlythingyouhavetodoisreplacegithub,forgitlab.Atlassiantutorialsforgit
In addition the tutorials from atlassian are a good source.However rememberthatyoumaynotusebitbucketas the repository, so ignore those tutorials.Wefoundthefollowinguseful
Whatisgit:https://www.atlassian.com/git/tutorials/what-is-gitInstallinggit:https://www.atlassian.com/git/tutorials/install-gitgit config: https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-configgit clone: https://www.atlassian.com/git/tutorials/setting-up-a-repository#git-clonesavingchanges:https://www.atlassian.com/git/tutorials/saving-changescollaboratingwithgit:https://www.atlassian.com/git/tutorials/syncing
4.6.18Exercises
E.Github.1:
Howdoyousetyourfavoriteeditorasadefaultwithgithubconfig
E.Github.2:
Whatisthedifferencebetweenmergeandrebase?
E.Github.3:
Assumeyouhavemadea change in your local fork, howeverotherusershavesincecommittedtothemasterbranch,howcanyoumakesureyourcommitworksofffromthelatestinformationinthemasterbranch?
E.Github.4:
FindaspellingerrorintheWebpageoracontributionandcreateapullrequestforit.
E.Gitlab.5:
CreateaREADME.ymlinyourgithubaccountdirectoryprovidedforyouforclass.
4.6.19GithubIssues
Github8:29Issues
Whenweworkinteamsorevenifweworkbyourselves,itisprudenttoidentifyasystemtocoordinateyourwork.Whileconductionprojectsthatuseavarietyofcloudservices,itisimportanttohaveasystemthatenablesustohaveacloudservice that enables us to facilitate this coordination. Github provides such afeaturethroughitsissueservicethatisembeddedineachrepository.
Issuesallow for the coordinationof tasks, enhancements,bugs, aswell as selfdefinedlabeledactivities.Issuesaresharedwithinyourteamthathasaccesstoyourrepository.Furthermore,inanopensourceprojecttheissuesarevisibletothecommunity,allowingtoeasilycommunicatethestatus,aswellasaroadmaptonewfeatures.
Thisenablesthecommunitytoparticipatealsoinreportingofbugs.Usingsuchasystemtransformsthedevelopmentofsoftwarefromthetraditionalclosedshopdevelopmenttoatrulyopensourcedevelopmentencouragingcontributionsfromothers.Furthermoreitisalsousedasbugtrackerinwhichnotonlyyou,butthecommunitycancommunicatebugstotheproject.
Agoodresourceforlearningmoreaboutissuesisprovidedat
https://guides.github.com/features/issues/
4.6.19.1GitIssueFeatures
Agitissuehasthefollowingfeatures:
title
–ashortdescriptionofwhattheissueisabout
description
a more detailed description. Descriptions allow also to conveniently addcheck-boxedtodo’s.
label
acolorenhancedlabelthatcanbeusedtoeasilycategorizetheissue.Youcandefineyourownlabels.
milestone
amilestone so you can identify categorical groups issues aswell as theirduedate.Youcanforexamplegroupalltasksforaweekinamilestone,oryoucouldforexampleputalltasksforatopicsuchasdevelopingapaperinamilestoneandprovideadeadlineforit.
assignee
an assignee is the person that is responsible for making sure the task isexecutedoron track ifa teamworkson it.Oftenprojectsallowonlyoneassignee,but incertaincases it isuseful toassignagroup,and thegroupidentifiesifthetaskcanbesplitupandassignsthemthroughcheck-boxedtodo’s.
comments
allowanyonewithaccesstoprovidefeedbackviacomments.
4.6.19.2GithubMarkdown
GithubusesmarkdownwhichweintroduceyouinSection[S:markdown].
Asgithubhasitsownflavorofmarkdownwehoweveralsopointyouto
as a reference. We like to mention the special enhancements fo github’smarkdownthatintegratewelltosupportprojectmanagement.
4.6.19.2.1Tasklists
TakslistscanbeaddedtoanydescriptionorcommentingithubissuesTocreateatasklistyoucanaddtoanyitem[].Thisincludesatasktobedone.Tomakeitascompletesimplechangeitto[x].Whoeverthegreatfeatureoftasksisthatyoudonotevenhavetoopentheeditorbutyoucansimplycheckthetaskonandofviaamouseclick.Anexampleofatasklistcouldbe
In caseyouneed touse a (have at thebeginningot the task text, youneed toescapeitwitha\
4.6.19.2.2Teamintegration
A person or team on GitHub can be mentioned by typing the usernameproceeded by the @ sign.When posting the text in the issue, it will trigger anotification to themandallow them to react to it. It is evenpossible tonotifyentireteams,whicharedescribedinmoredetailat
https://help.github.com/articles/about-teams/
4.6.19.2.3ReferencingIssuesandPullrequests
Eachissuehasanumber.Ifyouusethe#followedbytheissuenumberyoucanrefer to it in the textwhichwill also automatically include a hyperlink to thetask.Thesameisvalidforpullrequests.
4.6.19.2.4Emojis
Althoughgithubsupportsemojissuchas:+1:wedonotusethemtypicallyinourclass.
4.6.19.3Notifications
Github allows you to set preferences on how you lik to receive notifications.
PostBios
*[x]Postbioonpiazza
*[]Postbioongoogledocs
*[]Postbioongithub
*[]\(optional)integrateimageingoogledocsbio
You can receive them either via e-mail or the Web. This is controlled byconfiguring it in your settings, where you can set the preferences forparticipating projects as well as projects you decide to watch. To access thenotifications you can simply look at them in the notification screen. In thisscreenwhenyoupressthe?youwillseeanumberofcommandsthatallowyoutocontrolthenotificationwhenpressingononeofthem.
4.6.19.4cc
Tocarboncopyusersinyourissuetext,simplyuse/ccfollowedbythe@signandtheirgithubusername.
4.6.19.5Interactingwithissues
Githubhastheabilitytosearchissueswithasearchqueryandasearchlanguagethatyoucanfindoutmoreaboutitat
https://guides.github.com/features/issues/#search
Adashboardgivesconvenientoverviewsoftheissuesincludingapulsethatliststodo’sstatusifyouusethemintheissuedescription.
4.6.20Glossary
TheGlossaryiscopiedfrom
https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/GitTipsAndTricks#A-suggested-work-flow-for-distributed-projects-NoSY
Addputafile(orparticularchangesthereto)intotheindexreadyforacommitoperation. Optional for modifications to tracked files; mandatory forhithertoun-trackedfiles.
Brancha divergent change tree (eg a patch branch)which canmemerged eitherwholesaleorpiecemealwiththemastertree.
Commitsave thecurrent stateof the indexand/orother specified files to the localrepository.
Commitobjectanobjectwhichcontainstheinformationaboutaparticularrevision,suchasparents, committer, author, date and the tree objectwhich corresponds tothetopdirectoryofthestoredrevision.
Fast-forwardanupdateoperationconsistingonlyoftheapplicationofalinearpartofthechangetreeinsequence.
Fetchupdate your local repository database (not your working area) with thelatestchangesfromaremote.
HEADthelateststateofthecurrentbranch.
Indexa collection of files with stat information, whose contents are stored asobjects.The index is a storedversionofyourworking tree.Filesmaybestagedtoanindexpriortocommitting.
Masterthemainbranch:knownasthetrunkinotherSCMsystems.
Mergejointwotrees.Acommitismadeifthisisnotafast-forwardoperations(oroneisrequestedexplicitly.
Objecttheunitofstorageingit.ItisuniquelyidentifiedbytheSHA1hashofitscontents.Consequently,anobjectcannotbechanged.
Originthedefault remote,usually the source for the cloneoperation that createdthelocalrepository.
Pullshorthand for a fetch followedby amerge (or rebase if –rebaseoption isused).
Pushtransfer the state of the current branch to a remote tracking branch. Thismustbeafast-forwardoperation(seemerge).
Rebase
amerge-likeoperationinwhichthechangetreeisrewritten(seeRebasingbelow).Usedtoturnnon-trivialmergesintofast-forwardoperations.
Remoteanother repository known to this one. If the local repositorywas createdwith“clone”thenthereisatleastoneremote,usuallycalled,“origin.”
Stagetoaddafileorselectedchangestherefromtotheindexinpreparationforacommit.
Stashastackontowhichthecurrentsetofuncommittedchangescanbeput(eginorder to switch to or synchronize with another branch) as a patch forretrievallater.Alsotheactofputtingchangesontothisstack.
Taghuman-readablelabelforaparticularstateofthetree.Tagsmaybesimple(in which case they are actually branches) or annotated (analogous to aCVStag),withanassociatedSHA1hashandmessage.Annotatedtagsarepreferableingeneral.
Trackingbrancha branch on a remote which is the default source / sink for pull / pushoperationsrespectivelyforthecurrentbranch.Forinstance,origin/masteristhetrackingbranchforthelocalmasterinalocalrepository.
Un-trackednotknowncurrentlytogit.
4.6.21Examplecommands
Towork in your local directory you can use the following commands. Pleasenotethatthesecommandsdonotuploadyourworktogithub,butonlyintroduceversioncontrolwithinyourlocalfiles.
Thecommandlistiscopiedfrom
https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/GitTipsAndTricks#A-suggested-work-flow-for-distributed-projects-NoSY
4.6.21.1Localcommandstoversioncontrilyourfiles
Obtaindifferenceswith
Movefilesfromonepartofyourdirectorytreetoanother:
Deleteunwantedtrackedfiles:
Addun-trackedfiles:
Stageamodifiedfileforcommit:
Commitcurrently-stagedfiles:
Commitonlyspecificfiles(regardlessofwhatisstaged):
Commitallmodifiedfiles:
Un-stageapreviouslystaged(butnotyetcommitted)file:
Getdifferenceswithrespecttothecommitted(orstaged)versionofafile:
Getdifferencesbetweenlocalfileandcommittedversion:
Create(butdonotswitchto)anewlocalbranchbasedonthecurrentbranch:
$gitstatus
$gitmv<old-path><new-path>
$gitrm<path>
$gitadd<un-tracked-file>
$gitadd<file>
$gitcommit-m<log-message>
$gitcommit-m<log-message>
$gitcommit-a-m<log-message>
$gitresetHEAD<file>
$gitdiff<file>
$gitdiff--cached<file>
$gitbranch<new-branch>
Changetoanexistinglocalbranch:
Mergeanotherbranchintothecurrentone:
4.6.21.2Interactingwiththeremote
Getthecurrentlistofremotes(includingURIs)with
Getthecurrentlistofdefinedbrancheswith
Change to (creating if necessary) a local branch tracking an existing remotebranchofthesamename:
Updateyour local repository refdatabasewithout altering the currentworkingarea:
Updateyourcurrent localbranchwithrespect toyourrepository’scurrent ideaofaremotebranch’sstatus:
Pullremoterefinformationfromallremotesandmergelocalbrancheswiththeirremotetrackingbranches(ifapplicable):
Examinechangestothecurrentlocalbranchwithrespecttoitstrackingbranch:
Pushchangestotheremotetrackingbranch:
$gitcheckout<branch>
$gitmerge<branch>
$gitremote-v
$gitbranch-a
$gitcheckout<branch>
$gitfetch<remote>
$gitmerge<branch>
$gitpull
$gitcherry-v
$gitpush
Pushallchangestoalltrackingbranches:
4.7GITPULLREQUEST☁�4.7.1Introduction
Gitpullrequestsallowdeveloperstosubmitworkorchangestheyhavedonetoarepository,Thedeveloperscanthencheckthechangesthathavebeenproposedinthepullrequest,discussandmakechangesifneeded.Afterthecontentoffthepullrequesthasbeenagreeduponitcanbemergedtotherepositorytoaddtheinformationorchangesinthepullrequestintotherepository.
4.7.2Howtocreateapullrequest
InthisdocumentwewillseehowwecancreateapullrequestfortheCloudmeshtechnologiesrepothatislocatedat
https://github.com/cloudmesh/technologies
Howeverifyoudopullrequestonotherdirectories,youjusthavetoreplacetheurlwiththatoftherepositoryyouliketouse.Acommononefourourclassesisalso
https://github.com/cloudmesh-community/book
Whichcontainsthisbook.
Youcaneithercreateapullrequestthroughabranchorthroughafork.Inthisdocumentwewillbelookingathowwecancreateapullrequestthroughafork.
4.7.3Forktheoriginalrepository
First you need to create a fork of the original repository.A fork is your owncopy of the repository to which you can make changes to. To fork theCloudmesh technologies goto Cloudmesh technologies repo and click on theFork button on the top right corner. Now you can notice that instead of
$gitpush--all
cloudmesh/technologies thenameofthereposays YOURGITUSERNAME/technologies,where YOURGITUSERANAMEisindeedyourgithubusername.Thatisbecauseyouarenowinyourowncopyofthecloudmesh/technologiesrepository.nourcasetheusernamewillbepulashti.
4.7.4Cloneyourcopy
Now that you have your fork created, we can go ahead and clone it into ourmachine. Instructionsonhowtoclonearepositorycanbefoundin theGithubdocumentation-Cloningarepository.Makesurethatyoucloneyourversionofthetechnologiesrepo.
4.7.5Addinganupstream
Beforewe can startworking on our copy of the git repo it is good to add anupstream(alinktotheoriginalrepo)sothatwecangetallthelatestchangesinthe original repository into our copy.Use the following commands to add anupstreamtocloudmesh/technologies.Firstgointothefolderwhichcontainsyourgitrepothatyouclonedandexecutethefollowingcommand.
Tomakesureyouhaveaddeditcorrectlyexecutethefollowingcommand
Youshouldseesomethingsimilartothefollowingastheoutput
4.7.6Makingchanges
Now you can make changes to your repo as with any normal git repository.However tomake sure you have the latest copy from the original execute thefollowing commandbefore you startmaking changes.Thiswill pull the latestchangesfromtheoriginalcloudmesh/technologiesintoyourlocalcopy
Nowmaketheneededchangescommitandpush,thechangeswillbepushedto
$gitremoteaddupstreamhttps://github.com/cloudmesh/technologies.git'
$gitremote-v
originhttps://github.com/pulasthi/technologies.git(fetch)
originhttps://github.com/pulasthi/technologies.git(push)
upstreamhttps://github.com/cloudmesh/technologies.git(fetch)
upstreamhttps://github.com/cloudmesh/technologies.git(push)
$gitpullupstreammaster
yourcopyoftherepoiGithub,notthecloudmesh/technologiesrepo.
4.7.7Creatingapullrequest
Oncewe have changes pushed, you can go into your repository in Github tocreateapull request.As seen in@#fig:button-pullrequest,youhaveanbuttonnamedPullrequest
Figure35:ButtonPullrequest
Once you click on that button you will be taken to a page to create the pullrequest,whichwilllooksimilartoFigure36.
Figure36:Createapullrequest
OnceyouclickontheCreatepullrequestbuttonyouwillbegivenanoptiontoaddatitle and a comment for the pull request. Once you complete the details andsubmitthepullrequestwillappearintheoriginalcloudmesh/technologiesrepo.
Note: Make sure you see the Abletomerge sign before you submit the pullrequest, otherwise your pull will not be able to directly merged to theoriginalrepo.Ifyoudonotseethisthatmeansyouhavenotproperlydonethegitpullupstreammastercommandbeforeyoumadethechanges
gitexampleonCL10:09
4.8TIG☁�Manybrowsersexisttogaininsightintogitrepositories.IncaseyouhaveLinuxorUbuntuatooltodisplayinformationinaterminalisavailable.
https://jonas.github.io/tig/
OnOSXitcanbeinstalledwith:
Tig has many different views including views for main, log, diff, tree, blob,
$brewinstalltig
blame,refs,status,stage.stash,grep,andpager.
AscreenshotshowssomeifitsbasicfunctionalityisshowninFigure37
Figure37:Gittigmainvie
Exampleinfocationsare$tig
$gitshow|tig
$gitlog|tig
5INTRODUCTIONTOCLOUDCOMPUTINGANDDATA
ENGINEERINGFORCLOUDCOMPUTINGANDMACHINE
LEARNING☁�E222IntelligentSystemsIIandE516EngineeringCloudComputing
YouTube Playlist https://www.youtube.com/playlist?list=PLy0VLh_GFyz81ZFQ6Xrd1PHHI1EzjhhVb
PowerPoint of full set A) to U) https://drive.google.com/open?id=1RQ8Q_A32ks02CSCZAzKiJJ9P9YntMRAo
5.1A.SUMMARYOFINTRODUCTIONTOCLOUDCOMPUTING
&DATAENGINEERING
This lessonsummarizes thecomponent lessonsCloudIntroBtoCloudIntroUofIntroductiontoCloudComputingandDataEngineeringLecture
SummaryCloudComputing15:40
5.2B.DEFININGCLOUDSI
Basic definition of cloud and two very simple examples of whyvirtualizationisimportant.HowcloudsaresituatedwrtHPCandsupercomputersWhymulticorechipsareimportantTypicaldatacenter
DefiningCloudsI20:22
5.3C.DEFININGCLOUDSII
Service-oriented architectures: Software services as Message-linked
computingcapabilitiesThedifferentaaS’s:Network,Infrastructure,Platform,SoftwareTheamazingservicesthatAmazonAWSandMicrosoftAzurehaveInitialGartnercommentsonclouds(theyarenowthenorm)andevolutionofservers;serverlessandmicroservices2016/2018InfrastructureStrategiesHypeCycleandPriorityMatrix
DefiningCloudsII24:22
5.4D.DEFININGCLOUDSIII
CloudMarketShareHowimportantarethey?Howmuchmoneydotheymake?
DefiningCloudsIII12:23
5.5E.VIRTUALIZATION
VirtualizationTechnologies,Hypervisorsandthedifferentapproaches
KVMXen,DockerandOpenstack
See:
https://en.wikipedia.org/wiki/Hypervisorhttps://en.wikipedia.org/wiki/Xenhttps://en.wikipedia.org/wiki/Kernel-based_Virtual_Machinehttps://en.wikipedia.org/wiki/Operating-system-level_virtualizationhttps://medium.com/\@dbclin/aws-just-announced-a-move-from-xen-towards-kvm-so-what-is-kvm-2091f123991https://nickjanetakis.com/blog/comparing-virtual-machines-vs-docker-containershttps://en.wikipedia.org/wiki/OpenStack
Virtualization11:21
5.6F.TECHNOLOGYHYPECYCLEI
Gartner’s Hypecycles and especially that for emerging technologies in2018,2017and2016ThephasesofhypecyclesPriorityMatrixwithbenefitsandadoptiontimeTodaycloudshavegotthroughthecycle(theyhaveemerged)butfeatureslikeblockchain,serverlessandmachinelearningareoncycleHypecycle and Priority Matrix for Data Center Infrastructure 2017 and2018
TechnologyHypecycleI31:23
5.7G.TECHNOLOGYHYPECYCLEII
Emerging Technologies hypecycles and Priority matrix at selected times2008-2015Cloudsstarfrom2008totodayTheyaremixedupwithtransformationalanddisruptivechangesTheroutetoDigitalBusiness(2015)
TechnologyHypecycleII16:05
5.8H.CLOUDINFRASTRUCTUREI
CommentsontrendsinthedatacenteranditstechnologiesCloudsphysicallyacrosstheworldGreencomputingFractionofworld’scomputingecosystemincloudsandassociatedsizes
CloudInfrastructureI21:20
5.9I.CLOUDINFRASTRUCTUREII
GartnerhypecycleandprioritymatrixonInfrastructureStrategiesandComputeInfrastructureContainerscomparedtovirtualmachinesTheemergenceofartificialintelligenceasadominantforce
CloudInfrastructureII17:52
5.10JCLOUDSOFTWARE
HPC-ABDSwithover350 softwarepackages andhow touse eachof21layersGoogle’ssoftwareinnovationsMapReduceinpicturesCloudandHPCsoftwarestackscomparedComponentsneedtosupportcloud/distributedsystemprogrammingSingleProgram/InstructionMultipleDataSIMDSPMD
CloudSoftware37:56
5.11K.CLOUDAPPLICATIONSI
BigData;alotofbestexampleshaveNOTbeenupdatedsosomeslidesoldbutstillmakethecorrectpointsSomeofthebusinessusagepatternsfromNIST
CloudApplicationsI11:58
5.12LCLOUDAPPLICATIONSII
Clouds in sciencewhereareacalledcyberinfrastructure; theusagepatternfromNISTArtificialIntelligencefromGartner
CloudApplicationsII13:03
5.13MCLOUDAPPLICATIONSIII
CharacterizeApplicationsusingNISTapproachInternetofThingsDifferenttypesofMapReduce
CloudApplicationsIII24:12
5.14N.CLOUDSANDPARALLELCOMPUTING
ParallelComputingingeneralBigDataandSimulationsComparedWhatishardtodo?
:clapper:CloudsandParallelComputing35:03
5.15O.STORAGE
ClouddataapproachesRepositories,FileSystems,Datalakes
Storage19:22
5.16P.HPCANDCLOUDS
TheBranscombPyramidSupercomputersversuscloudsScienceComputingEnvironments
HPCandClouds19:29
5.17Q.COMPARISONOFDATAANALYTICSWITH
SIMULATION
StructureofdifferentapplicationsforsimulationsandBigDataSoftwareimplicationsLanguages
ComparisonofDataAnalyticswithSimulation16:19
5.18R.JOBS
ComputerEngineeringCloudsDesignDataScience/Engineering
Jobs15:30
5.19S.THEFUTUREI
GartnercloudcomputinghypecycleandprioritymatrixHyperscalecomputingServerlessandFaaSCloudNativeMicroservices
TheFutureI29:29
5.20T.THEFUTUREANDOTHERISSUESII
SecurityBlockchain
TheFutureandotherIssuesII11:30
5.21U.THEFUTUREANDOTHERISSUESIII
FaultTolerance
TheFutureandotherIssuesIII9:10
6REST
6.1INTRODUCTIONTOREST☁�
LearningObjectives
UnderstandRESTServioces.UnderstandOpenAPI.DevelopRESTservicesinPythonusingEve.DevelopRESTservicesinPythonusingOpenAPIwithswagger.
RESTstandsforREpresentationalStateTransfer.RESTisanarchitecturestylefor designing networked applications. It is based on stateless, client-server,cacheable communications protocol. In contrast to what some others write orsay, REST is not a standard. Although not based on http, in most cases, theHTTPprotocolisused.Inthatcase,RESTfulapplicationsuseHTTPrequeststo(a) post data while creating and/or updating it, (b) read data while makingqueries,and(c)deletedata.
RESTwasfirstintroducedinathesisfromRoyT.Fielding[40].
HenceRESTcanuseHTTPforthefourCRUDoperations:
CreateresourcesReadresourcesUpdateresourcesDeleteresources
AspartoftheHTTPprotocolwehavemethodssuchasGET,PUT,POST,andDELETE.ThesemethodscanthanbeusedtoimplementaRESTservice.Thisisnot surprising as the HTTP protocol was explicitly designed to support theseoperations.AsRESTintroducescollectionsanditemsweneedtoimplementtheCRUD functions for them.We distinguish single resources and collection ofresources.Thesemanticsforaccessingthemisexplainednextillustratinghowto
implementthemwithHTTPmethods(SeeRESTonWikipedia[41]).
6.1.0.1CollectionofResources
LetusassumethefollowingURIidentifiesacollectionofresourceshttp://.../resources/
thanweneedtoimplementthefollowingCRUDmethods:
GET
ListtheURIsandperhapsotherdetailsofthecollectionsmembers
PUT
Replacetheentirecollectionwithanothercollection.
POST
Create a new entry in the collection. The new entry’s URI is assignedautomaticallyandisusuallyreturnedbytheoperation.
DELETE
Deletetheentirecollection.
6.1.0.2SingleResource
LetusassumethefollowingURI identifiesasingle resource inacollectionofresourceshttp://.../resources/item42
thanweneedtoimplementthefollowingCRUDmethods:
GET
Retrieve a representation of the addressed member of the collection,expressedinanappropriateinternetmediatype.
PUT
Replace the addressed member of the collection, or if it does not exist,createit.
POST
Notgenerallyused.Treattheaddressedmemberasacollectioninitsownrightandcreateanewentrywithinit.
DELETE
Deletetheaddressedmemberofthecollection.
6.1.0.3RESTToolClassification
Due to thewell defined structure thatREST provides a number of tools havebeencreated thatmanage thecreationof thespecificationfor restservicesandtheirprogramming.Wedistinguishseveraldifferentcategories:
RESTSpecificationFrameworks:
These are frameworks that help defining rest servicice throughspecifications to generate REST services in a language and frameworkindependentway. This includes for example Swagger 2.0 [42],OpenAPI3.0[43],andRAML[44].
RESTprogramminglanguagesupport:
These toolsandservicesare targetingaparticularprogramming language.SuchtoolsincludeFlaskRest[45],andDjangoRestServices[46],someofwhichwewillexploreinmoredetail.
RESTdocumentationbasedtools:
These tools are primarily focusing on documenting REST specifications.SuchtoolsincludeSwagger[47],whichwewillexploreinmoredetail.
RESTdesignsupporttools:
These tools are used to support the design process of developing RESTserviceswhileabstractingontopoftheprogramminglanguagesanddefinereusable specifications that can be used to create clients and servers forparticular technology targets. Such tools include also swagger [47] asadditional tools are available that can generate code from OpenAPIspecifications[48],whichwewillexploreinmoredetail.
AlistofsucheffortsisavailableatOpenAPITools[49]
6.2OPENAPIRESTSERVICESWITHSWAGGER☁�Swaggerhttps://swagger.io/isatoolfordevelopingAPIspecificationsbasedontheOpenAPISpecification (OAS). It allowsnotonly thespecification,but thegenerationofcodebasedonthespecificationinavarietyoflanguages.
Swagger itself has a number of tools which together build a framework fordevelopingRESTservicesforavarietyoflanguages.
6.2.1SwaggerTools
ThemajorSwaggertoolsofinterestare:
SwaggerCore
includes libraries for working with Swagger specificationshttps://github.com/swagger-api/swagger-core.
SwaggerCodegen
allows to generate code from the specifications to develop Client SDKs,servers, and documentation. https://github.com/swagger-api/swagger-codegen
SwaggerUI
is an HTML5 based UI for exploring and interacting with the specifiedAPIshttps://github.com/swagger-api/swagger-ui
SwaggerEditor
isaWeb-browserbasededitor forcomposingspecificationsusingYAMLhttps://github.com/swagger-api/swagger-editor
SwaggerHub
isaWebservicetocollaborativlydevelopandhostOpenAPIspecificationshttps://swagger.io/tools/swaggerhub/
ThedevelopedAPIscanbehostedandfurtherdevelopedonanonlinerepositorynamed SwaggerHub https://app.swaggerhub.com/home The convenient onlineeditor isavailablewhichalsocanbe installed locallyonavarietyofoperatingsystemsincludingmacOS,Linux,andWindows.
6.2.2SwaggerCommunityTools
notifyusaboutothertoolsthatyoufindandwouldlikeustomentionhere.
6.2.2.1ConvertingJsonExamplestoOpenAPIYAMLModels
Swagger toolbox is autility that canconvert json to swaggercompatibleyamlmodels.Itishostedonlineat
https://swagger-toolbox.firebaseapp.com/
Thesourcecodetothistoolisavailableongithubat
https://github.com/essuraj/swagger-toolbox
Itisimportanttomakesurethatthejsonmodelisproperlyconfigured.Assucheachdatatypemustbewrappedin“quotes”andthelastelementmustnothavea,behindit.
Incaseyouhavelargemodels,werecommendthatyougraduallyaddmoreandmore features so that it is easier todebug in caseof an error.This tool is notdesignedtoprovidebackafull featuredOpenAPI,buthelpyougettingstartedderivingone.
Letuslookatasmallexample.LetusassumewewanttocreateaRESTservicetoexecuteacommandontheremoteservice.Weknowthismaynotbeagoodideaifitisnotproperlysecured,sobeextracareful.Agoodwaytosimulatethisistojustuseareturnstringinsteadofexecutingthecommand.
Letusassumethejsonschemalookslike:
Theoutputtheswaggertoolboxcreatesis
Asyoucanseeitisfarfromcomplete,butitcouldbeusedtogetyoustarted.
BasedonthistooldeveloparestservicetowhichyousendaschemainJSONformatfromwhichyougetbacktheYAMLmodel.
6.3OPENAPI2.0SPECIFICATION☁�Swagger provides through its specification the definition of REST servicesthroughaYAMLorJSONdocument.
When following the API-specification-first approach to define and develop aRESTfulservice,thefirstandforemoststepistodefinetheAPIconformingtothe OpenAPI specification, and then using codegen tools to convenientlygenerateserversidestubcode,clientcode,documentations,inthelanguageyoudesire.InSectionRESTServiceGenerationwithOpenAPIwehaveintroducedthecodegentoolandhowtousethattogenerateserversideandclientsidecodeanddocumentation.InthisSectionTheVirtualClusterexampleAPIDefinitionwewill use a slightlymore complex example to show how to define anAPIfollowing the OpenAPI 2.0 specification. The example is to retrieve virtualcluster(VC)objectfromtheserver.
{
"host":"string",
"command":"string"
}
---
required:
-"host"
-"command"
properties:
host:
type:"string"
command:
type:"string"
The OpenAPI Specification is formerly known as Swagger RESTful APIDocumentationSpecification.ItdefinesaspecificationtodescribeanddocumentaRESTfulserviceAPI.Itisalsoknownunderversion3.0ofswagger.However,as the tools for 3.0 are not yet completed, we will continue for now to useversionswagger2.0,tillthetransitionhasbeencompleted.Thisisespeciallyofimportance,asweneedtousetheswaggercodegentool,whichcurrentlysupportonlyuptospecificationv2.HenceweareatthistimeusingOpenAPI/Swaggerv2.0inourexample.Therearesomestructureandsyntaxchangesinv3,whiletheessenceisverysimilar.Formoredetailsofthechangesbetweenv3andv2,please refer to A document published on the Web titled Difference betweenOpenAPI3.0andSwagger2.0.
You canwrite theAPI definition in json for yaml format. Let us discuss thisformatbrieflyandfocusonyamlasitiseasiertoreadandmaintain.
Ontherootleveloftheyamldocumentweseefieldslikeswagger,info,andsoon.Amongthesefields,swagger,info,andpatharerequired.Theirmeaningisasfollows:
swagger
specifiestheversionnumber.Inourcaseastringvalue‘2.0’isusedaswearewritingthedefinitionconformingtothev2.0specification.
info
definesmetadatainformationrelatedtotheAPI.E.g.,theAPIversion,titleand description, termsOfService if applicable, contact information andlicense, etc. Among these attributes, version and title are required whileothersareoptional.
path
defines the actual endpoints of the exposed RESTful API service. Eachendpointhasafieldpatternasthekey,andaPathItemObjectasthevalue.In this example we have defined /vc and /vc/{id} as the two serviceendpoints.Theywill bepartof the final serviceURL,appendedafter theservicehostandbasePath,whichwillbeexplainedlater.
Let us focus on thePath ItemObject. It contains one ormore supportedoperationsontheserviceendpoint.AnoperationiskeyedbyavalidHTTPoperationverb,e.g.,oneofget,put,post,delete,orpatch.IthasavalueofOperationObjectthatdescribestheoperationsinmoredetail.
TheOperationObjectwillalwaysrequireaResponseObject.AResponseObject has aHTTPstatuscode as the key, e.g.,200 as successful return;40X as authentication and authorization related errors; and 50x as otherserversideservers.Itcanalsohasadefaultresponsekeyedbydefault forundeclared http status return code. The Response Object value has arequireddescriptionfield,andifanythingisreturned,aschemaindicatingtheobjecttypetobereturned,whichcouldbeaprimitivetype,e.g.,string,oranarrayorcustomizedobject. Incaseofobject or anarray ofobject,use$reftopointtothedefinitionoftheobject.Inthisexample,wehave
$ref:“#/definitions/VC”
to point to the VC definition in the definitions section in the samespecificationfile,whichwillbeexplainedlater.
Besides the required field, theOperationObject can have summary anddescription to indicate what the operation is about; and operationId touniquely identify the operation; and consumes and produces to indicatewhatMIMEtypesitexpectsasinputandforreturns,e.g.,application/jsoninmostmodernRESTfulAPIs.Itcanfurtherspecifywhatinputparameteris expectedusingparameters,which requires aname and in fields.namespecifiesthenameoftheparameter,andinspecifiesfromwheretogettheparameter, and its possible values are query, header, path, formData orbody.Inthisexampleinthe/vc/{id}pathweobtaintheidparameter fromtheURLpathwoithasthepathvalue.Whenthe inhaspathasitsvalue,therequiredfieldisrequiredandhastobesetastrue;whentheinhasvalueotherthanbody,atypefieldisrequiredtospecifythetypeoftheparameter.
Whilethethreerootlevelfieldsmentionedpreviouslyarerequired,inmostcaseswewillalsouseotheroptionalfields.
host
toindicatewheretheserviceistobedeployed,whichcouldbelocalhostora valid IP address or aDNSnameof the hostwhere the service is to bedeployed. Ifotherportnumberother than80 is tobeused,write theportnumberaswell,e.g.,localhost:8080.
schemas
tospecifythetransferprotocol,e.g,httporhttps.
basePath
tospecify thecommonbaseURL tobeappendafter thehost to form thebasepathforalltheendpoints,e.g.,/apior/api/1.0/. In thisexamplewiththe values specified we would have the final service endpointshttp://localhost:8080/api/vcs and http://localhost:8080/api/vc/{id} bycombiningtheschemas,host,basePathandpathsvalues.
consumesandproduces
canalsobespecifiedonthetopleveltospecifythedefaultMIMEtypesoftheinputandreturnifmostpathsandthedefinedoperationshavethesame.
definitions
as used in in the paths field, in order to point to a customized objectdefinitionwitha$refkeyword.
Thedefinitionsfieldreallycontainstheobjectdefinitionofthecustomizedobjects involved in the API, similar to a class definition in any ObjectOrientedprogramminglanguage.Inthisexample,wedefinedaVCobject,andhierarchicallyaNodeobject.Eachobjectdefinedisa typeofSchemaObjectinwhichmanyfieldcouldbeusedtospecifytheobject(seedetailsintheREFlinkattopofthedocument),butthemostfrequentlyusedonesare:
type
tospecifythetype,andinthecustomizeddefinitioncasethevalueismostlyobject.
required
fieldtolistthenamesoftherequiredattributesoftheobject.
properties
has the detailed information of each attribute/property of the object, e.g,type,format.Italsosupportshierarchicalobjectdefinitionsoapropertyofone object could be another customized object defined elsewhere whileusingschemaand$refkeyword topoint to thedefinition. In thisexamplewehavedefinedaVC, orvirtual cluster,object,while it containsanotherobjectdefinitionof
Node
aspartofacluster.
6.3.1TheVirtualClusterexampleAPIDefinition
6.3.1.1Terminology
VC
A virtual cluster, which has one Front-End (FE) management node andmultiplecomputenodes.AVCobjectalsohasidandname toidentifytheVC,andnnodestoindicatehowmanycomputenodesithas.
FE
Amanagementnodefromwhichtoaccessthecomputenodes.TheFEnodeusuallyconnectstoallthecomputenodesviaprivatenetwork.
Node
Acomputernodeobjectthattheinfoncorestoindicatenumberofcoresithas,andramandlocaldisktoshowthesizeofRAMandlocaldiskstorage.
6.3.1.2Specification
swagger:"2.0"
info:
version:"1.0.0"
title:"AVirtualCluster"
description:"VirtualClusterasatestofusingswagger-2.0specificationandcodegen"
termsOfService:"http://swagger.io/terms/"
contact:
name:"IUISEsoftwareandsystemteam"
license:
name:"Apache"
host:"localhost:8080"
basePath:"/api"
schemes:
-"http"
consumes:
-"application/json"
produces:
-"application/json"
paths:
/vcs:
get:
description:"ReturnsallVCsfromthesystemthattheuserhasaccessto"
produces:
-"application/json"
responses:
"200":
description:"AlistofVCs."
schema:
type:"array"
items:
$ref:"#/definitions/VC"
/vcs/{id}:
get:
description:"ReturnsallVCsfromthesystemthattheuserhasaccessto"
operationId:getVCById
parameters:
-name:id
in:path
description:IDofVCtofetch
required:true
type:string
produces:
-"application/json"
responses:
"200":
description:"Thevcwiththegivenid."
schema:
$ref:"#/definitions/VC"
default:
description:unexpectederror
schema:
$ref:'#/definitions/Error'
definitions:
VC:
type:"object"
required:
-"id"
-"name"
-"nnodes"
-"FE"
-"computes"
properties:
id:
type:"string"
name:
type:"string"
nnodes:
type:"integer"
format:"int64"
FE:
type:"object"
schema:
$ref:"#/definitions/Node"
computes:
type:"array"
items:
$ref:"#/definitions/Node"
tag:
type:"string"
Node:
type:"object"
6.3.2References
TheofficialOpenAPI2.0Documentation
6.4OPENAPI3.0RESTSERVICEVIAINTROSPECTION☁�Thesimplestway tocreateanOpenAPI service is touse theconexionserviceandreadinthespecificationfromitsyamlfile.Itwillthanbeintrospectedanddynamically methods are created that are used for the implementation of theserver.
Thefullexampleforthisisavailablein
https://github.com/cloudmesh-community/book/tree/master/examples/rest/cpu
Anextensivedocumentationisavailableat
https://media.readthedocs.org/pdf/connexion/latest/connexion.pdf
This example will return dynamically the cpu information of a computer todemonstrate how simple it is to generate in python a REST service from anOpenAPIspecification.
Ourrequirements.txtfileincludes
required:
-"ncores"
-"ram"
-"localdisk"
properties:
ncores:
type:"integer"
format:"int64"
ram:
type:"integer"
format:"int64"
localdisk:
type:"integer"
format:"int64"
Error:
required:
-code
-message
properties:
code:
type:integer
format:int32
message:
type:string
asdependencies.Theserver.pyfilesimplycontainsthefollowingcode:
ThiswillrunourRESTserviceundertheassumptionwehaveacpu.yamlandacpu.pyfilesasouryamlfilecallsoutmethodsfromcpu.py
Theyamlfilelooksasfollows
Herewe simply implement a getmethod and associate iswith theURL /cpu.The operationid, defines the method that we call which as we used the localdirectory is included in the file cpu.py. This is controlled by the prefix in the
flask
connexion[swagger-ui]
fromflaskimportjsonify
importconnexion
#Createtheapplicationinstance
app=connexion.App(__name__,specification_dir="./")
#Readtheyamlfiletoconfiguretheendpoints
app.add_api("cpu.yaml")
#createaURLrouteinourapplicationfor"/"
@app.route("/")
defhome():
msg={"msg":"It'sworking!"}
returnjsonify(msg)
if__name__=="__main__":
app.run(port=8080,debug=True)
openapi:3.0.2
info:
title:cpuinfo
description:AsimpleservicetogetcpuinfoasanexampleofusingOpenAPI3.0
license:
name:Apache2.0
version:0.0.1
servers:
-url:http://localhost:8080/cloudmesh
paths:
/cpu:
get:
summary:Returnscpuinformationofthehostingserver
operationId:cpu.get_processor_name
responses:
'200':
description:cpuinfo
content:
application/json:
schema:
$ref:"#/components/schemas/cpu"
components:
schemas:
cpu:
type:"object"
required:
-"model"
properties:
model:
type:"string"
operationid.
Averysimplefunctiontoreturnthecpuinformationisdefinedincpu.pywhichwelistnext
Wehave implemented this function to return a jsonified information from thedictpinfo.
Tosimplifyworkingwiththisexample,wealsoprovideamakefileforOSXthatallowsustocalltheserverandthecalltotheservoerintwodifferentterminals
Whenwecall
ourdemoisrun.
6.4.1Verification
Itisimportanttobeabletoverifyifayamlfileiscorrect.Toidentifythis,the
importos,platform,subprocess,re
fromflaskimportjsonify
defget_processor_name():
ifplatform.system()=="Windows":
p=platform.processor()
elifplatform.system()=="Darwin":
command="/usr/sbin/sysctl-nmachdep.cpu.brand_string"
p=subprocess.check_output(command,shell=True).strip().decode()
elifplatform.system()=="Linux":
command="cat/proc/cpuinfo"
all_info=subprocess.check_output(command,shell=True).strip().decode()
forlineinall_info.split("\n"):
if"modelname"inline:
p=re.sub(".*modelname.*:","",line,1)
else:
p="cannotfindcpuinfo"
pinfo={"model":p}
returnjsonify(pinfo)
defineterminal
osascript-e'tellapplication"Terminal"todoscript"cd$(PWD);$1"'
endef
install:
pipinstall-rrequirements.txt
demo:
$(callterminal,pythonserver.py)
sleep3
@echo"==============================================================================="
@echo"Gettheinfo"
@echo"==============================================================================="
curlhttp://localhost:8080/cloudmesh/cpu
@echo
@echo"==============================================================================="
makedemo
easiestmethodistousetheswaggereditor.Thereisanonlineverionavailableat:
https://editor.swagger.io/
GototheWebsite,removethecurrentpetstoreexampleandsimplypasteyouryamlfileinit.Debugmeessageswillbehelpingyoutocorrectthings.
Aterminalbasedcommandmayalsbehelpful,butisabitdifficulttoread.
6.4.2Swagger-UI
SwaggercomeswithaconvenientUItoinvokeRESTAPIcallsusingthewebbrowserratherthanrelyingonthecurlcommands.
Once the requestand responsedefinitionsareproperly specified,youcanstarttheserverby,
Then the UI would also be spawned under the service URL http://[serviceurl]/ui/
Example:http://localhost:8080/cloudmesh/ui/
6.4.3Mockservice
InsomecasesitmaybeusefultodeveloptheAPIwithouthavingyetdevelopedmethodsthatyoucallwiththeOperationI.Inthiscaseitisusefultorunamockservice.YOucaninvocesuchaservicewith
6.4.4Exercise
OpenAPI.Conexion.1:
Modifythemakefilesoitworksalsoonubuntu,butdonotdisablethe
$connexionruncpu.yaml--stub--debug
$pythonserver.py
$connexionruncpu.yaml--mock=all-v
abilitytorunitcorrectlyonOSX.Tipuseif’sinmakefilesbaseontheOS.Youcanlookat themakefiles thatcreate thisbookasexample.findalternativestosartingaterminalinLinux.
OpenAPI.Conexion.2:
Modify the makefile so it works also on Windows 10, but do notdisabletheabilitytorunitcorrectlyonOSX.Tipuseifsinmakefiles.Youcanlookatthemakefilesthatcreatethisbookasexample.Findalternativestostartapowershellorcmd.exeinwindows.Maybeyouneedtousegitbash.
OpenAPI.Conexion.3:
Implement a swagger specification of an issue related to the NISTBDRA. Implement it.Pleaseremember thiscouldprepareyou foraprojectgoodtopicsinclude:
virtual compute service interfacingwith aws, azure, google oropenstackvirtual directory service interfacing with google drive, box,github,iCloud,ftp,scp,andothers
Astherearesomanypossibilitiestocontribute,comeupinclasswithone specificationand than implement it fordifferent providers.Thedifficultyhereis that it isnotdoneforoneIaaS,but forallof themandallcanbeintegrated.
Thisexerciseistypicallygrowingtobepartofyourclassproject.
OpenAPI.Conexion.4:
Develop instructions on how to integrate the OpenAPI serviceframeworkinaWSGIbasedWebservice.Choseaserviceyoulikesothattheservicecouldruninproduction.
OpenAPI.Conexion.5:
Develop instructions on how to integrate the OpenAPI service
frameworkinTornadosotheservicecouldruninproduction.
6.5OPENAPIRESTSERVICEVIACODEGEN☁�
REST36:02Swagger
In this subsection we are discussing how to use OpenAPI 2.0 and SwaggerCodegentodefineanddevelopaRESTService.
WeassumeyouhavebeenfamiliarwiththeconceptofRESTservice,OpenAPIasdiscussedinsectionOverviewofRest.
InnextsectionwewillfurtherlookintotheSwagger/OpenAPI2.0specificationSwagger Specification and use a slight more complex example to walk youthrough the design of a RESTful service following the OpenAPI 2.0specifications.
WewilluseasimpleexampletodemonstratetheprocessofdevelopingaRESTservicewithSwagger/OpenAPI2.0specificationandthetoolsrelatedtois.Thegeneralstepsare:
Step 1 (Section Step 1: Define Your REST Service. Define the RESTservice conforming toSwagger/OpenAPI 2.0 specification. It is aYAMLdocument file with the basics of the REST service defined, e.g., whatresourcesithasandwhatactionsaresupported.
Step 2 (Section Step 2: Server Side Stub Code Generation andImplementation. Use Swagger Codegen to generate the server side stubcode.Fillintheactualimplementationofthebusinesslogicportioninthecode.
Step3(SectionStep3:InstallandRuntheRESTService.Installtheserversidecodeandrunit.Theservicewillthenbeavailable.
Step 4 (Section Step 4: Generate Client Side Code andVerify.Generateclientsidecode.DevelopcodetocalltheRESTservice.Installandruntoverify.
6.5.1Step1:DefineYourRESTService
In this example we define a simple REST service that returns the hostingserver’sbasicCPUinfo.Theexamplespecificationinyamlisasfollows:
6.5.2Step2:ServerSideStubCodeGenerationandImplementation
With theREST service having been defined,we can now generate the serversidestubcodeeasily.
6.5.2.1SetuptheCodegenEnvironment
YouwillneedtoinstalltheSwaggerCodegentoolifnotyetdoneso.FormacOSwerecommendthatyouusethehomebrewinstallvia
OnUbuntuyoucaninstallswaggerasfollows(updatetheversionasneeded):
swagger:"2.0"
info:
version:"0.0.1"
title:"cpuinfo"
description:"Asimpleservicetogetcpuinfoasanexampleofusingswagger-2.0specificationandcodegen"
termsOfService:"http://swagger.io/terms/"
contact:
name:"CloudmeshRESTServiceExample"
license:
name:"Apache"
host:"localhost:8080"
basePath:"/api"
schemes:
-"http"
consumes:
-"application/json"
produces:
-"application/json"
paths:
/cpu:
get:
description:"Returnscpuinformationofthehostingserver"
produces:
-"application/json"
responses:
"200":
description:"CPUinfo"
schema:
$ref:"#/definitions/CPU"
definitions:
CPU:
type:"object"
required:
-"model"
properties:
model:
type:"string"
$brewinstallswagger-codegen
Addthealiastoyour.bashrcor.bash_profile file.Afteryoustartanewterminalyoucanuseinthatterminalnowthecommand
Forotherplatformsyoucanjustusethe.jarfile,whichcanbedownloadedfromthislink.
AlsomakesureJava7or8isinstalledinyoursystem.Tohaveawelldefinedlocationwerecommendthatyouplacethecodeinthedirectory~/cloudmesh.Inthisdirectoryyouwillalsoplacethecpu.yamlfile.
6.5.2.2GenerateServerStubCode
After youhave the codegen tool ready, andwith Java7or 8 installed inyoursystem,youcanrunthefollowingtogeneratetheserversidestubcode:
orifyouhavenotcreatedanalias
In the specified directory under flaskConnexion you will find the generatedpython flaskcode,withpython2compatibility. It isbest toplace theswaggercodeunderthedirectory~/cloudmeshtohavealocationwhereyoucaneasilyfindit.Ifyouwanttousepython3makesuretochosetheappropriateoption.Toswitchbetweenpython2andpython3werecommendthatyouusepyenvasdiscussedinourpythonsection.
6.5.2.3Fillintheactualimplementation
$mkdir~/swagger
$cd~/swagger
$wgethttps://oss.sonatype.org/content/repositories/releases/io/swagger/swagger-codegen-cli/2.3.1/swagger-codegen-cli-2.3.1.jar
$aliasswagger-codegen="java-jar~/swagger/swagger-codegen-cli-2.3.1.jar"
swagger-codegen
$swagger-codegengenerate\
-i~/cloudmesh/cpu.yaml\
-lpython-flask\
-o~/cloudmesh/swagger_example/server/cpu/flaskConnexion\
-DsupportPython2=true
$java-jarswagger-codegen-cli.jargenerate\
-i~/cloudmesh/cpu.yaml\
-lpython-flask\
-o~/cloudmesh/swagger_example/server/cpu/flaskConnexion\
-DsupportPython2=true
Under the flaskConnexion directory, youwill find a swagger_server directory,underwhichyouwillfinddirectorieswithmodelsdefinedandcontrollerscodestubresides.ThemodelscodearegeneratedfromthedefinitioninStep1.Onthecontroller code though,wewill need to fill in the actual implementation.Youmay see a default_controller.py file under the controllers directory in which theresourceandaction isdefinedbutyet tobe implemented. Inourexample,youwillfindsuchafunctiondefinitionwhichwelistnext:
Pleasenote the dosomemagic! string at the return of the function.This ought to bereplacedwithactualimplementationwhatyouwouldlikeyourRESTcalltobereally doing. In reality this could be some call to a backend database ordatastore; a call to another API; or even calling another REST service fromanotherlocation.Inthisexamplewesimplyretrievethecpumodelinformationfromthehostingserverthroughasimplepythoncalltoillustratethisprinciple.Thusyoucandefinethefollowingcode:
Andthenchangethecpu_get()functiontothefollowing:
Please note we are returning a CPU object as defined in the API and later
defcpu_get():#noqa:E501
"""cpu_get
Returnscpuinfoofthehostingserver#noqa:E501
:rtype:CPU
"""
return'dosomemagic!'
importos,platform,subprocess,re
defget_processor_name():
ifplatform.system()=="Windows":
returnplatform.processor()
elifplatform.system()=="Darwin":
command="/usr/sbin/sysctl-nmachdep.cpu.brand_string"
returnsubprocess.check_output(command,shell=True).strip()
elifplatform.system()=="Linux":
command="cat/proc/cpuinfo"
all_info=subprocess.check_output(command,shell=True).strip()
forlineinall_info.split("\n"):
if"modelname"inline:
returnre.sub(".*modelname.*:","",line,1)
return"cannotfindcpuinfo"
defcpu_get():#noqa:E501
"""cpu_get
Returnscpuinfoofthehostingserver#noqa:E501
:rtype:CPU
"""
returnCPU(get_processor_name())
generatedbythecodegentoolinthemodelsdirectory.
Itisbestnottoincludethedefinitionofget_processor_name()inthesamefileasyouseethedefinitionofcpu_get().Thereasonforthisisthatincaseyouneedtoregeneratethe code, yourmodified codewill naturallybeoverwritten.Thus, tominimizethechanges,wedorecommendtomaintainthatportioninadifferentfilenameandimportthefunctionastokeepthemodificationssmall.
Atthisstepwehavecompletedtheserversidecodedevelopment.
6.5.3Step3:InstallandRuntheRESTService:
NowwecaninstallandruntheRESTservice.Itisstronglyrecommendedthatyourunthisinapyenvoravirtualenvenvironment.
6.5.3.1Startavirtualenv:
Incaseyouarenotusingpyenv,pleaseusevirtualenvasfollows:
6.5.3.2Makesureyouhavethelatestpip:
6.5.3.3Installtherequirementsoftheserversidecode:
6.5.3.4Installtheserversidecodepackage:
Underthesamedirectory,run:
6.5.3.5Runtheservice
Underthesamedirectory:
$virtualenvRESTServer
$sourceRESTServer/bin/activate
$pipinstall-Upip
$cd~/cloudmesh/swagger_example/server/cpu/flaskConnexion
$pipinstall-rrequirements.txt
$pythonsetup.pyinstall
Youshouldseeamessagelikethis:
6.5.3.6Verifytheserviceusingawebbrowser:
Openawebbrowserandvisit:
http://localhost:8080/api/cpu
toseeifitreturnsajsonobjectwithcpumodelinfoinit.
Assignment:Howwouldyouverifythatyourserviceworkswithacurlcall?
6.5.4Step4:GenerateClientSideCodeandVerify
Inadditiontotheserversidecodeswaggercanalsocreateaclientsidecode.
6.5.4.1Clientsidecodegeneration:
Generate theclientsidecode inasimilar fashionaswedid for theserversidecode:
6.5.4.2Installtheclientsidecodepackage:
Although we could have installed the client in the same python pyenv orvirtualenv,we showcase here that it can be installed in a completely differentenvironment.Thatwouldmake itevenpossible touseapython3basedclientand a python 2 based server showcasing interoperability between pythonversions(althoughwejustusepython2here).Thuswecreateanenewpythonvirtualenvironmentandconductourinstall.
$python-mswagger_server
*Runningonhttp://0.0.0.0:8080/(PressCTRL+Ctoquit)
$java-jarswagger-codegen-cli.jargenerate\
-i~/cloudmesh/cpu.yaml\
-lpython\
-o~/cloudmesh/swagger_example/client/cpu\
-DsupportPython2=true
$virtualenvRESTClient
$sourceRESTClient/bin/activate
$pipinstall-Upip
6.5.4.3UsingtheclientAPItointeractwiththeRESTservice
Under thedirectoryswagger_example/client/cpuyouwill findaREADME.mdfilewhichservesasanAPIdocumentationwithexampleclientcodeinit.E.g.,ifwesavethefollowingcodeintoa.pyfile:
We can then run this code to verify the calling to the REST service actuallyworks.Weareexpectingtoseeareturnsimilartothis:
Obviously,wecouldhaveappliedadditionalcleanupoftheinformationreturnedbythepythoncode,suchasremovingduplicatedspaces.
6.5.5TowardsaDistributedClientServer
Althoughwedevelopand run theexampleonone localhostmachine,youcanseparatetheprocessintotwoseparatemachines.E.g.,onaserverwithexternalIPorevenDNSname todeploy theserversidecode,andona local laptoporworkstationtodeploytheclientsidecode.InthiscasepleasemakechangesontheAPIdefinitionaccordingly,e.g.,thehostvalue.
6.6FLASKRESTFULSERVICES☁�Flask is amicro services framework allowing towriteweb services in pythonquickly.OneofitsextensionsisFlask-RESTful.ItaddsforbuildingRESTAPIsbasedonaclassdefinitionmakingitrelativelysimple.ThroughthisinterfacewecanthanintegratewithyourexistingObjectRelationalModelsandlibraries.As
$cdswagger_example/client/cpu
$pipinstall-rrequirements.txt
$pythonsetup.pyinstall
from__future__importprint_function
importtime
importswagger_client
fromswagger_client.restimportApiException
frompprintimportpprint
#createaninstanceoftheAPIclass
api_instance=swagger_client.DefaultApi()
try:
api_response=api_instance.cpu_get()
pprint(api_response)
exceptApiExceptionase:
print("ExceptionwhencallingDefaultApi->cpu_get:%s\n"%e)
{'model':'Intel(R)Core(TM)[email protected]'}
Flask-RESTful leverages the main features from Flask an extensive set ofdocumentation isavailableallowingyou togetstartedquicklyand thoroughly.TheWebpagecontainsextensivedocumentation:
https://flask-restful.readthedocs.io/en/latest/
Wewillprovideasimpleexample thatshowcasessomehardcodeddata tobeservedasarestservice.Itwillbeeasytoreplacethisforexamplewithfunctionsand methods that obtain such information dynamically from the operatingsystem.
This example has not been tested. We like that the class defines a beautifulexampletocontributetothissectionandexplainswhathappensinthisexample.fromflaskimportFlask
fromflask_restfulimportreqparse,abort
fromflask_restfulimportApi,Resource
app=Flask(__name__)
api=Api(app)
COMPUTERS={
'computer1':{
'processor':'iCore7'
},
'computer2':{
'processor':'iCore5'
},
'computer3':{
'processor':'iCore3'
},
}
defabort_if_cluster_doesnt_exist(computer_id):
ifcomputer_idnotinCOMPUTERS:
abort(404,message="Computer{}doesnotexist".format(computer_id))
parser=reqparse.RequestParser()
parser.add_argument('processor')
classComputer(Resource):
'''showsasinglecomputeritemandletsyoudeleteacomputer
item.'''
defget(self,computer_id):
abort_if_computer_doesnt_exist(computer_id)
returnCOMPUTERS[computer_id]
defdelete(self,computer_id):
abort_if_computer_doesnt_exist(computer_id)
delCOMPUTERS[computer_id]
return'',204
defput(self,computer_id):
args=parser.parse_args()
processor={'processor':args['processor']}
COMPUTERS[computer_id]=processor
returnprocessor,201
#ComputerList
classComputerList(Resource):
'''showsalistofallcomputers,andletsyouPOSTtoaddnewcomputers'''
6.7RESTSERVICESWITHEVE☁�Next,wewill focusonhowtomakeaRESTfulwebservicewithPythonEve.Eve makes the creation of a REST implementation in python easy. MoreinformationaboutEvecanbefoundat:
http://python-eve.org/
AlthoughwedorecommendUbuntu17.04,atthistimethereisabugthatforcesustouse16.04.Furthermore,werequireyoutofollowtheinstructionsonhowto installpyenvanduse it tosetupyourpythonenvironment.Werecommendthat you use either python 2.7.14 or 3.6.4.We do not recommend you to useanacondaasitisnotsuitedforcloudcomputingbuttargetsdesktopcomputing.Ifyouusepyenvyoualsoavoidtheissueofinterferingwithyoursystemwidepythoninstall.Wedorecommendpyenvregardlessifyouuseavirtualmachineorareworkingdirectlyonyouroperatingsystem.Afteryouhavesetupaproperpython environment, make sure you have the newest version of pip installedwith
ToinstallEve,youcansay
AsEvealsoneedsabackenddatabase,andasMongoDBisanobviouschoicefor this, we will have to first install MongoDB. MongoDB is a Non-SQLdatabasewhichhelpstostorelightweightdataeasily.
defget(self):
returnCOMPUTERS
defpost(self):
args=parser.parse_args()
computer_id=int(max(COMPUTERS.keys()).lstrip('computer'))+1
computer_id='computer%i'%computer_id
COMPUTERS[computer_id]={'processor':args['processor']}
returnCOMPUTERS[computer_id],201
##
##SetuptheApiresourceroutinghere
##
api.add_resource(ComputerList,'/computers')
api.add_resource(Computer,'/computers/<computer_id>')
if__name__=='__main__':
app.run(debug=True)
$pipinstallpip-U
$pipinstalleve
6.7.1UbuntuinstallofMongoDB
OnUbuntuyoucaninstallMongoDBasfollows
6.7.2macOSinstallofMongoDB
OnmacOSyoucanusethecommand
6.7.3Windows10InstallationofMongoDB
Astudentorstudentgroupofthisclassareinvitedtodiscussonpiazzaonhowto install mongoDB on Windows 10 and come up with an easy installationsolution.Naturallywehavethesame2differentwaysonhowtorunmongo.Inuser space or in the system. As we want to make sure your computer stayssecure. the solutionmust have an easyway on how to shut down theMongoservices.
Anenhancementofthistaskwouldbetointegratethisfunctionintocloudmeshcmd5with a commandmongo that allows for easily starting and stopping theservicefromcms.
6.7.4DatabaseLocation
After downloadingMongo, create the db directory. This is where theMongodatafileswilllive.Youcancreatethedirectoryinthedefaultlocationandassureithas therightpermissions.Makesure that the /data/dbdirectoryhas therightpermissionsbyrunning
6.7.5Verification
InordertochecktheMongoDBinstallation,pleaserunthefollowingcommands
$sudoapt-keyadv--keyserverhkp://keyserver.ubuntu.com:80\
--recv2930ADAE8CAF5059EE73BB4B58712A2291FA4AD5
$echo"deb[arch=amd64,arm64]https://repo.mongodb.org/apt/ubuntu\
xenial/mongodb-org/3.6multiverse"|\
sudotee/etc/apt/sources.list.d/mongodb-org-3.6.list
$sudoapt-getupdate
$sudoapt-getinstall-ymongodb-org
$brewupdate
$brewinstallmongodb
inoneterminal:
Inanotherterminalwetrytoconnecttomongoandissueamongocommandtoshowthedatabases:
If they execute without errors, you have successfully installed MongoDB. Inordertostoptherunningdatabaseinstancerunthefollowingcommand.simplyCTRL-Ctherunningmongodprocess
6.7.6BuildingasimpleRESTService
Inthissectionwewillfocusoncreatingasimplerestservice.Toorganizeourworkwewillcreatethefollowingdirectory:
AsEveneedsaconfigurationanditisreadinbydefaultfromthefilesettings.pyweplacethefollowingcontentinthefile~/cloudmesh/eve/settings.py:
TheDOMAINobjectspecifiestheformatofastudentobjectthatweareusingaspartofourRESTservice.InadditionwecanspecifyRESOURCE_METHODSwhichmethods
$mkdir-p~/cloudmesh/data/db
$mongod--dbpath~/cloudmesh/data/db
$mongo--host127.0.0.1:27017
$showdatabases
$mkdir-p~/cloudmesh/eve
$cd~/cloudmesh/eve
MONGO_HOST='localhost'
MONGO_PORT=27017
MONGO_DBNAME='student_db'
DOMAIN={
'student':{
'schema':{
'firstname':{
'type':'string'
},
'lastname':{
'type':'string'
},
'university':{
'type':'string'
},
'email':{
'type':'string',
'unique':True
}
'username':{
'type':'string',
'unique':True
}
}
}
}
RESOURCE_METHODS=['GET','POST']
are activated for the REST service. This way the developer can restrict theavailable methods for a REST service. To pass along the specification formongoDB,we simply specify the hostname, the port, aswell as the databasename.
Nowthatwehavedefinedthesettingsforourexampleservice,weneedtostartitwithasimplepythonprogram.Wecouldnamethatprogramanythingwelike,butoftenitiscalledsimplyrun.py.Thisfileisplacedinthesamedirectorywhereyou placed the settings.py. In our case it is in the file ~/cloudmesh/eve/run.py andcontainsthefollowingpythonprogram:
ThisisthemostminimalapplicationforEve,thatusesthesettings.pyfileforitsconfiguration. Naturally, if we were to change the configuration file and forexample change the DOMAIN and its schema, we would naturally have toremove the database previously created and start the service new. This isespeciallyimportantasduringthedevelopmentphasewemayfrequentlychangetheschemaandthedatabase.ThusitisconvenienttodevelopnecessarycleaningactionsaspartofaMakefilewhichweleaveaseasyexerciseforthestudents.
Next,weneed to start the serviceswhichcaneasilybeachieved ina terminalwhilerunningthecommands:
PreviouslywestartedthemongoDBserviceasfollows:
This is done in its own terminal, sowe can observe the logmessages easily.NextwestartinanotherwindowtheEveservicewith
You can find the codes and commands up to this point in the followingdocument.
6.7.7InteractingwiththeRESTservice
fromeveimportEve
app=Eve()
if__name__=='__main__':
app.run()
$mongod--dbpath~/cloudmesh/data/db/
$cd~/cloudmesh/eve
$pythonrun.py
Yetinanotherwindow,wecannowinteractwiththeRESTservice.WecanusethecommandlinetosavethedatainthedatabaseusingtheRESTapi.Thedatacanbe retrieved inXMLor in json format. Json is oftenmore convenient fordebuggingasitiseasiertoreadthanXML.
Naturally,weneedfirsttoputsomedataintotheserver.LetusassumeweaddtheuserAlbertZweistein.
Toachievethis,weneedtospecifytheheaderusingH tagsayingweneedthedatatobesavedusingjsonformat.AndXtagsaystheHTTPprotocolandhereweusePOSTmethod.Andthetagdspecifies thedataandmakesureyouusejsonformattoenterthedata.Finally,theRESTapiendpointtowhichwemustsavedata.ThisallowsustosavethedatainatablecalledstudentinMongoDBwithinadatabasecalledeve.
Inordertocheckiftheentrywasacceptedinmongoandincludedintheserverissuethefollowingcommandsequenceinanotherterminal:
Nowyoucanquerymongodirectlywithitsshellinterface
NaturallythisisnotreallynecessaryforARESTservicesuchaseveasweshowyounexthowtogainaccesstothedataviamongowhileusingRESTcalls.WecansimplyretrievetheinformationwiththehelpofasimpleURI:
Naturally, you can formulate otherURLs and query attributes that are passedalongafterthe?.
ThiswillnowallowyoutodevelopsophisticatedRESTservices.Weencourageyou to inspect the documentation provided by Eve to showcase additionalfeaturesthatyoucouldbeusingaspartofyourefforts.
$curl-H"Content-Type:application/json"-XPOST\
-d'{"firstname":"Albert","lastname":"Zweistein",\
"school":"ISE","university":"IndianaUniversity",\
"email":"[email protected]","username":"albert"}'\
http://127.0.0.1:5000/student/
$mongo
>showdatabases
>usestudent_db
>showtables#querythetablenames
>db.student.find().pretty()#prettywillshowthejsoninaclearway
$curlhttp://127.0.0.1:5000/student?firstname=Albert
LetusexplorehowtoproperlyuseadditionalRESTAPIcalls.WeassumeyouhaveMongoDBupandrunning.ToquerytheserviceitselfwecanusetheURIontheEveport
Yourpayloadshouldlookliketheonelistednext,ifyouroutputisnotformattedlikethistryadding?pretty=1
RememberthattheAPIentrypointsincludeadditionalinformationsuchaslinksandachild,andhref.
Set up a python environment that works for your platform. Provide explicitreasonswhy anaconda and other prepackaged python versions have issues forcloudrelatedactivities.Whenmayyouuseanacondaandwhenshouldyounotuseanaconda.Whywouldyouwanttousepyenv?
Whatisthemeaningandpurposeoflinks,child,andhref
InthiscasehowmanychildresourcesareavailablethroughourAPI?
DevelopaRESTservicewithEveandstartandstopit
Definecurlcallstostoredataintotheserviceandretrieveit.
Write a Makefile and in it a target clean that cleans the data base. Developadditional targets such as start and stop, that start and stop themongoDB butalsotheEveRESTservice
Issuethecommand
$curl-ihttp://127.0.0.1:5000
$curl-ihttp://127.0.0.1:5000?pretty=1
HTTP/1.0200OK
Content-Type:application/json
Content-Length:150
Server:Eve/0.7.6Werkzeug/0.11.15Python/2.7.16
Date:Wed,17Jan201818:34:07GMT
{
"_links":{
"child":[
{
"href":"student",
"title":"student"
}
]
}
Whatdoesthe_linkssectiondescribe?
Whatdoesthe_itemssectiondescribe?
6.7.8CreatingRESTAPIEndpoints
Nextwewont to enhance our example a bit. First, let us get back to the eveworkingdirectorywith
Addthefollowingcontenttoafilecalledrun2.py
After creating and saving the file. Run the following command to start theservice
$curl-ihttp://127.0.0.1:5000/people
{
"_items":[],
"_links":{
"self":{
"href":"people",
"title":"people"
},
"parent":{
"href":"/",
"title":"home"
}
},
"_meta":{
"max_results":25,
"total":0,
"page":1
}
}
$cd~/cloudmesh/eve
fromeveimportEve
fromflaskimportjsonify
importos
importgetpass
app=Eve()
@app.route('/student/albert')
defalberts_information():
data={
'firstname':'Albert',
'lastname':'Zweistsein',
'university':'IndianaUniversity',
'email':'[email protected]'
}
try:
data['username']=getpass.getuser()
except:
data['username']='not-found'
returnjsonify(**data)
if__name__=='__main__':
app.run(debug=True,host="127.0.0.1")
$pythonrun2.py
Afterrunningthecommand,youcaninteractwiththeservicewhileenteringthefollowingurlinthewebbrowser:
Youcanalsoopenupasecondterminalandtypeinit
Thefollowinginformationwillbereturned:
ThisexampleillustrateshoweasyitistocreateRESTservicesinpythonwhilecombininginformationfromadictwithinformationretrievedfromthesystem.The important part is to understand the decorator app.route. The parameterspecifiestherouteoftheAPIendpointwhichwillbetheaddressappendedtothebasepath,http://127.0.0.1:5000.Itisimportantthatwereturnajsonifiedobject,whichcaneasilybedonewiththejsonifyfunctionprovidedbyflask.Asyoucanseethenameofthedecoratedfunctioncanbeanythingyoulok.Theroutespecifieshowweaccessitfromtheservice.
6.7.9RESTAPIOutputFormatsandRequestProcessing
Anotherway ofmanaging the data is to utilize class definitions and responsetypesthatweexplicitlydefine.
Ifwewant tocreateanobject likeStudent,wecanfirstdefineapythonclass.Createafilecalledstudent.py.Please,notethegetmethodthatreturnssimplythe information in the dict for the class. It is not related to the REST getfunction.
http://127.0.0.1:5000/student/alberts
$curlhttp://127.0.0.1:5000/student/alberts
{
"firstname":"Albert",
"lastname":"Zweistain",
"university":"IndianaUniversity",
"email":"[email protected]",
"username":"albert"
}
classStudent(object):
def__init__(self,firstname,lastname,university,email):
self.firstname=firstname
self.lastname=lastname
self.university=university
self.email=email
self.username='undefined'
defget(self):
returnself.__dict__
defsetUsername(self,name):
self.username=name
returnname
NextwedefineaRESTservicewithEveasshowninthefollowinglisting
Incontrasttoourearlierexample,wearenotusingthejsonifyobject,butcreateexplicitlyaresponsethatwereturntotheclients.Theresponseincludesaheaderthatwereturntheinformationinjsonformat,astatusof200,whichmeanstheobjectwasreturnedsuccessfully,andtheactualdata.
6.7.10RESTAPIUsingaClientApplication
�Thisexampleisnottested.Pleaseprovidefeedbackandimprove.
IntheSectionRestServiceswithEvewecreatedourownRESTAPIapplicationusing Python Eve.Now once the service running, awe need to learn how tointeractwithitthroughclients.
Firstgobacktotheworkingfolder:
Herewecreateanewpythonfilecalledclient.py.Thefileincludethefollowingcontent.
fromeveimportEve
fromstudentimportStudent
importplatform
importpsutil
importjson
fromflaskimportResponse
importgetpass
app=Eve()
@app.route('/student/albert',methods=['GET'])
defprocessor():
student=Student("Albert",
"Zweistein",
"IndianaUniversity",
response=Response()
response.headers["Content-Type"]="application/json;charset=utf-8"
try:
student.setUsername(getpass.getuser())
response.headers["status"]=200
except:
response.headers["status"]=500
response.data=json.dumps(student.get())
returnresponse
if__name__=='__main__':
app.run(debug=True,host='127.0.0.1')
$cd~/cloudmesh/eve
importrequests
importjson
defget_all():
response=requests.get("http://127.0.0.1:5000/student")
Runthefollowingcommandinanewterminaltoexecutethesimpleclientby
Herewhenyourunthisclassforthefirsttime,itwillrunsuccessfully,butifyoutried it for the second time, itwill give you an error. Becausewe did set theemailtobeauniquefieldintheschemawhenwedesignedthesettings.pyfileinthebeginning.Soifyouwanttosaveanotherrecordyoumusthaveentrieswithuniqueemails.Inordertomakethisdynamicyoucanincludeainputreadingbyusingtheterminaltogetthestudentdatafirstandinsteadofthestaticdatayoucanuse theuser inputdata fromthe terminal togetdynamicdata.But for thisexercisewedonotexpectthatoranyotherformdatafunctionality.
Inordertogetthesaveddata,youcancommenttherecordsavingfunctionanduncommentthegetallfunction.Inpythoncommentingisdonebyusing#.
Thisclient isusing therequests python library to sendGET,POSTandotherHTTP requests to the server soyoucan leveragebuild inmethods to simplifyyourwork.
The get_all functionprovidesaway toget theoutput to theconsolewithall thedatainthestudentdatabase.Thesave_recordfunctionprovidesawaytosavedatainthedatabase.Youcancreatedynamicfunctionsinorder tosavedynamicdata.Howeveritmaytakesometimeforyoutoapplyasexercise.
WriteaRESTfulservicetodetermineausefulpieceofinformationoffofyour
print(json.dumps(response.json(),indent=4,sort_keys=True))
defsave_record():
headers={
'Content-Type':'application/json'
}
data='{"firstname":"Gregor",
"lastname":"vonLaszewski",
"university":"IndianaUniversity",
"email":"[email protected]",
"username":"jane"}'
response=requests.post('http://localhost:5000/student/',
headers=headers,
data=data)
print(response.json())
if__name__=='__main__':
save_record()
get_all()
$pythonclient.py
computeri.e.diskspace,memory,RAM,etc.Inthisexercisewhatyouneedtodoisuseapythonlibrarytoextractdataaboutcomputerinformationmentionedpreviously and send these information to the user once the user calls an APIendpoint like http://localhost:5000/performance/ram, it must return the RAM value of thegivenmachine.Foreachinformationlikediskspace,RAM,etcyoucanuseanendpointpereachfeatureneeded.Asatipforthisexercise,usethepsutillibraryinpythontoretrievethedata,andthengettheseinformationintoanstringthenpopulateaclasscalledComputerandtrytosavetheobjectlikewise.
6.7.11Towardscmd5extensionstomanageeveandmongo �
�
Part of this section related to management of the mongo dbserviceisdonebythecm4commandwewillbedeveloppingaspartofthis class cmsmongoadmin that does all of the things explained next andmore.
NaturallyitisofadvantagetohaveincmsadministrationcommandstomanagemongoandevefromcmdinsteadoftargetsintheMakefile.Hence,weproposethat the classdevelops suchanextension.Wewill create in the repository theextension called admin andhope that students through collaborativework andpullrequestscompletesuchanadmincommand.
Theproposedcommandislocatedat:
https://github.com/cloudmesh/cloudmesh.rest/blob/master/cloudmesh/admin/command/admin.py
Itwillbeuptotheclasstoimplementsuchacommand.Pleasecoordinatewitheachother.
TheimplementationbasedonwhatweprovidedintheMakefileseemsstraightforward. A great extension is to load the objects definitions or evee.g.settings.pynotfromtheclass,butfromaplacein.cloudmesh.Iproposetoplacethefileat:~/.cloudmesh/db/settings.py
thelocationof thisfile isusedwhentheServiceclass is initializedwithNone.Prior to starting the service the file needs to be copied there. This could beachievedwithasetcommand.
6.8HATEOAS☁�In the previous section we discussed the basic concepts about RESTful webservice.NextweintroduceyoutotheconceptofHATEOAS
HATEOASstandsforHypermediaastheEngineofApplicationStateandthisisenabled by the default configuration within Eve. It is useful to review theterminologyandattributesusedaspartofthisconfiguration.HATEOSexplainshowRESTAPIendpointsaredefinedanditprovidesacleardescriptiononhowtheAPIcanbeconsumedthroughtheseterms:
_links
Linksdescribetherelationofcurrentresourcebeingaccessedtotherestofthe resources. It is like if we have a set of links to the set of objects orserviceendpointsthatwearereferringintheRESTfulwebservice.Hereanendpoint refers toaservicecallwhich is responsible forexecutingoneoftheCRUDoperationsonaparticularobjectorsetofobjects.Moreonthelinks,thelinksobjectcontainsthelistofserviceableAPIendpointsorlistof services.WhenwearecallingaGET requestor anyother request,wecan use these service endpoints to execute different queries based on theuser purpose. For instance, a service call can be used to insert data orretrieve data from a remote database using a REST API call. Aboutdatabaseswewilldiscussindetailinanotherchapter.
title
The title in the rest endpoint is the name or topic that we are trying toaddress.Itdescribesthenatureoftheobjectbyasingleword.Forinstancestudent,bank-statement,salary,etccanbeatitle.
parent
The term parent refers to the very initial link or an API endpoint in a
particularRESTfulwebservice.Generallythisisdenotedwiththeprimaryaddresslikehttp://example.com/api/v1/.
href
ThetermhrefreferstotheurlsegmentthatweusetoaccesstheaparticularREST API endpoint. For instance “student?page=1” will return the firstpageofstudentlistbyretrievingaparticularnumberofitemsfromaremotedatabase or a remote data source. The full url will look like this,“http://www.exampleapi.com/student?page=1”.
In addition to these fields eve will automatically create the follwoinginformationwhenresourcesarecreatedasshowcasedot
http://python-eve.org/features.html
Field Description_created itemcreationdate._updated itemlastupdatedon._etag ETag,tobeusedforconcurrencycontrolandconditionalrequests._id uniqueitemkey,alsoneededtoaccesstheindividualitemendpoint.
Pagenationinformationcanbeincludedinthe_metafield.
6.8.1Filtering
Clientscansubmitquerystringstotherestservicetoretrieveresourcesbasedona filter.This also allows sortingof the results queried.Onenice feature aboutusing mongo as a backend database is that Eve not only allows pythonconditionalexpressions,butalsomongoqueries.
Anumberofexamplestoconductsuchqueriesinclude:
Apythonexpression
$curl-i-ghttp://eve-demo.herokuapp.com/people?where={%22lastname%22:%20%22Doe%22}
$curl-ihttp://eve-demo.herokuapp.com/people?where=lastname=="Doe"
6.8.2PrettyPrinting
Prettyprintingistypicallysupportedbyaddingtheparameter?prettyor?pretty=1
Ifthisdoesnotworkyoucanalwaysusepythontobeautifyajsonoutputwith
or
6.8.3XML
IfforsomereasonyouliketoretrievetheinformationinXMLyoucanspecifythisforexamplethroughcurlwithanAcceptheader
6.9EXTENSIONSTOEVE☁�Anumberofextensionshavebeendevelopedbythecommunity.Thisincludeseve-swagger,eve-sqlalchemy,eve-elastic,eve-mongoengine,eve-neo4j,eve.net,eve-auth-jwt,andflask-sentinel.
Naturallytherearemanymore.
Students have the opportunity to pic one of the community extensions andprovideasectionforthehandbook.
Pick one of the extension, research it and provide a small section for thehandbooksoweaddit.
6.9.1ObjectManagementwithEveandEvegenie
http://python-eve.org/
Eve makes the creation of a REST implementation in python easy. We willprovideyouwithanimplementationexamplethatshowcasesthatwecancreateRESTserviceswithoutwritingasinglelineofcode.Thecodeforthisislocated
$curl-ihttp://localhost/people?pretty
$curl-ihttp://localhost/people|python-mjson.tool
$curl-H"Accept:application/xml"-ihttp://localhost
athttps://github.com/cloudmesh/rest
Thiscodewillhaveamasterbranchbutwillalsohaveadevbranchinwhichwewilladdgraduallymoreobjects.Objectsinthedevbranchwillinclude:
virtualdirectoriesvirtualclustersjobsequencesinventories
You may want to check our active development work in the dev branch.Howeverforthepurposeofthisclassthemasterbranchwillbesufficient.
6.9.1.1Installation
Firstwehavetoinstallmongodb.Theinstallationwilldependonyouroperatingsystem.Fortheuseof therestserviceit isnot important tointegratemongodbinto the system upon reboot, which is focus of many online documents.However, for us it is better ifwe can start and stop the services explicitly fornow.
Onubuntu,youneedtodothefollowingsteps:
�TODO:Sectioncanbecontributedbystudent.
Onwindows10,youneedtodothefollowingsteps:
�TODO:Sectioncanbecontributedbystudent.IfyouelectWindows10.YoucouldbeusingtheonlinedocumentationprovidedbystartingitonWindows,orrunningitinadockercontainer.
OnmacOSyoucanusehome-brewandinstallitwith:
Infuturewemaywanttoaddsslauthenticationinwhichcaseyoumayneedtoinstallitasfollows:
$brewupdate
$brewinstallmongodb
$brewinstallmongodb--with-openssl
6.9.1.2Startingtheservice
WehaveprovidedaconvenientMakefilethatcurrentlyonlyworksformacOS.ItwillbeeasyforyoutoadaptittoLinux.Certainlyyoucanlookatthetargetsinthemakefileandreplicatethemonebyone.Importanttargetsaredeployandtest.
Whenusingthemakefileyoucanstarttheserviceswith:
ITwillstarttwoterminals.INoneyouwillseethemongoservice,intheotheryou will see the eve service. The eve service will take a file calledsample.settings.py that is base on sample.json for the start of the eve service.Themongo service is configured in such away that it only accepts incomingconnectionsfromthelocalhostwhichwillbesufficientforourcase.Themongodataiswrittenintothe$USER/.cloudmeshdirectory,somakesureitexists.
Totesttheservicesyoucansay:
Youwillseanumberofjsontextbeenwrittentothescreen.
6.9.1.3Creatingyourownobjects
Theexampledemonstratedhoweasy it is tocreateamongodbandaneverestservice. Now let us use this example to create your own. For this we havemodifiedatoolcalledevegenietoinstallitontoyoursystem.
Theoriginaldocumentationforevegenieislocatedat:
http://evegenie.readthedocs.io/en/latest/
However, we have improved evegenie while providing a commandline toolbasedonit.Theimprovedcodeislocatedat:
https://github.com/cloudmesh/evegenie
Youcloneitandinstallonyoursystemasfollows:
$makedeploy
$maketest
Thisshouldinstallinyoursystemevegenie.YOucanverifythisbytyping:
If you see the path evegenie is installed.With evegenie installed its usage issimple:
Ittakesajsonfileasinputandwritesoutasettingsfilefortheuseineve.Letsassume the file is called sample.json, than the settings file will be calledsample.settings.py.Having the evegenieprogramwill allowus togenerate thesettings files easily. You can include them into your project and leverage theMakefile targets tostart theservices inyourproject. Incaseyougeneratenewobjects,makesureyourerunevegenie,killallpreviouswindowsinwhichyourun eve and mongo and restart. In case of changes to objects that you havedesignedandrunpreviously,youneedtoalsodeletethemongoddatabase.
6.10DJANGORESTFRAMEWORK☁�DjangoRESTframeworkisalargetoolkittodevelopWebAPIs.Thedevelopersof the framework provide the following reasons for using it aggording to thedevelopersofthatmodule:
1. TheWebbrowsableAPIimprovesusability.2. Authentication policies including packages for OAuth1a andOAuth2.
3. Serialization that supports both ORM and non-ORM datasources.
4. Customizableallthewaydown-justuseregularfunction-basedviewsifyoudonotneedthemorepowerfulfeatures.
5. Extensivedocumentation,andgreatcommunitysupport.6. Used and trusted by internationally recognised companies
$cd~/github
$gitclonehttps://github.com/cloudmesh/evegenie
$cdevegenie
$pythonsetup.pyinstall
$pipinstall.
$whichevegenie
$evegenie
Usage:
evegenie--help
evegenieFILENAME
includingMozilla,RedHat,Heroku,andEventbrite."
https://www.django-rest-framework.org/
EnexampleisprovidedontheirWebPageat
https://www.django-rest-framework.org/#example
To document your django framework with Swagger you can look at thisexample:
https://www.django-rest-framework.org/topics/documenting-your-api/
However,webelievethatforourpurposestheapproachtouseconexionfromanOpenAPI ismuchmore appealing, also using conexion and also flask for theREST service is easier to acomplish.Django is a large package thatwill takemortimetogettingusedto.
6.11GITHUBRESTSERVICES☁�InthissectionwewanttoexploreamorefeaturesofRESTservicesandhowtoaccessthem.NaturallymanycloudservicesprovidesuchRESTsinterfaces.ThisisvalidforIaaS,PaaS,andSaaS.
InsteadofusingaRESTserviceforIaaS,letushereinspectaRESTservicefortheGithub.complatform.
Itsinterfacesaredocumentednicelyat
https://developer.github.com/v3/
We see that Github offers many resources that can be accessed by the userswhichincludes
ActivitiesChecksGistsGitData
GitHubAppsIssuesMigrationsMiscellaneousOrganizationsProjectsPullRequestsReactionsRepositoriesSearchesTeamsUsers
Most likely we forgot the one or the other Resource that we can access viaREST.Itwillbeoutofscopeforustoexplorealloftheseissues,soletusfocusonhowweforexampleaccessGithubIssues.Infactwewillusethescriptthatweusetocreateissuetablesforthisbooktoshowcasehoweasytheinteractionisandtoretrievetheinformation.
6.11.1Issues
The REST service for issues is described in the following Web page asspecification
https://developer.github.com/v3/issues/
Weseethefollowingfunctionality:
ListissuesListissuesforarepositoryGetasingleissueCreateanissueEditanissueLockanissueUnlockanissueCustommediatypes
As we have learned in our REST section we need to issue GET requests to
obtaininformationabouttheissues.Suchas
As response we obtain a json object with the information we need to furtherprocessit.Unfortunately,thefreetierofgithubhaslimitationsinregardstothefrequencywecanissuesuchrequeststotheservice,aswellasinthevolumeinregardstonumberofpagesreturnedtous.
Letusnowexplorehow toeasilyquerysome information. Inourexamplewelike to retrive the list of issues for a repository as LaTeX table but also asmarkdown. This waywe can conveniently integrate it in documents of eitherformat.AsLaTeXhasamoresophisticatedtablemanagement,letusfirstcreatea LaTeX table document and than use a program to convert LaTeX tomarkdown.ForthelaterwecanreuseaprogramcalledpandocthatcanconvertthetableforLaTeXtomarkdown.
Letusassumewehaveaprogramcalledissues.pythatprintsthetableinmarkdownformat
Anexampleforsuchaprogramislistesat.
https://github.com/cloudmesh-community/book/blob/master/bin/issues.py
Althoughpythonprovidestheverynicemodulerequestswhichwetypicallyuseforsuchissues.wehaveherejustwrappedthecommandlinecalltocurlintoasystemcommand and redirect its output to a file. However, as we only get limitedinformationbackinpages,weneedtocontinuesucharequestmultipletimes.Tokeepthingssimpleweidentifiedthatfortheprojectatthistimenotmorethatnpagesneedtobefetched,soweappendtheoutputfromeachpagetothefile.
Your task is it to improve this script and automatize this activity so that nomaximumfetcheshavetobeentered.
Thereasonwhythisprogramissoshortisthatweleveragethebuildinfunctionforjsondatastructuremanipulation,hearareadandadump.Whenwelookintheissue.jsonfilethatiscreatedasintermediaryfileweseealistofitemssuchas
GET/issues
GET/user/issues
$pythonissues.py
[
...
{
"url":"https://api.github.com/repos/cloudmesh-community/book/issues/46",
"repository_url":"https://api.github.com/repos/cloudmesh-community/book",
"labels_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/labels{/name}",
"comments_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/comments",
"events_url":"https://api.github.com/repos/cloudmesh-community/book/issues/46/events",
"html_url":"https://github.com/cloudmesh-community/book/issues/46",
"id":360613438,
"node_id":"MDU6SXNzdWUzNjA2MTM0Mzg=",
"number":46,
"title":"Taken:Virtualization",
"user":{
"login":"laszewsk",
"id":425045,
"node_id":"MDQ6VXNlcjQyNTA0NQ==",
"avatar_url":"https://avatars1.githubusercontent.com/u/425045?v=4",
"gravatar_id":"",
"url":"https://api.github.com/users/laszewsk",
"html_url":"https://github.com/laszewsk",
"followers_url":"https://api.github.com/users/laszewsk/followers",
"following_url":"https://api.github.com/users/laszewsk/following{/other_user}",
"gists_url":"https://api.github.com/users/laszewsk/gists{/gist_id}",
"starred_url":"https://api.github.com/users/laszewsk/starred{/owner}{/repo}",
"subscriptions_url":"https://api.github.com/users/laszewsk/subscriptions",
"organizations_url":"https://api.github.com/users/laszewsk/orgs",
"repos_url":"https://api.github.com/users/laszewsk/repos",
"events_url":"https://api.github.com/users/laszewsk/events{/privacy}",
"received_events_url":"https://api.github.com/users/laszewsk/received_events",
"type":"User",
"site_admin":false
},
"labels":[],
"state":"open",
"locked":false,
"assignee":{
"login":"laszewsk",
"id":425045,
"node_id":"MDQ6VXNlcjQyNTA0NQ==",
"avatar_url":"https://avatars1.githubusercontent.com/u/425045?v=4",
"gravatar_id":"",
"url":"https://api.github.com/users/laszewsk",
"html_url":"https://github.com/laszewsk",
"followers_url":"https://api.github.com/users/laszewsk/followers",
"following_url":"https://api.github.com/users/laszewsk/following{/other_user}",
"gists_url":"https://api.github.com/users/laszewsk/gists{/gist_id}",
"starred_url":"https://api.github.com/users/laszewsk/starred{/owner}{/repo}",
"subscriptions_url":"https://api.github.com/users/laszewsk/subscriptions",
"organizations_url":"https://api.github.com/users/laszewsk/orgs",
"repos_url":"https://api.github.com/users/laszewsk/repos",
"events_url":"https://api.github.com/users/laszewsk/events{/privacy}",
"received_events_url":"https://api.github.com/users/laszewsk/received_events",
"type":"User",
"site_admin":false
},
"assignees":[
{
"login":"laszewsk",
"id":425045,
"node_id":"MDQ6VXNlcjQyNTA0NQ==",
"avatar_url":"https://avatars1.githubusercontent.com/u/425045?v=4",
"gravatar_id":"",
"url":"https://api.github.com/users/laszewsk",
"html_url":"https://github.com/laszewsk",
"followers_url":"https://api.github.com/users/laszewsk/followers",
"following_url":"https://api.github.com/users/laszewsk/following{/other_user}",
"gists_url":"https://api.github.com/users/laszewsk/gists{/gist_id}",
"starred_url":"https://api.github.com/users/laszewsk/starred{/owner}{/repo}",
"subscriptions_url":"https://api.github.com/users/laszewsk/subscriptions",
"organizations_url":"https://api.github.com/users/laszewsk/orgs",
"repos_url":"https://api.github.com/users/laszewsk/repos",
"events_url":"https://api.github.com/users/laszewsk/events{/privacy}",
"received_events_url":"https://api.github.com/users/laszewsk/received_events",
"type":"User",
"site_admin":false
}
],
"milestone":null,
"comments":0,
"created_at":"2018-09-16T07:35:35Z",
Aswecanseefromthisentrythereisalotofinformationassociatedthatforourpurposeswedonotneed,butcertainlycouldbeusedtominegithubingeneral.
Weliketopointoutthatgithubisactivelyminedforexploitswherepasswordsare posted in clear text forAWS,Azure and other clouds. This is a commonmistakeasmanysampleprogramsaskthestudenttoplacethepassworddirectlyintotheirprogramsinsteadofusingaconfigurationfilethatisneverpartofthecoderepository.
6.11.2Exercise
E.github.issues.1:
Develop a new code like the one in this section, but use pythonrequestsinsteadoftheos.systemcall.
E.github.issues.2:
In the simple programwe hardcoded the number of page requests.Howcanwe findoutexactlyhowmanypagesweneed to retrieve?Implementyoursolution
E.github.issues.3:
Be inspiredby themanyREST interfaces.Howcan theybeused tomineinterestingthings.
E.github.issues.4:
Can you create a project, author, or technology map based oninformationthatisavailableingithub.Forexamplepythonprojectsmay include a requirements file, or developers may work on someprojects together, but others do other projects with others can youcreateanetwork?
"updated_at":"2018-09-16T07:35:35Z",
"closed_at":null,
"author_association":"CONTRIBUTOR",
"body":"DevelopasectionaboutVirtualization"
},
...
]
E.github.issues.5:
Use github to develop some cool python programs that show somestatistics about github. An example would be: Given a githubrepository,showthecheckinsbydataandvisualizethemgraphicallyforonecommitterandallcommitters.Usebokeahormatplotlib.
E.github.issues.6:
Develop a python program that retrieves a file. Deevlop a pythonprogramthatuploadsafile.Developaclassthatdoesthisanduseitin your proggram. Use docopt to create a manual page. Pleaserememberthispreparesyouforyourprojectsothisisveryusefultodo.
7MAPREDUCE
7.1INTRODUCTIONTOMAPREDUCE☁�In this section we discuss about the background of Mapreduce along withHadoopandcorecomponentsofHadoop.
Westartoutoursectionwithareviewofthepythonlambdaexpressionaswellas the map function. Understanding these concepts is helpful for our overallunderstandingofmapreduce.
Sobeforeyouwatchthevideo,weencourageyoutolearnSections{#s-python-lambda}and{#s-python-map}.
Nowthatyouhaveabasicunderstandingofthemapfunctionwerecommendtowatchourvideosaboutmapreduce,hadoopandsparkwhichweprovidewithinthischapter.
MapReduce,Hadoop,andSpark(19:02)HadoopA
MapReduceisaprogrammingtechniqueorprocessingcapabilitywhichoperatesin a cluster or a grid on a massive data set and brings out reliable output. Itworks on essentially two main functions – map() and reduce(). MapReduceprocesses large chunks of data so its highly beneficial to operate in multi-threaded fashion meaning parallel processing. MapReduce can also takeadvantageofdata locality so thatwedonot loosemuchoncommunicationofdatafromplacetoanother.
7.1.1MapReduceAlgorithm
MapReduce can operate on a filesystem, which is an unstructured data or adatabase, a structured data and these are the following three stages of itsoperation(seeFigure38):
1. Map:Thismethodprocessestheveryinitialdataset.Generally,thedataisin file formatwhich can be stored inHDFS (Hadoop File System).Map
functionreadsthedatalinebylineandcreatesseveralchunksofdataandthatisagainstoredinHDFS.Thisbrokensetofdataisinkey/valuepairs.So in multi-threaded environment, there will be many worker nodesoperatingonthedatausingthismap()functionandwritethisintermediatedatainformofkey/valuetotemporarydatastorage.
2. Shuffle:Inthisstage,workernodeswillshuffleorredistributethedatainsuchawaythatthereisonlyonecopyforeachkey.
3. Reduce: This function always comes at last and it works on the dataproduced bymap and shuffle stages and produces even smaller chunk ofdatawhichisusedtocalculateoutput.
Figure38:MapReduceConceptualdiagram
TheShuffle operation is very important here as that ismainly responsible forreducing the communication cost. The main advantage of using MapReducealgorithmisthatitbecomesveryeasytoscaleupdataprocessingjustbyaddingsome extra computing nodes. Building up map and reduce methods aresometimesnontrivialbutoncedone,scalinguptheapplicationsissoeasythatitisjustamatterofchangingconfiguration.ScalabilityisreallybigadvantageofMapReducemodel. In the traditionalwayof dataprocessing, datawasmovedfromnodestothemasterandthentheprocessinghappensinmastermachine.Inthisapproach,welosebandwidthandtimeonmovingdatatomasterandparalleloperation cannot happen. Also master can get over-burdened and fail. InMapReduceapproach,Masternodedistributesthedatatotheworkermachineswhich are in themselves a processing unit. So all worker process the data inparallel and the time taken to process the data is reduced tremendously. (seeFigure39)
Figure39:MapReduceMasterworkerdiagram
7.1.1.1MapReduceExample:WordCount
LetusunderstandMapReducebyanexample.Forexample:wehaveatextfileas Sample.txt asCat, Bear,Camel,Bird,Cat,Bird,Camel,Cat,Bear,Camel,Cat,Camel
1. Firstwedividetheinputintofourpartssothatindividualnodescanhandletheload.
2. Wetokenizeeachwordandassignweightageofvalue“1”toeachword.3. Thiswaywewillhavealistofkey-valuepairswithkeybeingthewordand
valueas1.4. Afterthismappingphase,shufflingphasestartswhereallmapswithsame
keyaresentcorrespondingreducer.5. Noweachreducerwillhaveauniquekeyandalistofvaluesforeachkey
whichinthiscaseisall1s.6. Afterthat,eachreducerwillcountthetotalnumberof1sandassignsfinal
counttoeachword.7. Thefinaloutputisthenwrittentoafile.(seeFigure40)
Figure40:MapReduceWordCount[50]
Let us see an example of map() and reduce() methods in code for this wordcountexample.
Here we have created a class Map which extends Mapper from MapReduceframeworkandweoverridemap()methodtodeclarethekey/valuepairs.Next,therewillbeareducemethoddefinedinsideReduceclassasnextandbothinputandoutputhereisakey/valuepairs:
7.1.2HadoopMapReduceandHadoopSpark
publicstaticclassMapextendsMapper<LongWritable,
Text,
Text,
IntWritable>{
publicvoidmap(LongWritablekey,
Textvalue,
Contextcontext)
throwsIOException,InterruptedException{
Stringline=value.toString();
StringTokenizertokenizer=newStringTokenizer(line);
while(tokenizer.hasMoreTokens()){
value.set(tokenizer.nextToken());
context.write(value,newIntWritable(1));
}
}
publicstaticclassReduceextendsReducer<Text,
IntWritable,
Text,IntWritable>{
publicvoidreduce(Textkey,
Iterable<IntWritable>values,
Contextcontext)
throwsIOException,InterruptedException{
intsum=0;
for(IntWritablex:values){
sum+=x.get();
}
context.write(key,newIntWritable(sum));
}
}
InearlierversionofHadoop,wecoulduseMapReducewithHDFSdirectlybutfrom2.0onwards,YARN(ClusterResourceManagement) is introducedwhichacts as a layer betweenMapReduce and HDFS and using this YARN, manyotherBigDataframeworkscanconnecttoHDFSaswell.(seeFigure41)
Figure41:MapReduceHadoopandSpark[51]
Therearemanybigdataframeworksavailableandthereisalwaysaquestionastowhichoneistherightone.LeadingframeworksareHadoopMapReduceandApache Spark and choice depends on business needs. Let us start comparingbothoftheseframeworkswithrespecttotheirprocessingcapability.
7.1.2.1ApacheSpark
Apache Spark is lightning fast cluster computing framework. Spark is in-memorysystem.Sparkis100timefasterthanHadoopMapReduce.
7.1.2.2HadoopMapReduce
HadoopMapReducereadsandwritesondiskbecauseofthisitisaslowsystemandthataffectsthevolumeofdatabeenprocessed.ButHadoopisascalableandfaulttolerant,itusgoodforlinearprocessing.
7.1.2.3KeyDifferences
Thekeydifferencesbetweenthemareasfollows:
1. Speed:Sparkislightningfastclustercomputingframeworkandoperatesupto100timefasterin-memoryand10timesfasterthanHadoopondisk.In-memory processing reduces the disk read/write processeswhich are timeconsuming.
2. Complexity:SparkiseasytousesincetherearemanyAPIsavailablebutforHadoop,developersneedtocodethefunctionswhichmakesitharder.
3. ApplicationManagement:Sparkcanperformbatchprocessing,interactiveandMachineLearningandStreamingofdata,allinthesamecluster,whichmakesitacompleteframeworkfordataanalysiswhereasHadoopisjustabatchengineanditrequiresotherframeworksforothertaskswhichmakesitsomewhatdifficulttomanage.
4. Real-TimeDataAnalysis Spark is capable of processing real time datawith great efficiency. But Hadoop was designed primarily for batchprocessingsoitcannotlivedata.
5. FaultTolerance:Boththesystemsarefaulttolerantsothereisnoneedtorestarttheapplicationsfromscratch.
6. DataVolume:AsthedataforsparkisheldinmemorylargerdatavolumesarebettermanagedinHadoop.
7.1.3References
[52]https://www.ibm.com/analytics/hadoop/mapreduce[53]https://en.wikipedia.org/wiki/MapReduce[54]https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm[50] https://www.edureka.co/blog/mapreduce-tutorial/?utm_source=youtube&utm_campaign=mapreduce-tutorial-161216-wr&utm_medium=description[55] https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark[56]https://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce[51] https://www.youtube.com/watch?v=SqvAaB3vK8U&list=WL&index=25&t=2547s
7.2HADOOP☁�Hadoopisanopensourceframeworkforstorageandprocessingoflargedatasetsoncommodityclusters.HadoopinternallyusesitsownfilesystemcalledHDFS
(HadoopDistributedFileSystem).
ThemotivationforHadoopwasintroducedinSectionMapreduce
7.2.1HadoopandMapReduce
InthissectionwediscussabouttheusageHadoopMapReducearchitecture.
Hadoop13:19HadoopB
7.2.2HadoopEcoSystem
InthissectionwediscussabouttheHadoopEcoSystemandthearchitecture.
Hadoop12:57HadoopC
7.2.3HadoopComponents
InthissectionwediscussaboutHadoopComponentsindetail.
Hadoop15:14HadoopD
7.2.4HadoopandtheYarnResourceManager
InthissectionwediscussaboutYarnresourcemanagerandnovelcomponentsadded to the Hadoop framework in case of improving the performance andminimizingfaulttolerance.
Hadoop14:55HadoopE
7.2.5PageRank
In thissectionwediscussabouta realworldproblemthatcanbesolvedusingtheMapReducetechnique.PageRankisaproblemsolvedbytheearlieststagesof theGoogle.inc. In this sectionwe discuss about the theoretical backgroundaboutthisproblemandwediscusshowthiscanbesolvedusingthemapreduce
concepts.
Hadoop25:41HadoopF
7.3INSTALLATIONOFHADOOP☁�This section isusingHadoopversion3.1.1 inUbuntu18.04.WealsodescribetheinstallationoftheYarnresourcemanager.Weassumethatyouhavessh,andrsyncinstalledanduseemacsaseditor.
Ifyouuseanewerversion,andliketoupdatethistextpleasehelp
7.3.1Releases
Hadoopchangesonregularbasis.Beforefollwoingthissection,werecommendthatyouvisit
https://hadoop.apache.org/releases.html
Thelistofdownloadablefilesisalsoavailableat
andverify thatyouuseanup todatversion.If theverisonof this instalation isoutdated.weaskyouasexcrsisetoupdateit.
7.3.2Prerequisites
7.3.3UserandUserGroupCreation
Forsecurityreasonswewill installhadoopinaparticularuserandusergroup.Wewillusethefollowing
Thesestepswillprovidesudoprivilegestothecreatedhduseruserandaddthe
sudoapt-getinstallssh
sudoapt-getinstallrsync
sudoapt-getinstallemacs
sudoaddgrouphadoop_group
sudoadduser--ingrouphadoop_grouphduser
sudoadduserhdusersudo
usertothegrouphadoop_group.
7.3.4ConfiguringSSH
HereweconfigureSSHkeyforthelocalusertoinstallhadoopwithassh-key.This is different from the ssh-key you used for Github, FutureSystems, etc.FollowthissectiontoconfigureitforHadoopinstallation.
The ssh content is included here because, we are making a ssh key for thisspecificuser.Next,wehavetoconfiguresshtobeusedbythehadoopuser.
Follow the instructions as provided in the commandline. When you see thefollowingconsoleinput,pressENTER.Hereonlywewillcreatepasswordlesskeys.INgeneralthisisnotagoodidea,butforthiscasewemakeanexception.
Nextyouwillbeaskedtoenterapasswordforsshconfiguration,
Hereenterthesamepassword
Finallyyouwillseesomethinglikethisafterthesestepsarefinished.
sudosu-hduser
ssh-keygen-trsa
Enterfileinwhichtosavethekey(/home/hduser/.ssh/id_rsa):
Enterpassphrase(emptyfornopassphrase):
Entersamepassphraseagain:
Generatingpublic/privatersakeypair.
Enterfileinwhichtosavethekey(/home/hduser/.ssh/id_rsa):
Createddirectory'/home/hduser/.ssh'.
Enterpassphrase(emptyfornopassphrase):
Entersamepassphraseagain:
Youridentificationhasbeensavedin/home/hduser/.ssh/id_rsa.
Yourpublickeyhasbeensavedin/home/hduser/.ssh/id_rsa.pub.
Thekeyfingerprintis:
SHA256:0UBCPd6oYp7MEzCpOhMhNiJyQo6PaPCDuOT48xUDDc0hduser@computer
Thekey'srandomartimageis:
+---[RSA2048]----+
|.+ooo|
|.oE.oo|
|+.....+.|
|X+=.o..|
|XX.oo.S|
|Bo++.o|
|*o*+.|
|*..*.|
|+.o..|
+----[SHA256]-----+
Youhavesuccessfullyconfiguredssh.
7.3.5InstallationofJava
Ifyouarealreadyloggedintosu,youcanskipthenextcommand:
Nowexecutethefollowingcommandstodownloadandinstalljava
PleasenotethatusersmustacceptOracleOTNlicensebeforedownloadingJDK.
7.3.6InstallationofHadoop
FirstwewilltakealookonhowtoinstallHadoop3.1.1onUbuntu
16.04.Wemayneedapriorfolderstructuretodotheinstallationproperly.
7.3.7HadoopEnvironmentVariables
InUbuntutheenvironmentalvariablesaresetupinafilecalledbashrcatitcanbeaccessedthefollowingway
Nowaddthefollowingtoyour~/.bashrcfile
InEmacstosavethefileCtrl-X-SandCtrl-X-Ctoexit.Aftereditingyoumustupdatethevariablesinthesystem.
su-hduser
mkdir-p~/cloudmesh/bin
cd~/cloudmesh/bin
wget-c--header"Cookie:\
oraclelicense=accept-securebackup-cookie"\
"http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz"
tarxvzfjdk-8u191-linux-x64.tar.gz
cd~/cloudmesh/bin/
wgethttp://mirrors.sonic.net/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar-xzvfhadoop-3.1.1.tar.gz
emacs~/.bashrc
exportJAVA_HOME=~/cloudmesh/bin/jdk1.8.0_191
exportHADOOP_HOME=~/cloudmesh/bin/hadoop-3.1.1
exportYARN_HOME=$HADOOP_HOME
exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
exportPATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
source~/.bashrc
java-version
If you have installed things properly therewill be no errors. It will show theversionasfollows,
Andverifyingthehadoopinstallation,
Ifyouhavesuccessfullyinstalledthis,theremustbeamessageshownasnext.
7.4HADOOPVIRTUALCLUSTERINSTALLATIONUSING
CLOUDMESH �☁�
No
Thisversionisdependentonanolderversionofcloudmesh.
:TODO:weneedtoaddtheinstalationinstructionsbasedonthisversion
7.4.1CloudmeshClusterInstallation
Beforeyoustartthislesson,youMUSTfinishcm_install.
This lesson is created and test under the newest version ofCloudmesh client.Updateyoursifnot.
javaversion"1.8.0_191"
Java(TM)SERuntimeEnvironment(build1.8.0_191-b12)
JavaHotSpot(TM)64-BitServerVM(build25.191-b12,mixedmode)
hadoop
Usage:hadoop[--configconfdir]COMMAND
whereCOMMANDisoneof:
fsrunagenericfilesystemuserclient
versionprinttheversion
jar<jar>runajarfile
checknative[-a|-h]checknativehadoopandcompressionlibrariesavailability
distcp<srcurl><desturl>copyfileordirectoriesrecursively
archive-archiveNameNAME-p<parentpath><src>*<dest>createahadooparchive
classpathprintstheclasspathneededtogetthe
credentialinteractwithcredentialproviders
Hadoopjarandtherequiredlibraries
daemonlogget/settheloglevelforeachdaemon
traceviewandmodifyHadooptracingsettings
or
CLASSNAMEruntheclassnamedCLASSNAME
Mostcommandsprinthelpwheninvokedw/oparameters.
Tomanagevirtualclusteroncloud, thecommandis cmcluster.Try cmclusterhelp toseewhatothercommandsareandwhatoptionstheysupported.
7.4.1.1CreateCluster
To create a virtual cluster on cloud, we must define an active clusterspecificationwithcmclusterdefinecommand.Forexample,wedefineaclusterwith3nodes:
Alloptionswillusethedefaultsettingifnotspecifiedduringcluster
define.Try cmclusterhelp command to seewhat options cmclusterdefine has andmeans,hereispartoftheusageinformation::
FloatingIPisavaluableandlimitedresourceoncloud.
cmclusterdefinewill assign floating IP to everynodewithin the cluster bydefault.ClustercreationwillfailifthefloatingIPsrunoutoncloud.Whenyourunintoerrorlikethis,useoption-Ior--no-floating-iptoavoidassigningfloatingIPsduringclustercreation:
ThenmanuallyassignfloatingIPtooneofthenodes.Usethisnodeasaloggingnodeorheadnodetologintoalltheothernodes.
Wecanhavemultiplespecificationsdefinedatthesametime.Everytimeanewcluster specification is defined, the counter of the default cluster name will
$cmclusterdefine--count3
$cmclusterhelp
usage:clustercreate[-nNAME][-cCOUNT][-C
CLOUD][-uNAME][-iIMAGE][-fFLAVOR][-kKEY][-sNAME]
[-AI]
Options:
-A--no-activateDonotactivatethiscluster
-I--no-floating-ipDonotassignfloatingIPs
-nNAME--name=NAMENameofthecluster
-cCOUNT--count=COUNTNumberofnodesinthecluster
-CNAME--cloud=NAMENameofthecloud
-uNAME--username=NAMENameoftheimageloginuser
-iNAME--image=NAMENameoftheimage
-fNAME--flavor=NAMENameoftheflavor
-kNAME--key=NAMENameofthekey
-sNAME--secgroup=NAMENAMEofthesecuritygroup
-oPATH--path=PATHOutputtothispath...
$cmclusterdefine--count3--no-floating-ip
increment.Hence,thedefaultclusternamewillbecluster-001,cluster-002,cluster-003andsoon.Usecmclusteravailtocheckalltheavailableclusterspecifications:
Withcmclusteruse[NAME],weareabletoswitchbetweendifferentspecificationswithgivenclustername:
Thiswillactivatespecificationcluster-001whichassignsfloatingIPduringcreationratherthanthelatestonecluster-002.
Withourclusterspecificationready,wecreatetheclusterwithcommandcmclusterallocate. This will create a virtual cluster on the cloud with the activatedspecification:
Eachspecificationcanhaveoneactivecluster,whichmeanscmclusterallocate doesnothingifthereisasuccessfullyactivecluster.
7.4.1.2CheckCreatedCluster
$cmclusteravail
cluster-001
count:3
image:CC-Ubuntu14.04
key:xl41
flavor:m1.small
secgroup:default
assignFloatingIP:True
cloud:chameleon
>cluster-002
count:3
image:CC-Ubuntu14.04
key:xl41
flavor:m1.small
secgroup:default
assignFloatingIP:False
cloud:chameleon
$cmclusterusecluster-001
$cmclusteravail
>cluster-001
count:3
image:CC-Ubuntu14.04
key:xl41
flavor:m1.small
secgroup:default
assignFloatingIP:True
cloud:chameleon
cluster-002
count:3
image:CC-Ubuntu14.04
key:xl41
flavor:m1.small
secgroup:default
assignFloatingIP:False
cloud:chameleon
$cmclusterallocate
Withcommandcmclusterlist,wecanseetheclusterwiththedefaultnamecluster-001wejustcreated:
Usingcmclusternodes[NAME],wecanalsoseethenodesoftheclusteralongwiththeirassignedfloatingIPsofthecluster:
If option --no-floating-ip is included during definition, youwill see nodeswithoutfloatingIP:
Tologinoneofthem,usecommand cmvmassignIP[NAME] to assign a floating IP tooneofthem:
Thenyoucanloginthisnodeasaheadnodeofyourclusterbycmvmssh[NAME]:
7.4.1.3DeleteCluster
Usingcmclusterdelete[NAME],weareabletodeletetheclusterwecreated:
Option--allcandeletealltheclusterscreated,sobecareful:
:
$cmclusterdelete–all
Thenweneedtoundefineourclusterspecificationwithcommandcmclusterundefine
$cmclusterlist
cluster-001
$cmclusternodescluster-001
xl41-001129.114.33.147
xl41-002129.114.33.148
xl41-003129.114.33.149
$cmclusternodescluster-002
xl41-004None
xl41-005None
xl41-006None
$cmvmipassignxl41-006
$cmclusternodescluster-002
xl41-004None
xl41-005None
xl41-006129.114.33.150
$cmvmsshxl41-006
cc@xl41-006$
$cmclusterdeletecluster-001
[NAME]:
Option--allcandeletealltheclusterspecifications:
7.4.2HadoopClusterInstallation
Thissectionisbuiltuponthepreviousone.Pleasefinishthepreviousonebeforestartthisone.
7.4.2.1CreateHadoopCluster
Tocreate aHadoopcluster,weneed to first define a clusterwith cmclusterdefinecommand:
TodeployaHadoopcluster,weonlysupportimageCC-Ubuntu14.04
onChameleon.DONOTuseCC-Ubuntu16.04oranyotherimages.Youwillneedtospecifyitifit’snotthedefaultimage:
$cmclusterdefine–count3–imageCC-Ubuntu14.04
ThenwedefinetheHadoopclusterupontheclusterwedefinedusingcmhadoopdefinecommand:
Same as cmclusterdefine, you can define multiple specifications for the Hadoopclusterandcheckthemwithcmhadoopavail:
Wecanusecmhadoopuse[NAME]toactivatethespecificationwiththegivenname:
$cmclusterundefinecluster-001
$cmclusterundefine--all
$cmclusterdefine--count3
$cmhadoopdefine
$cmhadoopavail
>stack-001
local_path:/Users/tony/.cloudmesh/stacks/stack-001
addons:[]
$cmhadoopusestack-001
MaynotbeavailableforcurrentversionofCloudmeshClient.
Beforedeploy,weneedtousecmhadoopsynctocheckout/synchronizetheBigDataStackfromGithub.com:
Toavoiderrors,makesureyouareabletoconnecttoGithub.comusingSSH:
https://help.github.com/articles/connecting-to-github-with-ssh/.
Finally,wearereadytodeployourHadoopcluster:
Thisprocesscouldtakeupto10minutesbasedonyournetwork.
To checkHadoop is working or not. Use cmvmssh to log into the Namenode of theHadoopcluster.It’susuallythefirstnodeofthecluster:
SwitchtouserhadoopandcheckHDFSissetupornot:
NowtheHadoopclusterisproperlyinstalledandconfigured.
7.4.2.2DeleteHadoopCluster
To delete the Hadoop cluster we created, use command cmclusterdelete[NAME] todeletetheclusterwithgivenname:
ThenundefinetheHadoopspecificationandtheclusterspecification:
MaynotbeavailableforcurrentversionofCloudmeshClient.
$cmhadoopsync
$cmhadoopdeploy
$cmvmsshnode-001
cc@hostname$
cc@hostname$sudosu-hadoop
hadoop@hostname$hdfsdfs-ls/
Found1items
drwxrwx----hadoophadoop,hadoopadmin02017-02-1517:26/tmp
$cmclusterdeletecluster-001
$cmhadoopundefinestack-001
$cmclusterundefinecluster-001
7.4.3AdvancedTopicswithHadoop
7.4.3.1HadoopVirtualClusterwithSparkand/orPig
To installSparkand/orPigwithHadoopcluster,we firstusecommand cmhadoopdefinebutwithADDONtodefinetheclusterspecification.
Forexample,wecreatea3-nodeSparkclusterwithPig.Todothat,allweneedistospecifysparkasanADDONduringHadoopdefinition:
Usingcmhadoopaddons,weareabletocheckthecurrentsupportedaddon:
Withcmhadoopavail,wecanseethedetailofthespecificationfortheHadoopcluster:
ThenweusecmhadoopsyncandcmhadoopdeploytodeployourSparkcluster:
Thisprocesswilltake15minutesorlonger.
Beforeweproceedtothenextstep,thereisonemorethingweneedto,whichistomakesureweareabletosshfromeverynodetootherswithoutpassword.Toachievethat,weneedtoexecutecmclustercross_ssh:
7.4.3.2WordCountExampleonSpark
Nowwiththeclusterready,let’srunasimpleSparkjob,WordCount,ononeofWilliam Shakespeare’s work. Use cmvmssh to log into the Namenode of the Sparkcluster.Itshouldbethefirstnodeofthecluster:
$cmclusterdefine--count3
$cmhadoopdefinesparkpig
$cmhadoopaddons
$cmhadoopavail
>stack-001
local_path:/Users/tony/.cloudmesh/stacks/stack-001
addons:[u'spark',u'pig']
$cmhadoopsync
$cmhadoopdeploy
$cmclustercross_ssh
$cmvmsshnode-001
cc@hostname$
SwitchtouserhadoopandcheckHDFSissetupornot:
DownloadtheinputfilefromtheInternet:
Youcanalsouseanyothertextfileyoupreferred.CreateanewdirectorywordcountwithinHDFStostoretheinputandoutput:
Storetheinputtextfileintothedirectory:
Savethefollowingcodeaswordcount.pyonthelocalfilesystemonNamenode:
NextsubmitthejobtoYarnandrunindistribute:
Finally,takealookattheresultintheoutputdirectory:
cc@hostname$sudosu-hadoop
hadoop@hostname$
wget--no-check-certificate-Oinputfile.txt\
https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
$hdfsdfs-mkdir/wordcount
$hdfsdfs-putinputfile.txt/wordcount/inputfile.txt
importsys
frompysparkimportSparkContext,SparkConf
if__name__=="__main__":
#taktwoarguments,inputandoutput
iflen(sys.argv)!=3:
print("Usage:wordcount<input><output>")
exit(-1)
#createSparkcontextwithSparkconfiguration
conf=SparkConf().setAppName("SparkCount")
sc=SparkContext(conf=conf)
#readintextfile
text_file=sc.textFile(sys.argv[1])
#spliteachlineintowords
#counttheoccurrenceofeachword
#sorttheoutputbasedonword
counts=text_file.flatMap(lambdaline:line.split(""))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdaa,b:a+b)\
.sortByKey()
#savetheresultintheoutputtextfile
counts.saveAsTextFile(sys.argv[2])
$spark-submit--masteryarn--deploy-modeclient--executor-memory1g\
--namewordcount--conf"spark.app.id=wordcount"wordcount.py\
hdfs://192.168.0.236:8020/wordcount/inputfile.txt\
hdfs://192.168.0.236:8020/wordcount/output
7.5SPARK
7.5.1SparkLectures☁�
ThissectioncoversanintroductiontoSparkthatissplitupintoeightparts.WediscussSparkbackground,RDDoperations,Shark,SparkML,SparkvsOtherFrameworks.
7.5.1.1MotivationforSpark
InthissectionwediscussaboutthebackgroundofSparkandcorecomponentsofSpark.
Spark15:57SparkA
7.5.1.2SparkRDDOperations
InthissectionwediscussaboutthebackgroundofRDDoperationsalongwithothertransformationfunctionalityinSpark.
Spark12:17SparkB
7.5.1.3SparkDAG
InthissectionwediscussaboutthebackgroundofDAG(directacyclicgraphs)operationsalongwithothercomponentslikeSharkintheearlierstagesofSpark.
$hdfsdfs-ls/wordcount/outputfile/
Found3items
-rw-r--r--1hadoophadoop,hadoopadmin02017-03-0721:28/wordcount/output/_SUCCESS
-rw-r--r--1hadoophadoop,hadoopadmin4831822017-03-0721:28/wordcount/output/part-00000
-rw-r--r--1hadoophadoop,hadoopadmin6396492017-03-0721:28/wordcount/output/part-00001
$hdfsdfs-cat/wordcount/output/part-00000|less
(u'',517065)
(u'"',241)
(u'"\'Tis',1)
(u'"A',4)
(u'"AS-IS".',1)
(u'"Air,"',1)
(u'"Alas,',1)
(u'"Amen"',2)
(u'"Amen"?',1)
(u'"Amen,"',1)
...
Spark10:37SparkC
7.5.1.4Sparkvs.otherFrameworks
In this section we discuss about the real world applications that can be doneusing Spark. And also we discuss some comparision results obtained fromexperimentsdoneinSparkalongwithFrameworkslikeHarp,HarpDAAL,etc.Wediscussthebenchmarksandperformanceobtainedfromsuchexperiments.
Spark26:18SparkD
7.5.2InstallationofSpark☁�
InthissectionwewilldiscusshowtoinstallSpark2.3.2inUbuntu18.04.
7.5.2.1Prerequisites
Weassumethatyouhavessh,andrsyncinstalledanduseemacsaseditor.
7.5.2.2InstallationofJava
FirstdownloadJava8.
Thenaddtheenvironmentalvariablestothebashrcfile.
Sourcethebashrcfileafteraddingtheenvironmentalvariables.
sudoapt-getinstallssh
sudoapt-getinstallrsync
sudoapt-getinstallemacs
mkdir-p~/cloudmesh/bin
cd~/cloudmesh/bin
wget-c--header"Cookie:oraclelicense=accept-securebackup-cookie""http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz"
tarxvzfjdk-8u161-linux-x64.tar.gz
emacs~/.bashrc
exportJAVA_HOME=~/cloudmesh/bin/jdk1.8.0_161
exportPATH=$JAVA_HOME/bin:$PATH
source~/.bashrc
7.5.2.3InstallSparkwithHadoop
HereweuseSparkpackagedwithHadoop.InthispackageSparkusesHadoop2.7.0 in thepackagedversion.Note that inSectionHadoopInstallationweuseforthevanillaHadoopinstallationHadoop3.0.1.
Createthebasedirectoriesandgotothedirectory.
ThendownloadSpark2.3.2asfollows.
Nowextractthefile,
7.5.2.4SparkEnvironmentVariables
Openupbashrcfileandaddenvironmentalvariablesasfollows.
Gotothelastlineandaddthefollowingcontent.
Sourcethebashrcfile.
7.5.2.5TestSparkInstallation
Openupanewterminalandthenrunthefollowingcommand.
Ifithasbeenconfiguredproperly,itwilldisplaythefollowingcontent.
mkdir-p~/cloudmesh/bin
cd~/cloudmesh/bin
wgethttps://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tarxzfspark-2.3.2-bin-hadoop2.7.tgz
mvspark-2.3.2-bin-hadoop2.7spark-2.3.2
emacs~/.bashrc
exportSPARK_HOME=~/cloudmesh/bin/spark-2.3.2
exportPATH=$SPARK_HOME/bin:$PATH
source~/.bashrc
spark-shell
SparkcontextWebUIavailableathttp://192.168.1.66:4041
Sparkcontextavailableas'sc'(master=local[*],appid=local-1521674331361).
Sparksessionavailableas'spark'.
Pleasecheck theconsoleLOGSand find theportnumberonwhich theSparkWebUIishosted.Itwillshowsomethinglike:
SparkcontextWebUIavailableat:<someurl>
Thentakealookthefollowingaddressinthebrowser.
Ifyousee theSparkDashboard, thenyoucanrealizeyouhave installedSparksuccessfully.
7.5.2.6InstallSparkWithCustomHadoop
InstallingSparkwith pre-existingHadoop version is favorable, if youwant touse the latest features from the latest Hadoop version or when you need aspecificHadoopversiondependingontheexternaldependenciestoyourproject.
FirstweneedtodownloadtheSparkpackagedwithoutHadoop.
ThendownloadSpark2.3.2asfollows.
Nowextractthefile,
Thenaddtheenvironmentalvariables,
IfyouhavealreadyinstalledSparkwithHadoopbyfollowingsection1.3pleaseupdatethecurrentSPARKHOMEvariablewiththenewpath.
Welcometo
______
/__/__________//__
_\\/_\/_`/__/'_/
/___/.__/\_,_/_//_/\_\version2.3.2
/_/
UsingScalaversion2.11.8(JavaHotSpot(TM)64-BitServerVM,Java1.8.0_151)
Typeinexpressionstohavethemevaluated.
Type:helpformoreinformation.
http://localhost:4041
mkdir-p~/cloudmesh/bin
cd~/cloudmesh/bin
wgethttps://archive.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-without-hadoop.tgz
tarxzfspark-2.3.2-bin-without-hadoop.tgz
emacs~/.bashrc
Gotothelastlineandaddthefollowingcontent.
Sourcethebashrcfile.
7.5.2.7ConfiguringHadoop
NowwemustaddthecurrentHadoopversionthatweareusingforSpark.Openupanewterminalandthenrunthefollowing.
Nowweneedtoaddanewlinetoshowthecurrentpathtohadoopinstallation.Addthefollowingvariableintothespark-env.shfile.
SparkWebUI-HadoopPath
7.5.2.8TestSparkInstallation
Openupanewterminalandthenrunthefollowingcommand.
Ifithasbeenconfiguredproperly,itwilldisplaythefollowingcontent.
exportSPARK_HOME=~/cloudmesh/bin/spark-2.3.2-bin-without-hadoop
exportPATH=$SPARK_HOME/bin:$PATH
source~/.bashrc
cd$SPARK_HOME
cdconf
cpspark-env.sh.templatespark-evn.sh
emacsspark-env.sh
exportSPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoopclasspath)
spark-shell
Toadjustlogginglevelusesc.setLogLevel(newLevel).ForSparkR,usesetLogLevel(newLevel).
SparkcontextWebUIavailableathttp://149-160-230-133.dhcp-bl.indiana.edu:4040
Sparkcontextavailableas'sc'(master=local[*],appid=local-1521732740077).
Sparksessionavailableas'spark'.
Welcometo
______
/__/__________//__
_\\/_\/_`/__/'_/
/___/.__/\_,_/_//_/\_\version2.3.2
/_/
UsingScalaversion2.11.8(JavaHotSpot(TM)64-BitServerVM,Java1.8.0_151)
Typeinexpressionstohavethemevaluated.
Thentakealookthefollowingaddressinthebrowser.
Pleasecheck theconsoleLOGSand find theportnumberonwhich theSparkWebUIishosted.Itwillshowsomethinglike:SparkcontextWebUIavailableatthelogsfolder
7.5.3SparkStreaming☁�
7.5.3.1StreamingConcepts
Spark Streaming is one of the components extending fromCore Spark. Sparkstreaming provides a scalable fault tolerant systemwith high throughput. Forstreamingdata intospark, therearemany libraries likeKafka,Flume,Kinesis,etc.
7.5.3.2SimpleStreamingExample
Inthissection,wearegoingtofocusonmakingasimplestreamingapplicationusing thenetwork inyourcomputer.Herewearegoing toexposeaparticularportandfromthatportwearegoingtocontinuouslystreamdatabyuserentriesandthewordcountisbeingcalculatedastheoutput.
First,createaMakefile
ThenaddthefollowingcontenttoMakefile.
Please add a tab when adding the corresponding command for a giveninstructioninMakefile.Inpdfmodethetabisnotclearlyshown.
Nowweneedtocreatefilecalledstreaming.py
Type:helpformoreinformation.
http://localhost:4040
mkdir-p~/cloudmesh/spark/examples/streaming
cd~/cloudmesh/spark/examples/streaming
emacsMakefile
SPARKHOME=${SPARK_HOME}
run-streaming:
${SPARKHOME}/bin/spark-submitstreaming.pylocalhost9999
Thenaddthefollowingcontent.
Torunthecode,weneedtoopenuptwoterminals.Terminal1:
Firstusenetstattoopenupaporttostartcommunication.
Terminal2:
NowruntheSparkprogrammeinthesecondterminal.
Inthisterminalyoucanseeanscriptrunningtryingtoreadthestreamcomingfromtheport9999.YoucanentertextsintheTerminal1andthesetextswillbetokenized and the word count is calculated and the result is shown in theTerminal2.
7.5.3.3SparkStreamingForTwitterData
In this sectionwearegoing to learnhow touseTwitterdata as the streamingdatasourceanduseSparkStreamingcapabilitiestoprocessthedata.Asthefirststepyoumustinstallthepythonpackagesusingpip.
emacsstreaming.py
frompysparkimportSparkContext
frompyspark.streamingimportStreamingContext
#CreatealocalStreamingContextwithtwoworkingthreadandbatchintervalof1second
sc=SparkContext("local[2]","NetworkWordCount")
log4jLogger=sc._jvm.org.apache.log4j
LOGGER=log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("Pysparkscriptloggerinitialized")
ssc=StreamingContext(sc,1)
#CreateaDStreamthatwillconnecttohostname:port,likelocalhost:9999
lines=ssc.socketTextStream("localhost",9999)
#Spliteachlineintowords
words=lines.flatMap(lambdaline:line.split(""))
#Counteachwordineachbatch
pairs=words.map(lambdaword:(word,1))
wordCounts=pairs.reduceByKey(lambdax,y:x+y)
#PrintthefirsttenelementsofeachRDDgeneratedinthisDStreamtotheconsole
wordCounts.pprint()
ssc.start()#Startthecomputation
ssc.awaitTermination(100)#Waitforthecomputationtoterminate
nc-lk9999
makerun-streaming
7.5.3.3.1Step1
7.5.3.3.2Step2
ThenyouneedtocreateanaccountinTwitterApps.Gotoandsignintoyourtwitteraccountorcreateanewtwitteraccount.Thenyouneedtocreateanewapplication,let’snamethisapplicationasCloudmesh-Spark-Streaming.
Firstyouneedtocreateanappwiththeappnamewesuggestedinthissection.Thewaytocreatetheappismentionedin+Figure42.
Figure42:CreateTwitterApp
Nextweneedtototakealookatthedashboardcreatedfortheapp.Youcanseehowyourdashboardlookslikein+Figure43.
sudopipinstalltweepy
Figure43:GoToTwitterAppDashboard
Nexttheapplicationtokensgeneratedmustbereviewedanditcanbefoundin+Figure44,youneedtogototheKeysandAccessTokenstab.
Figure44:CreateYourTwitterSettings
Nowyouneed togenerate the access tokens for the first time if youhavenotgenerated access tokens and this can be done by clicking the Createmyaccesstokenbutton.See+Figure45
Figure45:CreateYourTwitterAccessTokens
Theaccesstokensandkeysareblurredinthissectionforprivacyissues.
7.5.3.3.3Step3
LetusbuildasimpleTwitterApptoseeifeverythingisokay.
Add the following content to the file and make sure you update thecorrespondingtokenkeyswithyourtokenvalues.
mkdir-p~/cloudmesh/spark/streaming
cd~/cloudmesh/spark/streaming
emacstwitterstreaming.py
importtweepy
CONSUMER_KEY='your_consumer_key'
CONSUMER_SECRET='your_consumer_secret'
ACCESS_TOKEN='your_access_token'
ACCESS_TOKEN_SECRET='your_access_token_secret'
7.5.3.3.4Step4
Letusstartthetwitterstreamingexercise.WeneedtocreateaTweetListenerinorder to retrieve data from twitter regarding a topic of your choice. In thisexercise,wehavetriedkeywordsliketrump,indiana,messi.
Makeyourtoreplacestringsrelatedtosecretkeysandipaddressesbyreplacingthesevaluesdependingonyourmachineconfigurationandtwitterkeys.
Nowaddthefollowingcontent.
auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN,ACCESS_TOKEN_SECRET)
api=tweepy.API(auth)
status="Testing!"
api.update_status(status=status)
pythontwitterstreaming.py
mkdir-p~/cloudmesh/spark/streaming
cd~/cloudmesh/spark/streaming
emacstweetlistener.py
importtweepy
fromtweepyimportOAuthHandler
fromtweepyimportStream
fromtweepy.streamingimportStreamListener
importsocket
importjson
CONSUMER_KEY='YOUR_CONSUMER_KEY'
CONSUMER_SECRET='YOUR_CONSUMER_SECRET'
ACCESS_TOKEN='YOUR_ACCESS_TOKEN'
ACCESS_SECRET='YOUR_SECRET_ACCESS'
classTweetListener(StreamListener):
def__init__(self,csocket):
self.client_socket=csocket
defon_data(self,data):
try:
msg=json.loads(data)
print(msg['text'].encode('utf-8'))
self.client_socket.send(msg['text'].encode('utf-8'))
returnTrue
exceptBaseExceptionase:
print("Erroron_data:%s"%str(e))
returnTrue
defon_error(self,status):
print(status)
returnTrue
defsendData(c_socket):
auth=OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN,ACCESS_SECRET)
twitter_stream=Stream(auth,TweetListener(c_socket))
twitter_stream.filter(track=['messi'])#youcanchangethistopic
if__name__=="__main__":
7.5.3.3.5step5
Pleasereplacethelocalfilepathsmentionedinthiscodewithafilepathofyourpreferencedependingonyourworkstation.AndalsoIPaddressmustbereplacedwithyouripaddress.Thelogfolderpathmustbepre-createdandmakesuretoreplace the registerTempTable namewith respect to the entity that you are referring.Thiswillminimizetheconflictsamongdifferenttopicswhenyouneedtoplotitinasimplemanner.
AddthefollowingcontenttotheIpythonNotebookasfollows
Openupaterminal,
Then in thebrowser the jupyternotebook isbeing loaded.There create anewIPythonnotebookcalledtwittersparkstremer.
Thenaddthefollowingcontent.
s=socket.socket()
host="YOUR_MACHINE_IP"
port=5555
s.bind((host,port))
print("Listeningonport:%s"%str(port))
s.listen(5)
c,addr=s.accept()
print("Receivedrequestfrom:"+str(addr))
sendData(c)
cd~/cloudmesh/spark/streaming
jupyternotebook
frompysparkimportSparkContext
frompyspark.streamingimportStreamingContext
frompyspark.sqlimportSQLContext
frompyspark.sql.functionsimportdesc
sc=SparkContext('local[2]','twittersparkstreamer')
ssc=StreamingContext(sc,10)
sqlContext=SQLContext(sc)
ssc.checkpoint("file:///home/<your-username>/cloudmesh/spark/streaming/logs/messi")
socket_stream=ssc.socketTextStream("YOUR_IP_ADDRESS",5555)
lines=socket_stream.window(20)
fromcollectionsimportnamedtuple
fields=("tag","count")
Tweet=namedtuple('Tweet',fields)
(lines.flatMap(lambdatext:text.split(""))
.filter(lambdaword:word.lower().startswith("#"))
.map(lambdaword:(word.lower(),1))
7.5.3.3.6step6
OpenTerminal1,thendothefollowing
Itwillshowthat:
OpenTerminal2
Now we must start the Spark app by running the content in the IPythonNotebookbypressingSHIFT-ENTERineachboxtoruneachcommand.Makesurenotto run twice the starting command of the SparkContext or initialization ofSparkContext.
NowyouwillseestreamsintheTerminal1andyoucanseeplotsafterawhileintheIPythonNotebook.
Sampleoutputscanbeseenin+Figure46,+Figure47,+Figure48,+Figure49.
.reduceByKey(lambdaa,b:a+b)
.map(lambdarec:Tweet(rec[0],rec[1]))
.foreachRDD(lambdardd:rdd.toDF().sort(desc("count"))
.limit(10).registerTempTable("tweetsmessi")))#changetablenamedependingonyourentity
sqlContext
<pyspark.sql.context.SQLContextat0x7f51922ba350>
ssc.start()
importmatplotlib.pyplotasplt
importseabornassn
importtime
fromIPythonimportdisplay
count=0
whilecount<10:
time.sleep(20)
top_10_tweets=sqlContext.sql('Selecttag,countfromtweetsmessi')#changetablenameaccordingtoyourentity
top_10_df=top_10_tweets.toPandas()
display.clear_output(wait=True)
#sn.figure(figsize=(10,8))
sn.barplot(x="count",y="tag",data=top_10_df)
plt.show()
count=count+1
ssc.stop()
cd~/cloudmesh/spark/streaming
pythontweetslistener.py
Listeningonport:5555
Figure46:TwitterTopicMessi
Figure47:TwitterTopicMessi
Figure48:TwitterTopicMessi
Figure49:TwitterTopicMessi
7.5.4UserDefinedFunctionsinSpark☁�
ApacheSparkisafastandgeneralcluster-computingframeworkwhichperformcomputational tasksup to100xfaster thanHadoopMapReduce inmemory,or10x faster on disk for high speed large-scale streaming,machine learning andSQL workloads tasks. Spark offers support for the applications developmentemployingover80high-leveloperatorsusingJava,Scala,Python,andR.SparkpowersthecombinedorstandaloneuseofastackoflibrariesincludingSQLandDataFrames,MLlibformachinelearning,GraphX,andSparkStreaming.Sparkcanbeutilized in standalone clustermode, onEC2,onHadoopYARN,or onApache Mesos and it allows data access in HDFS, Cassandra, HBase, Hive,Tachyon,andanyHadoopdatasource.
User-definedfunctions(UDFs)arethefunctionscreatedbydeveloperswhenthebuilt-infunctionalitiesofferedinaprogramminglanguage,arenotsufficienttodo the requiredwork.Similarly,ApacheSparkUDFsalsoallowdevelopers toenablenewfunctionsinhigherlevelprogramminglanguagesbyextendingbuilt-in functionalities. It also allows developers to experiment with wide range ofoptionsforintegratingUDFswithSparkSQL,MLibandGraphXworkflows.
Thistutorialexplainsfollowing:
HowtoinstallSparkinLinux,WindowsandMacOS.
Howtocreateandutilizeuserdefinedfunctions(UDF)inSparkusingPython.
Howtoruntheprovidedexampleusingaprovideddockerfileandmakefile.
7.5.4.1Resources
https://spark.apache.org/http://www.scala-lang.org/https://docs.databricks.com/spark/latest/spark-sql/udf-in-python.html
7.5.4.2InstructionsforSparkinstallation
7.5.4.2.1Linux
First,JDK(Recommendedversion8)shouldbeinstalledtoapathwherethereisnospace.
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Second, setup environment variables for JDK by adding bin folder path to touserpathvariable.
Next,downloadandextractScalapre-builtversionfrom
http://www.scala-lang.org/download/
Then,setupenvironmentvariablesforScalabyaddingbinfolderpathtotheuserpathvariable.
Next,downloadandextractApacheSparkpre-builtversion.
https://spark.apache.org/downloads.html
Then,setupenvironmentvariablesforsparkbyaddingbinfolderpathtotheuserpathvariable.
This$exportPATH=$PATH:/usr/local/java8/bin
$exportPATH=$PATH:/usr/local/scala/bin
Finally,fortestingtheinstallation,pleasetypethefollowingcommand.
7.5.4.3Windows
First, JDK should be installed to a pathwhere there is no space in that path.RecommendedJAVAversionis8.
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Second,setupenvironmentvariablesforjdkbyaddingbinfolderpathtotouserpathvariable.
Next,downloadandextractApacheSparkpre-builtversion.
https://spark.apache.org/downloads.html
Then,setupenvironmentvaribaleforsparkbyaddingbinfolderpathtotheuserpathvariable.
Next, download the winutils.exe binary and Save winutils.exe binary to adirectory(c:\hadoop\bin).
https://github.com/steveloughran/winutils
Then,changethewinutils.exepermissionusingfollowingcommandusingCMDwithadministratorpermission.
Ifyoursystemdoesnthavehivefolder,makesuretocreateC:\tmp\hivedirectory.
Next, setupenvironmentvariables forhadoopbyaddingbin folderpath to theuserpathvariable.
$exportPATH=$PATH:/usr/local/spark/bin
spark-shell
setJAVA_HOME=c:\java8
setPATH=%JAVA_HOME%\bin;%PATH%
setSPARK_HOME=c:\spark
setPATH=%SPARK_HOME%\bin;%PATH%
$winutils.exechmod-R777C:\tmp\hive
Then, installPython3.6with anaconda (This is a bundledpython installer forpyspark).
https://anaconda.org/anaconda/python
Finally,fortestingtheinstallation,pleasetypethefollowingcommand.
7.5.4.4MacOS
First, JDK should be installed to a pathwhere there is no space in that path.RecommandedJAVAversionis8.
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Second,setupenvironmentvariablesforjdkbyadddingbinfolderpathtotouserpathvariable.
Next,InstallApacheSparkusingHomebrewwithfollowingcommands.
Then,setupenvironmentvaribaleforsparkwithfollowingcommands.
Next, install Python 3.6with anaconda (This is a bundled python installer forpyspark)
https://anaconda.org/anaconda/python
Finally,fortestingtheinstallation,pleasetypethefollowingcommand.
setHADOOP_HOME=c:\hadoop\bin
setPATH=%HADOOP_HOME%\bin;%PATH%
$pyspark
$exportJAVA_HOME=$(/usr/libexec/java_home)
$brewupdate
$brewinstallscala
$brewinstallapache-spark
$exportSPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec/"
$exportPYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
$exportPYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
$pyspark
7.5.4.5InstructionsforcreatingSparkUserDefinedFunctions
7.5.4.5.1Example:Temperatureconversion
In this examplewe convert temperature data fromCelsius to Fahrenheit withfilteringandsorting
7.5.4.5.1.1Descriptionaboutdataset
Thefiletemperature_data.csvcontainstemperaturedataofdifferentwheatherstationsandithasthefollowingstructure.ITE00100554,18000101,TMAX,-75,,,E,
ITE00100554,18000101,TMIN,-148,,,E,
GM000010962,18000101,PRCP,0,,,E,
EZE00100082,18000101,TMAX,-86,,,E,
GM000010962,18000104,PRCP,0,,,E,
EZE00100082,18000104,TMAX,-55,,,E,
Wewill only considerwheather station ID (column 0), entrytype (column 2),temperature(column3:itisin10*Celsius)
7.5.4.5.1.2HowtowriteapythonprogramwithUDF
First, we need to import the relevent libraries to use Spark sql built infunctionalitieslistedasfollows.
Then,weneedcreateauserdefinedfuctionwhichwill read the text inputandprocessthedataandreturnasparksqlRowobject.Itcanbecreatedaslistedasfollows.
Then we need to create a Spark SQL session as listed as follows with anapplicationname.
Next, we read the raw data using spark build-in function textFile() as shown
frompyspark.sqlimportSparkSession
frompyspark.sqlimportRow
defprocess_data(line):
fields=line.split(',')
stationID=fields[0]
entryType=fields[2]
temperature=float(fields[3])*0.1*(9.0/5.0)+32.0
returnRow(ID=stationID,t_type=entryType,temp=temperature)
spark=SparkSession.builder.appName("SimpleSparkSQLUDFexample").getOrCreate()
next.
Then,weconvert those read lines to aResilientDistributedDataset (RDD)ofRowobjectusingUDF(process_data)whichwecreatedaslistedasfollows.
AlternativelywecoludhavewrittentheUDFusingapythonlamdafunctiontodothesamethingasshownnext.
Now,we can convert ourRDDobject to a SparkSQLDataframe as listed asfollows.
Next, we can print and see the first 20 rows of data to validate our work asshownnext.
7.5.4.5.1.3Howtoexecuteapythonsparkscript
Youcanusespark-submitcommandtorunasparkscriptasshownnext.
Ifeverythingwentwell,youshouldseethefollowingoutput.
lines=spark.sparkContext.textFile("temperature_data.csv")
parsedLines=lines.map(process_data)
parsedLines=lines.map(lambdaline:Row(ID=line.split(',')[0],
t_type=line.split(',')[2],
temp=float(line.split(',')[3])*0.1*(9.0
/5.0)+32.0))
TempDataset=spark.createDataFrame(parsedLines)
TempDataset.show()
spark-submittemperature_converter.py
+-----------+------+-----------------+
|ID|t_type|temp|
+-----------+------+-----------------+
|EZE00100082|TMAX|90.14000000000001|
|ITE00100554|TMAX|90.14000000000001|
|ITE00100554|TMAX|89.42|
|EZE00100082|TMAX|88.88|
|ITE00100554|TMAX|88.34|
|ITE00100554|TMAX|87.80000000000001|
|ITE00100554|TMAX|87.62|
|ITE00100554|TMAX|87.62|
|EZE00100082|TMAX|87.26|
|EZE00100082|TMAX|87.08000000000001|
|EZE00100082|TMAX|87.08000000000001|
|ITE00100554|TMAX|86.72|
|ITE00100554|TMAX|86.72|
|ITE00100554|TMAX|86.72|
|EZE00100082|TMAX|86.72|
|ITE00100554|TMAX|86.0|
|ITE00100554|TMAX|86.0|
7.5.4.5.1.4Filteringandsorting
Now we are trying to find what is the maximum temperature reported for aparticluarwhetherstationandprintthedatainascendingorder.Wecanachievethisbyusingwhere()andorderBy()fundtionsasshownnext.
We achieved the filtering using temperature type and it filters out all the datawhichisnotaTMAX.
Finally,wecanprintthedatatoseewhetherthisworkedornotusingfollowingstatement.
Now,itisthetimetorunthepythonscriptagainusingfollowingcommand.
Ifeverythingwentwell,youshouldseethefollowingsortedandfilteredoutput.
Complete python script is listed as follows as well as under this directory(temperature_converter.py).
|ITE00100554|TMAX|86.0|
|ITE00100554|TMAX|85.64|
|ITE00100554|TMAX|85.64|
+-----------+------+-----------------+
onlyshowingtop20rows
TempDatasetProcessed=TempDataset.where(TempDataset['t_type']=='TMAX'
).orderBy('temp',ascending=False).cache()
TempDatasetProcessed.show()
spark-submittemperature_converter.py
+-----------+------+-----------------+
|ID|t_type|temp|
+-----------+------+-----------------+
|EZE00100082|TMAX|90.14000000000001|
|ITE00100554|TMAX|90.14000000000001|
|ITE00100554|TMAX|89.42|
|EZE00100082|TMAX|88.88|
|ITE00100554|TMAX|88.34|
|ITE00100554|TMAX|87.80000000000001|
|ITE00100554|TMAX|87.62|
|ITE00100554|TMAX|87.62|
|EZE00100082|TMAX|87.26|
|EZE00100082|TMAX|87.08000000000001|
|EZE00100082|TMAX|87.08000000000001|
|ITE00100554|TMAX|86.72|
|ITE00100554|TMAX|86.72|
|ITE00100554|TMAX|86.72|
|EZE00100082|TMAX|86.72|
|ITE00100554|TMAX|86.0|
|ITE00100554|TMAX|86.0|
|ITE00100554|TMAX|86.0|
|ITE00100554|TMAX|85.64|
|ITE00100554|TMAX|85.64|
+-----------+------+-----------------+
onlyshowingtop20rows
https://github.com/cloudmesh-community/hid-sp18-409/blob/master/tutorial/spark_udfs/temperature_converter.py
7.5.4.6Instructionstoinstallandruntheexampleusingdocker
Followinglinkisthehomedirectoryfortheexampleexplainedinthistutorial.
https://github.com/cloudmesh-community/hid-sp18-409/tree/master/tutorial/spark_udfs
Itcontainsfollowingfiles
Pythonscriptwhichcontainstheexample:temperature_converter.pyTemperaturedatafile:temperature_data.csvRequiredpythondependenciesareputhere:requirements.txtDocker file which automatically setup the codebase with dependencyinstallation:Dockerfile
frompyspark.sqlimportSparkSession
frompyspark.sqlimportRow
defprocess_data(line):
fields=line.split(',')
stationID=fields[0]
entryType=fields[2]
temperature=float(fields[3])*0.1*(9.0/5.0)+32.0
returnRow(ID=stationID,t_type=entryType,temp=temperature)
#CreateaSparkSQLSession
spark=SparkSession.builder.appName('SimpleSparkSQLUDFexample'
).getOrCreate()
#Gettherawdata
lines=spark.sparkContext.textFile('temperature_data.csv')
#ConvertittoaRDDofRowobjects
parsedLines=lines.map(process_data)
#alternativelamdafundtion
parsedLines=lines.map(lambdaline:Row(ID=line.split(',')[0],
t_type=line.split(',')[2],
temp=float(line.split(',')[3])*0.1*(9.0
/5.0)+32.0))
#ConvertthattoaDataFrame
TempDataset=spark.createDataFrame(parsedLines)
#showfirst20rowstemperatureconverteddata
#TempDataset.show()
#SomeSQL-stylemagictosortallmoviesbypopularityinoneline!
TempDatasetProcessed=TempDataset.where(TempDataset['t_type']=='TMAX'
).orderBy('temp',ascending=False).cache()
#showfirst20rowsoffilteredandsorteddata
TempDatasetProcessed.show()
Makefilewhichwillexcutetheexamplewithasinglecommand:Makefile
Toinstalltheexampleusingdockerplessedothefollowingsteps.
First,youshouldinstalldockerintoyourcomputer.
Next, git clone the project . Alternatively you can also download the dockerimagefromthedockerhub.Thenyoudon’tneedtododockerbuild.
Then,changethedirectorytospark_udfsfolder.
Next,installtheserviceusingfollowingmakecommand
Finally,starttheserviceusingfollowingmakecommand
Now you should see the same output we saw at the end of the exampleexplanation.
7.6ADVANCEDHADOOP
7.6.1AmazonEMR(ElasticMapReduce) �☁�
AmazonEMRfacilitatesyoutoanalyzeandprocessvast(huge)amountsofdatabydistributingthecomputationalworkacrossaclusterofvirtualserversrunningin the AWS Cloud. The EMR cluster is managed using an open-sourceframework called Hadoop. Amazon EMR lets you focus on crunching oranalyzing your data without having to worry about time-consuming setup,management,and tuningofHadoopclustersor thecomputecapacity they relyonunlikeotherHadoopdistributorslikeCloudera,Hortonworksetc.,
Easy:TomaintainondemandbasisFast:AutoshrinkingofclusteranddynamicallyincreasememorybasedontheneedCost-effective:Scalaoutandinanytimebasedonthebusinessrequirement
$dockerpullkadupitiya/tutorial
$makedocker-build
$makedocker-start
ormodels
EMR Supports other distributed framework such as Apache Spark, HBase,Presto,Flinkandetc.InteractwithdatainAWSdatastoressuchasAmazonS3,DynamoDBandetc.
ComponentsOfEMR:
StorageEC2instanceClustersSecurityKMS
7.6.1.1WhyEMR?
ThefollowingarreasonsgivenbyAmazoneforusingEMR
Easy toUse:Launchcluster ina5 to10minutes timeasmanyclusterofnodesasyouneedPay as you go: Pay an hourly rate (with AWS latest pricing model,customerscanchoosetopayinminutes)Flexible:EasilyAdd/Removecapacity(Autoscaleoutandinanytime)Reliable:Spendlesstimeformonitoringandcanutilizein-builtAWStoolswhichwillreduceoverheadSecure:Managefirewall(VPCbothprivateandsubnet)
7.6.1.2UnderstandingClustersandNodes
The component of Amazon EMR is the cluster. A cluster is a collection ofAmazonElasticComputeCloud(AmazonEC2)instances.Eachinstanceintheclusteriscalledanode.Eachnodehasarolewithinthecluster,referredtoasthenode type.Amazon EMR also installs different software components on eachnode type, giving each node a role in a distributed application like ApacheHadoop.
ThenodetypesinAmazonEMRareasfollows:
Master node: A node that manages the cluster by running softwarecomponents to coordinate the distribution of data and tasks among othernodes for processing. The master node tracks the status of tasks andmonitorsthehealthofthecluster.Everyclusterhasamasternode,anditispossibletocreateasingle-nodeclusterwithonlythemasternode.
Corenode:AnodewithsoftwarecomponentsthatruntasksandstoredataintheHadoopDistributedFileSystem(HDFS)onyourcluster.Multi-nodeclustershaveatleastonecorenode.
Tasknode:AnodewithsoftwarecomponentsthatonlyrunstasksanddoesnotstoredatainHDFS.Tasknodesareoptional.
Thefollowingdiagramrepresentsaclusterwithonemasternodeandfourcorenodes.
CluserandNodes
7.6.1.2.1SubmitWorktoaCluster
WhenyourunaclusteronAmazonEMR,youhaveseveraloptionsas tohowyouspecifytheworkthatneedstobedone.
Providetheentiredefinitionoftheworktobedoneinfunctionsthatyouspecify
asstepswhenyoucreateacluster.Thisistypicallydoneforclustersthatprocessasetamountofdataandthenterminatewhenprocessingiscomplete.
Create a long-running cluster anduse theAmazonEMRconsole, theAmazonEMRAPI, or theAWSCLI to submit steps,whichmay contain one ormorejobs.
Createacluster,connect to themasternodeandothernodesas requiredusingSSH, and use the interfaces that the installed applications provide to performtasksandsubmitqueries,eitherscriptedorinteractively.
7.6.1.2.2ProcessingData
Whenyou launchyourcluster,youchoose theframeworksandapplications toinstall for your data processing needs. To process data in yourAmazonEMRcluster,youcansubmitjobsorqueriesdirectlytoinstalledapplications,oryoucanrunstepsinthecluster.
SubmittingJobsDirectlytoApplications:
Youcansubmitjobsandinteractdirectlywiththesoftwarethatisinstalledin your Amazon EMR cluster. To do this, you typically connect to themaster node over a secure connection and access the interfaces and toolsthat are available for the software that runs directly on your cluster. Formoreinformation,seeConnecttotheCluster.
RunningStepstoProcessData
You can submit one or more ordered steps to an Amazon EMR cluster.Eachstepisaunitofworkthatcontainsinstructionstomanipulatedataforprocessingbysoftwareinstalledonthecluster.
Thefollowingisanexampleprocessusingfoursteps:
1. Submitaninputdatasetforprocessing.2. ProcesstheoutputofthefirststepbyusingaPigprogram.3. ProcessasecondinputdatasetbyusingaHiveprogram.4. Writeanoutputdataset.
Generally,whenyouprocessdata inAmazonEMR, the input isdatastoredasfilesinyourchosenunderlyingfilesystem,suchasAmazonS3orHDFS.Thisdatapassesfromonesteptothenextintheprocessingsequence.Thefinalstepwritestheoutputdatatoaspecifiedlocation,suchasanAmazonS3bucket.
Stepsareruninthefollowingsequence:
1. Arequestissubmittedtobeginprocessingsteps.2. ThestateofallstepsissettoPENDING.3. Whenthefirststepinthesequencestarts, itsstatechangestoRUNNING.
TheotherstepsremaininthePENDINGstate.4. Afterthefirststepcompletes,itsstatechangestoCOMPLETED.5. Thenext step in the sequence starts, and its state changes toRUNNING.
Whenitcompletes,itsstatechangestoCOMPLETED.6. This pattern repeats for each step until they all complete and processing
ends.
Thefollowingdiagramrepresentsthestepsequenceandchangeofstateforthestepsastheyareprocessed.
CluserandNodes
If a step fails during processing, its state changes toTERMINATED_WITH_ERRORS. You can determine what happens next foreach step. By default, any remaining steps in the sequence are set toCANCELLED and do not run.You can also choose to ignore the failure andallowremainingstepstoproceed,ortoterminatetheclusterimmediately.
Thefollowingdiagramrepresentsthestepsequenceanddefaultchangeofstatewhenastepfailsduringprocessing.
CluserandNodes
7.6.1.3AWSStorage
S3 - Cloud based storage - Using EMRFS can directly connects s3 storage -Accessiblefromanywhere
InstanceStore-Localstorage-DatawillbelostonstartandstopEC2instances
EBS-Networkattachedstorage-Datapreservedonstartandstop-AccessibleonlythroughEC2instances
7.6.1.4CreateEMRinAWS
7.6.1.4.1Createthebuckets
Login to AWS console and create the buckets athttps://aws.amazon.com/console/.Tocreatethebuckets,gotoservices(seeFigure 50, Figure 51), click on S3 under Storage, Figure 52, Figure 53,Figure54.ClickonCreatebucketbuttonandthenprovideallthedetailstocompletebucketcreation.AWSConsole
Figure50:AWSConsole
AWSLogin
Figure51:AWSLogin
S3–AmazonStorage
Figure52:AmazonStorage
S3–Createbuckets
Figure53:S3buckets
Figure54:S3buckets1
7.6.1.4.2CreateKeyPairs
LogintoAWSconsole,gotoservices,clickonEC2undercompute.SelecttheKeypairsresoure,clickonCreateKeyPairandprovideKeyPairnametocompletetheKeypairscreation.SeeFigure55
Download the.pemfileonceKeyvaluepair iscreated.This isneeded toaccess AWS Hadoop environment from client machine. This need to beimportedinPuttytoaccessyourAWSenvironemnt.SeeFigure56
7.6.1.4.2.1CreateKeyValuePairScreenshots
Figure55:AMSKeyValuePair
Figure56:AMSKeyValuePair1
7.6.1.5CreateStepExecution–HadoopJob
Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetailstocompletetocompletestepexecutioncreation.See:Figure57,Figure58,Figure59,Figure60,Figure61
Clustername(Example:HadoopJobStepExecutionCluster)Select Logging check box and provide S3 folder location (Example:s3://bigdata-raviAndOrlyiuproject/logs/)SelectlaunchmodeasStepexecutionSelectthesteptypeandcompletethestepconfigurationCompleteSoftwareConfigurationCompleteHardwareConfigurationCompleteSecurityandaccessAndthenclickoncreateclusterbutton
Once job started, if there are no errors output file will be created in theoutputdirectory.
7.6.1.5.0.1Screenshots
Figure57:AWSEMR
Figure58:AWSCreateEMR
Figure59:AWSConfigEMR
Figure60:AWSCreateCluster
Figure61:AWSCreateCluster1
7.6.1.6CreateaHiveCluster
Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetails tocomplete.See,Figure62,Figure63,Figure64
Clustername(Example:MyFirstCluster-Hive)SelectLoggingcheckboxselectedandprovideS3folderlocationSelectlaunchmodeasClusterComplete software configuration (select hive application) and click oncreatecluster
7.6.1.6.1CreateaHiveCluster-Screenshots
Figure62:HiveCluser
Figure63:HiveCluser1
Figure64:HiveCluser2
7.6.1.7CreateaSparkCluster
Login toAWS console, go to services and then select EMR.Click onCreateCluster.Theclusterconfigurationprovidesdetails tocomplete.See,Figure65,Figure66,Figure67
Clustername(Example:MyCluster-Spark)SelectLoggingcheckboxselectedandprovideS3folderlocationSelectlaunchmodeasClusterCompletesoftwareconfigurationandclickoncreateclusterSelectapplicationasSpark
7.6.1.7.1CreateaSparkCluster-Screenshots
Figure65:SparkCluser
Figure66:SparkCluser
Figure67:SparkCluser
7.6.2Twister2☁�
7.6.2.1Introduction
Twister2[57] provides a data analytics hosting environment where it supportsdifferent data analytics including streaming, data pipelines and iterativecomputations. The functionality of Twister2 is similar to other Big dataframeworks suchasApacheSparkandApacheFlink.But thereare a fewkeydifferenceswhichdifferentiatesTwister2fromother frameworks.UnlikemanyotherbigdatasystemsthataredesignedarounduserAPIs,Twister2isbuiltfrombottomup to supportdifferentAPIsandworkloads.TheaimofTwister2 is todevelopacompletecomputingenvironmentfordataanalytics.
OnemajorgoalofTwister2is toprovideindependentcomponents, thatcanbeused by other big data systems and evolve separately. To this end Twister2supportsacomposablearchitecturewheredeveloperscaneasilyreplaceasmallcomponent in thesystemwithanew implementationveryeasily.Forexamplethe resource scheduling layer has several implementations it supports,Kubernetes,Mesos,Slurm,Nomadanda standalone implementation, If auserwants to add support for another resources scheduler such as Yarn they caneasilydosobyimplementingthewelldefinedinterfaces.
Twister2supportsbothbatchandstreamingapplications.Unlikeotherbigdataframeworkswhicheithersupportbatchorstreaminginthecoreanddevelopthe
otherontopofthat,Twister2nativelysupportsbothbatchandstreaming.WhichallowsTwister2tomakeseparateoptimizationsforeachtype.
Twister2 project is still less than 2 years old and still in it’s early stages andgoing through rapid development to complete its functionality. It is an OpenSourceprojectwhichislicencedundertheApache2.0[58]
7.6.2.2Twister2API’s
Twister2 provides users with 3 levels on API’s which can be used to writeapplications.The3APIlevelsareshowninFigureFigure68.
Figure68:Twister2API’s
As shown in Figure 68 eachAPI level has different levels of abstraction andprogrammingcomplexities.TSetAPIisthemosthighlevelinTwister2whichinsomewaysissimlartotheRDDAPIinApacheSparkorDataSetAPIinApacheFlink.IftheuserwantsmorecontrolovertheapplicationdevelopmenttheycanopttouseamorelowerlevelAPI’s.
7.6.2.2.1TSetAPI
TSetAPI is themost abstract API provided by Twister2. This allows user todevelop theirprogramsat thedata layer, similar to theprogrammingmodelofApacheSpark.SimilartoRDDinSparkuserscanperformoperationsontopofTSetobjectswhichwillbeautomaticallyparallelizedbytheframework.TogetaslightunderstandingoftheTsetAPItakealookattheabstractexamplegivenonhowTSetAPIcanbeusedtoimplementKMeansalgorithm.publicclassKMeansJobextendsTaskWorker{
//......
@Override
publicvoidexecute(){
//.....
TSet<double[][]>points=TSetBuilder.newBuilder(config).createSource(newSource<double[][]>(){
//Codeforsourcefunctiontoreaddatapoints
}).cache();
WhenprogrammingattheTSetAPIlevelthedeveloperdoesneedtohandleanyinformationrelatedtotaskandcommunications.
Note:TheTSetAPI iscurrentlyunderdevelopmentandhasnotbeen releasedyetandthereforetheAPImaychangefromwhatwasdiscussedinthissection,anyonewho is interestedcan follow thedevelopmentprogressorcontribute totheprojectthroughtheGitHubrepo[58].
7.6.2.2.2TaskAPI
TheTaskAPIallowsdeveloperstocreatetheirapplicationattheTasklevel.ThedeveloperisresponsibleofmanagingtaskleveldetailswhendevelopingatthisAPI level, theupsideofusing theTaskAPI is that it ismore flexible than theTSetAPIsoitallowsdeveloperstoaddcustomoptimizationstotheapplicationcode.TheTSetAPIisbuiltontopoftheTaskAPIthereforetheaddedlayerofabstraction isbound toaddslightlymoreoverheads to the runtime,whichyoumightbeabletoavoidbydirectlycodingattheTaskAPIlevel.
TogetabetterunderstandingoftheTaskAPItakealookathowtheclassicmapreduce problem word count is implemented at using the Task API in thefollowingcodesegment.Thisisonlyaportionoftheexamplecode,youcanfindthecompletecodefortheexampleat[59].
TSet<double[][]>centroids=TSetBuilder.newBuilder(config).createSource(newSource<double[][]>(){
//Codeforsourcefunctiontoreadcenters(orgeneraterandomcenters)
}).cache();
for(
inti=0;
i<iterations;i++){
TSet<double[][]>KmeansTSet=points.map(newMapFunction<double[][],double[][]>(){
//CodeforKmeanscalculation,thiswillhaveaccesstothecentroidswhicharepassedin
});
KmeansTSet.addInput("centroids",centroids);
Link<double[][]>allReduced=KmeansTset.allReduce();
TSet<double[][]>newCentroids=allReduced.map(newMapFunction<double[][],Object>(){
/*Codethatproducesthenewcentersforthenextiteration.TheallReducewillresultin
asumorallthecenterssentbyeachworkersothismapfunctionsimplyneedstocomputethe
averagetogetthenewcenters
*/
});
centroids.override(newCentroids);
}
//.....
}
}
publicclassWordCountJobextendsTaskWorker{
//.....
@Override
publicvoidexecute(){
//sourceandaggregator
MoreTaskAPIexamplescanbefoundinTwister2documentations[60].
7.6.2.3OperatorAPI
The lowest level API provided by Twister2 is the Operator API, this allowsdevelopers todevelopapplicationsat the communication level.However sincethis API only abstracts out communication operations, details such as taskmanagementneedtobehandledbytheapplicationdeveloper.Againsimilar totheTaskAPI this provides the developerwithmore flexibility to createmoreoptimizedapplications,atthecostofbeinghardertoprogram.Twister2supportsavarietyofcommunicationpatterns,knownascollectivecommunicationsintheHPCworld.Thesecommunicationsarehighlyoptimizedusingvarious routingpatterns to reduce the number of communication calls that go through thenetworktoprovideuserswithaextremelyefficientOperatorAPI.ThefollowinglistshowthecommunicationoperationsthataresupportedbyTwister2.Youcanfind more information on each or these operations in the Twister2documentation[61].
ReduceGatherAllReduceAllGatherPartitionBroadcastKeyedReduceKeyedPartitionKeyedGather
WordSourcesource=newWordSource();
WordAggregatorcounter=newWordAggregator();
//buildthetaskgraph
TaskGraphBuilderbuilder=TaskGraphBuilder.newBuilder(config);
builder.addSource("word-source",source,4);
builder.addSink("word-aggregator",counter,4).keyedReduce("word-source",EDGE,
newReduceFn(Op.SUM,DataType.INTEGER),DataType.OBJECT,DataType.INTEGER);
builder.setMode(OperationMode.BATCH);
//executethegraph
DataFlowTaskGraphgraph=builder.build();
ExecutionPlanplan=taskExecutor.plan(graph);
taskExecutor.execute(graph,plan);
}
//.....
}
Initial Performance comparisons that are discussed in[62] show howTwister2outperformspopularframeworkssuchApacheFlink,ApacheSparkandApacheStrominmanyareas.ForexampletheFigure69showsacomparisionbetweenTwister2,MPI and Apache Spark versions of KMeans algorithm, please notethatthegraphisinlogarithmicscale
Figure69:KmeansPerformanceComparison[63]
Notation:*DFWreferstoTwister2*BSPreferstoMPI(OpenMPI)
This shows thatTwister2 performs around~10x faster thanApacheSpark forKMeans.AndthatitisonparwithimplementationsdoneusingOpenMPIwhichisawidelyusedHPCframework.
7.6.2.3.1Resources
http://www.iterativemapreduce.org/
http://www.cs.allegheny.edu/sites/amohan/teaching/CMPSC441/paper10.pdf
https://twister2.gitbook.io/twister2/
http://dsc.soic.indiana.edu/publications/Twister2.pdf
https://www.computer.org/csdl/proceedings/cloud/2018/7235/00/723501a383-abs.html
7.6.3Twister2Installation☁�
7.6.3.1Prerequisites
BecauseTwister2 isstill in theearlystagesofdevelopmentabinaryrelease isnotavailableasofyet,thereforetotryoutTwister2usersneedtofirstbuildthebinariesfromthesourcecode.
OperatingSystem:Twister2istestedandknowntoworkon,RedHatEnterpriseLinuxServerrelease7Ubuntu14.05,Ubuntu16.10andUbuntu18.10
Java(Jdk1.8)CoveredinSection[s:hadoop-local-installation].
G++Compilersudoapt-getinstallg++
MavenInstallationExplainedinSectionMaven
OpenMPIInstallationExplainedinSectionOpenMPI
BazelBuildInstallationExplainedinSectionBazel
AdditionalLibrariesExplainedinSectionTwisterExtra
7.6.3.1.1MavenInstallation
ExecutethefollowingcommandstoinstallMavenlocally.
Addingenvironmentalvariables
Addthefollowinglineattheendofthefile.
mkdir-p~/cloudmesh/bin/maven
cd~/cloudmesh/bin/maven
wgethttp://mirrors.ibiblio.org/apache/maven/maven-3/3.5.2/binaries/apache-maven-3.5.2-bin.tar.gz
tarxzfapache-maven-3.5.2-bin.tar.gz
emacs~/.bashrc
MAVEN_HOME=~/cloudmesh/bin/maven/apache-maven-3.5.2
PATH=$MAVEN_HOME/bin:$PATH
exportMAVEN_HOMEPATH
7.6.3.1.2OpenMPIInstallation
When you compile Twister2 it will automatically download and compileOpenMPI3.1.2withit.Ifyoudon’tlikethisversionofOpenMPIandwantstouseyourownversion,youcancompileOpenMPIusingfollowinginstructions.
WerecommendusingOpenMPI3.1.2Download OpenMPI 3.0.0 from https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gzExtractthearchivetoafoldernamedopenmpi-3.1.2Alsocreateadirectorynamedbuild insome location.Wewilluse this toinstallOpenMPISetthefollowingenvironmentvariables
TheinstructionstobuildOpenMPIdependontheplatform.Therefore,wehighlyrecommendlookingintothe$OMPI_1101/INSTALLfile.Platformspecificbuildfilesareavailablein$OMPI_1101/contrib/platformdirectory.
In general, please specify --prefix=$BUILD and --enable-mpi-java as arguments toconfigurescript.IfInfinibandisavailable(highlyrecommended)specify--with-verbs=<path-to-verbs-installation>.Usually, thepath toverbs installation is /usr. Insummary,thefollowingcommandswillbuildOpenMPIforaLinuxsystem.
Ifeverythinggoeswellmpirun--versionwillshowmpirun(OpenMPI)3.1.2. Execute thefollowingcommandtoinstal$OMPI_312/ompi/mpi/java/java/mpi.jarasaMavenartifact.
7.6.3.1.3InstallExtras
Installtheotherrequirementsasfollows,
source~/.bashrc
BUILD=<path-to-build-directory>
OMPI_312=<path-to-openmpi-3.1.2-directory>
PATH=$BUILD/bin:$PATH
LD_LIBRARY_PATH=$BUILD/lib:$LD_LIBRARY_PATH
exportBUILDOMPI_312PATHLD_LIBRARY_PATH
cd$OMPI_312
./configure--prefix=$BUILD--enable-mpi-java
make-j8;makeinstall
mvninstall:install-file-DcreateChecksum=true-Dpackaging=jar-Dfile=$OMPI_312/ompi/mpi/java/java/mpi.jar-DgroupId=ompi-DartifactId=ompijavabinding-Dversion=3.1.2
sudo apt-get install g++ git build-essential automake cmake libtool-bin ziplibunwind-setjmp0-dev zlib1g-dev unzip pkg-config python-setuptools -y sudoapt-getinstallpython-devpython-pip
Now you have successfully installed the required packages. Let us compileTwister2.
7.6.3.1.4CompilingTwister2
Nowletsgetacloneofthesourcecode.
YoucancompiletheTwister2distributionbyusingthebazeltargetasfollows.
Thiswillbuildtwister2distributioninthefile
If you would like to compile the twister2 without building the distributionpackagesusethecommand
Forcompilingaspecifictargetsuchascommunications
7.6.3.1.5Twister2Distribution
After you’ve build the Twister2 distribution, you can extract it and use it tosubmitjobs.
7.6.4Twister2Examples☁�
Twister documentation lists several examples[64] that users can leverage tobetter understand the Twister2 API’s. Currently there are several
gitclonehttps://github.com/DSC-SPIDAL/twister2.git
cdtwister2
bazelbuild--config=ubuntuscripts/package:tarpkgs
bazel-bin/scripts/package/twister2-client-0.1.0.tar.gz
bazelbuild--config=ubuntutwister2/...
bazelbuild--config=ubuntutwister2/comms/src/java:comms-java
cdbazel-bin/scripts/package/
tar-xvftwister2-0.1.0.tar.gz
CommunicationAPIexamplesandTaskAPIexamplesavailableintheTwister2documentation. In this section we will go through how an example can beexecutedwithTwister2.
7.6.4.1SubmittingaJob
InordertorunanexampleusersneedtosubmittheexampletoTwister2usingthe twister command. This command is found inside the bin directory of thedistribution.
Hereisadescriptionofthecommand
submitisthecommandtoexecutecluster which resource manager to use, i.e. standalone, kubernetes, thisshould be the name of the configuration directory for that particularresourcemanagerjob-typeatthemomentweonlysupportjarjob-file-namethefilepathofthejobfile(thejarfile)job-class-namenameofthejobclasswithamainmethodtoexecute
Hereisanexamplecommand
Inthiscommand,clusterisstandaloneandhasprogramarguments.
For this exercise we are using the standlonemode to submit a job. HoweverTwister2 does support Kubernetes, Mesos, Slurm and Nomad resourceschedulersifuserswanttosubmitjobstolargerclusterdeployments.
7.6.4.2BatchWordCountExample
In this section we will run a batch word count example from Twister2. Thisexample only uses communication layer and resource scheduling layer. Thethreadsaremanagedbytheuserprogram.
twister2submitclusterjob-typejob-file-namejob-class-name[job-args]
./bin/twister2submitstandalonejarexamples/libexamples-java.jaredu.iu.dsc.tws.examples.task.ExampleTaskMain-itr80-workers4-size1000-op
Theexamplecodecanbefoundintwister2/examples/src/java/edu/iu/dsc/tws/examples/basic/batch/wordcount/
When we install Twister2, it will compile the examples. Lets go to theinstallationdirectoryandruntheexample.
Thiswillrun4executorswith8tasks.Soeachexecutorwillhavetwotasks.Atthe firstphase, the0-3 tasks running ineachexecutorwillgeneratewordsandaftertheyarefinished,5-8taskswillconsumethosewordsandcreateacount.
7.6.5HADOOPRDMA☁�
Acknowledgement:Thissectionwascopiedandmodifiedwithpermissionfromhttps://www.chameleoncloud.org/appliances/17/docs/
InChameleon cloud it is possible to launch a virtualHadoop cluster on bare-metalInfiniBandnodeswithSR-IOV.
TheCentOS7SR-IOVRDMA-HadoopisbasedonaCentOS7VirtualMachineimage,aVMstartupscriptandaHadoopclusterlaunchscript,sothatuserscanlaunchVMswithSR-IOVinordertorunRDMA-HadoopacrosstheseVMsonSR-IOVenabledInfiniBandclusters.
Imagename:CC-CentOS7-RDMA-HadoopDefaultuseraccount:ccRemoteaccess:Key-BasedSSHRootaccess:passwordlesssudofromtheccaccountChameleonadminaccess:enabledontheccadminaccountCloud-initenabledonboot:yesRepositories(Yum):EPEL,RDO(OpenStack)Installedpackages:RebuiltkerneltoenableIOMMUMellanoxSR-IOVdriversforInfiniBandKVMhypervisorStandarddevelopmenttoolssuchasmake,gcc,gfortran,etc.Configmanagementtools:Puppet,Ansible,Salt
cdbazel-bin/scripts/package/twister2-dist/
./bin/twister2submitstandalonejarexamples/libexamples-java.jaredu.iu.dsc.tws.examples.batch.wordcount.WordCountJob
OpenStackcommand-lineclientsIncludedVMimagename:chameleon-rdma-hadoop-appliance.qcow2IncludedVMstartupscript:start-vm.shIncludedHadoopclusterlaunchscript:launch-hadoop-cluster.shDefaultVMrootpassword:nowlab
We refer to the chameleon cloud baremetal user guide for documentation onhow to reserve and provision resources using the appliance of CC-CentOS7-RDMA-Hadoop.
�linkmissing
7.6.5.1LaunchingaVirtualHadoopClusteronBare-metalInfiniBandNodeswithSR-IOVonChameleon
WeprovideaCentOS7VMimage(chameleon-rdma-hadoop-appliance.qcow2)andaHadoopclusterlaunchscript(launch-hadoop-cluster.sh)tofacilitateuserstosetupVirtualHadoopClusterseffortlessly.
First, launchbare-metal nodes using theRDMA-HadoopAppliance and selectoneofthenodesasthebootstrapnode.Thisnodewillserveasthehostforthemaster node of the Hadoop cluster and will also be used to setup the entirecluster.Now, ssh to thisnode.Beforeyoucan launch thecluster,youhave todownload your OpenStack credentials file (see how to download yourcredentialsfile).Then,createafile(henceforthreferredtoasips-file)withtheipaddressesof thebare-metal nodesyouwant to launchyourHadoopclusteron(excludingthebootstrapnode),eachonanewline.Next,runthesecommandsasroot:
The launch cluster scriptwill launchVMs for you, then install and configureHadoopontheseVMs.Notethatwhenyoulaunchtheclusterforthefirsttime,alotofinitializationisrequired.Dependingonthesizeofyourcluster,itmaytakesometimetosetupthecluster.Aftertheclustersetupiscomplete,thescriptwillprintanoutputtellingyouthattheclusterissetupandhowyoucanconnecttotheHadoopmasternode.NotethattheminimumrequiredmemoryforeachVM
[root@host]$cd/home/cc
[root@host]$./launch-hadoop-cluster.sh<num-of-vms-per-node><num-of-MB-per-VM><num-of-cores-per-VM><ips-file><openstack-credentials-file
is8,192MB.TheHadoopclusterwillalreadybesetupforuse.Formoredetailsonhowtouse theRDMA-Hadooppackage torun jobs,pleaserefer to itsuserguide.
7.6.5.2LaunchingVirtualMachinesManually
WeprovideaCentOS7VMimage(chameleon-rdma-hadoop-appliance.qcow2)andaVMstartupscript(start-vm.sh)tofacilitateuserstolaunchVMsmanually.Before you can launch aVM, you have to create a network port. To do this,sourceyourOpenStack credentials file (seehow todownloadyour credentialsfile)andrunthiscommand:
NotetheMACaddressandIPaddressareintheoutputof thiscommand.YoushouldusethisMACaddresswhilelaunchingaVMandtheIPaddresstosshtotheVM.YoualsoneedthePCIdeviceIDofthevirtualfunctionthatyouwanttoassigntotheVM.Thiscanbeobtainedbyrunning"lspci|grepMellanox"andlooking for the device ID (with format - XX:XX.X) of one of the virtualfunctionsasshownnext:
ThePCIdeviceIDoftheVirtualFunctionis03:00:1inthepreviousexample.
Now,youcan launchaVMonyour instancewithSR-IOVusing theprovidedVMstartupscriptandcorrespondingargumentsasfollowswiththerootaccount.
Please note that and are the ones you get from the outputs of previouscommands.AndisthenameofVMvirtualNICinterface.Forexample:
You can also edit corresponding fields in VM startup script to change thenumberofcores,memorysize,etc.
YoushouldnowhaveaVMrunningonyourbaremetalinstance.Ifyouwantto
[user@host]$neutronport-createsharednet1
[cc@host]$lspci|grepMellanox
03:00.0Networkcontroller:MellanoxTechnologiesMT27500Family[ConnectX-3]
03:00.1Networkcontroller:MellanoxTechnologiesMT27500/MT27520Family[ConnectX-3/ConnectX-3ProVirtualFunction]
...
[root@host]$./start-vm.sh<vm-mac><vm-ifname><virtual-function-device-id>
[root@host]$./start-vm.shfa:16:3e:47:48:00tap003:00:1
runmoreVMs on your instance, youwill have to createmore network ports.YouwillalsohavetochangethenameofVMvirtualNICinterfacetodifferentones(liketap1,tap2,etc.)andselectdifferentdeviceIDsofvirtualfunctions.
7.6.5.3ExtraInitializationwhenLaunchingVirtualMachines
InordertorunRDMA-HadoopacrossVMswithSR-IOV,andkeepthesizeofVM image small, extra initialization will be executed when launching VMautomatically,whichincludes:
DetectMellanoxSR-IOVdrivers,downloadandinstallitifnonexistentDetectJavapackageinstalled,downloadandinstallifnon-existentDetect RDMA-Hadoop package installed, download and install if non-existent
After finishing the extra initialization procedure, you should be able to runHadoopjobswithSR-IOVsupportacrossVMs.Notethatthisinitializationwillbe done automatically. For more details about the RDMA-Hadoop package,pleaserefertoitsuserguide.
7.6.5.4ImportantNoteforTearingDownVirtualMachinesandDeletingNetworkPorts
Onceyouaredonewithyourexperiments,youshouldkillallthelaunchedVMsanddelete the creatednetworkports. If youused the launch-hadoop-cluster.shscripttolaunchVMs,youcandothisbyrunningthekill-vms.shscriptasshownnext. This script will kill all launched VMs and also delete all the creatednetworkports.
Pleasenotethatitisimportanttodeleteunusedportsafterexperiments.
[root@host]$cd/home/cc
[root@host]$./kill-vms.sh<ips-file><openstack-credentials-file>
\end{vernatim}
IfyoulaunchedVMsusingthestart-vm.shscript,youshouldfirstmanuallykillalltheVMs.Then,deleteallthecreatednetworkportsusingthiscommand:
[user@host]$neutronport-deletePORT
8CONTAINER
8.1INTRODUCTIONTOCONTAINERS☁�
LearningObjectives
Knowingwhatacontaineris.DifferentiatingContainersfromVirtualMachines.Understandingthehistoricalaspectsthatleadtocontainers.
Thissectioncoversanintroductiontocontainersthatissplitupintofourparts.Wediscussmicroservices,serverlesscomputing,Docker,andkubernetes.
8.1.1Motivation-Microservices
Wediscussthemotivationforcontainersandcontrastthemtovirtualmachines.Additionally we provide a motivation for containers as they can be used tomicroservices.
Container11:01ContainerA
8.1.2Motivation-ServerlessComputing
We enhance our motivation while contrasting containers and microserviceswhile relating them to serverless computing. We anticipate that serverlesscomputingwillincreaseinimportanceoverthenextyears
Container15:08ContainerB
8.1.3Docker
Inorderforustousecontainers,wegobeyondthehistoricalmotivationthatwas
introducedinaprevioussectionandfocusonDockerapredominanttechnologyforcontainersonWindows,Linux,andmacOS
Container40:09ContainerC
8.1.4DockerandKubernetes
Wecontinueourdiscussionaboutdockerandintroducekubernetes,allowingustorunmultiplecontainersonmultipleserversbuildingaclusterofcontainers.
Container50:14ContainerD
8.2DOCKER
8.2.1IntroductiontoDocker☁�
Dockeris thecompanydrivingthecontainermovementandtheonlycontainerplatformprovidertoaddresseveryapplicationacrossthehybridcloud.Today’sbusinesses are under pressure to digitally transform but are constrained byexisting applications and infrastructure while rationalizing an increasinglydiverse portfolio of clouds, datacenters and application architectures. Dockerenables true independence between applications and infrastructure anddevelopers and IT ops to unlock their potential and creates amodel for bettercollaborationandinnovation.Anoverviewofdockerisprovidedat
https://docs.docker.com/engine/docker-overview/
Figure70:DockerContainers[ImageSource][65]
Figure70showshowdockercontainersfitintothesystem##Dockerplatform
Dockerprovidesusers anddeveloperswith the tools and technologies that areneeded tomanage their application development using containers. Developerscaneasilysetupdifferentenvironmentsfordevelopment,testingandproduction.
8.2.1.1DockerEngine
TheDocker engine can be thought of as the core of the docker runtime. Thedocker engine mainly provides 3 services. Figure 71 shows how the dockerengineiscomposed.
AlongrunningserverwhichmanagesthecontainersARESTAPIAcommandlineinterface
Figure71:DockerEngineComponentFlow[ImageSource][65]
8.2.1.2DockerArchitecture
Themainconceptofthedockerarchitectureisbasedonthesimpleclient-servermodel.Dockerclients communicatewith theDocker serveralsoknownas theDockerdaemontorequestvariousresourcesandservices.THedaemonmanagesall thebackgroundtasksthatneedtobeperformedtocompleteclientrequests.Managing and distributing containers, running the containers, buldingcontainers,etc.areresponsibilitiesoftheDockerdaemon.Figure72showshowthedockerarchitectureissetup.Theclientmoduleandservercanruneitherinthesamemachineorinseparatemachines.Inthelattercasethecommunicationbetweentheclientandserveraredonethroughthenetwork.
Figure72:DockerArchitecture[ImageSource][65]
8.2.1.3DockerSurvey
In 2016 Docker Inc. surveyed over 500 Docker developers and operationsexpertsinvariousphasesofdeployingcontainer-basedtechnologies.TheresultisavailableintheTheDockerSurvey2016asseeninFigure73.
https://www.docker.com/survey-2016
Figure73:DockerSurveyResults2016[ImageSource][65]
8.2.2RunningDockerLocally☁�
⚠�Pleaseverifyiftheinstructionsarestilluptodate.Rapidchangescouldmeantheycanbeoutdatedquickly.Alsoweassumetheubuntuinstalationsmayhavechangedandmaybedifferentbetween18.04and19.04.
Theofficial installationdocumentation fordockercanbe foundbyvisiting thefollowingWebpage:
https://www.docker.com/community-edition
Hereyouwillfindavarietyofpackages,oneofwhichwillhopefullysuitableforyouroperatingsystem.Thesupportedoperatingsystemscurrentlyinclude:
OSX,Windows,Centos,Debian,Fedora,Ubuntu,AWS,Azure
Please chose the onemost suitable for you. For your conveniencewe provideyouwithinstallationinstructionsforOSX(SectionDockeronOSX),Windows10(SectionDockeronWindows)andUbuntu(SectionDockeronubuntu).
8.2.2.1InstillationforOSX
ThedockercommunityeditionforOSXcanbefoundatthefollowinglink
https://store.docker.com/editions/community/docker-ce-desktop-mac
WerecommendthatatthistimeyougettheversionDockerCEforMAC(stable)
https://download.docker.com/mac/stable/Docker.dmg
Clickingon the linkwill download a dmg file to yourmachine, that you thanwillneedtoinstallbydoubleclickingandallowingaccesstothedmgfile.UponinstallationawhaleinthetopstatusbarshowsthatDockerisrunning,andyoucanaccessitviaaterminal.
DockerintegratedinthemenubaronOSX
8.2.2.2InstallationforUbuntu
In order to install Docker community edition for Ubuntu, you first have toregistertherepositoryfromwhereyoucandownloadit.Thiscanbeachievedasfollows:
Nowthatyouhaveconfiguredtherepositorylocation,youcaninstallitafteryouhaveupdatedtheoperatingsystem.Theupdateandinstallisdoneasfollows:
Once installedexecute the followingcommand tomakesure the installation isdoneproperly
Thisshouldgiveyouanoutputsimilartothenext.
local$sudoapt-getupdate
local$sudoapt-getinstall\
apt-transport-https\
ca-certificates\
curl\
software-properties-common
local$curl-fsSLhttps://download.docker.com/linux/ubuntu/gpg|sudoapt-keyadd-
local$sudoapt-keyfingerprint0EBFCD88
local$sudoadd-apt-repository\
"deb[arch=amd64]https://download.docker.com/linux/ubuntu\
local$(lsb_release-cs)\
stable"
local$sudoapt-getupdate
local$sudoapt-getinstalldocker-ce
local$sudoapt-getupdate
local$sudosystemctlstatusdocker
docker.service-DockerApplicationContainerEngine
8.2.2.3InstallationforWindows10
DockerneedsMicrosoft’sHyper-Vtobeenabled,butitwillimpactrunningthevirtualmachines
StepstoInstall
Download Docker forWindows(Community Edition) from the followinglinkhttps://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exeFollowtheWizardstepsintheinstallerLaunchdockerDockerusuallylauchesautomaticallyduringwindowsstartup.
8.2.2.4TestingtheInstall
Totestifitworksexecutethefollowingcommandsinaterminal:
Youshouldseeanoutputsimilarto
Toseeifyoucanrunacontaineruse
Loaded:loaded(/lib/systemd/system/docker.service;enabled;vendorpreset:enabled)
Active:active(running)sinceWed2018-10-0313:02:04EDT;15minago
Docs:https://docs.docker.com
MainPID:6663(dockerd)
Tasks:39
local$dockerversion
dockerversion
Client:
Version:17.03.1-ce
APIversion:1.27
Goversion:go1.7.5
Gitcommit:c6d412e
Built:TueMar2800:40:022017
OS/Arch:darwin/amd64
Server:
Version:17.03.1-ce
APIversion:1.27(minimumversion1.12)
Goversion:go1.7.5
Gitcommit:c6d412e
Built:FriMar2400:00:502017
OS/Arch:linux/amd64
Experimental:true
local$dockerrunhello-world
Onceexecutedyoushouldseeanoutputsimilarto
8.2.3Dockerfile☁�
Inorderforustobuildcontainers,weneedtoknowwhatisinthecontainerandhow to create an image representing a container. To do this a convenientspecificationformatcalledDockerfilecanbeused.OnceaDockerfileiscreated,wecanbuildimagesfromit
Weshowcasehere theuseof adockerfileona simpleexampleusingaRESTservice.
Thisexampleiscopiedfromtheofficialdockerdocumentationhostedat
https://docs.docker.com/get-started/part2/#publish-the-image
8.2.3.1Specification
ItosbesttostartwithanemptydirectoryinwhichwecreateaDockerfile.
Unabletofindimage'hello-world:latest'locally
latest:Pullingfromlibrary/hello-world
78445dd45222:Pullcomplete
Digest:sha256:c5515758d4c5e1e838e9cd307f6c6a.....
Status:Downloadednewerimagefor
hello-world:latest
HellofromDocker!
Thismessageshowsthatyourinstallationappears
tobeworkingcorrectly.
Togeneratethismessage,Dockertookthefollowing
steps:
1.TheDockerclientcontactedtheDockerdaemon.
2.TheDockerdaemonpulledthe"hello-world"image
fromtheDockerHub.
3.TheDockerdaemoncreatedanewcontainerfromthat
imagewhichrunstheexecutablethatproducesthe
outputyouarecurrentlyreading.
4.TheDockerdaemonstreamedthatoutputtotheDocker
client,whichsentittoyourterminal.
Totrysomethingmoreambitious,youcanrunanUbuntu
containerwith:
local$dockerrun-itubuntubash
Shareimages,automateworkflows,andmorewitha
freeDockerID:
https://cloud.docker.com/
Formoreexamplesandideas,visit:
https://docs.docker.com/engine/userguide/
Next,wecreateanemptyfilecalledDockerfile
WecopythefollowingcontentsintotheDockerfileandafterthatcreateasimpleRESTservice
Wealsocreatearequirements.txtfilethatweneedforinstallingthenecessarypythonpackages
The example application we use here is a student info served via a RESTfulserviceimplementedusingpythonflask.Itisstoredinthefileapp.py
Tobuildthecontainer,wecanusethefollowingcommand:
Toruntheserviceopenanewwindowandcdintothedirectorywhereyoucode
local$mkdir~/cloudmesh/docker
local$cd~/cloudmesh/docker
local$touchDockerfile
#UseanofficialPythonruntimeasaparentimage
FROMpython:3.7-slim
#Settheworkingdirectoryto/app
WORKDIR/app
#Copythecurrentdirectorycontentsintothecontainerat/app
COPY./app
#Installanyneededpackagesspecifiedinrequirements.txt
RUNpipinstall--trusted-hostpypi.python.org-rrequirements.txt
#Makeport80available
EXPOSE80
#Runapp.pywhenthecontainerlaunches
CMD["python","app.py"]
Flask
fromflaskimportFlask,jsonify
importos
app=Flask(__name__)
@app.route('/student/albert')
defalberts_information():
data={
'firstname':'Albert',
'lastname':'Zweistsein',
'university':'IndianaUniversity',
'email':'[email protected]'
}
returnjsonify(**data)
if__name__=='__main__':
app.run(host="0.0.0.0",port=80)
local$dockerbuild-tstudents.
islocated.Nowsay
Yourdockercontainerwillrunandyoucanvisititbyusingthecommand
Tostopthecontainerdoa
andlocatetheidofthecontainer,e.g.,2a19776ab812,andthenrunthis
Todelete thedocker container image,youmust first sopall instancesusing itandtheremovetheimage.Youcanseetheimageswiththecommand
ThenyoucanlocateallcontainersusingthatimagewhilelookingintheIMAGEcolumn or using a simple fgrep in case you have many images. stop thecontainersusingthatimageandthatyoucansay
whilethenumberisthecontainerid
Onceyoukilledallcontainersusingthatimage,youcanremovetheimagewiththermicommand.
8.2.3.2References
Thereferencedocumentationaboutdockerfilescanbefoundat
https://docs.docker.com/engine/reference/builder/
8.2.4DockerHub☁�
Docker Hub is a cloud-based registry service which provides a “centralized
local$dockerrun-d-p4000:80students
local$curlhttp://localhost:4000/student/albert
local$dockerps
local$dockerstop2a19776ab812
local$dockerimages
local$dockerrm74b9b994c9bd
local$dockerrmi8b3246425402
resource for container image discovery, distribution and change management,user and team collaboration, and workflow automation throughout thedevelopment pipeline” [65]. There are both private and public repositories.Privaterepositorycanonlybeusedbypeoplewithintheirownorganization.
DockerHubisintegratedintoDockerasthedefaultregistry.Thismeansthatthedocker pull commandwill initialize the download automatically fromDockerHub[66]. It allowsusers todownload (pull),build, testandstore their imagesforeasydeploymentonanyhosttheymayhave[65].
8.2.4.1CreateDockerIDandLogIn
A log-in is not necessary for pulling Docker images from the Hub but it isnecessaryforpushingimagestodockerhubforsharing.ThustostoreimagesonDockerhubyouneed tocreateanaccountbyvisitingDockerHubWebpage.Dockerhub offers in general a free account, but it has restrictions. The freeaccount allows you to share images that you distriuute publically, but it onlyallows one private Docker Hub Repository. In case you needmore, you willneedtoupgradetoapaidplan.
For the rest of the tutorial we assume that you use the environment variableDOCKERHUBtoindicateyourusername.Itiseasisetifyousetitinyourshellwith
8.2.4.2SearchingforDockerImages
TherearetwowaystosearchforDockerimagesonDockerHub:
OnewayistousetheDockercommandlinetool.Wecanopenaterminalandrunthedockersearchcommand.Forexample,thefollowingcommandsearchesforcentOSimages:
youwillseeoutputsimilarto:
NAME DESCRIPTION STAR OFFICIAL AUTOMATED
local$exportDOCKERHUB=<PUTYOURDOCKERUSERNAMEHERE>
local$sudodockersearchcentos
centos OfficialCentOS 4130 [OK]
ansible/centos7 AnsibleonCentos7 105 [OK]
…
Ifyoudonotwanttousesudowithdockercommandeachtimeyouneedtoaddthe current user into the docker group. You can do that using the followingcommand.
Thiswillpromptyoutoenterthepasswordforthecurrentuser.Nowyoushouldbeabletoexecutethepreviouscommandwithoutusingsudo.
Officialrepositoriesindockerhubarepublic,certifiedrepositoriesfromvendorsand contributors to Docker. They contain Docker images from vendors likeCanonical, Oracle, and Red Hat that you can use as the basis to build yourapplications and services. There is one official repository in this list, the firstone,centos.
TheotherwayistosearchviatheWebSearchBoxatthetopoftheDockerwebpagebytypingthekeyword.Thesearchresultscanbesortedbynumberofstars,numberofpulls,andwhetheritisanofficialimage.Thenforeachsearchresult,you can verify the information of the image by clicking the details button tomakesurethisistherightimagethatfitsyourneeds.
8.2.4.3PullingImages
Aparticularimage(takecentosasanexample)canbepulledusingthefollowingcommand:
Tags can be used to specify the image to pull. By default the tag is latest,thereforethepreviouscommandisthesameasthefollowing:
local$sudousermod-aGdocker${USER}
local$su-${USER}
local$dockerpullcentos
local$dockerpullcentos:latest
Youcanuseadifferenttag:
Tochecktheexistinglocaldockerimages,runthefollowingcommand:
Theresultsshow:
REPOSITORY TAG IMAGEID CREATED SIZEcentos latest 26cb1244b171 2weeksago 195MBcentos 6 2d194b392dd1 2weeksago 195MB
8.2.4.4CreateRepositories
In order to push images toDockerHub, you need to have a and account andcreatearepository.
WhenyoufirstcreateaDockerHubuser,youseeaGetstartedwithDockerHubscreen,fromwhichyoucanclickdirectlyintoCreateRepository.YoucanalsousetheCreatemenutoCreateRepository.Whencreatinganewrepository,youcanchoose toput it inyourDockerIDnamespace,or thatofanyorganizationthatyouareintheownersteam[67].
As an example,we created a repository cloudtechnologywith the namespace$DOCKERHUB (here DOCKERHUB is your docker hub username). Hence the full name is$DOCKERHUB/cloudtechnology
8.2.4.5PushingImages
Topushanimagetotherepositorycreated,thefollowingstepscanbefollowed.
First,logintoDockerHubfromthecommandlinebyspecifyingtheusername.Ifyouencounterpermissionissuespleaseusesudoinfrontofthecommand
Enterthepasswordwhenprompted.Ifeverythingworkedyouwillgetamessage
local$dockerpullcentos:6
local$dockerimages
$dockerlogin--username=$DOCKERHUB
similarto:
Second,checktheimageIDusing:
theresultlookssimilarto:
REPOSITORY TAG IMAGEID CREATED SIZEcloudmesh-nlp latest 1f26a5f7a1b4 10daysago 1.79GBcentos latest 26cb1244b171 2weeksago 195MB
centos latest 2d194b392dd1 2weeksago 195MB
Here, the the imagewith ID1f26a5f7a1b4 is the one to push toDockerHub.Youcanchooseanotherimageinsteadifyoulike.
Third,tagtheimage
Herewehave used a version number as a tag.However another goodwayofadding a tag is to use a keyword/tag that will help you understand what thiscontainershouldbeusedinconjunctionwith,orwhatitrepresents.
Fourth,nowthelistofimageswilllooksomethinglike
REPOSITORY TAG IMAGEID CREATED SIZEcloudmesh-nlp latest 1f26a5f7a1b4 10dago 1.79GB$DOCKERHUB/cloudmesh v1.0 1f26a5f7a1b4 10dago 1.79GBcentos latest 26cb1244b171 2wago 195MBcentos latest 2d194b392dd1 2wago 195MB
Fifth,Nowyoucanseeanimagesunderthename$DOCKERHUB/cloudmesh,wenowneedtopushthisimagetotherepositorythatwecreatedonthedockerhubwebsite.Forthatexecutethefollowingcommand.
LoginSucceeded
$dockerimages
$dockertag1f26a5f7a1b4$DOCKERHUB/cloudmesh:v1.0
Itshowssomethingsimilarto,tomakesureyoucancheckondockerhubiftheimagesthatwaspushedislistedintherepositorythatwecreated.
Sixth,nowtheimageisavailableonDockerHub.Everyonecanpullitsinceitisapublicrepositorybyusingcommand:
PleaserememberthattheUSERNAMEistheusernamefortheuserthatmakesthisimagepublicallyavailable.Ifyouaretheuseryouwillseethevaluebeingtheonefrom$DOCKERHUB,Ifnotyouwillseeheretheusernameoftheuseruploadingtheimage
8.2.4.6Resources
TheofficalOverviewofDockerHub[65]InformationaboutusingdockerrepositoriescanbefoundatRepositoriesonDockerHub[67]HowtoUseDockerHub[66]DockerTutorialSeries[68]
8.3DOCKERASPAAS
8.3.1DockerSwarm☁�
AswarmisagroupofmachinesthatarerunningDockerandarejoinedintoacluster.Dockercommandsareexecutedonaclusterbyaswarmmanager.Themachinesinaswarmcanbephysicalorvirtual.Afterjoiningaswarm,theyarereferredtoasnodes.
8.3.1.1Terminology
$dockerpush$DOCKERHUB/cloudmesh
Thepushreferstorepository[docker.io/$DOCKERHUB/cloudmesh]
18f9479cfc2c:Pushed
e9ddee98220b:Pushed
...
db584c622b50:Mountedfromlibrary/ubuntu
a94e0d5a7c40:Mountedfromlibrary/ubuntu
...
v1.0:digest:sha256:305b0f911077d9d6aab4b447b...size:3463
$dockerpullUSERNAME/cloudmesh
Inthissectionifacommandisprefixedwithlocal$itmeansthecommandistobeexecuted on your localmachine. If it is prefixedwith either master or worker thatmeans the command is tobe executed fromwithin a virtualmachine thatwascreated.
8.3.1.2CreatingaDockerSwarmCluster
Aswarmismadeupofmultiplenodes,whichcanbeeitherphysicalorvirtualmachines.Weusemasterasthenameofthehostthatisrunasmasterandworker-1asa host run as aworker,where the number indicatet the i-thworkerThe basicstepsare:
1. run
toenableswarmmodeandmakeyourcurrentmachineaswarmmanager,
2. thenrun
onothermachines tohave themjoin theswarmasworkers.Choosea tabdescribedinnexttoseehowthisplaysoutinvariouscontexts.WeuseVMstoquicklycreateatwo-machineclusterandturnitintoaswarm.
8.3.1.3CreateaSwarmClusterwithVirtualBox
Incaseyoudonothaveaccesstomultiplephysicalmachines,youcancreateavirtual cluster on yourmachinewith the help of virtual box. Instead of usingvagrant we can use the built in docker-machine command to start several virtualmachines.
Ifyoudonothavevirtualboxinstalledonyourmachineinstallitonyourmachine.Additionallyyouwouldrequiredocker-machinetobeinstalledonyourlocalmachine.Toinstalldocker-machineonpleasefollowinstructionsatthedockerdocumentationatInstallDockerMachine
Tocreatethevirtualmachinesyoucanusethecommandasfollows:
master$dockerswarminit
worker-1$dockerswarmjoin
To list the VMs and get their ip addresses. Use this command to list themachinesandgettheirIPaddresses.
8.3.1.4InitializetheSwarmManagerNodeandAddWorkerNodes
Thefirstmachineactsasthemanager,whichexecutesmanagementcommandsandauthenticatesworkerstojointheswarm,andthesecondisaworker.
Toinstruct thefirstvmtobecomethemaster,firstweneedtologintothevmthatwasnamed master.Tologinyoucanuse ssh,executethefollowingcommandonyourlocalmachinetologintothemastervm.
Now sincewe are inside the master vmwe can configure this vm as the dockerswarmmanager.Executethefollowingcommandwithinthemastervmininitializeswarm
Ifyougetanerrorstatingsomethingsimilarto“couldnotchooseanIPaddresstoadvertisesincethissystemhasmultipleaddressesondifferentinterfaces”,usethe followingcommand instead.To find the IPaddress execute the commandifconfigandpicktheipaddresswhichismostsimmilarto192.x.x.x.
Theoutputwillooklikethis,whereIP-myvm1istheipaddressofthefirstvm
Nowthatwehavethedockerswarmmanagerupwecanaddworkermachinestotheswarm.Thecommandthatisprintedintheoutputshownpreviouslycanbe
local$docker-machinecreate--drivervirtualboxmaster
local$docker-machinecreate--drivervirtualboxworker-1
local$docker-machinels
local$docker-machinesshmaster
master$dockerswarminit
master$dockerswarminit--advertise-addr192.x.x.x
master$Swarminitialized:currentnode(p6hmohoeuggtwqj8xz91zbs5t)isnow
amanager.
Toaddaworkertothisswarm,runthefollowingcommand:
worker-1$dockerswarmjoin--tokenSWMTKN-1-5c3anju1pwx94054r3vx0v7j4obyuggfu2cmesnx
192.168.99.100:2377
Toaddamanagertothisswarm,run'dockerswarmjoin-tokenmanager'andfollowtheinstructions.
usedtojoinworkerstothemanager.Pleasenotethatyouneedtousetheoutputcommandthatisgeneratedwhenyourundockerswarminitsincethetokenvalueswillbedifferent.
Nowweneedtouseaseparateshelltologintotheworkervmthatwecreated.Openupanewshell(orterminal)andusethefollowingcommandtosshintotheworker
Once you are in the worker execute the following command to join worker to theswammanager.
Thegenericversionofthecommandwouldbeasfollows,youneedtofillinthecorrectvaluestovaluesmarkedas‘<>’toexecutethecommand.
Youwillseeanoutputstatingthatthismachinejoinedthedockerswarm.
If youwant to add another node as amanager to the current swarm you canexecutethefollowingcommandandfollowtheinstructions.Howeverthisisnotneededforthisexercise.
Rundocker-machinelstoverifythatworkerisnowtheactivemachine,asindicatedbytheasterisknexttoit.
Iftheastrixisnotpresentexecutethefollowingcommand
Theoutputwilllooksimilarto
local$docker-machinesshworker-1
worker-1$dockerswarmjoin--token
SWMTKN-1-5c3anju1pwx94054r3vx0v7j4obyuggfu2cmesnx192.168.99.100:2377
worker-1$dockerswarmjoin--token<token><myvmip>:<port>
Thisnodejoinedaswarmasaworker.
newvm$dockerswarmjoin-tokenmanager'
local$docker-machinels
local$sudosh-c'eval"$(docker-machineenvworker-1)";docker-machinels'
NAMEACTIVEDRIVERSTATEURLSWARMDOCKERERRORS
master-virtualboxRunningtcp://192.168.99.100:2376v18.06.1-ce
worker-1*virtualboxRunningtcp://192.168.99.102:2376v18.06.1-ce
8.3.1.5Deploytheapplicationontheswarmmanager
Nowwe can try to deploy a test application.Firstweneed to create a dockerconfigurationfilewhichwewillname docker-compse.yml.Sinceweare in thevmweneedtocreatethefileusingtheterminal.followthestepsgivennextthecreateandsavethefile.Firstlogintothemaster
Then,
Thiscommandwillopenaneditor.Press the Insertbutton toenableeditingandthencopypastethefollowingintothedocument.version:"3"
services:
web:
#replaceusername/repo:tagwithyournameandimagedetails
image:username/repo:tag
deploy:
replicas:5
resources:
limits:
cpus:"0.1"
memory:50M
restart_policy:
condition:on-failure
ports:
-"4000:80"
networks:
-webnet
networks:
webnet:
ThenprestheEcsbuttonandenter:wqtosaveandclosetheeditor.
Oncewe have the filewe can deploy the test application using the followingcommand.whichwillbeexecutedinthemaster
Toverify theservicesandassociatedcontainershavebeendistributedbetweenbothmasterandworker,executethefollowingcommand.
Theoutputwilllooksimilarto
```bash ID NAME IMAGE NODE DESIRED STATE CURRENT STATEERRORPORTSwpqtkv69qbeegetstartedlab_web.1username/repo:tagworker-
local$docker-machinesshworker-1
master$vidocker-compose.yml
master$dockerstackdeploy-cdocker-compose.ymlgetstartedlab
master$dockerstackpsgetstartedlab
1 Running Preparing 4 seconds ago whkiecyenuv0 getstartedlab_web.2username/repo:tag master Running Preparing 4 seconds ago 13obecvxohh1getstartedlab_web.3 username/repo:tagworker-1Running Preparing 5 secondsago 76srj0nflagi getstartedlab_web.4 username/repo:tag worker-1 RunningPreparing5secondsagoymqoonad5c1fgetstartedlab_web.5username/repo:tagmasterRunningPreparing5secondsago
8.3.2DockerandDockerSwarmonFutureSystems☁�
ThissectionisforIUstudentsonlythattakeclasseswithus.
This section introduces how to run Docker container on FutureSystems.CurrentlywehavedeployedDockerswarmonEcho.
8.3.2.1GettingAccess
YouwillneedanaccountonFutureSystemsandbeenrolledinanactiveproject.Toverify,trytoseeifyoucanlogintovictor.futuresystems.org.YouneedtobeamemberofavalidFutureSystemsproject,andhadsubmittedansshpublickeyviatheFutureSystemsportal.
ForFall2018classesatIUyouneedtobeinthefollowingproject:
https://portal.futuresystems.org/project/553
If your access to the victor host has been verified, try to login to the dockerswarmheadnode.ToconvenientlydothisletusdefinesomeLinuxenvironmentvariablestosimplifytheaccessandthematerialpresentedhere.Youcanplacethemeveninyour.bashrcor.bash_profilesotheinformationgetspopulatedwheneveryoustartanewterminal.Ifyoudirectlyedit thefilesmakesure toexecute thesourcecommandtorefreshtheenvironmentvariablesforthecurrentsessionusingsource.bashrcorsource.bash_profile.Oryoucanclosethecurrentshellandreopenanewone.
NowyoucanusethetwovariablesthatweresettologintotheEchoserer,usingthefollowingcommand
local$exportECHO=149.165.150.76
local$exportFS_USER=<putyourfutersystemaccountnamehere>
Note: If you have access to india but not the docker swarm system, yourprojectmaynothavebeenauthorizedtoaccess thedockerswarmcluster.SendatickettoFutureSystemsticketsystemtorequestthis.
Onceloggedintothedockerswarmheadnode,trytorun:
toverifydockerrunworks.
8.3.2.2Creatingaserviceanddeploytotheswarmcluster
Whiledockerruncanstartacontainerandyoumayevenattach to itsconsole, therecommendedwaytouseadockerswarmclusteristocreateaserviceandhaveit run on the swarm cluster. The service will be scheduled to one or manynumberofthenodesoftheswarmcluster,basedontheconfiguration.Itisalsoeasy to scale up the service when more swarm nodes are available. Dockerswarmreallymakes iteasier forservice/applicationdevelopers to focuson thefunctionality development but notworrying about how andwhere to bind theservicetosomeresources/server.Thedeployment,access,andscalingup/downwhen necessary, are all managed transparently. Thus achieving the newparadigmofserverlesscomputing.
As an example, the following command creates a service and deploy it to theswarm cluster, if the port is in use the port 9001 used in the command can bechangedtoanavailableport.
TheNOTEBOOK_PASS_HASHcanbegeneratedinpython:
Sopassthroughthestringstartingwith‘sha1:......’.
Thecommandpullsapublishedimagefromdockercloud,startsacontainerandrunsascripttostarttheserviceinsidethecontainerwithnecessaryparameters.
local$ssh$FS_USER@$ECHO
echo$dockerrunhello-world
echo$dockerservicecreate--namenotebook_test-p9001:8888\
jupyter/datascience-notebookstart-notebook.sh
--NotebookApp.password=NOTEBOOK_PASS_HASH
>>>importIPython
>>>IPython.lib.passwd("YOUR_SELECTED_PASSWROD")
'sha1:52679cadb4c9:6762e266af44f86f3d170ca1......'
Theoption“-p9001:8888”mapstheserviceportinsidethecontainer(8888)toanexternalportoftheclusternode(9001)sotheservicecouldbeaccessedfromtheInternet.Inthisexample,youcanthenvisittheURL:
toaccess theJupyternotebook.Using the specifiedpasswordwhenyoucreatetheservicetologin.
Please note the service will be dynamically deployed to a container instance,which would be allocated to a swarm node based on the allocation policy.DockermakesthisprocesstransparenttotheuserandevencreatedmeshroutingsoyoucanaccesstheserviceusingtheIPaddressofthemanagementheadnodeof the swarm cluster, no matter which actual physical node the service wasdeployedto.
Thisalso implies that theexternalportnumberusedhas tobe freeat the timewhentheservicewascreated.
Someusefulrelatedcommands:
liststhecurrentlyrunningservices.
liststhedetailedinfoofthecontainerwheretheserviceisrunning.
listsalltherunningcontainersofanode.
listsallthenodesintheswarmcluster.
Tostoptheserviceandthecontainer:
8.3.2.3Createyourownservice
local$openhttp://$ECHO:9001
echo$dockerservicels
echo$dockerservicepsnotebook_test
echo$dockernodepsNODE
echo$dockernodels
echo$dockerservicermnoteboot_test
Youcancreateyourownserviceandrunit.Todoso,startfromabaseimage,e.g.,aubuntuimagefromthedockercloud.Thenyoucould:
Run a container from the image and attach to its console to develop theservice,andcreateanewimagefromthechangedinstanceusingcommand‘dockercommit’.
Create a dockerfile, which has the step by step building process of theservice,andthenbuildanimagefromit.
In reality, the first approach is probably useful when you are in the phase ofdevelop and debug your application/service. Once you have the step by stepinstructionsdevelopedthelatterapproachistherecommendedway.
Publishtheimagetothedockercloudbyfollowingthisdocumentation:
https://docs.docker.com/docker-cloud/builds/push-images/
Please make sure no sensitive information is included in the image to bepublished. Alternatively you could publish the image internally to the swarmcluster.
8.3.2.4Publishanimageprivatelywithintheswarmcluster
Oncetheimageispublishedandavailabletotheswarmcluster,youcouldstartanewservicefromtheimagesimilartotheJupyterNotebookexample.
8.3.2.5Exercises
E.Docker.Futuresystems.1:
Obtainanaccountonfuturesystems.
E.Docker.Futuresystems.2:
CreateaRESTservicewithswaggercodegenandrunitontheechocloud(seeexampleinthissection)
8.3.3HadoopwithDocker☁�
In this section we will explore the Map/Reduce framework using HadoopprovidedthroughaDockercontainer.
Wewillshowcasethefunctionalityonasmallexamplethatcalculatesminimum,maximum,averageandstandarddeviationvaluesusingseveralinputfileswhichcontainfloatnumbers.
This section is based on the hadoop release 3.1.1 which includes significantenhancementsoverthepreviousversionofHadoop2.x.Changesincludetheuseofthefollowingsoftware:
CentOS7systemctlJavaSEDevelopmentKit8
ADockerfiletocreatethehadoopdeploymentisavailableat
*https://github.com/cloudmesh-community/book/blob/master/examples/docker/hadoop/3.1.1/Dockerfile
8.3.3.1BuildingHadoopusingDocker
YoucanbuildhadoopfromtheDockerfileasfollows:
ThecompletedockerimageforHadoopconsumes1.5GB.
Tousetheimageinteractivelyyoucanstartthecontainerasfollows:
Itmaytakeafewminutesatfirsttodownloadimage.
$mkdircloudmesh-community
$cdcloudmesh-community
$gitclonehttps://github.com/cloudmesh-community/book.git
$cdbook/examples/docker/hadoop/3.1.1
$dockerbuild-tcloudmesh/hadoop:3.1.1.
$dockerimages
REPOSITORYTAGIMAGEIDCREATEDSIZE
cloudmesh/hadoop3.1.1ba2c51f943481hourago1.52GB
$dockerrun-itcloudmesh/hadoop:3.1.1/etc/bootstrap.sh-bash
8.3.3.2HadoopConfigurationFiles
Theconfigurationfilesareincludedintheconffolder
8.3.3.3VirtualMemoryLimit
INcaseyouneedmorememory,youcanincreaseitbychangingtheparametersinthefilemapred-site.xml,forexample:
mapreduce.map.memory.mbato4096mapreduce.reduce.memory.mbto8192
8.3.3.4hdfsSafemodeleavecommand
ASafemodeforHDFSisaread-onlymodefortheHDFScluster,whereitdoesnotallowanymodificationsoffilesandblocks.Namenodedisablessafemodeautomatically after startingupnormally. If required,HDFScouldbe forced toleavethesafemodeexplicitlybythiscommand:
8.3.3.5Examples
We included a statistics and a PageRank examples into the container. Theexamplesarealsoavailableingithubat
https://github.com/cloudmesh-community/book/tree/master/examples/docker/hadoop/3.1.1/examples
Weexplaintheexamplesnext
8.3.3.5.1StatisticalExamplewithHadoop
After we launch the container and use the interactive shell, we can run thestatisticsHadoopapplicationwhichcalculatestheminimum,maximim,average,and standard derivation from values stored in a number of input files. FigureFigure74showsthecomputingphasesinaMapReducejob.
$hdfsdfsadmin-safemodeleave
To achieve this, this Hadoop program reads multiple files from HDFS andprovides calculated values.Wewalk through every step from compiling JavasourcecodetoreadingaoutputfilefromHDFS.Theideaofthisexerciseistoget you startedwithHadoop and theMapReduce concept. Youmay seen theWordCount fromHadoop official website or documentation and this examplehasasamefunctions(Map/Reduce)exceptthatyouwillbecomputingthebasicstatisticssuchasmin,max,average,andstandarddeviationofagivendataset.
Theinputtotheprogramwillbeatextfile(s)carryingexactlyonefloatingpointnumber per line. The result file includes min, max, average, and standarddeviation.
Figure74:MapReduceexampleinDocker
8.3.3.5.1.1BaseLocation
Theexampleisavailablewithinthecontainerat:
8.3.3.5.1.2InputFiles
Atestinputfilesareavailableunder/cloudmesh/examples/statistics/input_datadirectoryinsideof the container.The statistics values for this input areMin: 0.20Max: 19.99Avg:9.51StdDev:5.55forallinputfiles.
10 files contain 55000 lines to process and each line is a random float pointvaluerangingfrom0.2to20.0.
8.3.3.5.1.3Compilation
The source code file name is MinMaxAvgStd.java which is available at
container$cd/cloudmesh/examples/statistics
/cloudmesh/examples/statistics/src.
TherearethreefunctionsinthecodeMap,ReduceandMainwhereMapreadseach lineof a file andupdatesvalues to calculateminimum,maximumvaluesandReducecollectsmapperstoproduceaverageandstandarddeviationvaluesatlast.
Thesecommandssimplypreparecompilingtheexamplecodeandthecompiledclassfilesaregeneratedatthedestlocation.
8.3.3.5.1.4ArchivingClassFiles
Jar command tool helps archiving classes in a single file which will be usedwhenHadoop runs this example. This is useful because a jar file contains allnecessaryfilestorunaprogram.
8.3.3.5.1.5HDFSforInput/Output
The input filesneed tobeuploaded toHDFSasHadoop runs thisexamplebyreadinginputfilesfromHDFS.
Ifuploadingiscompleted,youmayseefilelistingslike:
8.3.3.5.1.6RunProgramwithaSingleInputFile
$exportHADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoopclasspath`
$mkdir/cloudmesh/examples/statistics/dest
$javac-classpath$HADOOP_CLASSPATH-d/cloudmesh/examples/statistics/dest/cloudmesh/examples/statistics/src/MinMaxAvgStd.java
$cd/cloudmesh/examples/statistics
$jar-cvfstats.jar-C./dest/.
$exportPATH=$PATH:/HADOOP_HOME/bin
$hadoopfs-mkdirstats_input
$hadoopfs-putinput_data/*stats_input
$hadoopfs-lsstats_input/
Found10items
-rw-r--r--1rootsupergroup139422018-02-2823:16stats_input/data_1000.txt
-rw-r--r--1rootsupergroup1392252018-02-2823:16stats_input/data_10000.txt
-rw-r--r--1rootsupergroup278682018-02-2823:16stats_input/data_2000.txt
-rw-r--r--1rootsupergroup417932018-02-2823:16stats_input/data_3000.txt
-rw-r--r--1rootsupergroup556992018-02-2823:16stats_input/data_4000.txt
-rw-r--r--1rootsupergroup696632018-02-2823:16stats_input/data_5000.txt
-rw-r--r--1rootsupergroup836142018-02-2823:16stats_input/data_6000.txt
-rw-r--r--1rootsupergroup974902018-02-2823:16stats_input/data_7000.txt
-rw-r--r--1rootsupergroup1114512018-02-2823:16stats_input/data_8000.txt
-rw-r--r--1rootsupergroup1253372018-02-2823:16stats_input/data_9000.txt
Weare ready to run the program to calculate values from text files. First,wesimply run the programwith a single input file to see how itworks. data_1000.txtcontains1000linesoffloats,weusethisfilehere.
Thecommandrunswithinputparameterswhichindicateajarfile(theprogram,stats.jar), exercise.MinMaxAvgStd (package name.class name), input file path(stats_input/data_1000.txt)andoutputfilepath(stats_output_1000).
Thesampleresultsthattheprogramproduceslooklikethis:
$hadoopjarstats.jarexercise.MinMaxAvgStdstats_input/data_1000.txtstats_output_1000
18/02/2823:48:50INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032
18/02/2823:48:50INFOinput.FileInputFormat:Totalinputpathstoprocess:1
18/02/2823:48:50INFOmapreduce.JobSubmitter:numberofsplits:1
18/02/2823:48:50INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1519877569596_0002
18/02/2823:48:51INFOimpl.YarnClientImpl:Submittedapplicationapplication_1519877569596_0002
18/02/2823:48:51INFOmapreduce.Job:Theurltotrackthejob:http://f5e82d68ba4a:8088/proxy/application_1519877569596_0002/
18/02/2823:48:51INFOmapreduce.Job:Runningjob:job_1519877569596_0002
18/02/2823:48:56INFOmapreduce.Job:Jobjob_1519877569596_0002runninginubermode:false
18/02/2823:48:56INFOmapreduce.Job:map0%reduce0%
18/02/2823:49:00INFOmapreduce.Job:map100%reduce0%
18/02/2823:49:05INFOmapreduce.Job:map100%reduce100%
18/02/2823:49:05INFOmapreduce.Job:Jobjob_1519877569596_0002completedsuccessfully
18/02/2823:49:05INFOmapreduce.Job:Counters:49
FileSystemCounters
FILE:Numberofbytesread=81789
FILE:Numberofbyteswritten=394101
FILE:Numberofreadoperations=0
FILE:Numberoflargereadoperations=0
FILE:Numberofwriteoperations=0
HDFS:Numberofbytesread=14067
HDFS:Numberofbyteswritten=86
HDFS:Numberofreadoperations=6
HDFS:Numberoflargereadoperations=0
HDFS:Numberofwriteoperations=2
JobCounters
Launchedmaptasks=1
Launchedreducetasks=1
Data-localmaptasks=1
Totaltimespentbyallmapsinoccupiedslots(ms)=2107
Totaltimespentbyallreducesinoccupiedslots(ms)=2316
Totaltimespentbyallmaptasks(ms)=2107
Totaltimespentbyallreducetasks(ms)=2316
Totalvcore-secondstakenbyallmaptasks=2107
Totalvcore-secondstakenbyallreducetasks=2316
Totalmegabyte-secondstakenbyallmaptasks=2157568
Totalmegabyte-secondstakenbyallreducetasks=2371584
Map-ReduceFramework
Mapinputrecords=1000
Mapoutputrecords=3000
Mapoutputbytes=75783
Mapoutputmaterializedbytes=81789
Inputsplitbytes=125
Combineinputrecords=0
Combineoutputrecords=0
Reduceinputgroups=3
Reduceshufflebytes=81789
Reduceinputrecords=3000
Reduceoutputrecords=4
SpilledRecords=6000
ShuffledMaps=1
FailedShuffles=0
MergedMapoutputs=1
GCtimeelapsed(ms)=31
CPUtimespent(ms)=1440
Physicalmemory(bytes)snapshot=434913280
Virtualmemory(bytes)snapshot=1497260032
Totalcommittedheapusage(bytes)=402653184
ShuffleErrors
Thesecondlineofthefollowinglogsindicatesthatthenumberofinputfilesis1.
8.3.3.5.1.7ResultforSingleInputFile
WereadsresultsfromHDFSby:
Thesampleoutputlookslike:
8.3.3.5.1.8RunProgramwithMultipleInputFiles
Thefirstrunwasdoneprettyquickly(1440millisecondstookaccordingtotheprevioussampleresult)becausetheinputfilesizewassmall(1,000lines)anditwas a single file. We provides more input files with a larger size (2,000 to10,000 lines). Input files are already uploaded to HDFS. We simply run theprogramagainwithaslightchangeintheparameters.
Thecommandisalmostsameexceptthataninputpathisadirectoryandanewoutput directory. Note that every time that you run this program, the outputdirectorywillbecreatedwhichmeansthatyouhavetoprovideanewdirectorynameunlessyoudeleteit.
The sample outputmessages look like the followingwhich is almost identicalcomparedtothepreviousrunexcept that this timethenumberof inputfiles toprocessis10,seethelinetwonext:
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
FileInputFormatCounters
BytesRead=13942
FileOutputFormatCounters
BytesWritten=86
$hadoopfs-catstats_output_1000/part-r-00000
Max:19.9678704297
Min:0.218880718983
Avg:10.225467263249385
Std:5.679809322880863
$hadoopjarstats.jarexercise.MinMaxAvgStdstats_input/stats_output_all
18/02/2823:17:18INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032
18/02/2823:17:18INFOinput.FileInputFormat:Totalinputpathstoprocess:10
8.3.3.5.1.9ResultforMultipleFiles
Theexpectedresultlookslike:
18/02/2823:17:18INFOmapreduce.JobSubmitter:numberofsplits:10
18/02/2823:17:18INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1519877569596_0001
18/02/2823:17:19INFOimpl.YarnClientImpl:Submittedapplicationapplication_1519877569596_0001
18/02/2823:17:19INFOmapreduce.Job:Theurltotrackthejob:http://f5e82d68ba4a:8088/proxy/application_1519877569596_0001/
18/02/2823:17:19INFOmapreduce.Job:Runningjob:job_1519877569596_0001
18/02/2823:17:24INFOmapreduce.Job:Jobjob_1519877569596_0001runninginubermode:false
18/02/2823:17:24INFOmapreduce.Job:map0%reduce0%
18/02/2823:17:32INFOmapreduce.Job:map40%reduce0%
18/02/2823:17:33INFOmapreduce.Job:map60%reduce0%
18/02/2823:17:36INFOmapreduce.Job:map70%reduce0%
18/02/2823:17:37INFOmapreduce.Job:map100%reduce0%
18/02/2823:17:39INFOmapreduce.Job:map100%reduce100%
18/02/2823:17:39INFOmapreduce.Job:Jobjob_1519877569596_0001completedsuccessfully
18/02/2823:17:39INFOmapreduce.Job:Counters:49
FileSystemCounters
FILE:Numberofbytesread=4496318
FILE:Numberofbyteswritten=10260627
FILE:Numberofreadoperations=0
FILE:Numberoflargereadoperations=0
FILE:Numberofwriteoperations=0
HDFS:Numberofbytesread=767333
HDFS:Numberofbyteswritten=84
HDFS:Numberofreadoperations=33
HDFS:Numberoflargereadoperations=0
HDFS:Numberofwriteoperations=2
JobCounters
Launchedmaptasks=10
Launchedreducetasks=1
Data-localmaptasks=10
Totaltimespentbyallmapsinoccupiedslots(ms)=50866
Totaltimespentbyallreducesinoccupiedslots(ms)=4490
Totaltimespentbyallmaptasks(ms)=50866
Totaltimespentbyallreducetasks(ms)=4490
Totalvcore-secondstakenbyallmaptasks=50866
Totalvcore-secondstakenbyallreducetasks=4490
Totalmegabyte-secondstakenbyallmaptasks=52086784
Totalmegabyte-secondstakenbyallreducetasks=4597760
Map-ReduceFramework
Mapinputrecords=55000
Mapoutputrecords=165000
Mapoutputbytes=4166312
Mapoutputmaterializedbytes=4496372
Inputsplitbytes=1251
Combineinputrecords=0
Combineoutputrecords=0
Reduceinputgroups=3
Reduceshufflebytes=4496372
Reduceinputrecords=165000
Reduceoutputrecords=4
SpilledRecords=330000
ShuffledMaps=10
FailedShuffles=0
MergedMapoutputs=10
GCtimeelapsed(ms)=555
CPUtimespent(ms)=16040
Physicalmemory(bytes)snapshot=2837708800
Virtualmemory(bytes)snapshot=8200089600
Totalcommittedheapusage(bytes)=2213019648
ShuffleErrors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
FileInputFormatCounters
BytesRead=766082
FileOutputFormatCounters
BytesWritten=84
$hadoopfs-catstats_output_all/part-r-00000
8.3.3.5.2Conclusion
The example program of calculating some values by reading multiple filesshows howMap/Reduce iswritten by a Java programming language and howHadoop runs its program using HDFS.We also observed the one of benefitsusingDockercontainerwhichisthatthehassleofconfigurationandinstallationofHadoopisnotnecessaryanymore.
8.3.3.6Refernces
The details of the new version is available from the official site athttp://hadoop.apache.org/docs/r3.1.1/index.html
8.3.4DockerPagerank☁�
PageRankisapopularexamplealgorithmusedtodisplaytheabilityofbigdataapplications to run parallel tasks. This example will show how the dockerhadoopimagecanbeusedtoexecutethePagerankexamplewhichisavailablein/cloudmesh/examples/pagerank
8.3.4.1Usetheautomatedscript
Wemake the steps of compiling java source, archiving class files, load inputfilesandruntheprogramintoonesinglescript.Toexecuteitwiththeinputfile:PageRankDataGenerator/pagerank5000g50.input.0, using 5000 urls and 1iteration:
Resultwilllooklike
Theheadoftheresultwilllooklike
Max:19.999191254
Min:0.200268613863
Avg:9.514884854468903
Std:5.553921579413547
$cd/cloudmesh/examples/pagerank
$./compileAndExecHadoopPageRank.shPageRankDataGenerator/pagerank5000g50.input.050001
output.pagerank/part-r-00000
headoutput.pagerank/part-r-00000
8.3.4.2Compileandrunbyhand
If onewants to generate the java class files and archive them as the previousexercise, one could use the following code (which is actually insidecompileAndExecHadoopPageRank.sh)
LoadinputfilestoHDFS
Runprogramwiththe[PageRankInputsFileDirectory][PageRankOutputDirectory][NumberofUrls][NumberOfIterations]
Result
8.3.5ApacheSparkwithDocker☁�
8.3.5.1PullImagefromDockerRepository
We use a Docker image from Docker Hub:(https://hub.docker.com/r/sequenceiq/spark/) This repository contains aDockerfiletobuildaDockerimagewithApacheSparkandHadoopYarn.
02.9999999999999997E-5
12.9999999999999997E-5
22.9999999999999997E-5
32.9999999999999997E-5
42.9999999999999997E-5
52.9999999999999997E-5
62.9999999999999997E-5
72.9999999999999997E-5
82.9999999999999997E-5
92.9999999999999997E-5
exportHADOOP_CLASSPATH=`$HADOOP_PREFIX/bin/hadoopclasspath`
mkdir/cloudmesh/examples/pagerank/dist
$find/cloudmesh/examples/pagerank/src/indiana/cgl/hadoop/pagerank/\
-name"*.java"|xargsjavac-classpath$HADOOP_CLASSPATH\
-d/cloudmesh/examples/pagerank/dist
$cd/cloudmesh/examples/pagerank/dist
$jar-cvfHadoopPageRankMooc.jar-C..
$exportPATH=$PATH:/$HADOOP_PREFIX/bin
$cd/cloudmesh/examples/pagerank/
$hadoopfs-mkdirinput.pagerank
$hadoopfs-putPageRankDataGenerator/pagerank5000g50.input.0input.pagerank
$hadoopjardist/HadoopPageRankMooc.jarindiana.cgl.hadoop.pagerank.HadoopPageRankinput.pagerankoutput.pagerank50001
$hadoopfs-catoutput.pagerank/part-r-00000
$dockerpullsequenceiq/spark:1.6.0
8.3.5.2RunningtheImage
Inthisstep,wewilllaunchaSparkcontainer.
8.3.5.2.1Runninginteractively
8.3.5.2.2Runninginthebackground
8.3.5.3RunSpark
Afteracontainerislaunched,wecanrunSparkinthefollowingtwomodes:(1)yarn-clientand(2)yarn-cluster.Thedifferencesbetweenthetwomodescanbefoundhere:https://spark.apache.org/docs/latest/running-on-yarn.html
8.3.5.3.1RunSparkinYarn-ClientMode
8.3.5.3.2RunSparkinYarn-ClusterMode
8.3.5.4ObserveTaskExecutionfromRunningLogsofSparkPi
LetusobserveSparktaskexecutionbyadjustingtheparameterofSparkPiandthePiresultfromthefollowingtwocommands.
8.3.5.5WriteaWord-CountApplicationwithSparkRDD
Let us write our own word-count with Spark RDD. After the shell has been
$dockerrun-it-p8088:8088-p8042:8042-hsandboxsequenceiq/spark:1.6.0bash
$dockerrun-d-hsandboxsequenceiq/spark:1.6.0-d
$spark-shell--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1
$spark-submit--classorg.apache.spark.examples.SparkPi--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1
$spark-submit--classorg.apache.spark.examples.SparkPi\
--masteryarn-client--driver-memory1g\
--executor-memory1g\
--executor-cores1$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar10
$spark-submit--classorg.apache.spark.examples.SparkPi\
--masteryarn-client--driver-memory1g\
--executor-memory1g\
--executor-cores1$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar10000
started,copyandpastethefollowingcodeinconsolelinebyline.
8.3.5.5.1LaunchSparkInteractiveShell
8.3.5.5.2PrograminScala
8.3.5.5.3LaunchPySparkInteractiveShell
8.3.5.5.4PrograminPython
8.3.5.6DockerSparkExamples
8.3.5.6.1K-MeansExample
FirstweneedtopulltheimagefromtheDockerHub:
Itwilltakesometimetodownloadtheimage.Nowwehavetorundockersparkimageinteractively.
Thiswilltakeyoutotheinteractivemode.
LetusrunasampleKMeansexample.ThisisalreadybuiltwithSpark.
Herewespecifythedatadatasetfromalocalfolderinsidetheimageandwerunthe sample classKMeans in the sample package. The sample data set used isinside the sample-data folder. Spark has it’s own format formachine learningdatasets.Herethekmeans_data.txtfilecontainstheKMeansdataset.
$spark-shell--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1
valtextFile=sc.textFile("file:///etc/hosts")
valwords=textFile.flatMap(line=>line.split("\\s+"))
valcounts=words.map(word=>(word,1)).reduceByKey(_+_)
counts.values.sum()
$pyspark--masteryarn-client--driver-memory1g--executor-memory1g--executor-cores1
textFile=sc.textFile("file:///etc/hosts")
words=textFile.flatMap(lambdaline:line.split())
counts=words.map(lambdaword:(word,1)).reduceByKey(lambdax,y:x+y)
counts.map(lambdax:x[1]).sum()
$dockerpullsequenceiq/spark-native-yarn
$dockerrun-i-t-hsandboxsequenceiq/spark-native-yarn/etc/bootstrap.sh-bash
Ifyourunthissuccessfully,youcangetanoutputasshownhere.
8.3.5.6.2JoinExample
Run the followingcommand todoa sample joinoperationonagivendataset.Hereweusetwodatasets,namelyjoin1.txtandjoin2.txt.Thenweperformthejoinoperationthatwediscussedinthetheorysection.
8.3.5.6.3WordCount
Inthisexamplethewordcount.txtwillusedtodothewordcountusingmultiplereducers. Number 1 at the end of the command determines the number ofreducers.As spark can runmultiple reducers,we can specify the number as aparametertotheprogramme.
8.3.5.7InteractiveExamples
Hereweneedanewimagetoworkon.Letusrunthefollowingcommand.Thiswillpullthenecessaryrepositoriesfromdockerhub,aswedonothavemostofthe dependencies related to it. This can take a few minutes to downloadeverything.
Hereyouwillgetthefollowingoutputintheterminal.
$./bin/spark-submit--classsample.KMeans\
--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext
\
--confupdate-classpath=true\
./lib/spark-native-yarn-samples-1.0.jar/sample-data/kmeans_data.txt
Finishediteration(delta=0.0)
Finalcenters:
DenseVector(0.15000000000000002,0.15000000000000002,0.15000000000000002)
DenseVector(9.2,9.2,9.2)
DenseVector(0.0,0.0,0.0)
DenseVector(9.05,9.05,9.05)
$./bin/spark-submit--classsample.Join--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext--confupdate-classpath=true./lib/spark-native-yarn-samples-1.0.jar/sample-data/join1.txt/sample-data/join2.txt
$./bin/spark-submit--classsample.WordCount--masterexecution-context:org.apache.spark.tez.TezJobExecutionContext--confupdate-classpath=true./lib/spark-native-yarn-samples-1.0.jar/sample-data/wordcount.txt1
$dockerrun-it-p8888:8888-v$PWD:/cloudmesh/spark--namesparkjupyter/pyspark-notebook
dockerrun-it-p8888:8888-v$PWD:/cloudmesh/spark--namesparkjupyter/pyspark-notebook
Unabletofindimage'jupyter/pyspark-notebook:latest'locally
latest:Pullingfromjupyter/pyspark-notebook
a48c500ed24e:Pullcomplete
Pleasecopytheurlshownattheendoftheterminaloutputandgotothaturlinthebrowser.
Youwillseethefollowingoutputinthebrowser,(UseGoogleChrome)
JupyterNotebookinBrowser
First navigate to thework folder. Let us create a new python file here. Clickpython3inthenewmenu.
1e1de00ff7e1:Pullcomplete
0330ca45a200:Pullcomplete
471db38bcfbf:Pullcomplete
0b4aba487617:Pullcomplete
d44ea0cd796c:Pullcomplete
5ac827d588be:Pullcomplete
d8d7747a335e:Pullcomplete
08790511e3e9:Pullcomplete
e3c68aea9a5f:Pullcomplete
484c6d5fc38a:Pullcomplete
0448c1360cb9:Pullcomplete
61d7e6dc705d:Pullcomplete
92f1091ed72b:Pullcomplete
8045d3663a7e:Pullcomplete
1bde7ba25439:Pullcomplete
5618f8ed38b4:Pullcomplete
f08523cb6144:Pullcomplete
99eee56fda2f:Pullcomplete
b37b1ce39785:Pullcomplete
aee4b9eac4ea:Pullcomplete
f810ef87439d:Pullcomplete
038786dce388:Pullcomplete
ded31312ea33:Pullcomplete
30221ffdd1a6:Pullcomplete
da1d368f8592:Pullcomplete
523809a30a21:Pullcomplete
47ab1b230dd2:Pullcomplete
442f9435e1a9:Pullcomplete
Digest:sha256:f8b6309cd39481de1a169143189ed0879b12b56fe286d254d03fa34ccad90734
Status:Downloadednewerimageforjupyter/pyspark-notebook:latest
Containermustberunwithgroup"root"toupdatepasswdfile
Executingthecommand:jupyternotebook
[I15:47:52.900NotebookApp]Writingnotebookservercookiesecretto/home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[I15:47:53.167NotebookApp]JupyterLabextensionloadedfrom/opt/conda/lib/python3.6/site-packages/jupyterlab
[I15:47:53.167NotebookApp]JupyterLabapplicationdirectoryis/opt/conda/share/jupyter/lab
[I15:47:53.176NotebookApp]Servingnotebooksfromlocaldirectory:/home/jovyan
[I15:47:53.177NotebookApp]TheJupyterNotebookisrunningat:
[I15:47:53.177NotebookApp]http://(3a3d9f7e2565or127.0.0.1):8888/?token=f22492fe7ab8206ac2223359e0603a0dff54d98096ab7930
[I15:47:53.177NotebookApp]UseControl-Ctostopthisserverandshutdownallkernels(twicetoskipconfirmation).
[C15:47:53.177NotebookApp]
Copy/pastethisURLintoyourbrowserwhenyouconnectforthefirsttime,
tologinwithatoken:
http://(3a3d9f7e2565or127.0.0.1):8888/?token=f22492fe7ab8206ac2223359e0603a0dff54d98096ab7930
Createanewpythonfile
Now add the following content in the new file. In Jupyter notebook, you canenterapythoncommandorpythoncodeandpress
Thiswillrunthecodeinteractively.
Nowlet’screatethefollowingcontent.
Nowletusdothefollowing.
Inthefollowingstageweconfiguresparkcontextandimportthenecessaryfiles.
Nextstageweusesampledatasetbycreatingtheminformofanarrayandwetrainthekmeansalgorithm.
SHIFT+ENTER
importos
os.getcwd()
importpyspark
sc=pyspark.SparkContext('local[*]')
rdd=sc.parallelize(range(1000))
rdd.takeSample(False,5)
os.makedirs("data")
frompyspark.mllib.clusteringimportKMeans,KMeansModel
fromnumpyimportarray
frommathimportsqrt
frompyspark.mllib.linalgimportVectors
frompyspark.mllib.linalgimportSparseVector
sc.version
sparse_data=[
SparseVector(3,{1:1.0}),
SparseVector(3,{1:1.1}),
SparseVector(3,{2:1.0}),
SparseVector(3,{2:1.1})
]
model=KMeans.train(sc.parallelize(sparse_data),2,initializationMode='k-means||',
seed=50,initializationSteps=5,epsilon=1e-4)
model.predict(array([0.,1.,0.]))
model.predict(array([0.,0.,1.]))
Inthefinalstageweputsamplevaluesandcheckthepredictionsonthecluster.In addition to that feed the data using SparseVector format and we add thekmeansinitializationmode,theerrormarginandthepalatalization.Weputthestep size as 5 for this example. In the previous one we did not specify anyparameters.
Thepredicttermpredictstheclusteridwhichitbelongsto.
Theninthefollowingwayyoucancheckwhethertwodatapointsbelongtooneclusterornot.
8.3.5.7.1StopDockerContainer
8.3.5.7.2StartDockerContainerAgain
8.3.5.7.3RemoveDockerContainer
8.4KUBERNETES
8.4.1IntroductiontoKubernetes☁�
LearningObjectives
model.predict(sparse_data[0])
model.predict(sparse_data[2])
data=array([0.0,0.0,1.0,1.0,9.0,8.0,8.0,9.0]).reshape(4,2)
model=KMeans.train(sc.parallelize(data),2,initializationMode='random',
seed=50,initializationSteps=5,epsilon=1e-4)
model.predict(array([0.0,0.0]))==model.predict(array([1.0,1.0]))
model.predict(array([8.0,9.0]))
model.predict(array([8.0,9.0]))==model.predict(array([9.0,8.0]))
model.k
model.computeCost(sc.parallelize(data))
isinstance(model.clusterCenters,list)
$dockerstopspark
$dockerstartspark
$dockerrmspark
WhatisKubernetes?Whatarecontainers?ClustercomponentsinKubernetesBasicUnitsinKubernetesRunanexamplewithMinikubeInteractiveonlinetutorialHaveasolidunderstandingofContainersandKubernetesUnderstandtheClustercomponentsofKubernetesUnderstandtheterminologyofKubernetesGainpracticalexperiencewithkubernetesWithminikubeWithaninteractiveonlinetutorial
Kubernetesisanopen-sourceplatformdesignedtoautomatedeploying,scaling,andoperatingapplicationcontainers.
https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
WithKubernetes,youcan:
Deployyourapplicationsquicklyandpredictably.Scaleyourapplicationsonthefly.Rolloutnewfeaturesseamlessly.Limithardwareusagetorequiredresourcesonly.Runapplicationsinpublicandprivateclouds.
Kubernetesis
Portable:public,private,hybrid,multi-cloudExtensible:modular,pluggable,hookable,composableSelf-healing:auto-placement,auto-restart,auto-replication,auto-scaling
8.4.1.1Whatarecontainers?
Figure75:KubernetesContainers[ImageSource]
Figure75showsadepictionofthecontainerarchitecture.
8.4.1.2Terminology
Inkubernetesweareusingthefollowingterminology
Pods:
A pod (as in a pod of whales or pea pod) is a group of one or morecontainers(suchasDockercontainers),withsharedstorage/network,andaspecificationforhowtorunthecontainers.Apod’scontentsarealwaysco-located and co-scheduled, and run in a shared context.A podmodels anapplication-specific logical host. It contains one or more applicationcontainers which are relatively tightly coupled. In a pre-container world,theywouldhaveexecutedonthesamephysicalorvirtualmachine.
Services:
ServiceisanabstractionwhichdefinesalogicalsetofPodsandapolicybywhichtoaccessthem.Sometimestheyarecalledamicro-service.ThesetofPodstargetedbyaServiceis(usually)determinedbyaLabelSelector.
Deployments:
A Deployment controller provides declarative updates for Pods andReplicaSets.Youdescribeadesiredstate inaDeploymentobject,andtheDeployment controller changes the actual state to the desired state at acontrolledrate.YoucandefineDeploymentstocreatenewReplicaSets,orto remove existing Deployments and adopt all their resources with newDeployments.
8.4.1.3KubernetesArchitecture
ThearchitectureofkubernetsisshowninFigure76.
Figure76:Kubernetes(Source:Google)
8.4.1.4Minikube
To try out kubernetes on your own computer you can download and installminikube. It deploys and runs a single-node Kubernetes cluster inside a VM.Hence it provide a reasonable environment not only to try it out, but also fordevelopment[cite].
Inthissectionwewillfirstdiscusshowtoinstallminikubeandthenshowcaseanexample.
8.4.1.4.1Installminikube
8.4.1.4.1.0.1OSX$curl-Lominikubehttps://storage.googleapis.com/minikube/releases/v0.25.0/minikube-darwin-amd64&&chmod+xminikube&&
8.4.1.4.1.0.2Windows10
We assume that you have installedOracleVirtualBox in yourmachinewhichmustbeaversion5.x.x.
Initially,weneedtodownloadtwoexecutables.
DownloadKubectl
DownloadMinikube
Afterdownloadingthesetwoexecutablesplacetheminthecloudmeshdirectorywe earlier created. Rename the minikube-windows-amd64.exe to minikube.exe. Make sureminikube.exeandkubectl.exelieinthesamedirectory.
8.4.1.4.1.0.3Linux
InstallingKVM2isimportantforUbuntudistributions
We are going to run minikube using KVM2 libraries instead of virtualboxlibrariesforwindowsinstallation.
TheninstallthedriversforKVM2,
8.4.1.4.2StartaclusterusingMinikube
8.4.1.4.2.0.1OSXMinikubeStart
8.4.1.4.2.0.2UbuntuMinikubeStart
$curl-Lominikubehttps://storage.googleapis.com/minikube/releases/v0.25.0/minikube-linux-amd64&&chmod+xminikube&&
$sudoaptinstalllibvirt-binqemu-kvm
$sudousermod-a-Glibvirtd$(whoami)
$newgrplibvirtd
$curl-LOhttps://storage.googleapis.com/minikube/releases/latest/docker-machine-driver-kvm2&&chmod+xdocker-machine-driver-kvm2
$minikubestart
$minikubestart--vm-driver=kvm2
8.4.1.4.2.0.3Windows10MinikubeStart
InthiscaseyoumustrunWindowsPowerShellasadministrator.Forthissearchfor the application in search and right click and clickRun as administrator. Ifyouareanadministratoritwillrunautomaticallybutifyouarenotpleasemakesureyouprovidetheadminlogininformationinthepopup.
8.4.1.4.3Createadeployment
8.4.1.4.4Exposetheservi
8.4.1.4.5Checkrunningstatus
Thisstepistomakesureyouhaveapodupandrunning.
8.4.1.4.6Callserviceapi
8.4.1.4.7TakealookfromDashboard
Ifyouwanttogetaninteractivedashboard,
Browsetohttp://192.168.99.101:30000inyourwebbrowseranditwillprovideaGUIdashboardregardingminikube.
8.4.1.4.8Deletetheserviceanddeployment
$cdC:\Users\<username>\Documents\cloudmesh
$.\minikube.exestart--vm-driver="virtualbox"
$kubectlrunhello-minikube--image=k8s.gcr.io/echoserver:1.4--port=8080
$kubectlexposedeploymenthello-minikube--type=NodePort
$kubectlgetpod
$curl$(minikubeservicehello-minikube--url)
$minikubedashboard
$minikubedashboard--url=true
http://192.168.99.101:30000
$kubectldeleteservicehello-minikube
$kubectldeletedeploymenthello-minikube
8.4.1.4.9Stopthecluster
Forallplatformswecanusethefollowingcommand.
8.4.1.5InteractiveTutorialOnline
Start cluster https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactive/Deploy app https://kubernetes.io/docs/tutorials/kubernetes-basics/cluster-interactiveExplorehttps://kubernetes.io/docs/tutorials/kubernetes-basics/explore-intro/Exposehttps://kubernetes.io/docs/tutorials/kubernetes-basics/expose-intro/Scalehttps://kubernetes.io/docs/tutorials/kubernetes-basics/scale-intro/Update https://kubernetes.io/docs/tutorials/kubernetes-basics/update-interactive/MiniKube https://kubernetes.io/docs/tutorials/stateless-application/hello-minikube/
8.4.2UsingKubernetesonFutureSystems☁�
This section introduces you on how to use the Kubernetes cluster onFutureSystems. Currently we have deployed kubernetes on our cluster calledecho.
8.4.2.1GettingAccess
You will need an account on FutureSystems and upload the ssh key to theFutureSystemsportalfromthecomputerfromwhichyouwanttologintoecho.To verify, if you have access try to see if you can log intovictor.futuresystems.org. You need to be a member of a valid FutureSystemsproject.
ForFall2018classesatIUyouneedtobeinthefollowingproject:
https://portal.futuresystems.org/project/553
$minikubestop
Ifyouhaveverifiedthatyouhaveaccesstothevictor,youcannowtrytologintothekubernetesclusterheadnodewiththesameusernameandkey.Runthesefirstonyourlocalmachinetosettheusernameandloginhost:
Thenyoucanlogintothekubernetesheadnodebyrunning:
NOTE: If you have access to victor but not the kubernetes system, yourprojectmaynothavebeenauthorizedtoaccessthekubernetescluster.SendatickettoFutureSystemsticketsystemtorequestthis.
Once you are logged in to the kubernetes cluster head node you can runcommandson theremoteechokubernetesmachine (all commands shown innextexceptstatedotherwise)tousethekubernetesinstallationthere.Firsttrytorun:
This will let you know if you have access to kubernetes and verifies if thekubectlcommandworksforyou.Naturallyitwillalsolistthepods.
8.4.2.2ExampleUse
ThefollowingcommandrunsanimagecalledNginxwithtworeplicas,Nginxisapopularwebseverwhichiswellknownasahighperformanceloadbalancer.
Asaresultofthisonedeploymentwascreated,andtwoPODsarecreatedandstarted. If you encounter and error stating that the deployment already existswhenexecutingthepreviouscommandthatisbecausethecommandhasalreadybeenexecuted.Toseethedeployment,pleaseusethecommand,thiscommandshouldworkevenifyounoticedtheerrormentioned.
ThiswillresultinthefollowingoutputNAMEDESIREDCURRENTUP-TO-DATEAVAILABLEAGE
$exportECHOK8S=149.165.150.85
$exportFS_USER=<putyourfutersystemaccountnamehere>
$ssh$FS_USER@$ECHOK8S
$kubectlgetpods
$kubectlrunnginx--replicas=2--image=nginx--port=80
$kubectlgetdeployment
nginx22227m
Toseethepodspleaseusethecommand
ThiswillresultinthefollowingoutputNAMEREADYSTATUSRESTARTSAGE
nginx-7587c6fdb6-4jnh61/1Running07m
nginx-7587c6fdb6-pxpsz1/1Running07m
Ifwewanttoseemoredetailedinformationwecnusethecommand
NAMEREADYSTATUSRESTARTSAGEIPNODE
nginx-75...-4jnh61/1Running08m192.168.56.2e003
nginx-75...-pxpsz1/1Running08m192.168.255.66e005
PleasenotetheIPaddressfield.MakesureyouareusingtheIPaddressthatislistedwhenyouexecute thecommandsincetheIPaddressmayhavechanged.Nowifwetrytoaccessthenginxhomepagewithwget(orcurl)
weseethefollowingoutput:--2018-02-2014:05:59--http://192.168.56.2/
Connectingto192.168.56.2:80...connected.
HTTPrequestsent,awaitingresponse...200OK
Length:612[text/html]
Savingto:'index.html'
index.html100%[=========>]612--.-KB/sin0s
2018-02-2014:05:59(38.9MB/s)-'index.html'saved[612/612]
Itverifiesthatthespecifiedimagewasrunning,anditisaccessiblefromwithinthecluster.
Nextweneedtostartthinkingabouthowweaccessthiswebserverfromoutsidethecluster.Wecanexplicitlyexposingtheservicewiththefollowingcommand.Youcanchangethenamethatissetusing--nametowhatyouwant.Giventhatisadherestothenamingstandards.Ifthenameyouenterisalreadyinthesystemyourcommandwillreturnanerrorsayingtheservicealreadyexists.
Wewillseetheresponse$service"nginx-external"exposed
$kubectlgetpods
$kubectlgetpods-owide
$wget192.168.56.2
$kubectlexposedeploymentnginx--type=NodePort--name=abc-nginx-ext
Tofindtheexposedipaddresses,wesimplyissuethecommand
WesesomethinglikethisNAMETYPECLUSTER-IPEXTERNPORT(S)AGE
AL-IP
kubernetesClusterIP10.96.0.1<none>443/TCP8h
abc-nginx-extNodePort10.110.177.35<none>80:31386/TCP3s
pleasenotethatwehavegivenauniquename.
ForIUstudents:
You could use your username or if you use one of our classes your hid. Thenumberpartwilltypicallybesufficient.Forclassusersthatdonotusethehidinthenamewewillterminateallinstanceswithoutnotification.Inaddition,welikeyou explicitly to add “-ext” to every container that is exposed to the internet.Naturallywewantyoutoshutdownsuchservicesiftheyarenotinuse.Failuretodosomayresultinterminationoftheservicewithoutnotice,andintheworstcaserevocationofyourprivilegestouseecho.
In our example you will find the port on which our service is exposed andremapped to.We find theport31386 in thevalue80:31386/TCP in the portscolumnfortherunningcontainer.
NowifwevisitthisURL,whichisthepublicIPoftheheadnodefollowedbytheexposedportnumber,fromabrowseronyourlocalmachinehttp://149.165.150.85:31386
youshouldseethe‘Welcometonginx’page.
Once you have done all the work needed using the service you can delete itusingthefollowingcommand.
8.4.2.3Exercises
$kubectlgetsvc
$kubectldeleteservice<service-name>
E.Kubernetes.fs.1:
Exploremorecomplexserviceexamples.
E.Kubernetes.fs.2:
Exploreconstructingacomplexwebappwithmultipleservices.
E.Kubernetes.fs.3:
Defineadeploymentwithayamlfiledeclaratively.
8.5SINGULARITY
8.5.1RunningSingularityContainersonComet☁�
Thissectionwascopiedfrom
https://www.sdsc.edu/support/user_guides/tutorials/singularity.html
and modified. To use it you will need an account on comet which can beobtained via XSEDE. In case you use this material as part of a class pleasecontactyourteacherformoreinformation.
8.5.1.1Background
WhatisSingularity?
“Singularityenablesuserstohavefullcontrolof theirenvironment.Singularity containers can be used to package entire scientificworkflows, software and libraries, and even data. This means thatyoudon’thavetoaskyourclusteradmintoinstallanythingforyou-youcanputitinaSingularitycontainerandrun.”
[fromtheSingularitywebsiteathttp://singularity.lbl.gov/]
There are numerous good tutorials on how to install and run Singularity onLinux,OSX,orWindowssowewon’tgointomuchdetailonthatprocesshere.
In this tutorial youwill learn how to run Singularity onComet. FirstwewillreviewhowtoaccessacomputenodeonCometandprovideasimpleexampletohelpgetyoustarted.TherearenumeroustutorialonhowtogetstartedwithSingularity,buttherearesomedetailsspecifictorunningSingularityonCometwhicharenotcoveredinthosetutorials.ThistutorialassumesyoualreadyhaveanaccountonComet.Youwillalsoneedaccesstoabasicsetofexamplefilestogetstarted.SDSChostsaGithubrepositorycontaininga’Helloworld!"examplewhichyoumayclonewiththefollowingcommand:
8.5.1.2TutorialContents
WhySingularity?Downloading&InstallingSingularityBuildingSingularityContainersRunningSingularityContainersonCometRunningTensorflowonCometUsingSingularity
8.5.1.3WhySingularity?
Listed next is a typical list of commands youwould need to issue in order toimplementafunctionalPythoninstallationforscientificresearch:
Singularityallowsyoutoavoidthistime-consumingseriesofstepsbypackagingthese commands in a re-usable and editable script, allowing you to quickly,easily, and repeatedly implement a custom container designed specifically foryouranalyticalneeds.
Figure77comparesaVMvs.Dockervs.Singularity.
gitclonehttps://github.com/hpcdevops/singularity-hello-world.git
COMMAND=apt-get-yinstalllibx11-dev
COMMAND=apt-getinstallbuild-essentialpython-libdev
COMMAND=apt-getinstallbuild-essentyialopenmpi-dev
COMMAND=apt-getinstallcmake
COMMAND=apt-getinstallg++
COMMAND=apt-getinstallgit-lfs
COMMAND=apt-getinstalllibXss.so.1
COMMAND=apt-getinstalllibgdal1-devlibproj-dev
COMMAND=apt-getinstalllibjsoncpp-devlibjsoncpp0
COMMAND=apt-getinstalllibmpich-dev--user
COMMAND=apt-getinstalllibpthread-stubs0libpthread-stubs0-devlibx11-devlibx11-d
COMMAND=apt-getinstalllibudev0:i386
COMMAND=apt-getinstallnumpy
COMMAND=apt-getinstallpython-matplotlib
COMMAND=apt-getinstallpython3`
Figure77:SingularityContainerArchitecture[69]
8.5.1.4Hands-OnTutorials
The following tutorial includes links to asciinema video tutorials created bySDSC HPC Systems Manager, Trevor Cooper which allow you to see the
console interactivity and output in detail. Look for the icon like the oneshowntotherightcorrespondingtothetaskyouarecurrentlyworkingon.
8.5.1.5Downloading&InstallingSingularity
Download&UnpackSingularityConfigure&BuildSingularityInstall&TestSingularity
8.5.1.5.1Download&UnpackSingularity
First we download and upack the source using the following commands(assumingyourusernameistest_userandyouareworkingonyourlocalcomputerwithsuperuserprivileges):
Singularity-downloadsourceandunpackinVirtualBoxVM(CentOS7)
Ifthefileissuccessfullyextracted,youshouldbeabletoviewtheresults:
[test_user@localhost~]$wgethttps://github.com/singularityware/singularity/
releases/download/2.5.1/singularity-2.5.1.tar.gztar-zxfsingularity-2.5.1.tar.gz
8.5.1.5.2Configure&BuildSingularity
Singularity-configureandbuildinVirtualBoxVM(CentOS7)
Next we configure and build the package. To configure, enter the followingcommand(wewillleaveoutthecommandprompts):
Tobuild,issuethefollowingcommand:make
Thismaytakeseveralsecondsdependingonyourcomputer.
8.5.1.5.3Install&TestSingularity
Singularity-installandtestinVirtualBoxVM(CentOS7)
Tocompletetheinstallationenter:sudomakeinstall
Youshouldbepromptedtoenteryouradminpassword.
Oncetheinstallationiscompleted,youcanchecktoseeifitsucceededinafewdifferentways:
whichsingularitysingularity-version
Youcanalsorunaselftestwiththefollowingcommand:singularityselftest
Theoutputshouldlooksomethinglike:+sh-ctest-f/usr/local/etc/singularity/singularity.conf(retval=0)OK+test-u
/usr/local/libexec/singularity/bin/action-suid(retval=0)OK+test-u/usr/local/libexec/singularity/bin/create-suid
(retval=0)OK+test-u/usr/local/libexec/singularity/bin/expand-suid(retval=0)OK+test-u
/usr/local/libexec/singularity/bin/export-suid(retval=0)OK+test-u/usr/local/libexec/singularity/bin/import-suid
(retval=0)OK+test-u/usr/local/libexec/singularity/bin/mount-suid(retval=0)OK
[test_user@localhost~]$cdsingularity-2.5.1/
[[email protected]]$ls
./configure
8.5.1.6BuildingSingularityContainers
TheprocessofbuildingaSingularitycontainerconsistsofafewdistinctstepsasfollows.
UpgradingSingularity(ifneeded)CreateanEmptyContainerImportintoContainerShellintoContainerWriteintoContainerBootstrapContainer
Wewillgothrougheachofthesestepsindetail.
8.5.1.6.1UpgradingSingularity
WerecommendbuildingcontainersusingthesameversionofSingularity,2.5.1,asexistsonComet.Thisisa2stepprocess.
Step1:runthenextscripttoremoveyourexistingSingularity:
Step2:runthefollowingscripttoinstallSingularity2.5.1:
#!/bin/bash
#
#AcleanupscripttoremoveSingularity
sudorm-rf/usr/local/libexec/singularity
sudorm-rf/usr/local/etc/singularity
sudorm-rf/usr/local/include/singularity
sudorm-rf/usr/local/lib/singularity
sudorm-rf/usr/local/var/lib/singularity/
sudorm/usr/local/bin/singularity
sudorm/usr/local/bin/run-singularity
sudorm/usr/local/etc/bash_completion.d/singularity
sudorm/usr/local/man/man1/singularity.1
#!/bin/bash
#
#AbuildscriptforSingularity(http://singularity.lbl.gov/)
declare-rSINGULARITY_NAME='singularity'
declare-rSINGULARITY_VERSION='2.5.1'
declare-rSINGULARITY_PREFIX='/usr/local'
declare-rSINGULARITY_CONFIG_DIR='/etc'
sudoaptupdate
sudoaptinstallpythondh-autoreconfbuild-essentialdebootstrap
cd../
tar-xzvf"${PWD}/tarballs/${SINGULARITY_NAME}-${SINGULARITY_VERSION}.tar.gz"
cd"${SINGULARITY_NAME}-${SINGULARITY_VERSION}"
./configure--prefix="${SINGULARITY_PREFIX}"--sysconfdir="${SINGULARITY_CONFIG_DIR}"
make
8.5.1.7CreateanEmptyContainer
Singularity-createcontainer
To create an empty Singularity container, you simply issue the followingcommand:singularitycreatecentos7.img
ThiswillcreateaCentOS7containerwithadefaultsizeof~805Mb.Dependingonwhat additional configurations you plan tomake to the container, this sizemay or may not be big enough. To specify a particular size, such as ~4 Gb,includethe-sparameter,asshowninthefollowingcommand:singularitycreate-s4096centos7.img
Toviewtheresultingimageinadirectorylisting,enterthefollowing:ls
8.5.1.8ImportIntoaSingularityContainer
Singularity-importDockerimage
Next,wewillimportaDockerimageintoouremptySingularitycontainer:singularityimportcentos7.imgdocker://centos:7
8.5.1.9ShellIntoaSingularityContainer
Singularity-shellintocontainer
OncethecontaineractuallycontainsaCentOS7installation,youcan‘shell’intoitwiththefollowing:singularityshellcentos7.img
Onceyou enter the container you should see a different commandprompt.At
sudomakeinstall
thisnewprompt,trytyping:whoami
Youruseridshouldbeidenticaltoyouruseridoutsidethecontainer.However,the operating system will probably be different. Try issuing the followingcommandfrominsidethecontainertoseewhattheOSversionis:cat/etc/*-release
8.5.1.10WriteIntoaSingularityContainer
Singularity-writeintocontainer
Next,let’stryingwritingintothecontainer(asroot):sudo/usr/local/bin/singularityshell-wcentos7.img
Youshouldbepromptedforyourpassword,andthenyoushouldseesomethinglikethefollowing:Invokinganinteractiveshellwithinthecontainer...
Next,let’screateascriptwithinthecontainersowecanuseittotesttheabilityofthecontainertoexecuteshellscripts:vihello_world.sh
Thepreviouscommandassumesyouknowthevieditor.Enterthefollowingtextintothescript,saveit,andquitthevieditor:#!/bin/bashecho"Hello,World!"
Youmayneedtochangethepermissionsonthescriptsoitcanbeexecutable:chmod+xhello_world.sh
Tryrunningthescriptmanually:./hello_world.sh
Theoutputshouldbe:Hello,World!
8.5.1.11BootstrappingaSingularityContainer
Singularity-bootstrappingacontainer
Bootstrapping a Singularity container allows you to use what is called a‘definitionsfile’soyoucanreproducetheresultingcontainerconfigurationsondemand.
Let us say youwant to create a containerwithUbuntu, but youmaywant tocreate variations on the configurations without having to repeat a long list ofcommands manually. First, we need our definitions file. Given next is thecontentsofadefinitionsfilewhichshouldsufficeforourpurposes.
Tobootstrapyourcontainer,firstweneedtocreateanemptycontainer.singularitycreate-s4096ubuntu.img
Now,wesimplyneedtoissuethefollowingcommandtoconfigureourcontainerwithUbuntu:sudo/usr/local/bin/singularitybootstrap./ubuntu.img./ubuntu.def
Thismay takeawhile tocomplete. Inprinciple,youcanaccomplish thesameresultbymanuallyissuingeachofthecommandscontainedinthescriptfile,butwhydothatwhenyoucanusebootstrappingtosavetimeandavoiderrors.
If all goes according to plan, you should then be able to shell into your newUbuntucontainer.
Bootstrap:docker
From:ubuntu:latest
%runscript
exececho"Therunscriptisthecontainersdefaultruntimecommand!"
%files
/home/testuser/ubuntu.def/data/ubuntu.def
%environment
VARIABLE=HELLOWORLD
ExportVARIABLE
%labels
%post
apt-getupdate&&apt-get-yinstallpython3gitwget
mkdir/data
echo"Thepostsectioniswhereyoucaninstallandconfigureyourcontainer."
8.5.1.12RunningSingularityContainersonComet
Of course, the purpose of this tutorial is to enable you to use the San DiegoSupercomputerCenter’sComet supercomputer to runyour jobs.This assumesyouhaveanaccountonCometalready.IfyoudonothaveanaccountonCometandyoufeelyoucanjustifytheneedforsuchanaccount(i.e.yourresearchislimited by the limited compute power you have in your government-fundedresearch lab),youcan requesta ‘StartupAllocation’ through theXSEDEUserPortal:
https://portal.xsede.org/allocations-overview#types-trial
YoumaycreateafreeaccountontheXUPifyoudonotalreadyhaveoneandthenproceedtosubmitanallocationrequestatthepreviouslygivenlink.
NOTE:SDSCprovidesaCometUserGuidetohelpgetyoustartedwithComet.LearnmoreaboutTheSanDiegoSupercomputerCenterathttp://www.sdsc.edu.
This tutorialwalksyou through the following four steps towards runningyourfirstSingularitycontaineronComet:
TransfertheContainertoCometRuntheContaineronCometAllocateResourcestoRuntheContainerIntegratetheContainerwithSlurmUseexistingCometContainers
8.5.1.12.1TransfertheContainertoComet
Singularity-transfercontainertoComet
Once you have created your container on your local system, youwill need totransferittoComet.Therearemultiplewaystodothisanditcantakeavaryingamountoftimedependingonitssizeandyournetworkconnectionspeeds.
Todothis,wewillusescp(securecopy).IfyouhaveaGlobusaccountandyourcontainersaremore than4Gbyouwillprobablywant touse that file transfermethodinsteadofscp.
Browse to the directory containing the container. Copy the container to yourscratchdirectoryonComet.Byissuingthefollowingcommand:scp./centos7.imgcomet.sdsc.edu:/oasis/scratch/comet/test_user/temp_project/
Thecontaineris~805Mbsoitshouldnottaketoolong,hopefully.
8.5.1.12.2RuntheContaineronComet
Singularity-runcontaineronComet
Oncethefileistransferred,logintoComet(assumingyourCometuserisnamedtest_user):[email protected]
NavigatetoyourscratchdirectoryonComet,whichshouldbesomethinglike:[test_user@comet-ln3~]$cd/oasis/scratch/comet/test_user/temp_project/
Next,youshouldsubmitarequestforaninteractivesessionononeofComet’scompute,debug,orsharednodes.[test_user@comet-ln3~]$srun--pty--nodes=1--ntasks-per-node=24-pcompute-t01:00:00--wait0/bin/bash
Once your request is approved your command prompt should reflect the newnodeid.
BeforeyoucanrunyourcontaineryouwillneedtoloadtheSingularitymodule(if you are unfamiliar with modules on Comet, you may want to review theCometUserGuide).ThecommandtoloadSingularityonCometis:[test_user@comet-ln3~]$moduleloadsingularity
YoumayissuethepreviouscommandfromanydirectoryonComet.Recallthatweaddeda hello_world.sh script toour centos7.imgcontainer.Letus try executingthatscriptwiththefollowingcommand:[test_user@comet-ln3~]$singularityexec/oasis/scratch/comet/test_user/temp_project/singularity/centos7.img
/hello_world.sh
Ifallgoeswell,ÂyoushouldseeHello,World!intheconsoleoutput.Youmightalsoseesomewarningspertainingtonon-existentbindpoints.Youcanresolve
this by adding some additional lines to your definitions file before you buildyour container. We did not do that for this tutorial, but you would use acommandlikethefollowinginyourdefinitionsfile:#createbindpointsforSDSCHPCenvironmentmkdir/oasis/scratch//comet/temp_project
YouwillfindadditionalexampleslocatedinthefollowinglocationsonComet:/share/apps/examples/SI2017/Singularity
and/share/apps/examples/SINGULARITY
8.5.1.12.3AllocateResourcestoRuntheContainer
Singularity-allocateresourcestoruncontainer
It is best to avoid working on Comet’s login nodes since they can become aperformance bottleneck not only for you but for all other users. You shouldratherallocateresourcesspecificforcomputationally-intensivejobs.Toallocatea‘computenode’foryouruseronComet,issuethefollowingcommand:[test_user@comet-ln3~]$salloc-N1-t00:10:00
Thisallocation requestsa singlenode (-N1) fora total timeof10minutes (-t00:10:00). Once your request has been approved, your computer node nameshouldbedisplayed,e.g.comet-17-12.
Nowyoumaylogintothisnode:[test_user@comet-ln3~]$sshcomet-17-12
Noticethatthecommandprompthasnowchangedtoreflectthefactthatyouareonacomputenodeandnotaloginnode.[test_user@comet-06-04~]$
Next, load the Singularity module, shell into the container, and execute thehello_world.shscript:[test_user@comet-06-04~]$moduleloadsingularity[test_user@comet-06-04~]$singularityshellcentos7.img
[test_user@comet-06-04~]$./hello_world.sh
Ifallgoeswell,youshouldseeHello,World!intheconsoleoutput.
8.5.1.12.4IntegratetheContainerwithSlurm
Singularity-runcontaineronCometviaSlurm
Ofcourse,mostuserssimplywanttosubmittheirjobstotheCometqueueandletitruntocompletionandgoontootherthingswhilewaiting.SlurmisthejobmanagerforComet.
Givennextisajobscript(whichwewillnamesingularity_mvapich2_hellow.run)whichwillsubmit your Singularity container to the Comet queue and run a program,hellow.c(writteninCusingMPIandprovidedaspartoftheexampleswiththemvapich2defaultinstallation).#!/bin/bash``#SBATCH--job-name="singularity_mvapich2_hellow"#SBATCH--output="singularity_mvapich2_hellow.%j.out"
#SBATCH--error="singularity_mvapich2_hellow.%j.err"#SBATCH--nodes=2#SBATCH--ntasks-per-node=24#SBATCH--
time=00:10:00#SBATCH--export=allmoduleloadmvapich2_ibsingularityCONTAINER=/oasis/scratch/comet/$USER/temp_project/singularity/centos7-mvapich2.img
mpirunsingularityexec${CONTAINER}/usr/bin/hellow
Thepreviousscriptrequests2nodesand24taskspernodewithawalltimeof10minutes. Notice that twomodules are loaded (see the line beginning with‘module’), one for Singularity and one for MPI. An environment variable‘CONTAINER’isalsodefinedtomakeitalittleeasiertomanagelongreusabletextstringssuchasfilepaths.
Youmayneed toadda line specifyingwithallocation tobeused for this job.Whenyouarereadytosubmit thejobtotheCometqueue, issuethefollowingcommand:[test_user@comet-06-04~]$sbatch-pdebug./singularity_mvapich2_hellow.run
ToviewthestatusofyourjobintheCometqueue,issuethefollowing:[test_user@comet-06-04~]$squeue-utest_user
Whenthejobiscomplete,viewtheoutputwhichshouldbewrittentotheoutputfile singularity_mvapich2_hellow.%j.out where %j is the job ID (let’s say the job ID is1000001):
[test_user@comet-06-04~]$moresingularity_mvapich2_hellow.1000001.out
Theoutputshouldlooksomethinglikethefollowing:
8.5.1.12.5UseExistingCometContainers
SDSCUser Support staff,MartyKandes, has built several customSingularitycontainersdesignedspecificallyfortheCometenvironment.
LearnmoreaboutthesecontainersforComet.
Aneasywaytopullimagesfromthesingularityhuboncommentisprovidedinthenextvideo:
Singularity-pullfromsingularity-hubonComet
Comet supports the capability to pull a container directly from any properlyconfigured remote singularity hub. For example, the following command canpull a container from the hpcdevops singularity hub straight to an emptycontainerlocatedonComet:
The resulting container should be named something like singularity-hello-world.img.
LearnmoreaboutSingularityHubsandcontainercollectionsat:
https://singularity-hub.org/collections
That’sit!Congratulations!YoushouldnowbeabletorunSingularitycontainersonCometeitherinteractivelyorthroughthejobqueue.Wehopeyoufoundthistutorialuseful.Pleasecontactsupport@xsede.orgwithanyquestionsyoumighthave. Your Comet-related questions will be routed to the amazing SDSC
Helloworldfromprocess28of48
Helloworldfromprocess29of48
Helloworldfromprocess30of48
Helloworldfromprocess31of48
Helloworldfromprocess32of48
Helloworldfromprocess33of48
Helloworldfromprocess34of48
Helloworldfromprocess35of48
Helloworldfromprocess36of48
Helloworldfromprocess37of48
Helloworldfromprocess38of48
comet$singularitypullshub://hpcdevops/singularity-hello-world:master
SupportTeam.
8.5.1.13UsingTensorflowWithSingularity
Oneof themorecommonadvantagesofusingSingularity is theability tousepre-builtcontainersforspecificapplicationswhichmaybedifficulttoinstallandmaintain by yourself, such as Tensorflow. The most common example of aTensorflowapplication is character recognitionusing theMNISTdataset.Youcanlearnmoreaboutthisdatasetathttp://yann.lecun.com/exdb/mnist/.
XSEDE’sCometsupercomputersupportsSingularityandprovidesseveralpre-built container which run Tensorflow. Given next is an example batch scriptwhichrunsaTensorflowjobwithinaSingularitycontaineronComet.Copythisscriptandpasteitintoashellscriptnamedmnist_tensorflow_example.sb.
8.5.1.14Runthejob
TosubmitthescripttoComet,firstyou’llneedtorequestacomputenodewiththefollowingcommand(replaceaccountwithyourXSEDEaccountnumber):
TosubmitajobtotheCometqueue,issuethefollowingcommand:
When the job is done you should see an output file in your output directorycontainingsomethingresemblingthefollowing:
#!/bin/bash
#SBATCH--job-name="TensorFlow"
#SBATCH--output="TensorFlow.%j.%N.out"
#SBATCH--partition=gpu-shared
#SBATCH--nodes=1
#SBATCH--ntasks-per-node=6
#SBATCH--gres=gpu:k80:1
#SBATCH-t01:00:00
moduleloadsingularity
singularityexec
/share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.imglsb_release
-a
singularityexec
/share/apps/gpu/singularity/sdsc_ubuntu_gpu_tflow.imgpython-m
tensorflow.models.image.mnist.convolutional
[test_user@comet-ln3~]$srun--account=your_account_code--partition=gpu-shared--gres=gpu:1--pty--nodes=1--ntasks-per-node=1-t00:30:00--wait=0--export=ALL/bin/bash
[test_user@comet-06-04~]$sbatchmnist_tensorflow_example.sb
DistributorID:Ubuntu
Description:Ubuntu16.04LTS
Congratulations!You have successfully trained a neural network to recognizeasciinumericcharacters.
8.6EXERCISES☁�E.Docker.1:MongoDBContainer
Develop a docker file that uses the mongo distribution fromDockerhubandstartsaMongoDBdatabaseontheregularportwhilecommunicatingtoyourcontainer.
What are the parameters on the command line that you need todefine?
E.Docker.2:MongoDBContainerwithauthentication
DevelopaMongoDBcontainer that includesanouthenticateduser.
Release:16.04
Codename:xenial
^[[33mWARNING:Nonexistentbindpoint(directory)incontainer:'/scratch'
^[[0mItensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcublas.solocally
Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcudnn.solocally
Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcufft.solocally
Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcuda.so.1locally
Itensorflow/stream_executor/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcurand.solocally
Itensorflow/core/common_runtime/gpu/gpu_init.cc:102]Founddevice0withproperties:
name:TeslaK80
major:3minor:7memoryClockRate(GHz)0.8235
pciBusID0000:85:00.0
Totalmemory:11.17GiB
Freememory:11.11GiB
Itensorflow/core/common_runtime/gpu/gpu_init.cc:126]DMA:0
Itensorflow/core/common_runtime/gpu/gpu_init.cc:136]0:Y
Itensorflow/core/common_runtime/gpu/gpu_device.cc:838]CreatingTensorFlowdevice(/gpu:0)->(device:0,name:TeslaK80,pcibusid:0000:85:00.0)
Extractingdata/train-images-idx3-ubyte.gz
Extractingdata/train-labels-idx1-ubyte.gz
Extractingdata/t10k-images-idx3-ubyte.gz
Extractingdata/t10k-labels-idx1-ubyte.gz
Initialized!
Step0(epoch0.00),40.0ms
Minibatchloss:12.054,learningrate:0.010000
Minibatcherror:90.6%
Validationerror:84.6%
Step100(epoch0.12),12.6ms
Minibatchloss:3.293,learningrate:0.010000
Minibatcherror:6.2%
Validationerror:7.0%
Step8400(epoch9.77),11.5ms
Minibatchloss:1.596,learningrate:0.006302
Minibatcherror:0.0%
Validationerror:0.9%
Step8500(epoch9.89),11.5ms
Minibatchloss:1.593,learningrate:0.006302
Minibatcherror:0.0%
Validationerror:0.8%
Testerror:0.9%
Youmustuse thecloudmesh.yaml file forspecifyingthe informationfortheadminuserandpassword.
1. Howdoyouaddtheuser?2. Howdoyoustartthecontainer?3. Showcase theuseof theauthenticationwitha simple script orpytest.
Youareallowedtousuedockercompose,butmakesureyoureadthepasswordondusernamefromtheyamlfile.YoUmustnotconfigureitbyhandinthecomposeyamlfile.Youcanusecloudmeshcommandstoreadtheusernameandpassword.
E.Docker.3:CloudmeshContainer
Inthisassignmentwewillexploretheuseoftwocontainers.WewillbeleveragingtheasisgnmentE.Docker.2.
First,youwillstarttheauthenticateddockerMongoDBcontainer
Youwillbewritinganadditionaldockerfile, thatcreatescloudmeshin a docker container. Upon start the parameter passed to thecontainerwillbeexecutedinthecontainer.Youwillusethe.sshand.cloudmeshdirectoryfromyournativefilesystem.
Forhints,pleaselookat
https://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Dockerfilehttps://github.com/cloudmesh/cloudmesh-cloud/blob/master/docker/ubuntu-19.04/Makefile
Tojumpstartyoutry
Explore!UnderstandwhatisdoneintheMakefile
cmsconfigvaluecloudmesh.data.mongo.MONGO_USERNAME
cmsconfigvaluecloudmesh.data.mongo.MONGO_PASSWORD
makeimage
makeshell
Questions:
1. HowwouldyouneedtomodifytheDockerfiletocompleteit?2. Whay did we outcomment the MongoDB related tasks in theDockerfile?
3. How dowe need to establish communication to theMongoDBcontainer
4. Could docker compose help, or would it be too complicated,e.g.whatifthemongocontaineralreadyruns?
5. Why would it be dangerous to store the cloudmesh.yaml fileinsidethecontainer?Hint:DockerHub.
6. WhyshouldyouatthistimenotuploadimagestoDockerHub?
E.Docker.Swarm.1:Documentation
Develop a section in the handbook that deploys a Docker Swarmclusteronanumberofubuntumachines.Notethatthismayactuallybe easier as docker and docker swarm are distributed with recentversionsofubuntu.Just incaseweareprovidingalink toaneffortwefoundtoinstalldockerswarm.Howeverwehavenotcheckeditoridentifiedifitisuseful.
https://rominirani.com/docker-swarm-tutorial-b67470cf8872
E.Docker.Swarm.2:GoogleComputeEngine
Develop a section that deploys aDocker Swarm cluster onGoogleComputeEngine.Notethatthismayactuallybeeasierasdockeranddockerswarmaredistributedwithrecentversionsofubuntu.Justincaseweareprovidinga link toaneffortwe found to installdockerswarm.Howeverwehavenotcheckeditoridentifiedifitisuseful.
https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed
E.SingleNodeHadoop:
Setupasinglenodehadoopenvironment.
Thisincludes:
1. CreateaDockerfilethatdeployshadoopinacontainer2. Developsampleapplicationsandtests to testyourcluster.Youcanusewordcountorsimilar.
youwill findacomprehensive installation instruction that setsupahadoopclusteronasinglenodeat
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
E.MultiNodeHadoop:
Setupahadoopclusterinadistributedenvironment.
Thisincludes:
1. CreatedockercomposeandDockerfilesthatdeployshadoopinkubernetes
2. Developsampleapplicationsandtests to testyourcluster.Youcanusewordcountorsimilar.
youwill findacomprehensive installation instruction that setsupahadoopclusterinadistributedenvironmentat
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/ClusterSetup.html
Youcanusethissetofinstructionsoridentifyotherresourcesontheinternet that allow the creation of a hadoop cluster on kubernetes.Alternativelyyoucanusedockercomposeforthisexercise.
E.SparkCluster:Documentation
Develop a high quality section that installs a spark cluster inkubernetes. Test your deployment on minikube and also onFuturesystemsecho.
You may want to get inspired from the talk Scalable SparkDeploymentusingKubernetes:
http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-1/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-2/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-3/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-4/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-5/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-6/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-7/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-8/http://blog.madhukaraphatak.com/scaling-spark-with-kubernetes-part-9/
Makesureyoudonotplagiarize.
9NIST
9.1NISTBIGDATAREFERENECEARCHITECTURE☁�
LearningObjectives
ObtainanoverviewoftheNISTBigDataRefernceArchitecture.Understandthatyoucancontributetoitaspartofthisclass.
Oneofthemajortechnicalareasinthecloudistodefinearchitecturesthatcanwork with Big Data. For this reason NIST has work now for some time onidentifyinghowtocreateadatainteroperabilityframework.Theideahereisthatat one point architecture designers can pick services that they can chose tocombinethemaspartoftheirdatapipelineandintegrateinaconvenientfashionintotheirsolution.
BesidesjustbeingahighleveldescriptionNISTalsoencouragestheverificationof the architecture through interface specifications, especially those that arecurrentlyunderwayinVolume8ofthedocumentseries.Youhavetheuniqueopportunitytohelpshapethisinterfaceandcontributetoit.Wewillprovideyounot onlymechanisms on how you theoretically can do this, but also how youpracticallycancontribute.
Aspartofyourprojects in516youwillneed to integratea significant servicethatyoucancontributetotheNISTdocumentinformofaspecificationandinformofanimplementation.
9.1.1PathwaytotheNIST-BDRA
The Nist Big Data Public Working Group (NBD-PWG) was established ascollaboration between industry, academia and government “to create aconsensus-based extensible Big Data Interoperability Framework (NBDIF)which is a vendor-neutral, technology- and infrastructure-independent
ecosystem” [70]. It will be helpful for Big Data stakeholders such as dataarchitects,datascientists,researchers,implementerstointegrateandutilize“thebestavailableanalyticstoolstoprocessandderiveknowledgethroughtheuseofstandard interfaces between swappable architectural components” [70]. TheNBDIFisbeingdevelopedinthreestages:
Stage 1: “Identify the high-level Big Data reference architecture keycomponents, which are technology, infrastructure, and vendor agnostic,”[70]introductionoftheBigDataReferenceArchitecture(NBD-RA);Stage2:“DefinegeneralinterfacesbetweentheNBD-RAcomponentswiththe goals to aggregate low-level interactions into high-level generalinterfaces and produce set ofwhite papers to demonstrate howNBD-RAcanbeused”[70];Stage3:“ValidatetheNBD-RAbybuildingBigDatageneralapplicationsthroughthegeneralinterfaces.[70]”
NisthasdevelopedthefollowingvolumesaslistedinTable:BDRAvolumesthatsurroundthecreationoftheNIST-BDRA.Werecommendthatyoutakeacloserlookatthesedocumentsasinthissectionweprovideafocussedsummarywiththeaspectofcloudcomputinginmind.
Table:NISTBDRAVolumes
.
Volumes Volume TitleNISTSP1500-1r1 Volume1 DefinitionsNISTSP1500-2r1 Volume2 TaxonomiesNISTSP1500-3r1 Volume3 UseCasesandRequirementsNISTSP1500-4r1 Volume4 SecurityandPrivacyNISTSP1500-5 Volume5 ReferenceArchitecturesWhitePaperSurveyNISTSP1500-6r1 Volume6 ReferenceArchitecture
NISTSP1500-7r1 Volume7 StandardsRoadmapNISTSP1500-9 Volume8 ReferenceArchitectureInterface(new)NISTSP1500-10 Volume9 AdoptionandModernization(new)
9.1.2BigDataCharacteristicsandDefinitions
Volume1oftheseriesintroducesthecommunitytocommondefinitionsthatareusedaspartofthefieldofBigdata.Thisincludestheanalysisofcharacteristicssuch as volume, velocity, variety, variability and the use of structures andunstructureddata.Aspartof thefieldofdatascienceandengineeringit listsanumberofareasthataretobebelievedtobeessentialincludingthattheymustmaster including data structures, parallelism, metadata, flow rate, visualcommunication. In addition we believe that an additional skill set must beprevalent that allows a data engineer to deploy such technologies onto actualsystems.
WehavesubmittedthefollowingproposaltoNIST:
3.3.6.Deployments:
A significant challange exists for data engineers to developarchitecturesandtheirdeploymentimplications.Thevolumeofdataandtheprocessingpowerneededtoanalysisthemmayrequiremanythousands of distributed compute resources. They can be part ofprivatedatacenters,virtualizedwiththehelpofvirtualmachinesorcontainersandevenutilizeserverlesscomputingtofocusintegrationof Big Data Function as a Service based architectures. As sucharchitecturesareassumedtobe largecommunitystandardssuchasleveragingDevOpswillbenecessary for theengineers tosetupandmanagesucharchitectures.Thisisespeciallyimportantwiththeswiftdevelopment of the field that may require rolling updates withoutinterruptionoftheservicesoffered.
Thisadditionreflectsthenewestinsightintowhatadatascientistneedstoknowandthenewestjobtrendsthatweobserved.
Toidentifywhatbigdataiswefindthefollowingcharacteristics
Volume: Big for data means lots of bytes. This could be achieved in manydifferentways.Typicallywelookat thaactualsizeofadataset,butalsohowthisdatasetisstoredforexampleinmanythousandsofsmallerfilesthatarepart
ofthedataset.Itisclearthatinmanyofsuchcasesanalysisofalargevolumeofdatawill impact the architecturaldesign for storage,but also theworkflowonhowthisdataisprocessed.
Velocity: We see often that big data is associated with high data flow ratescaused by for example streaming data. It can however also be caused byfunctions that are applied to large volumes of data and need to be integratedquicklytoreturntheresultasfastasposible.Needsforrealtimeprocessingaspart of the quality of service offered contribute also to this. Examples of IoTdevicesthatintegratenotonlydatainthecloud,butalsoontheedgeneedtobeconsidered.
Variety: In todaysworldwehavemanydifferentdata resources thatmotivatesophisticated data mashup strategies. Big data hence not only deals withinformation from one source but a variety of sources. The architectures andservices utilized are multiple and needed to enable automated analysis whileincorporatingvariousdatasource.
Another aspect of variety is that data can be structured or unstructured.NISTfindsthisaspectsoimportantthattheyincludeditsownsectionforit.
Variability:Anydataovertimewillchange.NaturallythatisnotanexceptioninBigdatawheredatamaybeatimetoliveorneedstobeupdatedinordernotto be stale or obsolete. Hence one of the characteristics that big data couldexhibitisthatitsdatabevariableandispronetochanges.
In addition to these general observations we also have to adress importantcharacteristicsthatareattachedwiththeDataitself.Thisincludes
Veracity:Veracityreferstotheaccuracyofthedata.Accuracycanbeincreasedbyaddingmetadata.
Validity:Referstodatathatisvalid.Whiledatacanbeaccuratelymeasured,itcouldbeinvalidbythetimeitisprocessed.
Volatility:Volatilityreferstothechangeinthedatavaluesovertime.
Value:Naturallywecanstorelotsof information,but if theinformationisnotvaluablethenwemaynotneedtostoreit.Thisisrecentlybeenseenasatrendas
some companies have transitioned data sets to the community as they do notprovidevaluetotheserviceprovidertojustifyitsprolongedmaintenance.
Inothercasesthedatahasbecomesovaluableandthattheservicesofferedhavebeen reduced for example as they provide too many resource needs by thecommunity.A good example isGoogle scholar that used to havemuchmoreliberaluseandtodayitsservicesaresignificantlyscaledbackforpublicusers.
9.1.3BigDataandtheCloud
WhilelookingatthecharacteristicsofBigDataitisobviousthatBigdataisontheonehandamotivatorforcloudcomputing,butontheotherhandexistingBigDataframeworksareamotivatorfordevelopingBigDataArchitecturesacertainway.
Hence we have to always look from both sides towards the creation ofarchitecturesrelatedtoaparticularapplicationofbigdata.
This is alsomotivated by the rich historywehave in the field of parallel anddistributed computing. For a long time engineers have dealtwith the issue ofhorizontalscaling,whichisdefinedbyaddingmorenodesorotherresourcestoacluster.Suchresourcesmayinclude
shareddiskfilesystems,distributedfilesystems,distributed data processing and concurrency frameworks, such asConcurrent sequential processes, workflows,MPI, map/reduce, or sharedmemory,resourcenegotiationtoestablishqualityofservice,datamovement,anddatatiers(asshowcasedinhighenergyphysicsLigo[71]andAtlas)
In addition to the horizontal scaling issues we also have to worry about theverticalscalingissues,thisishowtheoverallsytemarchitecturefitstogethertoadressanend-to-endusecase.Insucheffortswelookat
interfacedesigns,workflowsbetweencomponentsandservices,
privacyofdataandothersecurityissues,reusabilitywithinotheruse-cases.
Naturallythecloudofferstheabilitytocloudifyexistingrelationaldatabasesascloudserviceswhileleveragingtheincreasedperformanceandspecialhardwareandsoftwaresupportthatmaybeotherwiseunaffordableforanindividualuser.Howeverweseealsotheexplosivegrowthofnonsqldatabasesbecausesomeofthem can more effectively deal with the characteristics of big data thantraditional mostly weel structured data bases. In addition many of theseframeworks are able to introduce advanced capability such as distributed andreliableserviceintegration.
Althoughwehavebeenusedtothetermcloudwileusingvirtualizedresourcesand the term Grid by offering a network of supercomputers in a virtualorganization,WeshouldnotforgetthatCloudserviceprovidersalsoofferHighperformancecomputersresourcesforsomeoftheirmostadvancedusers.
Naturally such resources can be used not only for numerical intensifcomputations but also for big data applications as the Physics community hasdemonstrated.
9.1.4BigData,EdgeComputingandtheCloud
WhenlookingatthenumberofdevicesthatarebeingaddeddailytotheglobalIT infrastructureweobserve thatcellphonesandsoon InternetofThings (IoT)deviceswillproducethebulkofalldata.Howevernotalldatawillbemovedtothe cloud and lots of datawill be analyzed locally on the devices or evennotbeingconsideredtobeuploadedtothecloudeitherbecauseitprojecttoloworto high value to be moved. However a considerable portion will put newconstraintsonourservicesweofferinthecloudandanyarchitectureaddressingthis must be properly deal with scaling early on in the architectural designprocess.
9.1.5ReferenceArchitecture
NextwepresenttheBigdatareferencearchitecture.ItisDepictedinFigure78.Accordingtothedocument(Volume2) thefivemaincomponentsrepresentingthecentralrolesinclude
SystemOrchestrator:Defines and integrates the required data applicationactivitiesintoanoperationalverticalsystem;DataProvider:IntroducesnewdataorinformationfeedsintotheBigDatasystem;BigDataApplicationProvider:Executesa lifecycle tomeetsecurityandprivacyrequirementsaswellasSystemOrchestrator-definedrequirements;Big Data Framework Provider: Establishes a computing framework inwhich to execute certain transformation applications while protecting theprivacyandintegrityofdata;andDataConsumer:IncludesendusersorothersystemswhousetheresultsoftheBigDataApplicationProvider.
Inadditionwerecognizetwofabricslayers:
SecurityandPrivacyFabricManagementFabric
Figure78:NIST-BDRA(seeVolume2)
While lookingat theactorsdepicted inFigure79weneed tobeaware that ineachofthecategoriesaservicecanbeadded.Thisisanimportantdistinctiontothe original depiction in the definition as it is clear that an automated servicecouldactinbehalfoftheactorslistedineachofthecategories.
Figure79:NISTRoles(seeVolume2)
ForadetaileddefinitionwichisbeyondthescopeofthisdocumentwerefertotheVolume2ofthedocuments.
9.1.6FrameworkProviders
TraditionallycloudcomputinghasstartedwithofferingIaaS,followedbyPaaSandSaaS.WeseetheIaaSreflectedinthreecategoriesforbigdata:
1. Traditional compute and network resources including virtualizationframeworks
2. Data Organization and Distribution systems such as offered in IndexedStorageandFileSystems
3. Processing engines offering batch, interactive, and streaming services toprovidecomputingandanalyticsactivities
Messaging and communication takesplacebetween these layerwhile resourcemanagementisusedtoaddressefficiency.
Frameworks such as Spark andHadoop include components formmultiple ofthesecategoriestocreateaverticalintegratedsystem.Oftentheyareofferedbyaserviceprovider.However,oneneedstoberemindedthatsuchofferingsmay
notbe tailored to the individualuse-case and inefficiencies couldbeprevalentbecausetheserviceofferisoutdated,oritisnotexplicitlytunedtotheproblemathand.
9.1.7ApplicationProviders
The underlaying infrastructure is reused by big data application providerssupportingservicesandtasksuchas
DatacollectionsDatacurationDataAnalyticsDataVisualizationDataAccess
Throughtheinterplaybetweentheseservicesdataconsumersanddataproducerscanbeserved.
9.1.8Fabric
Securityandgeneralmanagementarepartofthegoverningfabricinwhichsuchanarchitectureisdeployed.
9.1.9Interfacedefinitions
TheinterfacedefinitionsfortheBDRAarespecifiedinVolume8.Weareinthesecond phase of our document specification while we switch from our pureResourcedescripyiontoanOpenAPIspecification.Beforewecanprovidemoredetailsweneedto introduceyoutoRESTwhichisanessential technologyformanymodercloudcomputingservices.
10AI
10.1ARTIFICIALINTELLIGENCESERVICEWITHREST �☁��{#sec:ai}
10.1.1UnsupervisedLearning
Keywords:clustering,kNN,MarkovModel
Unsupervisedlearningisalearningmethodwhenthetrainingdataisnotlabeled.Thisproblemcouldbemorechallengingbecausewearenotsupposed tohavepre-knowledgeandfindpatternsfromthedata.
Unsupervised learning could be very computing intensive and it has verycomplicatedmathprinciples,butveryuseful. In thischapter,wewill illustratesomemostpopularunsupervisedlearningalgorithmsandraiseexampleshowweapplythem,whichincludesKMeans,k-NN,MarkovModelandothers.
Itisimportanttoknowthatunsupervisedlearningisjustawayhowwelookatthe problem. Each algorithm is just an example on howwe solve a particularproblem.Beforeyouapplyanalgorithm,attentionshouldbegiventothereasonwhyweapplyaspecificalgorithm.
10.1.2KMeans
KMeansisoneofthemoststraightforwardunsupervisedlearningalgorithms.
slidesAI(40)UnsupervisedLearning
10.1.3Lab:PracticeonAI
Keywords:Docker,RESTService,Spark
slidesPracticeonAI(40)RESTservices
10.1.4k-NN
k-NN is a non-parametric statistical method meaning there is no assumptionmade about the distribution of the data. Additionally the distribution is notassumedtobefixedi.e.thedistributionmaychangethroughtime.Theserelaxedassumptionsmakenon-parametrictestsextremelyvaluablewhenappliedtoreal-world data as a vast majority of real-world data have dynamic distributionsthough time,climatedata isanobviousexample.Non-Parametricdata isoftenordinal which means the variables have an inherent categorical order withunknown distances between the categories. A common example of a non-parametricstatisticaltestisthesigntestwherevaluesareassignedapositiveornegativesignbasedonbeingaboveorbelow themedian. Ink-NNpredictionsaremadeaboutunknownvaluesbymatching theunknownvalueswithsimilarknown values. Naturally the determination of ‘similar’ is of fundamentalimportance. This is done through the application of the Euclidean distancecalculationgivenbyEquation1.
d(i, j)= d(j, i)=√(i1 − j1)2+(i2 − j2)2+...(in − jn)2 =⎷
n∑n=1
(in − jn)2
To illustrate an example of calculating similarity using Equation 1 it can bedetermiendifacarisfastornotbyusingthedatainTable1.LetspretendweknownothingaboutcarsandareaskedifwethinkaChevyCorvette isfastornot.
Car make and model with associated horsepower, whether the vehicle has aracingstripeandiftheauthorthinksthecarisfastornotTable1.
Table1:CarData
CarName Horsepower(HP)
RacingStripe(YesorNo)
Fast(YesorNo)
ToyotaPrius 120 0 0TeslaRoadster 288 0 1BugattiVeyron 1200 1 1HondaCivic 158 1 0
LamborghiniAventador
695 1 1
NowletssayourfriendwantstoknowifaFordMustangwitharacingstripeisfast or not. This particular friend knows nothing about cars so decided to putanalyticstowork.SinceaMustanghasroughly300horsepowertheclosetcarinour dataset to this is the TeslaRoadster and since the Tesla is fastwewouldpredict theMustang tobe fast.Remember this iscompletelydependenton theauthors initial classification of whether a car is fast or not. Clearly theLamborghiniand theBugattiare fastbutmaybe theTesla isnot fast thereforegiving an incorrect answer. An example using the Mustang and the Tesla isgiveninthenextcalculation:
d(i, j)= √(300 − 288)2+(1 − 0)2 = 12.04
Wewereabletodeterminetheclosest,orfirstnearestneighborbyinspectionofthisdata,howeverwithamorerobustdatasetthismaynotbethecase.Inthesesituations to find the nearest neighbor theEuclidean distance is calculated forevery unique row entry and then ordered from smallest to largest distances,naturally the smallest distances are themost similar.Youmay notice that thevalues of horsepower are significantly larger in magnitude than the valuesassociated with racing stripes. This could be problematic in many real worldscenarioswherethecolumnsassociatedwithlargevaluesdonothaveasdirectofanimpactashorsepowerdoesonthevariablewearetryingtopredict–acarbeingfast.Inthecasewhereeachcolumnvaluehasequalpredictivepowerdatanormalizationshouldbeperformed.Thisistheprocessofcenteringeachcolumntoameanofzero(0)andastandarddeviationofone(1).Thisisdonebytakingthe columnmeans and subtracting the columnmeans from each column entryanddividingbythecolumnstandarddeviation.
Determineforyourselfifweuse2nearestneighborswhatthepredictionaboutthe Mustang would be given the data provided what about 3, 4 nearestneighbors?Whatisthemaximumnumberofk-nearestneighborswecouldhavegiventhedatasetinTable1?
Calculate the Euclidean Distances for all five row entries with respect to theMustang.
Normalize thedataand recalculate the first and secondnearestneighborswithrespecttotheMustang.Doesanythingchange?
In order to see k-NN in action we will look at an in depth example using adataset from the National Basketball Associated (NBA) from 2013, naturallytherearemoreuptodatedatasetsbutassportsanalyticsbecomesasignificantmarketmoreandmoredataisbecomingproprietary.ThisexamplewillpickanNBAplayer and determine themost similarNBAplayer in the dataset to theselected player using k nearest neighbors. The following is set up for you toexecuteinapythoncommandpromptlinebylineforinstructionalpurposes.
Thepreviousportionofcodeusespandas toopenthedownloadedcsvfileandnameitnba,naturallyyoucouldnamethefileanything.Ifyouwanttoviewthecolumnsinthecsvfilethefollowingcommandcanbeused.
Nowwe need to select a player from the dataset, wewill then determine themostsimilarplayer toourselectedplayer.Analysis like this isbecomingmoreandmore prevalent in professional sports due to the large amounts ofmoneyinvested in players. Scoutsmay use this type of analysis to determine who agivenprospect ismost similar too.This followingbit of code selects a playerfromthedataset.Noticethatthecolumnplayerisfirstselectedfollowedbytheplayername.
Thenextstepistoremoveanynon-numericcolumnsfromouranalysissinceweare using theEuclidean distance to calculate proximity and strings can not beevaluated insuchaway.One thingyoucando ifyouhavecolumns thathavevalueslikeyesandnoisassignzerosandonesaccordingly.Inourcasewewillonlyselectthecolumnswithnumericvalues.
#ThiscodewasadoptedfromDataquest-KnearestneighborsinPython:
#Writtenby:VikParuchuri
importpandas
importmath
withopen("/path/to/the/nba_2013.csv",'r')ascsvfile:
nba=pandas.read_csv(csvfile)
print(nba.columns.values)
selected_player=nba[nba["player"]=="LastNameFirstName"].iloc[0]
numeric_columns=['age','g','gs','mp','fg','fga','fg.','x3p',
'x3pa','x3p.','x2p','x2pa','x2p.','efg.','ft','fta','ft.',
'orb','drb','trb','ast','stl','blk','tov','pf','pts']
WenowhaveeverythingweneedtocalculatetheEuclideandistance,therearebuilt in functions available in python to calculate this howeverwewill defineourownasitisastraightforwardcomputation.Itisalsogoodpracticetodefineyourownfunctionswheneverpossible.
Applying our function using the following command will determine theEuclidean distance between the selected player and all other players in thedataset.
For sake of argument we will assume that all the data columns have equalpredictivecapabilitiessowewishtonormalize.Thiswilloftenbethecasewithsports statistics as total points and field goal percentage vary in magnitudesignificantly but total points does not necessarily holdmore predictive powerthanfieldgoalpercentage.Inordertonormalizeweagainmostonlyselectthethe numeric columns and text columns can not be normalized in the waydescribedpreviously.
Wecannowusebuiltinfunctionstocalculatethenearestneighborsinordertocomparetoourresultsattainedfromthepreviousexercise.Incaseyoudidnotnoticetheselected_player_distancearrayisanarraythatlistsalltheEuclideandistances.Wewillusethislatertoseeif thesameresult isobtainedbyusingthebuilt infunctions.Firstwewillimportthenecessarylibrariesshowninthenextcode.
Ifyouinspectedthetheselected_player_distancearrayyouwouldhavenoticedthattherewere severalNaN’s present thiswas due to having an incomplete dataset andmust be avoided. The following bit of code will replace all NA entries withzeros.
defeuclidean_distance(row):
"""
DefineourownEuclideandistancefunction
"""
euc_distance=0
forkinnumeric_columns:
euc_distance+=(row[k]-selected_player[k])**2
returnmath.sqrt(euc_distance)
selected_player_distance=nba.apply(euclidean_distance,axis=1)
nba_numeric=nba[numeric_columns]
#applynormalizationformulausingbuiltinpythonmath
#functionsforthemeanandstandarddeviation
nba_normalized=(nba_numeric-nba_numeric.mean())/nba_numeric.std()
fromscipy.spatialimportdistance
nba_normalized.fillna(0,inplace=True)
UsingthebuiltinEuclideandistancetodeterminetheEuclideandistancesofallplayersinthedatasettoourselectedplayer.
Herewecreateadataframetoholdthedistancesandthensortthevaluesfromlowest tohighest.Sinceourplayerwillnaturallybe in thedataset theselectedplayerwillbe thelowestvalue, thereforethesecondlowestvalueisassociatedwiththeplayermostcloselyrelatedtoourselectedplayer.
Inthepythonprompttype:
Themostsimilarplayertoyourselectedplayershouldappear.
10.1.5MachineLearningandCloudServices
10.1.5.1IntroductionandRegression
This video lecture covers logistic and linear regression models in additon toclustering models. The algorithims for the three methods are formalized andsolutions are presented. Additionally visulization techinqies are introducedincludingWebPlotvizandmatplotlib.
IntroductiontoMachineLearningforCloudServicesandRegression10:55
10.1.5.2K-meansClustering
VideolectureandslidescoveranintroductiontoK-meansclusters.
K-meansClusters17:15
10.1.5.3Visulization
player_normalized=nba_normalized[nba["player"]=="LastNameFirstName"]
euclidean_distances=nba_normalized.apply(lambdarow:
distance.euclidean(row,player_normalized),axis=1)
distance_frame=pandas.DataFrame(data={"dist":euclidean_distances,
"idx":euclidean_distances.index})
distance_frame.sort_values("dist",inplace=True)
second_smallest=distance_frame.iloc[1]["idx"]
most_similar_to_player=nba.loc[int(second_smallest)]["player"]
most_similar_to_player
Video lectureand slidescoverdatavisulaitzation techinquesusing stateof thesciencetoolslikeWebPlotViz.
DataVisualization30:10
10.1.5.4ClusteringExamples
Videolectureandslidescoverclusteringexamples.
ExamplesofClustering5:48
10.1.5.5GeneralClusteringwithExamples
VideolectureandslidestakeageneralizedapprachtoclusteringwithexamplesfromDr.GeofferyFox’sresearch.
GeneralClusteringandResearchExamples22:28
10.1.5.6InDepthExamplewithfourcenters
Videolectureandslidesuse1000datapointsandfourartificalcenterstoprovideandindepthexampleofclustering.Codeprovided.
Examplewithfourcenters20:53
10.1.5.7ParallelComputingandK-means
Video lecture and slides discuss parallel computing using K-means as anexample of how to accelerate time to completion by exploiting moderncomputinghardware.
ParallelComputingandK-means
10.1.6ExampleProjectwithSVM
ThefollowingcodeissetupasanexampleprojectandwillshowhowtouseaRESTful service to download data. Additionally the differences between adynamic and static API will be showcased. First we begin by importing theappropriatelibraries.
Next we define three functions required to run this example: a function todownloadthedata;afunctiontopartitionthedata;andafunctiontogetthedataintotheappropriateformatoncedownloaded.
Defining thefirstAPIendpointwith thefollowinglinesofcodewillallowtheusertoexposetheAPI.Provethistoyourselfbyopeningabrowser,preferablygoogle,andfollowingtheurlnexttothecode.
NowOpentheapplicationinyourbrowserwith
The first API endpoint we will define is the endpoint to download the data,
importrequests
fromflaskimportFlask
importnumpyasnp
fromsklearn.externals.joblibimportMemory
fromsklearn.datasetsimportload_svmlight_file
fromsklearn.svmimportSVC
fromosimportlistdir
fromflaskimportFlask,request
app=Flask(__name__)
defdownload_data(url,filename):
r=requests.get(url,allow_redirects=True)
open(filename,'wb').write(r.content)
defdata_partition(filename,ratio):
file=open(filename,'r')
training_file=filename+'_train'
test_file=filename+'_test'
data=file.readlines()
count=0
size=len(data)
ftrain=open(training_file,'w')
ftest=open(test_file,'w')
forlineindata:
if(count<int(size*ratio)):
ftrain.write(line)
else:
ftest.write(line)
count=count+1
defget_data(filename):
data=load_svmlight_file(filename)
returndata[0],data[1]
@app.route('/')
defindex():
return"DemoProject!"
if__name__=='__main__':
app.run(debug=True)
http://127.0.0.1:5000/
whichisdonebythefollowinglinesofcode.NotetheurlishardcodedintothisportionofthecodeaspassingurlstoanAPIisnotgoodpractice.
The following threeapiendpointsuse thedatapartitionandgetdata functionsdefinedpreviously.Thepartitionfunctionsplits thedatasets into twosections–testingandtraining.Inthisexamplethetestingportionofthedatasetis20%andthetrainingis80%ofthedataset.Laterwewillexplorehowtomakethispartdynamic,allowingtheusertochoosethepartitioningpercentage.
ThelastbitofcodeistheimplementationofSVMintoaRESTfulAPIendpoint.Again this is static and all parameters needed to tune the algorithm arehardcoded. It will be worth your time to extrapolate the discussion aboutdynamicAPIsinordertomaketheseparameterstunablebytheuserthroughtheurl.
@app.route('/api/download/data')
defdownload():
url=
'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale'
download_data(url=url,filename='iris.scale')
return"DataDownloaded"
@app.route('/api/data/partition')
defpartition():
data_partition('iris.scale',0.8)
return"SuccessfullyPartitioned"
@app.route('/api/get/data/test')
defgettestdata():
Xtest,ytest=get_data("iris.scale_test")
return"ReturnXtestandYtestarrays"
@app.route('/api/get/data/train')
defgettraindata():
Xtrain,ytrain=get_data("iris.scale_train")
return"ReturnXtrainandYtrainarrays"
@app.route('/api/experiment/svm')
defsvm():
Xtrain,ytrain=get_data("iris.scale_train")
Xtest,ytest=get_data("iris.scale_test")
clf=SVC(gamma=0.001,C=100,kernel='linear')
clf.fit(Xtrain,ytrain)
test_size=Xtest.shape[0]
accuarcy_holder=[]
foriinrange(0,test_size):
prediction=clf.predict(Xtest[i])
print("PredictionfromSVM:"+str(prediction)+",Expected
Label:"+str(ytest[i]))
accuarcy_holder.append(prediction==ytest[i])
correct_predictions=sum(accuarcy_holder)
print(correct_predictions)
total_samples=test_size
accuracy=
float(float(correct_predictions)/float(total_samples))*100
print("PredictionAccuracy:"+str(accuracy))
Inordertorunthisyouneedtomakeadirectoryinalocationofyourchoiceandcreateafilecalledmain.pythathasthecodeincludedinit.Thensimplytypethefollowingcommand ina terminalwhereyouhavenavigated to the locationofthedirectorythatyourcreated.
A continuous version of main.py is provided next for ease of use. Please becarefulwhencopyingandpastingasadditionalcharactersmayshowup,thiswasnoticedintheurlsections.
return"PredictionAccuracy:"+str(accuracy)
pythonmain.py
importrequests
fromflaskimportFlask
importnumpyasnp
fromsklearn.externals.joblibimportMemory
fromsklearn.datasetsimportload_svmlight_file
fromsklearn.svmimportSVC
fromosimportlistdir
fromflaskimportFlask,request
app=Flask(__name__)
defdownload_data(url,filename):
r=requests.get(url,allow_redirects=True)
open(filename,'wb').write(r.content)
defdata_partition(filename,ratio):
file=open(filename,'r')
training_file=filename+'_train'
test_file=filename+'_test'
data=file.readlines()
count=0
size=len(data)
ftrain=open(training_file,'w')
ftest=open(test_file,'w')
forlineindata:
if(count<int(size*ratio)):
ftrain.write(line)
else:
ftest.write(line)
count=count+1
defget_data(filename):
data=load_svmlight_file(filename)
returndata[0],data[1]
@app.route('/')
defindex():
return"DemoProject!"
@app.route('/api/download/data')
defdownload():
url=
'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale'
download_data(url=url,filename='iris.scale')
return"DataDownloaded"
@app.route('/api/data/partition')
defpartition():
data_partition('iris.scale',0.8)
return"SuccessfullyPartitioned"
@app.route('/api/get/data/test')
defgettestdata():
Xtest,ytest=get_data("iris.scale_test")
Asmentioned previously these these are examples of staticAPI endpoints. InmanyscenarioshavingadynamicAPIwouldbepreferred.Letsexplorethedatapartitionendpointandmodifythecodeforthestaticversiontomakeadynamicversion.Thenextpart is thefunctiondefinitionfor thedynamicversionof thedata_partitionfunction,andnotmuchhaschanged.Theonlychangemadewasthatstingswereappendedtothetestingandtrainingfilenamesforconvenience.Theratiowillmatchtheuserdefinedratioenteredthroughtheurl.
Now for defining the endpoint, naturally it starts the same way as the staticversionhowevernowwemustaddapartthatallowsfortheusertoentervalues.Thisisdonebyuseofbrackets<text>.
return"ReturnXtestandYtestarrays"
@app.route('/api/get/data/train')
defgettraindata():
Xtrain,ytrain=get_data("iris.scale_train")
return"ReturnXtrainandYtrainarrays"
@app.route('/api/experiment/svm')
defsvm():
Xtrain,ytrain=get_data("iris.scale_train")
Xtest,ytest=get_data("iris.scale_test")
clf=SVC(gamma=0.001,C=100,kernel='linear')
clf.fit(Xtrain,ytrain)
test_size=Xtest.shape[0]
accuarcy_holder=[]
foriinrange(0,test_size):
prediction=clf.predict(Xtest[i])
print("PredictionfromSVM:"+str(prediction)+",Expected
Label:"+str(ytest[i]))
accuarcy_holder.append(prediction==ytest[i])
correct_predictions=sum(accuarcy_holder)
print(correct_predictions)
total_samples=test_size
accuracy=
float(float(correct_predictions)/float(total_samples))*100
print("PredictionAccuracy:"+str(accuracy))
return"PredictionAccuracy:"+str(accuracy)
if__name__=='__main__':
app.run(debug=True)
defdata_partition(filename,ratio):
file=open(filename,'r')
training_file=filename+'_train_'+str(ratio)
test_file=filename+'_test_'+str(ratio)
data=file.readlines()
count=0
size=len(data)
ftrain=open(training_file,'w')
ftest=open(test_file,'w')
forlineindata:
if(count<int(size*ratio)):
ftrain.write(line)
else:
ftest.write(line)
count=count+1
@app.route('/api/data/partition/<filename>/ratio/<ratio>')
defpartition(filename,ratio):
ratio=float(ratio)
path='data/'+filename
data_partition(path,ratio)
return"SuccessfullyPartitioned"
11REFERENCES
☁�
[1]L.Richardson,“Beautifulsouppythonpackageoverview.”WebPage,2019[Online].Available:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[2]C.WODEHOUSE,“Shouldyouusemongodb?Alookattheleadingnosqldatabase.” Web Page, 2018 [Online]. Available:https://www.upwork.com/hiring/data/should-you-use-mongodb-a-look-at-the-leading-nosql-database/
[3]Guru99, “Introduction tomongodb.”WebPage, 2018 [Online].Available:https://www.guru99.com/mongodb-tutorials.html#1
[4] MongoDB, “Https://www.mongodb.com/.” Web Page, 2018 [Online].Available:https://docs.mongodb.com/manual/introduction/
[5]M.Papiernik,“Howto installmongodbonubuntu18.04.”WebPage,Jun-2018 [Online]. Available:https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-18-04
[6]J.Ellingwood,“Initialserversetupwithubuntu18.04.”WebPage,Apr-2018[Online]. Available: https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04
[7]MongoDB,Databasesandcollections,4.0ed.NewYork,NewYork,USA:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/core/databases-and-collections/
[8]J.M.CraigBuckler,“Usingjoinsinmongodbnosqldatabases.”WebPage,Sep-2016 [Online]. Available: https://www.sitepoint.com/using-joins-in-mongodb-nosql-databases/
[9] MongoDB, Lookup (aggregation), 3.2 ed. New York City, New York,United States: MongoDB Inc, 2008 [Online]. Available:
https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
[10]MongoDB,MongoDB package components - mongoexport, 4.0 ed. NewYorkCity,NewYork,UnitedStates:MongoDBInc,2008[Online].Available:https://docs.mongodb.com/manual/reference/program/mongoexport/
[11]MongoDB, Security, 4.0 ed. New York City, New York, United States:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/security/
[12] MongoDB, “MongoDB atlas.” Web Page, 2018 [Online]. Available:https://www.mongodb.com/cloud/atlas
[13]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/api
[14]A. J. J. Davis, “Announcing pymongo3.”Web Page,Apr-2015 [Online].Available:https://emptysqua.re/blog/announcing-pymongo-3/
[15] M. Dirolf, “PyMongo.” Web Page, Jul-2018 [Online]. Available:https://github.com/mongodb/mongo-python-driver
[16] N. Leite, “MongoDB and python.” Web Page, Mar-2015 [Online].Available:https://www.slideshare.net/NorbertoLeite/mongodb-and-python
[17]V.Oleynik, “How do you usemongodbwith python?”Web Page,Mar-2017 [Online]. Available: https://gearheart.io/blog/how-do-you-use-mongodb-with-python/
[18] I. MongoDB, “Installing / upgrading.” Web pages, 2008 [Online].Available:http://api.mongodb.com/python/current/installation.html
[19] R. Python, “Introduction to mongodb and python.” Web Page, 2016[Online]. Available: https://realpython.com/introduction-to-mongodb-and-python/
[20]W3Schools,“Pythonmongodbcreatedatabase.”WebPage,1999[Online].Available:https://www.w3schools.com/python/python_mongodb_create_db.asp
[21]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/tutorial.html
[22] N. O’Higgins, PyMongo & python. O’Reilly, 2011 [Online]. Available:http://img105.job1001.com/upload/adminnew/2015-04-07/1428393873-MHKX3LN.pdf
[23]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/examples/aggregation.html
[24] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available: https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
[25] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://docs.mongodb.com/manual/core/map-reduce/
[26] MongoDB, “PyMongo v2.0 documentation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/2.0/examples/map_reduce.html
[27] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/current/examples/copydb.html
[28] MongoEngine, “MongoEngine user documentation.” Web Page, 2009[Online].Available:http://docs.mongoengine.org/
[29] Wikipedia, “Object-relational mapping.” Web Page, May-2009 [Online].Available:https://en.wikipedia.org/wiki/Object-relational_mapping
[30] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/guide/defining-documents.html
[31] MongoEngine, “User guide: Document instances.” Web Page, 2009[Online]. Available: http://docs.mongoengine.org/guide/document-instances.html
[32] MongoEngine, “2.1 installing mongoengine.” Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/installing.html
[33]MongoEngine, “2.2 connection to mongodb.”Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/connecting.html
[34]MongoEngine,“Userguide2.5.Querying thedatabase.”WebPage,2009[Online].Available:http://docs.mongoengine.org/guide/querying.html
[35]Wikipedia,“Flask(webframework).”WebPage,2010[Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)
[36] MongoDB, “Flask-pymongo.” Web Page, 2008 [Online]. Available:https://flask-pymongo.readthedocs.io/en/latest/
[37]MongoDB,“Flaskmongoalchemy.”WebPage,2008 [Online].Available:https://pythonhosted.org/Flask-MongoAlchemy/
[38] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/projects/flask-mongoengine/en/latest/
[39] Wikipedia, “Flask (web framework).” Web Page, Oct-2018 [Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)
[40] R. T. Fielding and R. N. Taylor, Architectural styles and the design ofnetwork-based software architectures, vol. 7. University of California, IrvineDoctoraldissertation,2000.
[41] Wikipedia, “Representational state transfer.” Web Page, 2019 [Online].Available:https://en.wikipedia.org/wiki/Representational_state_transfer
[42] OpenAPI Initiative, “The openapi specification.” Web Page [Online].Available: https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md
[43] OpenAPI Initiative, “The openapi specification.” Web Page [Online].Available:https://github.com/OAI/OpenAPI-Specification
[44]RAML,“RAMLversion1.0:RESTfulapimodelinglanguage.”WebPage[Online]. Available: https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md
[45] R. H. Kevin Burke Kyle Conroy, “Flask-restful.” Web Page [Online].Available:https://flask-restful.readthedocs.io/en/latest/
[46] E. O. Ltd, “Django rest framework.” Web Page [Online]. Available:https://www.django-rest-framework.org/
[47] S. Software, “API development for everyone.” Web Page [Online].Available:https://swagger.io
[48] S. Software, “Swagger codegen documentation.” Web Page [Online].Available:https://swagger.io/docs/open-source-tools/swagger-codegen/
[49] A. Y. W. Hate, “OpenAPI.Tools.” Web Page [Online]. Available:https://openapi.tools/
[50] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.edureka.co/blog/mapreduce-tutorial/?utm_source=youtube&utm_campaign=mapreduce-tutorial-161216-wr&utm_medium=description
[51] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.youtube.com/watch?v=SqvAaB3vK8U&list=WL&index=25&t=2547s
[52] “Apache mapreduce.” Aug-2019 [Online]. Available:https://www.ibm.com/analytics/hadoop/mapreduce
[53] Wikipedia, “MapReduce.” Aug-2019 [Online]. Available:https://en.wikipedia.org/wiki/MapReduce
[54] “Hadoop mapreduce.” Aug-2019 [Online]. Available:https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
[55] A. Khan, “Hadoop and spark.” Aug-2019 [Online]. Available:https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark.[Accessed:03-Sep-2017]
[56] “Apache spark vs hadoop mapreduce.” Aug-2019 [Online]. Available:https://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce/
[57] Twister, “Twister2: Twister2 Big Data Hosting Environment: Acomposable framework for high-performance data analytics.”Web Page, Feb-2017[Online].Available:https://twister2.gitbook.io/twister2/
[58] Twister, “Twister2: Twister2 Big Data Hosting Environment: Acomposable framework for high-performance data analytics.”Web Page, Feb-2017[Online].Available:https://github.com/DSC-SPIDAL/twister2/
[59]Twister,“Twister2wordcountexample.”Aug-2019.
[60] Twister, “Task examples.” Web Page, Feb-2017 [Online]. Available:https://twister2.gitbook.io/twister2/examples/task_examples
[61] Twister, “Communication Model.” Web Page, Feb-2017 [Online].Available:https://twister2.gitbook.io/twister2/concepts/communication/communication-model
[62]S.Kamburugamuveetal.,“Twister:Net-communicationlibraryforbigdataprocessing in hpc and cloud environments,” in 2018 ieee 11th internationalconferenceoncloudcomputing(cloud),2018,pp.383–391.
[63] Twister2, “Kmeans performance comparison.” Web Page, Jan-2019[Online].Available:https://twister2.gitbook.io/twister2/
[64] Twister, “Twister Examples.” Web Page, Feb-2017 [Online]. Available:https://twister2.gitbook.io/twister2/examples
[65] Docker, “Overview of docker hub.” Web Page, Mar-2018 [Online].Available:https://docs.docker.com/docker-hub/
[66]S.Bhartiya,“Howtousedockerhub.”Blog,Jan-2018[Online].Available:https://www.linux.com/blog/learn/intro-to-linux/2018/1/how-use-dockerhub
[67] Docker, “Repositories on docker hub.” Web Page, Mar-2018 [Online].Available:https://docs.docker.com/docker-hub/repos/
[68] R. Irani, “Docker tutorial series-part 4-docker hub.” Blog, Jul-2015[Online].Available: https://rominirani.com/docker-tutorial-series-part-4-docker-
hub-b51fb545dd8e
[69]G.M.Kurtzer,“SingularityContainersforScience.”Presentation,Jan-2019[Online]. Available: http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43
[70]NationalInstituteofStandars,“NISTbigdatapublicworkinggroup.”Aug-2019[Online].Available:https://bigdatawg.nist.gov/
[71] LIGO, “Ligo data grid.” Sep-2019 [Online]. Available: https://www.lsc-group.phys.uwm.edu/lscdatagrid/overview.html