Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf ·...

Post on 20-May-2020

14 views 0 download

transcript

PythonforDataAnalysisand Machine LearningDENG,Yang

ydeng@se.cuhk.edu.hk

Overview

§ Environment Preparation for Python§ PythonLibrariesforDataScientists§ Data Processing & Visualization Using Python§ Python for Basic Machine Learning Models

Environment Preparation for PythonIn this tutorial, we adopt◦ Anaconda (https://www.anaconda.com/)◦ Jupyter Notebook (https://jupyter.org/)

for Python environment.

Other alternatives:◦ Text Editor + Command line◦ IDE (IntegratedDevelopmentEnvironment): PyCharm, Vscode, …

What is Anaconda?Theopen-source Anaconda istheeasiestwaytoperformPython/RdatascienceandmachinelearningonLinux,Windows,andMacOSX. Withover19millionusersworldwide,itistheindustrystandardfordeveloping,testing,andtrainingonasinglemachine,enabling individualdatascientists to:§ Quicklydownload7,500+Python/Rdatasciencepackages§ Analyzedatawithscalabilityandperformancewith Dask, NumPy, pandas,and Numba

§ VisualizeresultswithMatplotlib, Bokeh, Datashader,and Holoviews§ Developandtrainmachinelearninganddeeplearningmodelswith scikit-learn, TensorFlow,and Theano

Anaconda InstallationPleasefollowtheinstructionheretoinstalltheAnaconda (for Python 3.7)

https://www.anaconda.com/distribution/#download-section

ItprovidesdifferentversionstosuitdifferentOS.Pleaseselecttheoneyouareusing.

Just install accordingtothedefaultsetting,andtheenvironmentvariableswillbeautomaticallyconfiguredafterinstallation.

WhatisJupyter Notebook?TheJupyterNotebookisanopen-sourcewebapplicationthatallowsyoutocreateandsharedocumentsthatcontainlivecode,equations,visualizationsandnarrativetext.Usesinclude:datacleaningandtransformation,numericalsimulation,statisticalmodeling,datavisualization,machinelearning,andmuchmore.

JupyterNotebook is included in the Anaconda.

Basic Operation on Jupyter NotebookAfter installing the Anaconda, open Anaconda-Navigator as below, and you canfind the JupyterNotebook on the Anaconda. Then click Launch.

Basic Operation on Jupyter NotebookJupyter Notebook is presented as a website. Select the path, then under thebutton “New”, choose “Python 3” to open a new python file.

Basic Operation on Jupyter NotebookType the code into the input box on Jupyter.

Get started learning Python: https://www.learnpython.org/

Basic Operation on Jupyter NotebookClick “Run”.

The outputwill be shown in the blank area right below the input box.

Basic Operation on Jupyter NotebookJupyter Notebook will help yousave your code automatically in“.ipynb” format.

If you want to save the code as“.py” format.

Here, we just use “.ipynb” format.

PythonLibrariesforDataScientistsPython toolboxes/libraries for data processing:◦ NumPy◦ SciPy◦ Pandas

Visualizationlibraries◦ matplotlib◦ Seaborn

Machine learning & deep learning◦ Scikit-learn◦ Tensorflow/Pytorch/Theano andmanymore…

PythonLibrariesforDataScientistsNumPy:§ introducesobjectsformultidimensionalarraysandmatrices,aswellasfunctionsthatallowtoeasilyperformadvancedmathematicalandstatisticaloperationsonthoseobjects

§providesvectorizationofmathematicaloperationsonarraysandmatriceswhichsignificantlyimprovestheperformance

§manyotherpythonlibrariesarebuiltonNumPy

Link: http://www.numpy.org/

PythonLibrariesforDataScientistsSciPy:§collectionofalgorithmsforlinearalgebra,differentialequations,numericalintegration,optimization,statisticsandmore

§partofSciPy Stack

§builtonNumPy

Link: https://www.scipy.org/scipylib/

PythonLibrariesforDataScientistsPandas:§addsdatastructuresandtoolsdesignedtoworkwithtable-likedata(similartoSeriesandDataFramesinR)

§providestoolsfordatamanipulation:reshaping,merging,sorting,slicing,aggregationetc.

§allowshandlingmissingdata

Link: http://pandas.pydata.org/

PythonLibrariesforDataScientistsmatplotlib:§python2Dplottinglibrarywhichproducespublicationqualityfiguresinavarietyofhardcopyformats

§asetoffunctionalitiessimilartothoseofMATLAB

§ lineplots,scatterplots,barcharts,histograms,piechartsetc.

§ relativelylow-level;someeffortneededtocreateadvancedvisualization

Link: https://matplotlib.org/

PythonLibrariesforDataScientistsSeaborn:§basedonmatplotlib

§provideshighlevelinterfacefordrawingattractivestatisticalgraphics

§Similar(instyle)tothepopularggplot2libraryinR

Link: https://seaborn.pydata.org/

PythonLibrariesforDataScientistsSciKit-Learn:§providesmachinelearningalgorithms:classification,regression,clustering,modelvalidationetc.

§builtonNumPy,SciPy andmatplotlib

Link: http://scikit-learn.org/

LoadingPythonLibraries

19

PressShift+Enter toexecutethejupyter cell, or just click “Run”.

Readingdatausingpandas

20

Thereisanumberofpandascommandstoreadotherdataformats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])

pd.read_stata('myfile.dta')

pd.read_sas('myfile.sas7bdat')

pd.read_hdf('myfile.h5','df')

Exploringdataframes

21

ü Trytoreadthefirst10,20,50recordsü Try toviewthelastfewrecords

DataFramedatatypesPandasType NativePythonType Description

object string Themostgeneraldtype.Willbeassignedtoyourcolumnifcolumnhasmixedtypes(numbersandstrings).

int64 int Numericcharacters.64referstothememoryallocatedtoholdthischaracter.

float64 float Numericcharacterswithdecimals.IfacolumncontainsnumbersandNaNs,pandaswilldefaulttofloat64,incaseyourmissingvaluehasadecimal.

datetime64,timedelta[ns]

N/A(butseethe datetimemoduleinPython’sstandardlibrary)

Valuesmeanttoholdtimedata.Lookintothesefortimeseriesexperiments.

22

DataFramedatatypes

23

DataFramesattributes

24

Pythonobjectshaveattributes andmethods.

df.attribute descriptiondtypes listthetypesofthecolumns

columns listthecolumnnames

axes listtherowlabels andcolumnnames

ndim numberofdimensions

size numberofelements

shape returnatuple representing thedimensionality

values numpy representationof thedata

25

DataFramesattributes

DataFramesmethods

26

df.method() descriptionhead([n]),tail([n]) first/last nrows

describe() generatedescriptivestatistics(fornumericcolumnsonly)

max(),min() returnmax/min valuesforallnumericcolumns

mean(),median() returnmean/median valuesforallnumericcolumns

std() standarddeviation

sample([n]) returnsarandomsampleofthe dataframe

dropna() dropalltherecordswithmissingvalues

Unlikeattributes,pythonmethodshaveparenthesis.Allattributesandmethodscanbelistedwithadir()function:dir(df)

DataFramesmethods

27

DataFramesmethods

28

SelectingacolumninaDataFrame

Note: If we want to select a columnwith a name as the attribute inDataFrameswe should usemethod 1.

E.G., Since there is an attribute – rankin DataFrame, if we want to selectthe column ‘rank’, we should usedf[‘rank’], and cannot use method 2,i.e., df.rank, which will return theattribute rank of the data frameinstead of the column “rank”.

29

SelectingacolumninaDataFrame

30

DataFramesgroupby method

31

Using"groupby"methodwecan:

• Splitthedataintogroupsbasedonsomecriteria• Calculatestatistics(orapplyafunction)toeachgroup

DataFramesgroupby method

32

Oncegroupby objectiscreatewecancalculatevariousstatisticsforeachgroup:

Note: Ifsinglebracketsareusedtospecifythecolumn(e.g.age),thentheoutputisPandasSeriesobject.WhendoublebracketsareusedtheoutputisaDataFrame (e.g. age &balance)

DataFramesgroupby method

33

groupby performancenotes:

- nogrouping/splittingoccursuntilit'sneeded.Creatingthegroupby objectonlyverifiesthatyouhavepassedavalidmapping- bydefaultthegroupkeysaresortedduringthegroupby operation.Youmaywanttopasssort=Falseforpotentialspeedup:

DataFrame:filtering

34

TosubsetthedatawecanapplyBooleanindexing.Thisindexingiscommonlyknownasafilter.Forexampleifwewanttosubsettherowsinwhichtheagevalueisgreaterthan50:

AnyBooleanoperatorcanbeusedtosubsetthedata:>greater; >=greaterorequal;<less;<=lessorequal;==equal;!=notequal;

DataFrames:Slicing

35

ThereareanumberofwaystosubsettheDataFrame:• oneormorecolumns• oneormorerows• asubsetofrowsandcolumns

Rowsandcolumnscanbeselectedbytheirpositionorlabel

DataFrames:Slicing

36

Whenselectingonecolumn,itispossibletousesinglesetofbrackets,buttheresultingobjectwillbeaSeries(notaDataFrame):

Whenweneedtoselectmorethanonecolumnand/ormaketheoutputtobeaDataFrame,weshouldusedoublebrackets:

DataFrames:Selectingrows

37

Ifweneedtoselectarangeofrows,wecanspecifytherangeusing":"

Noticethatthefirstrowhasaposition0,andthelastvalueintherangeisomitted:Sofor0:10rangethefirst10rowsarereturnedwiththepositionsstartingwith0andendingwith9

DataFrames:methodloc

38

Ifweneedtoselectarangeofrows,usingtheirlabelswecanusemethodloc:

Recallthat

DataFrames:methodiloc

39

Ifweneedtoselectarangeofrowsand/orcolumns,usingtheirpositionswecanusemethodiloc:

DataFrames:methodiloc (summary)

40

df.iloc[0] # First row of a data framedf.iloc[i] #(i+1)th row df.iloc[-1] # Last row

df.iloc[:, 0] # First columndf.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows df.iloc[:, 0:2] #First 2 columnsdf.iloc[1:3, 0:2] #Second through third rows and first 2 columnsdf.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th columns

DataFrames:Sorting

41

Wecansortthedatabyavalueinthecolumn.Bydefaultthesortingwilloccurinascendingorderandanewdataframeisreturn.

DataFrames:Sorting

42

Wecansortthedatausing2ormorecolumns:

AggregationFunctionsinPandas

43

Aggregation- computingasummarystatisticabouteachgroup,i.e.• computegroupsumsormeans• computegroupsizes/counts

Commonaggregationfunctions:

min,maxcount,sum,prodmean,median,mode,madstd,var

AggregationFunctionsinPandas

44

agg()methodareusefulwhenmultiplestatisticsarecomputedpercolumn:

BasicDescriptiveStatistics

45

df.method() description

describe Basicstatistics(count,mean,std,min,quantiles,max)

min,max Minimum andmaximumvalues

mean,median,mode Arithmeticaverage,medianandmode

var,std Varianceandstandarddeviation

sem Standarderrorofmean

skew Sampleskewness

kurt kurtosis

Graphicstoexplorethedata

46

ToshowgraphswithinPythonnotebookincludeinlinedirective:

Seaborn packageisbuiltonmatplotlib butprovideshighlevelinterfacefordrawingattractivestatisticalgraphics,similartoggplot2libraryinR.Itspecificallytargetsstatisticaldatavisualization

Graphics

47

descriptionhistplot histogram

barplot estimateofcentraltendencyforanumericvariable

violinplot similartoboxplot, alsoshowstheprobability densityofthedata

jointplot Scatterplot

regplot Regressionplot

pairplot Pairplot

boxplot boxplot

swarmplot categoricalscatterplot

factorplot Generalcategoricalplot

Draw Histogram Using Matplotlib

48

Draw Histogram Using Seaborn

49

Draw Barplot Using Matplotlib

50

Draw Barplot Using Seaborn

51

Draw Barplot Using Seaborn

52

Draw Scatterplot Using Seaborn

53

Draw Boxplot Using Seaborn

54

Python for Machine Learning

55

Machinelearning:theproblemsetting:Ingeneral,alearningproblemconsidersasetofn samples ofdataandthentriestopredictpropertiesofunknowndata.Ifeachsampleismorethanasinglenumberand,forinstance,amulti-dimensionalentry(aka multivariate data),itissaidtohaveseveralattributesor features.

Wecanseparatelearningproblemsinafewlargecategories:• SupervisedLearning (https://sklearn.org/supervised_learning.html#supervised-learning)

• Classification• Regression

• UnsupervisedLearning (https://sklearn.org/unsupervised_learning.html#unsupervised-learning)• Clustering

Python for Machine Learning

56

Trainingsetandtestingset:

Machinelearningisaboutlearningsomepropertiesofadatasetandapplyingthemtonewdata.Thisiswhyacommonpracticeinmachinelearningtoevaluate an algorithmistosplitthedataathandintotwosets,onethatwecallthe trainingset onwhichwelearndatapropertiesandonethatwecallthetesting set onwhichwetesttheseproperties.

scikit-learn comeswithafewstandarddatasets,forinstance the iris and digitsdatasetsforclassification andthe boston houseprices dataset for regression.

Loadinganexampledataset

57

Adatasetisadictionary-likeobjectthatholdsallthedataandsomemetadataaboutthedata.Thisdataisstoredinthe .datamember,whichisa (n_samples,n_features) array.Inthecaseofsupervisedproblem,oneormoreresponsevariablesarestoredinthe .target member.

Loadinganexampledataset - digits

58

Anexampleshowinghowthescikit-learncanbeusedtorecognizeimagesofhand-writtendigits.

Loadinganexampledataset - digits

59

and digits.target givesthegroundtruthforthedigitdataset,thatisthenumbercorrespondingtoeachdigitimagethatwearetryingtolearn:

Forinstance,inthecaseofthedigitsdataset, digits.data givesaccesstothefeaturesthatcanbeusedtoclassifythedigitssamples:

Learningandpredicting

60

Inthecaseofthedigits dataset,thetaskistopredict,givenanimage,whichdigititrepresents.Wearegivensamplesofeachofthe10possibleclasses(thedigitszero throughnine)onwhichwe fit a classifier tobeableto predict theclassestowhichunseensamplesbelong.

Inscikit-learn,a classifier forclassificationisaPythonobjectthatimplementsthemethods fit(X, y) and predict(T).

Anexampleofa classifier istheclass sklearn.svm.SVC,whichimplementssupportvectorclassification.Theclassifier’sconstructortakesasargumentsthemodel’sparameters.

Learningandpredicting

61

Fornow,wewillconsidertheclassifier asablackbox:

ChoosingtheparametersofthemodelInthisexample,wesetthevalueof gamma manually.Tofindgoodvaluesfortheseparameters,wecanusetoolssuchas gridsearch and crossvalidation.

Learningandpredicting

62

Forthetrainingset,we’llusealltheimagesfromourdataset,exceptforthelastimage,whichwe’llreserveforourpredicting.Weselectthetrainingsetwiththe [:-1] Pythonsyntax,whichproducesanewarraythatcontainsallbutthelastitemfrom digits.data:

Learningandpredicting

63

Nowyoucan predict newvalues.Inthiscase,you’llpredictusingthelastimagefrom digits.data.Bypredicting,you’lldeterminetheimagefromthetrainingsetthatbestmatchesthelastimage.

Thecorrespondingimageis:

Modelpersistence

64

Itispossibletosaveamodelinscikit-learnbyusing pickle: