+ All Categories
Home > Documents > Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf ·...

Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf ·...

Date post: 20-May-2020
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
64
Python for Data Analysis and Machine Learning DENG, Yang [email protected]
Transcript
Page 1: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonforDataAnalysisand Machine LearningDENG,Yang

[email protected]

Page 2: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Overview

§ Environment Preparation for Python§ PythonLibrariesforDataScientists§ Data Processing & Visualization Using Python§ Python for Basic Machine Learning Models

Page 3: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Environment Preparation for PythonIn this tutorial, we adopt◦ Anaconda (https://www.anaconda.com/)◦ Jupyter Notebook (https://jupyter.org/)

for Python environment.

Other alternatives:◦ Text Editor + Command line◦ IDE (IntegratedDevelopmentEnvironment): PyCharm, Vscode, …

Page 4: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

What is Anaconda?Theopen-source Anaconda istheeasiestwaytoperformPython/RdatascienceandmachinelearningonLinux,Windows,andMacOSX. Withover19millionusersworldwide,itistheindustrystandardfordeveloping,testing,andtrainingonasinglemachine,enabling individualdatascientists to:§ Quicklydownload7,500+Python/Rdatasciencepackages§ Analyzedatawithscalabilityandperformancewith Dask, NumPy, pandas,and Numba

§ VisualizeresultswithMatplotlib, Bokeh, Datashader,and Holoviews§ Developandtrainmachinelearninganddeeplearningmodelswith scikit-learn, TensorFlow,and Theano

Page 5: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Anaconda InstallationPleasefollowtheinstructionheretoinstalltheAnaconda (for Python 3.7)

https://www.anaconda.com/distribution/#download-section

ItprovidesdifferentversionstosuitdifferentOS.Pleaseselecttheoneyouareusing.

Just install accordingtothedefaultsetting,andtheenvironmentvariableswillbeautomaticallyconfiguredafterinstallation.

Page 6: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

WhatisJupyter Notebook?TheJupyterNotebookisanopen-sourcewebapplicationthatallowsyoutocreateandsharedocumentsthatcontainlivecode,equations,visualizationsandnarrativetext.Usesinclude:datacleaningandtransformation,numericalsimulation,statisticalmodeling,datavisualization,machinelearning,andmuchmore.

JupyterNotebook is included in the Anaconda.

Page 7: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Basic Operation on Jupyter NotebookAfter installing the Anaconda, open Anaconda-Navigator as below, and you canfind the JupyterNotebook on the Anaconda. Then click Launch.

Page 8: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Basic Operation on Jupyter NotebookJupyter Notebook is presented as a website. Select the path, then under thebutton “New”, choose “Python 3” to open a new python file.

Page 9: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Basic Operation on Jupyter NotebookType the code into the input box on Jupyter.

Get started learning Python: https://www.learnpython.org/

Page 10: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Basic Operation on Jupyter NotebookClick “Run”.

The outputwill be shown in the blank area right below the input box.

Page 11: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Basic Operation on Jupyter NotebookJupyter Notebook will help yousave your code automatically in“.ipynb” format.

If you want to save the code as“.py” format.

Here, we just use “.ipynb” format.

Page 12: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsPython toolboxes/libraries for data processing:◦ NumPy◦ SciPy◦ Pandas

Visualizationlibraries◦ matplotlib◦ Seaborn

Machine learning & deep learning◦ Scikit-learn◦ Tensorflow/Pytorch/Theano andmanymore…

Page 13: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsNumPy:§ introducesobjectsformultidimensionalarraysandmatrices,aswellasfunctionsthatallowtoeasilyperformadvancedmathematicalandstatisticaloperationsonthoseobjects

§providesvectorizationofmathematicaloperationsonarraysandmatriceswhichsignificantlyimprovestheperformance

§manyotherpythonlibrariesarebuiltonNumPy

Link: http://www.numpy.org/

Page 14: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsSciPy:§collectionofalgorithmsforlinearalgebra,differentialequations,numericalintegration,optimization,statisticsandmore

§partofSciPy Stack

§builtonNumPy

Link: https://www.scipy.org/scipylib/

Page 15: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsPandas:§addsdatastructuresandtoolsdesignedtoworkwithtable-likedata(similartoSeriesandDataFramesinR)

§providestoolsfordatamanipulation:reshaping,merging,sorting,slicing,aggregationetc.

§allowshandlingmissingdata

Link: http://pandas.pydata.org/

Page 16: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsmatplotlib:§python2Dplottinglibrarywhichproducespublicationqualityfiguresinavarietyofhardcopyformats

§asetoffunctionalitiessimilartothoseofMATLAB

§ lineplots,scatterplots,barcharts,histograms,piechartsetc.

§ relativelylow-level;someeffortneededtocreateadvancedvisualization

Link: https://matplotlib.org/

Page 17: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsSeaborn:§basedonmatplotlib

§provideshighlevelinterfacefordrawingattractivestatisticalgraphics

§Similar(instyle)tothepopularggplot2libraryinR

Link: https://seaborn.pydata.org/

Page 18: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

PythonLibrariesforDataScientistsSciKit-Learn:§providesmachinelearningalgorithms:classification,regression,clustering,modelvalidationetc.

§builtonNumPy,SciPy andmatplotlib

Link: http://scikit-learn.org/

Page 19: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

LoadingPythonLibraries

19

PressShift+Enter toexecutethejupyter cell, or just click “Run”.

Page 20: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Readingdatausingpandas

20

Thereisanumberofpandascommandstoreadotherdataformats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])

pd.read_stata('myfile.dta')

pd.read_sas('myfile.sas7bdat')

pd.read_hdf('myfile.h5','df')

Page 21: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Exploringdataframes

21

ü Trytoreadthefirst10,20,50recordsü Try toviewthelastfewrecords

Page 22: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramedatatypesPandasType NativePythonType Description

object string Themostgeneraldtype.Willbeassignedtoyourcolumnifcolumnhasmixedtypes(numbersandstrings).

int64 int Numericcharacters.64referstothememoryallocatedtoholdthischaracter.

float64 float Numericcharacterswithdecimals.IfacolumncontainsnumbersandNaNs,pandaswilldefaulttofloat64,incaseyourmissingvaluehasadecimal.

datetime64,timedelta[ns]

N/A(butseethe datetimemoduleinPython’sstandardlibrary)

Valuesmeanttoholdtimedata.Lookintothesefortimeseriesexperiments.

22

Page 23: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramedatatypes

23

Page 24: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesattributes

24

Pythonobjectshaveattributes andmethods.

df.attribute descriptiondtypes listthetypesofthecolumns

columns listthecolumnnames

axes listtherowlabels andcolumnnames

ndim numberofdimensions

size numberofelements

shape returnatuple representing thedimensionality

values numpy representationof thedata

Page 25: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

25

DataFramesattributes

Page 26: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesmethods

26

df.method() descriptionhead([n]),tail([n]) first/last nrows

describe() generatedescriptivestatistics(fornumericcolumnsonly)

max(),min() returnmax/min valuesforallnumericcolumns

mean(),median() returnmean/median valuesforallnumericcolumns

std() standarddeviation

sample([n]) returnsarandomsampleofthe dataframe

dropna() dropalltherecordswithmissingvalues

Unlikeattributes,pythonmethodshaveparenthesis.Allattributesandmethodscanbelistedwithadir()function:dir(df)

Page 27: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesmethods

27

Page 28: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesmethods

28

Page 29: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

SelectingacolumninaDataFrame

Note: If we want to select a columnwith a name as the attribute inDataFrameswe should usemethod 1.

E.G., Since there is an attribute – rankin DataFrame, if we want to selectthe column ‘rank’, we should usedf[‘rank’], and cannot use method 2,i.e., df.rank, which will return theattribute rank of the data frameinstead of the column “rank”.

29

Page 30: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

SelectingacolumninaDataFrame

30

Page 31: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesgroupby method

31

Using"groupby"methodwecan:

• Splitthedataintogroupsbasedonsomecriteria• Calculatestatistics(orapplyafunction)toeachgroup

Page 32: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesgroupby method

32

Oncegroupby objectiscreatewecancalculatevariousstatisticsforeachgroup:

Note: Ifsinglebracketsareusedtospecifythecolumn(e.g.age),thentheoutputisPandasSeriesobject.WhendoublebracketsareusedtheoutputisaDataFrame (e.g. age &balance)

Page 33: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFramesgroupby method

33

groupby performancenotes:

- nogrouping/splittingoccursuntilit'sneeded.Creatingthegroupby objectonlyverifiesthatyouhavepassedavalidmapping- bydefaultthegroupkeysaresortedduringthegroupby operation.Youmaywanttopasssort=Falseforpotentialspeedup:

Page 34: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrame:filtering

34

TosubsetthedatawecanapplyBooleanindexing.Thisindexingiscommonlyknownasafilter.Forexampleifwewanttosubsettherowsinwhichtheagevalueisgreaterthan50:

AnyBooleanoperatorcanbeusedtosubsetthedata:>greater; >=greaterorequal;<less;<=lessorequal;==equal;!=notequal;

Page 35: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:Slicing

35

ThereareanumberofwaystosubsettheDataFrame:• oneormorecolumns• oneormorerows• asubsetofrowsandcolumns

Rowsandcolumnscanbeselectedbytheirpositionorlabel

Page 36: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:Slicing

36

Whenselectingonecolumn,itispossibletousesinglesetofbrackets,buttheresultingobjectwillbeaSeries(notaDataFrame):

Whenweneedtoselectmorethanonecolumnand/ormaketheoutputtobeaDataFrame,weshouldusedoublebrackets:

Page 37: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:Selectingrows

37

Ifweneedtoselectarangeofrows,wecanspecifytherangeusing":"

Noticethatthefirstrowhasaposition0,andthelastvalueintherangeisomitted:Sofor0:10rangethefirst10rowsarereturnedwiththepositionsstartingwith0andendingwith9

Page 38: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:methodloc

38

Ifweneedtoselectarangeofrows,usingtheirlabelswecanusemethodloc:

Recallthat

Page 39: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:methodiloc

39

Ifweneedtoselectarangeofrowsand/orcolumns,usingtheirpositionswecanusemethodiloc:

Page 40: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:methodiloc (summary)

40

df.iloc[0] # First row of a data framedf.iloc[i] #(i+1)th row df.iloc[-1] # Last row

df.iloc[:, 0] # First columndf.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows df.iloc[:, 0:2] #First 2 columnsdf.iloc[1:3, 0:2] #Second through third rows and first 2 columnsdf.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th columns

Page 41: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:Sorting

41

Wecansortthedatabyavalueinthecolumn.Bydefaultthesortingwilloccurinascendingorderandanewdataframeisreturn.

Page 42: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

DataFrames:Sorting

42

Wecansortthedatausing2ormorecolumns:

Page 43: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

AggregationFunctionsinPandas

43

Aggregation- computingasummarystatisticabouteachgroup,i.e.• computegroupsumsormeans• computegroupsizes/counts

Commonaggregationfunctions:

min,maxcount,sum,prodmean,median,mode,madstd,var

Page 44: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

AggregationFunctionsinPandas

44

agg()methodareusefulwhenmultiplestatisticsarecomputedpercolumn:

Page 45: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

BasicDescriptiveStatistics

45

df.method() description

describe Basicstatistics(count,mean,std,min,quantiles,max)

min,max Minimum andmaximumvalues

mean,median,mode Arithmeticaverage,medianandmode

var,std Varianceandstandarddeviation

sem Standarderrorofmean

skew Sampleskewness

kurt kurtosis

Page 46: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Graphicstoexplorethedata

46

ToshowgraphswithinPythonnotebookincludeinlinedirective:

Seaborn packageisbuiltonmatplotlib butprovideshighlevelinterfacefordrawingattractivestatisticalgraphics,similartoggplot2libraryinR.Itspecificallytargetsstatisticaldatavisualization

Page 47: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Graphics

47

descriptionhistplot histogram

barplot estimateofcentraltendencyforanumericvariable

violinplot similartoboxplot, alsoshowstheprobability densityofthedata

jointplot Scatterplot

regplot Regressionplot

pairplot Pairplot

boxplot boxplot

swarmplot categoricalscatterplot

factorplot Generalcategoricalplot

Page 48: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Histogram Using Matplotlib

48

Page 49: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Histogram Using Seaborn

49

Page 50: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Barplot Using Matplotlib

50

Page 51: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Barplot Using Seaborn

51

Page 52: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Barplot Using Seaborn

52

Page 53: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Scatterplot Using Seaborn

53

Page 54: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Draw Boxplot Using Seaborn

54

Page 55: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Python for Machine Learning

55

Machinelearning:theproblemsetting:Ingeneral,alearningproblemconsidersasetofn samples ofdataandthentriestopredictpropertiesofunknowndata.Ifeachsampleismorethanasinglenumberand,forinstance,amulti-dimensionalentry(aka multivariate data),itissaidtohaveseveralattributesor features.

Wecanseparatelearningproblemsinafewlargecategories:• SupervisedLearning (https://sklearn.org/supervised_learning.html#supervised-learning)

• Classification• Regression

• UnsupervisedLearning (https://sklearn.org/unsupervised_learning.html#unsupervised-learning)• Clustering

Page 56: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Python for Machine Learning

56

Trainingsetandtestingset:

Machinelearningisaboutlearningsomepropertiesofadatasetandapplyingthemtonewdata.Thisiswhyacommonpracticeinmachinelearningtoevaluate an algorithmistosplitthedataathandintotwosets,onethatwecallthe trainingset onwhichwelearndatapropertiesandonethatwecallthetesting set onwhichwetesttheseproperties.

scikit-learn comeswithafewstandarddatasets,forinstance the iris and digitsdatasetsforclassification andthe boston houseprices dataset for regression.

Page 57: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Loadinganexampledataset

57

Adatasetisadictionary-likeobjectthatholdsallthedataandsomemetadataaboutthedata.Thisdataisstoredinthe .datamember,whichisa (n_samples,n_features) array.Inthecaseofsupervisedproblem,oneormoreresponsevariablesarestoredinthe .target member.

Page 58: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Loadinganexampledataset - digits

58

Anexampleshowinghowthescikit-learncanbeusedtorecognizeimagesofhand-writtendigits.

Page 59: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Loadinganexampledataset - digits

59

and digits.target givesthegroundtruthforthedigitdataset,thatisthenumbercorrespondingtoeachdigitimagethatwearetryingtolearn:

Forinstance,inthecaseofthedigitsdataset, digits.data givesaccesstothefeaturesthatcanbeusedtoclassifythedigitssamples:

Page 60: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Learningandpredicting

60

Inthecaseofthedigits dataset,thetaskistopredict,givenanimage,whichdigititrepresents.Wearegivensamplesofeachofthe10possibleclasses(thedigitszero throughnine)onwhichwe fit a classifier tobeableto predict theclassestowhichunseensamplesbelong.

Inscikit-learn,a classifier forclassificationisaPythonobjectthatimplementsthemethods fit(X, y) and predict(T).

Anexampleofa classifier istheclass sklearn.svm.SVC,whichimplementssupportvectorclassification.Theclassifier’sconstructortakesasargumentsthemodel’sparameters.

Page 61: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Learningandpredicting

61

Fornow,wewillconsidertheclassifier asablackbox:

ChoosingtheparametersofthemodelInthisexample,wesetthevalueof gamma manually.Tofindgoodvaluesfortheseparameters,wecanusetoolssuchas gridsearch and crossvalidation.

Page 62: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Learningandpredicting

62

Forthetrainingset,we’llusealltheimagesfromourdataset,exceptforthelastimage,whichwe’llreserveforourpredicting.Weselectthetrainingsetwiththe [:-1] Pythonsyntax,whichproducesanewarraythatcontainsallbutthelastitemfrom digits.data:

Page 63: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Learningandpredicting

63

Nowyoucan predict newvalues.Inthiscase,you’llpredictusingthelastimagefrom digits.data.Bypredicting,you’lldeterminetheimagefromthetrainingsetthatbestmatchesthelastimage.

Thecorrespondingimageis:

Page 64: Python for Data Analysis and Machine Learningeclt5810/lecture/weka_tutorial/python-tutorial.pdf · The open-source Anacondais the easiest way to perform Python/R data science and

Modelpersistence

64

Itispossibletosaveamodelinscikit-learnbyusing pickle:


Recommended