Overview
§ Environment Preparation for Python§ PythonLibrariesforDataScientists§ Data Processing & Visualization Using Python§ Python for Basic Machine Learning Models
Environment Preparation for PythonIn this tutorial, we adopt◦ Anaconda (https://www.anaconda.com/)◦ Jupyter Notebook (https://jupyter.org/)
for Python environment.
Other alternatives:◦ Text Editor + Command line◦ IDE (IntegratedDevelopmentEnvironment): PyCharm, Vscode, …
What is Anaconda?Theopen-source Anaconda istheeasiestwaytoperformPython/RdatascienceandmachinelearningonLinux,Windows,andMacOSX. Withover19millionusersworldwide,itistheindustrystandardfordeveloping,testing,andtrainingonasinglemachine,enabling individualdatascientists to:§ Quicklydownload7,500+Python/Rdatasciencepackages§ Analyzedatawithscalabilityandperformancewith Dask, NumPy, pandas,and Numba
§ VisualizeresultswithMatplotlib, Bokeh, Datashader,and Holoviews§ Developandtrainmachinelearninganddeeplearningmodelswith scikit-learn, TensorFlow,and Theano
Anaconda InstallationPleasefollowtheinstructionheretoinstalltheAnaconda (for Python 3.7)
https://www.anaconda.com/distribution/#download-section
ItprovidesdifferentversionstosuitdifferentOS.Pleaseselecttheoneyouareusing.
Just install accordingtothedefaultsetting,andtheenvironmentvariableswillbeautomaticallyconfiguredafterinstallation.
WhatisJupyter Notebook?TheJupyterNotebookisanopen-sourcewebapplicationthatallowsyoutocreateandsharedocumentsthatcontainlivecode,equations,visualizationsandnarrativetext.Usesinclude:datacleaningandtransformation,numericalsimulation,statisticalmodeling,datavisualization,machinelearning,andmuchmore.
JupyterNotebook is included in the Anaconda.
Basic Operation on Jupyter NotebookAfter installing the Anaconda, open Anaconda-Navigator as below, and you canfind the JupyterNotebook on the Anaconda. Then click Launch.
Basic Operation on Jupyter NotebookJupyter Notebook is presented as a website. Select the path, then under thebutton “New”, choose “Python 3” to open a new python file.
Basic Operation on Jupyter NotebookType the code into the input box on Jupyter.
Get started learning Python: https://www.learnpython.org/
Basic Operation on Jupyter NotebookClick “Run”.
The outputwill be shown in the blank area right below the input box.
Basic Operation on Jupyter NotebookJupyter Notebook will help yousave your code automatically in“.ipynb” format.
If you want to save the code as“.py” format.
Here, we just use “.ipynb” format.
PythonLibrariesforDataScientistsPython toolboxes/libraries for data processing:◦ NumPy◦ SciPy◦ Pandas
Visualizationlibraries◦ matplotlib◦ Seaborn
Machine learning & deep learning◦ Scikit-learn◦ Tensorflow/Pytorch/Theano andmanymore…
PythonLibrariesforDataScientistsNumPy:§ introducesobjectsformultidimensionalarraysandmatrices,aswellasfunctionsthatallowtoeasilyperformadvancedmathematicalandstatisticaloperationsonthoseobjects
§providesvectorizationofmathematicaloperationsonarraysandmatriceswhichsignificantlyimprovestheperformance
§manyotherpythonlibrariesarebuiltonNumPy
Link: http://www.numpy.org/
PythonLibrariesforDataScientistsSciPy:§collectionofalgorithmsforlinearalgebra,differentialequations,numericalintegration,optimization,statisticsandmore
§partofSciPy Stack
§builtonNumPy
Link: https://www.scipy.org/scipylib/
PythonLibrariesforDataScientistsPandas:§addsdatastructuresandtoolsdesignedtoworkwithtable-likedata(similartoSeriesandDataFramesinR)
§providestoolsfordatamanipulation:reshaping,merging,sorting,slicing,aggregationetc.
§allowshandlingmissingdata
Link: http://pandas.pydata.org/
PythonLibrariesforDataScientistsmatplotlib:§python2Dplottinglibrarywhichproducespublicationqualityfiguresinavarietyofhardcopyformats
§asetoffunctionalitiessimilartothoseofMATLAB
§ lineplots,scatterplots,barcharts,histograms,piechartsetc.
§ relativelylow-level;someeffortneededtocreateadvancedvisualization
Link: https://matplotlib.org/
PythonLibrariesforDataScientistsSeaborn:§basedonmatplotlib
§provideshighlevelinterfacefordrawingattractivestatisticalgraphics
§Similar(instyle)tothepopularggplot2libraryinR
Link: https://seaborn.pydata.org/
PythonLibrariesforDataScientistsSciKit-Learn:§providesmachinelearningalgorithms:classification,regression,clustering,modelvalidationetc.
§builtonNumPy,SciPy andmatplotlib
Link: http://scikit-learn.org/
LoadingPythonLibraries
19
PressShift+Enter toexecutethejupyter cell, or just click “Run”.
Readingdatausingpandas
20
Thereisanumberofpandascommandstoreadotherdataformats:
pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])
pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')
Exploringdataframes
21
ü Trytoreadthefirst10,20,50recordsü Try toviewthelastfewrecords
DataFramedatatypesPandasType NativePythonType Description
object string Themostgeneraldtype.Willbeassignedtoyourcolumnifcolumnhasmixedtypes(numbersandstrings).
int64 int Numericcharacters.64referstothememoryallocatedtoholdthischaracter.
float64 float Numericcharacterswithdecimals.IfacolumncontainsnumbersandNaNs,pandaswilldefaulttofloat64,incaseyourmissingvaluehasadecimal.
datetime64,timedelta[ns]
N/A(butseethe datetimemoduleinPython’sstandardlibrary)
Valuesmeanttoholdtimedata.Lookintothesefortimeseriesexperiments.
22
DataFramedatatypes
23
DataFramesattributes
24
Pythonobjectshaveattributes andmethods.
df.attribute descriptiondtypes listthetypesofthecolumns
columns listthecolumnnames
axes listtherowlabels andcolumnnames
ndim numberofdimensions
size numberofelements
shape returnatuple representing thedimensionality
values numpy representationof thedata
25
DataFramesattributes
DataFramesmethods
26
df.method() descriptionhead([n]),tail([n]) first/last nrows
describe() generatedescriptivestatistics(fornumericcolumnsonly)
max(),min() returnmax/min valuesforallnumericcolumns
mean(),median() returnmean/median valuesforallnumericcolumns
std() standarddeviation
sample([n]) returnsarandomsampleofthe dataframe
dropna() dropalltherecordswithmissingvalues
Unlikeattributes,pythonmethodshaveparenthesis.Allattributesandmethodscanbelistedwithadir()function:dir(df)
DataFramesmethods
27
DataFramesmethods
28
SelectingacolumninaDataFrame
Note: If we want to select a columnwith a name as the attribute inDataFrameswe should usemethod 1.
E.G., Since there is an attribute – rankin DataFrame, if we want to selectthe column ‘rank’, we should usedf[‘rank’], and cannot use method 2,i.e., df.rank, which will return theattribute rank of the data frameinstead of the column “rank”.
29
SelectingacolumninaDataFrame
30
DataFramesgroupby method
31
Using"groupby"methodwecan:
• Splitthedataintogroupsbasedonsomecriteria• Calculatestatistics(orapplyafunction)toeachgroup
DataFramesgroupby method
32
Oncegroupby objectiscreatewecancalculatevariousstatisticsforeachgroup:
Note: Ifsinglebracketsareusedtospecifythecolumn(e.g.age),thentheoutputisPandasSeriesobject.WhendoublebracketsareusedtheoutputisaDataFrame (e.g. age &balance)
DataFramesgroupby method
33
groupby performancenotes:
- nogrouping/splittingoccursuntilit'sneeded.Creatingthegroupby objectonlyverifiesthatyouhavepassedavalidmapping- bydefaultthegroupkeysaresortedduringthegroupby operation.Youmaywanttopasssort=Falseforpotentialspeedup:
DataFrame:filtering
34
TosubsetthedatawecanapplyBooleanindexing.Thisindexingiscommonlyknownasafilter.Forexampleifwewanttosubsettherowsinwhichtheagevalueisgreaterthan50:
AnyBooleanoperatorcanbeusedtosubsetthedata:>greater; >=greaterorequal;<less;<=lessorequal;==equal;!=notequal;
DataFrames:Slicing
35
ThereareanumberofwaystosubsettheDataFrame:• oneormorecolumns• oneormorerows• asubsetofrowsandcolumns
Rowsandcolumnscanbeselectedbytheirpositionorlabel
DataFrames:Slicing
36
Whenselectingonecolumn,itispossibletousesinglesetofbrackets,buttheresultingobjectwillbeaSeries(notaDataFrame):
Whenweneedtoselectmorethanonecolumnand/ormaketheoutputtobeaDataFrame,weshouldusedoublebrackets:
DataFrames:Selectingrows
37
Ifweneedtoselectarangeofrows,wecanspecifytherangeusing":"
Noticethatthefirstrowhasaposition0,andthelastvalueintherangeisomitted:Sofor0:10rangethefirst10rowsarereturnedwiththepositionsstartingwith0andendingwith9
DataFrames:methodloc
38
Ifweneedtoselectarangeofrows,usingtheirlabelswecanusemethodloc:
Recallthat
DataFrames:methodiloc
39
Ifweneedtoselectarangeofrowsand/orcolumns,usingtheirpositionswecanusemethodiloc:
DataFrames:methodiloc (summary)
40
df.iloc[0] # First row of a data framedf.iloc[i] #(i+1)th row df.iloc[-1] # Last row
df.iloc[:, 0] # First columndf.iloc[:, -1] # Last column
df.iloc[0:7] #First 7 rows df.iloc[:, 0:2] #First 2 columnsdf.iloc[1:3, 0:2] #Second through third rows and first 2 columnsdf.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th columns
DataFrames:Sorting
41
Wecansortthedatabyavalueinthecolumn.Bydefaultthesortingwilloccurinascendingorderandanewdataframeisreturn.
DataFrames:Sorting
42
Wecansortthedatausing2ormorecolumns:
AggregationFunctionsinPandas
43
Aggregation- computingasummarystatisticabouteachgroup,i.e.• computegroupsumsormeans• computegroupsizes/counts
Commonaggregationfunctions:
min,maxcount,sum,prodmean,median,mode,madstd,var
AggregationFunctionsinPandas
44
agg()methodareusefulwhenmultiplestatisticsarecomputedpercolumn:
BasicDescriptiveStatistics
45
df.method() description
describe Basicstatistics(count,mean,std,min,quantiles,max)
min,max Minimum andmaximumvalues
mean,median,mode Arithmeticaverage,medianandmode
var,std Varianceandstandarddeviation
sem Standarderrorofmean
skew Sampleskewness
kurt kurtosis
Graphicstoexplorethedata
46
ToshowgraphswithinPythonnotebookincludeinlinedirective:
Seaborn packageisbuiltonmatplotlib butprovideshighlevelinterfacefordrawingattractivestatisticalgraphics,similartoggplot2libraryinR.Itspecificallytargetsstatisticaldatavisualization
Graphics
47
descriptionhistplot histogram
barplot estimateofcentraltendencyforanumericvariable
violinplot similartoboxplot, alsoshowstheprobability densityofthedata
jointplot Scatterplot
regplot Regressionplot
pairplot Pairplot
boxplot boxplot
swarmplot categoricalscatterplot
factorplot Generalcategoricalplot
Draw Histogram Using Matplotlib
48
Draw Histogram Using Seaborn
49
Draw Barplot Using Matplotlib
50
Draw Barplot Using Seaborn
51
Draw Barplot Using Seaborn
52
Draw Scatterplot Using Seaborn
53
Draw Boxplot Using Seaborn
54
Python for Machine Learning
55
Machinelearning:theproblemsetting:Ingeneral,alearningproblemconsidersasetofn samples ofdataandthentriestopredictpropertiesofunknowndata.Ifeachsampleismorethanasinglenumberand,forinstance,amulti-dimensionalentry(aka multivariate data),itissaidtohaveseveralattributesor features.
Wecanseparatelearningproblemsinafewlargecategories:• SupervisedLearning (https://sklearn.org/supervised_learning.html#supervised-learning)
• Classification• Regression
• UnsupervisedLearning (https://sklearn.org/unsupervised_learning.html#unsupervised-learning)• Clustering
Python for Machine Learning
56
Trainingsetandtestingset:
Machinelearningisaboutlearningsomepropertiesofadatasetandapplyingthemtonewdata.Thisiswhyacommonpracticeinmachinelearningtoevaluate an algorithmistosplitthedataathandintotwosets,onethatwecallthe trainingset onwhichwelearndatapropertiesandonethatwecallthetesting set onwhichwetesttheseproperties.
scikit-learn comeswithafewstandarddatasets,forinstance the iris and digitsdatasetsforclassification andthe boston houseprices dataset for regression.
Loadinganexampledataset
57
Adatasetisadictionary-likeobjectthatholdsallthedataandsomemetadataaboutthedata.Thisdataisstoredinthe .datamember,whichisa (n_samples,n_features) array.Inthecaseofsupervisedproblem,oneormoreresponsevariablesarestoredinthe .target member.
Loadinganexampledataset - digits
58
Anexampleshowinghowthescikit-learncanbeusedtorecognizeimagesofhand-writtendigits.
Loadinganexampledataset - digits
59
and digits.target givesthegroundtruthforthedigitdataset,thatisthenumbercorrespondingtoeachdigitimagethatwearetryingtolearn:
Forinstance,inthecaseofthedigitsdataset, digits.data givesaccesstothefeaturesthatcanbeusedtoclassifythedigitssamples:
Learningandpredicting
60
Inthecaseofthedigits dataset,thetaskistopredict,givenanimage,whichdigititrepresents.Wearegivensamplesofeachofthe10possibleclasses(thedigitszero throughnine)onwhichwe fit a classifier tobeableto predict theclassestowhichunseensamplesbelong.
Inscikit-learn,a classifier forclassificationisaPythonobjectthatimplementsthemethods fit(X, y) and predict(T).
Anexampleofa classifier istheclass sklearn.svm.SVC,whichimplementssupportvectorclassification.Theclassifier’sconstructortakesasargumentsthemodel’sparameters.
Learningandpredicting
61
Fornow,wewillconsidertheclassifier asablackbox:
ChoosingtheparametersofthemodelInthisexample,wesetthevalueof gamma manually.Tofindgoodvaluesfortheseparameters,wecanusetoolssuchas gridsearch and crossvalidation.
Learningandpredicting
62
Forthetrainingset,we’llusealltheimagesfromourdataset,exceptforthelastimage,whichwe’llreserveforourpredicting.Weselectthetrainingsetwiththe [:-1] Pythonsyntax,whichproducesanewarraythatcontainsallbutthelastitemfrom digits.data:
Learningandpredicting
63
Nowyoucan predict newvalues.Inthiscase,you’llpredictusingthelastimagefrom digits.data.Bypredicting,you’lldeterminetheimagefromthetrainingsetthatbestmatchesthelastimage.
Thecorrespondingimageis:
Modelpersistence
64
Itispossibletosaveamodelinscikit-learnbyusing pickle: