Data Science for Marketing Analytics · 2020-01-08 · Adding and Removing Attributes and...

DataScienceforMarketingAnalyticsCopyright©2019PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Authors:TommyBlanchard,DebasishBehera,PranshuBhatnagar

TechnicalReviewer:DipankarNath

ManagingEditor:NehaNair

AcquisitionsEditor:KunalSawant

ProductionEditor:SamitaWarang

EditorialBoard:DavidBarnes,EwanBuckingham,ShivangiChatterji,SimonCox,ManasaKumar,AlexMazonowicz,DouglasPaterson,DominicPereira,ShinyPoojary,SamanSiddiqui,ErolStaveley,AnkitaThakur,andMohitaVyas

FirstPublished:March2019

ProductionReference:1290319

ISBN:978-1-78995-941-3

PublishedbyPacktPublishingLtd.

LiveryPlace,35LiveryStreet

BirminghamB32PB,UK

TableofContents

Preface

Chapter1:DataPreparationandCleaning

Introduction

DataModelsandStructuredData

pandas

ImportingandExportingDataWithpandasDataFrames

ViewingandInspectingDatainDataFrames

Exercise1:ImportingJSONFilesintopandas

Exercise2:IdentifyingSemi-StructuredandUnstructuredData

StructureofapandasSeries

DataManipulation

SelectingandFilteringinpandas

CreatingTestDataFramesinPython

AddingandRemovingAttributesandObservations

Exercise3:CreatingandModifyingTestDataFrames

CombiningData

HandlingMissingData

Exercise4:CombiningDataFramesandHandlingMissingValues

ApplyingFunctionsandOperationson

DataFrames

GroupingData

Exercise5:ApplyingDataTransformations

Activity1:AddressingDataSpilling

Summary

Chapter2:DataExplorationandVisualization

Introduction

IdentifyingtheRightAttributes

Exercise6:ExploringtheAttributesinSalesData

GeneratingTargetedInsights

SelectingandRenamingAttributes

TransformingValues

Exercise7:TargetingInsightsforSpecificUseCases

ReshapingtheData

Exercise8:UnderstandingStackingandUnstacking

PivotTables

VisualizingData

Exercise9:VisualizingDataWithpandas

VisualizationthroughSeaborn

VisualizationwithMatplotlib

Activity2:AnalyzingAdvertisements

Summary

Chapter3:UnsupervisedLearning:CustomerSegmentation

Introduction

CustomerSegmentationMethods

TraditionalSegmentationMethods

UnsupervisedLearning(Clustering)forCustomerSegmentation

SimilarityandDataStandardization

DeterminingSimilarity

StandardizingData

Exercise10:StandardizingAgeandIncomeDataofCustomers

CalculatingDistance

Exercise11:CalculatingDistanceBetweenThreeCustomers

Activity3:Loading,Standardizing,andCalculatingDistancewithaDataset

k-meansClustering

Understandingk-meansClustering

Exercise12:k-meansClusteringonIncome/AgeData

High-DimensionalData

Exercise13:DealingwithHigh-DimensionalData

Activity4:Usingk-meansClusteringon

CustomerBehaviorData

Summary

Chapter4:ChoosingtheBestSegmentationApproach

Introduction

ChoosingtheNumberofClusters

SimpleVisualInspection

Exercise14:ChoosingtheNumberofClustersBasedonVisualInspection

TheElbowMethodwithSumofSquaredErrors

Exercise15:DeterminingtheNumberofClustersUsingtheElbowMethod

Activity5:DeterminingClustersforHigh-EndClothingCustomerDataUsingthe

ElbowMethodwiththeSumofSquaredErrors

DifferentMethodsofClustering

Mean-ShiftClustering

Exercise16:PerformingMean-ShiftClusteringtoClusterData

k-modesandk-prototypesClustering

Exercise17:ClusteringDataUsingthek-prototypesMethod

Activity6:UsingDifferentClusteringTechniquesonCustomerBehaviorData

EvaluatingClustering

SilhouetteScore

Exercise18:CalculatingSilhouetteScoretoPicktheBestkfork-meansandComparingtotheMean-ShiftAlgorithm

TrainandTestSplit

Exercise19:UsingaTrain-TestSplittoEvaluateClusteringPerformance

Activity7:EvaluatingClusteringonCustomerBehaviorData

Summary

Chapter5:PredictingCustomerRevenueUsingLinearRegression

Introduction

UnderstandingRegression

FeatureEngineeringforRegression

FeatureCreation

DataCleaning

Exercise20:CreatingFeaturesforTransactionData

AssessingFeaturesUsingVisualizationsandCorrelations

Exercise21:ExaminingRelationshipsbetweenPredictorsandOutcome

Activity8:ExaminingRelationshipsBetweenStorefrontLocationsandFeaturesaboutTheirArea

PerformingandInterpretingLinearRegression

Exercise22:BuildingaLinearModelPredictingCustomerSpend

Activity9:BuildingaRegressionModeltoPredictStorefrontLocationRevenue

Summary

Chapter6:OtherRegressionTechniquesandToolsforEvaluation

Introduction

EvaluatingtheAccuracyofaRegressionModel

ResidualsandErrors

MeanAbsoluteError

RootMeanSquaredError

Exercise23:EvaluatingRegressionModelsofLocationRevenueUsingMAEandRMSE

Activity10:TestingWhichVariablesareImportantforPredictingResponsestoa

MarketingOffer

UsingRegularizationforFeatureSelection

Exercise24:UsingLassoRegressionforFeatureSelection

Activity11:UsingLassoRegressiontoChooseFeaturesforPredictingCustomerSpend

Tree-BasedRegressionModels

RandomForests

Exercise25:UsingTree-BasedRegressionModelstoCaptureNon-LinearTrends

Activity12:BuildingtheBestRegressionModelforCustomerSpendBasedonDemographicData

Summary

Chapter7:SupervisedLearning:PredictingCustomerChurn

Introduction

ClassificationProblems

UnderstandingLogisticRegression

RevisitingLinearRegression

LogisticRegression

Exercise26:PlottingtheSigmoidFunction

CostFunctionforLogisticRegression

AssumptionsofLogisticRegression

Exercise27:Loading,Splitting,andApplyingLinearandLogisticRegression

toData

CreatingaDataSciencePipeline

ObtainingtheData

Exercise28:ObtainingtheData

ScrubbingtheData

Exercise29:ImputingMissingValues

Exercise30:RenamingColumnsandChangingtheDataType

ExploringtheData

StatisticalOverview

Correlation

Exercise31:ObtainingtheStatistical

OverviewandCorrelationPlot

VisualizingtheData

Exercise32:PerformingExploratoryDataAnalysis(EDA)

Activity13:PerformingOSEofOSEMN

ModelingtheData

FeatureSelection

Exercise33:PerformingFeatureSelection

ModelBuilding

Exercise34:BuildingaLogisticRegressionModel

InterpretingtheData

Activity14:PerformingMNofOSEMN

Summary

Chapter8:Fine-TuningClassificationAlgorithms

Introduction

SupportVectorMachines

IntuitionBehindMaximumMargin

LinearlyInseparableCases

LinearlyInseparableCasesUsingKernel

Exercise35:TraininganSVMAlgorithmOveraDataset

DecisionTrees

Exercise36:ImplementingaDecisionTreeAlgorithmOveraDataset

ImportantTerminologyofDecisionTrees

DecisionTreeAlgorithmFormulation

RandomForest

Exercise37:ImplementingaRandomForestModelOveraDataset

Activity15:ImplementingDifferentClassificationAlgorithms

PreprocessingDataforMachineLearningModels

Standardization

Exercise38:StandardizingData

Scaling

Exercise39:ScalingDataAfterFeature

Selection

Normalization

Exercise40:PerformingNormalizationonData

ModelEvaluation

Exercise41:ImplementingStratifiedk-fold

Fine-TuningoftheModel

Exercise42:Fine-TuningaModel

Activity16:TuningandOptimizingtheModel

PerformanceMetrics

Precision

Recall

F1Score

Exercise43:EvaluatingthePerformanceMetricsforaModel

ROCCurve

Exercise44:PlottingtheROCCurve

Activity17:ComparisonoftheModels

Summary

Chapter9:ModelingCustomerChoice

Introduction

UnderstandingMulticlassClassification

ClassifiersinMulticlassClassification

Exercise45:ImplementingaMulticlassClassificationAlgorithmonaDataset

PerformanceMetrics

Exercise46:EvaluatingPerformanceUsingMulticlassPerformanceMetrics

Activity18:PerformingMulticlassClassificationandEvaluatingPerformance

ClassImbalancedData

Exercise47:PerformingClassificationonImbalancedData

DealingwithClass-ImbalancedData

Exercise48:VisualizingSamplingTechniques

Exercise49:FittingaRandomForestClassifierUsingSMOTEandBuildingtheConfusionMatrix

Activity19:DealingwithImbalancedData

Summary

Appendix

Preface

AboutThissectionbrieflyintroducestheauthors,thecoverageofthisbook,thetechnicalskillsyou'llneedtogetstarted,andthehardwareandsoftwarerequirementsrequiredtocompletealloftheincludedactivitiesandexercises.

AbouttheBookDataScienceforMarketingAnalyticscoverseverystageofdataanalytics,fromworkingwitharawdatasettosegmentingapopulationandmodelingdifferentpartsofitbasedonthesegments.

ThebookstartsbyteachingyouhowtousePythonlibraries,suchaspandasandMatplotlib,toreaddatafromPython,manipulateit,andcreateplotsusingbothcategoricalandcontinuousvariables.Then,you'lllearnhowtosegmentapopulationintogroupsandusedifferentclusteringtechniquestoevaluatecustomersegmentation.Asyoumakeyourwaythroughthechapters,you'llexplorewaystoevaluateandselectthebestsegmentationapproach,andgoontocreatealinearregressionmodeloncustomervaluedatatopredictlifetimevalue.Intheconcludingchapters,you'llgainanunderstandingofregressiontechniquesandtoolsforevaluatingregressionmodels,andexplorewaystopredictcustomerchoiceusingclassificationalgorithms.Finally,you'llapplythesetechniquestocreateachurnmodelformodelingcustomerproductchoices.

Bytheendofthisbook,youwillbeabletobuildyourownmarketingreporting

andinteractivedashboardsolutions.

AbouttheAuthorsTommyBlanchardearnedhisPhDfromtheUniversityofRochesteranddidhispostdoctoraltrainingatHarvard.Now,heleadsthedatascienceteamatFreseniusMedicalCareNorthAmerica.Histeamperformsadvancedanalyticsandcreatespredictivemodelstosolveawidevarietyofproblemsacrossthecompany.

DebasishBeheraworksasadatascientistforalargeJapanesecorporatebank,whereheappliesmachinelearning/AItosolvecomplexproblems.HehasworkedonmultipleusecasesinvolvingAML,predictiveanalytics,customersegmentation,chatbots,andnaturallanguageprocessing.HecurrentlylivesinSingaporeandholdsaMaster'sinBusinessAnalytics(MITB)fromtheSingaporeManagementUniversity.

PranshuBhatnagarworksasadatascientistinthetelematics,insurance,andmobilesoftwarespace.HehaspreviouslyworkedasaquantitativeanalystintheFinTechindustryandoftenwritesaboutalgorithms,timeseriesanalysisinPython,andsimilartopics.HegraduatedwithhonorsfromtheChennaiMathematicalInstitutewithadegreeinMathematicsandComputerScienceandhascompletedcertificationbooksinMachineLearningandArtificialIntelligencefromtheInternationalInstituteofInformationTechnology,Hyderabad.HeisbasedinBangalore,India.

Objectives

AnalyzeandvisualizedatainPythonusingpandasandMatplotlib

Studyclusteringtechniques,suchashierarchicalandk-meansclustering

Createcustomersegmentsbasedonmanipulateddata

Predictcustomerlifetimevalueusinglinearregression

Useclassificationalgorithmstounderstandcustomerchoice

Optimizeclassificationalgorithmstoextractmaximalinformation

AudienceDataScienceforMarketingAnalyticsisdesignedfordevelopersandmarketinganalystslookingtousenew,moresophisticatedtoolsintheirmarketinganalyticsefforts.It'llhelpifyouhavepriorexperienceofcodinginPythonandknowledgeofhighschoollevelmathematics.Someexperiencewithdatabases,Excel,statistics,orTableauisusefulbutnotnecessary.

ApproachDataScienceforMarketingAnalyticstakesahands-onapproachtothepracticalaspectsofusingPythondataanalyticslibrariestoeasemarketinganalyticsefforts.Itcontainsmultipleactivitiesthatusereal-lifebusinessscenariosforyoutopracticeandapplyyournewskillsinahighlyrelevantcontext.

MinimumHardwareRequirementsForanoptimalstudentexperience,werecommendthefollowinghardware

configuration:

Processor:DualCoreorbetter

Memory:4GBRAM

Storage:10GBavailablespace

SoftwareRequirementsYou'llalsoneedthefollowingsoftwareinstalledinadvance:

Anyofthefollowingoperatingsystems:Windows7SP132/64-bit,Windows8.132/64-bit,orWindows1032/64-bit,Ubuntu14.04orlater,ormacOSSierraorlater.

Browser:GoogleChromeorMozillaFirefox

Conda

Python3.x

ConventionsCodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:"Importtheclustermodulefromthesklearnpackage."

Ablockofcodeissetasfollows:

plt.xlabel('Income')

plt.ylabel('Age')

plt.show()

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:"TheYearcolumnappearstohavematchedtotherightvalues,butthelinecolumndoesnotseemtomakemuchsense."

InstallationandSetupWerecommendinstallingPythonusingtheAnacondadistribution,availablehere:https://www.anaconda.com/distribution/.

Itcontainsmostofthemodulesthatwillbeused.AdditionalPythonmodulescanbeinstalledusingthemethodshere:https://docs.python.org/3/installing/index.html.ThereisonlyonemodulethatisusedthatisnotpartofthestandardAnacondadistribution;useoneofthemethodsinthelinkedpagetoinstallit:

kmodes

IfyoudonotusetheAnacondadistribution,makesureyouhavethefollowingmodulesinstalled:

jupyter

pandas

sklearn

https://www.anaconda.com/distribution/

https://docs.python.org/3/installing/index.html

numpy

scipy

seaborn

statsmodels

InstallingtheCodeBundleCopythecodebundlefortheclasstotheC:/Codefolder.

AdditionalResourcesThecodebundleforthisbookisalsohostedonGitHubat:https://github.com/TrainingByPackt/Data-Science-for-Marketing-Analytics.

Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableathttps://github.com/PacktPublishing/.Checkthemout!

https://github.com/TrainingByPackt/Data-Science-for-Marketing-Analytics

https://github.com/PacktPublishing/

Chapter1

DataPreparationandCleaning

LearningObjectivesBytheendofthischapter,youwillbeableto:

CreatepandasDataFramesinPython

Readandwritedataintodifferentfileformats

Slice,aggregate,filter,andapplyfunctions(built-inandcustom)toDataFrames

JoinDataFrames,handlemissingvalues,andcombinedifferentdatasources

ThischaptercoversbasicdatapreparationandmanipulationtechniquesinPython,whichisthefoundationofdatascience.

IntroductionThewaywemakedecisionsintoday'sworldischanging.Averylargeproportionofourdecisions—fromchoosingwhichmovietowatch,whichsongtolistento,whichitemtobuy,orwhichrestauranttovisit—allrelyuponrecommendationsandratingsgeneratedbyanalytics.Asdecisionmakerscontinuetousemoreofsuchanalyticstomakedecisions,theythemselves

becomedatapointsforfurtherimprovements,andastheirowncustomneedsfordecisionmakingcontinuetobemet,theyalsokeepusingtheseanalyticalmodelsfrequently.

Thechangeinconsumerbehaviorhasalsoinfluencedthewaycompaniesdevelopstrategiestotargetconsumers.Withtheincreaseddigitizationofdata,greateravailabilityofdatasources,andlowerstorageandprocessingcosts,firmscannowcrunchlargevolumesofincreasinglygranulardatawiththehelpofvariousdatasciencetechniquesandleverageittocreatecomplexmodels,performsophisticatedtasks,andderivevaluableconsumerinsightswithhigheraccuracy.Itisbecauseofthisdramaticincreaseindataandcomputingpower,andtheadvancementintechniquestousethisdatathroughdatasciencealgorithms,thattheMcKinseyGlobalInstitutecallsouragetheAgeofAnalytics.

Severalindustryleadersarealreadyusingdatasciencetomakebetterdecisionsandtoimprovetheirmarketinganalytics.GoogleandAmazonhavebeenmakingtargetedrecommendationscateringtothepreferencesoftheirusersfromtheirveryearlyyears.PredictivedatasciencealgorithmstaskedwithgeneratingleadsfrommarketingcampaignsatDellreportedlyconverted50%ofthefinalleads,whereasthosegeneratedthroughtraditionalmethodshadaconversionrateofonly17%.PricesurgesonUberfornon-passholdersduringrushhouralsoreportedlyhadmassivepositiveeffectsonthecompany'sprofits.Infact,itwasrecentlydiscoveredthatpricemanagementinitiativesbasedonanevaluationofcustomerlifetimevaluetendedtoincreasebusinessmarginsby2%–7%overa12-monthperiodandresultedina200%–350%ROIingeneral.

Althoughusingdatascienceprinciplesinmarketinganalyticsisaprovencost-effective,efficientwayforalotofcompaniestoobserveacustomer'sjourney

andprovideamorecustomizedexperience,multiplereportssuggestthatitisnotbeingusedtoitsfullpotential.Thereisawidegapbetweenthepossibleandactualusageofthesetechniquesbyfirms.Thisbookaimstobridgethatgap,andcoversanarrayofusefultechniquesinvolvingeverythingdatasciencecandointermsofmarketingstrategiesanddecision-makinginmarketing.Bytheendofthebook,youshouldbeabletosuccessfullycreateandmanageanend-to-endmarketinganalyticspipelineinPython,segmentcustomersbasedonthedataprovided,predicttheirlifetimevalue,andmodeltheirdecision-makingbehavioronyourownusingdatasciencetechniques.

Thischapterintroducesyoutocleaningandpreparingdata—thefirststepinanydata-centricpipeline.Rawdatacomingfromexternalsourcescannotgenerallybeuseddirectly;itneedstobestructured,filtered,combined,analyzed,andobservedbeforeitcanbeusedforanyfurtheranalyses.Inthischapter,wewillexplorehowtogettherightdataintherightattributes,manipulaterowsandcolumns,andapplytransformationstodata.Thisisessentialbecause,otherwise,wewillbepassingincorrectdatatothepipeline,therebymakingitaclassicexampleofgarbagein,garbageout.

DataModelsandStructuredDataWhenwebuildananalyticspipeline,thefirstthingthatweneedtodoistobuildadatamodel.Adatamodelisanoverviewofthedatasourcesthatwewillbeusing,theirrelationshipswithotherdatasources,whereexactlythedatafromaspecificsourceisgoingtoenterthepipeline,andinwhatform(suchasanExcelfile,adatabase,oraJSONfromaninternetsource).Thedatamodelforthepipelineevolvesovertimeasdatasourcesandprocesseschange.Adatamodelcancontaindataofthefollowingthreetypes:

StructuredData:Thisisalsoknownascompletelystructuredorwell-structureddata.Thisisthesimplestwaytomanageinformation.Thedataisarrangedinaflattabularformwiththecorrectvaluecorrespondingtothecorrectattribute.Thereisauniquecolumn,knownasanindex,foreasyandquickaccesstothedata,andtherearenoduplicatecolumns.DatacanbequeriedexactlythroughSQLqueries,forexample,datainrelationaldatabases,MySQL,AmazonRedshift,andsoon.

Semi-structureddata:Thisreferstodatathatmaybeofvariablelengthsandthatmaycontaindifferentdatatypes(suchasnumericalorcategorical)inthesamecolumn.Suchdatamaybearrangedinanestedorhierarchicaltabularstructure,butitstillfollowsafixedschema.Therearenoduplicatecolumns(attributes),buttheremaybeduplicaterows(observations).Also,eachrowmightnotcontainvaluesforeveryattribute,thatis,theremaybemissingvalues.Semi-structureddatacanbestoredaccuratelyinNoSQLdatabases,ApacheParquetfiles,JSONfiles,andsoon.

Unstructureddata:Datathatisunstructuredmaynotbetabular,andevenifitistabular,thenumberofattributesorcolumnsperobservationmaybecompletelyarbitrary.Thesamedatacouldberepresentedindifferentways,andtheattributesmightnotmatcheachother,withvaluesleakingintootherparts.Unstructureddatacanbestoredastextfiles,CSVfiles,Excelfiles,images,audioclips,andsoon.

Marketingdata,traditionally,comprisesdataofallthreetypes.Initially,mostdatapointsoriginatedfromdifferent(possiblymanual)datasources,sothevaluesforafieldcouldbeofdifferentlengths,thevalueforonefieldwouldnotmatchthatofotherfieldsbecauseofdifferentfieldnames,somerowscontaining

datafromeventhesamesourcescouldalsohavemissingvaluesforsomeofthefields,andsoon.Butnow,becauseofdigitization,structuredandsemi-structureddataisalsoavailableandisincreasinglybeingusedtoperformanalytics.Thefollowingfigureillustratesthedatamodeloftraditionalmarketinganalyticscomprisingallkindsofdata:structureddatasuchasdatabases(top),semi-structureddatasuchasJSONs(middle),andunstructureddatasuchasExcelfiles(bottom):

Figure1.1:Datamodeloftraditionalmarketing

analytics

Adatamodelwithallthesedifferentkindsofdataispronetoerrorsandisveryriskytouse.Ifwesomehowgetagarbagevalueintooneoftheattributes,ourentireanalysiswillgoawry.Mostofthetimes,thedataweneedisofacertainkindandifwedon'tgetthattypeofdata,wemightrunintoabugorproblemthatwouldneedtobeinvestigated.Therefore,ifwecanenforcesomecheckstoensurethatthedatabeingpassedtoourmodelisalmostalwaysofthesamekind,

wecaneasilyimprovethequalityofdatafromunstructuredtoatleastsemi-structured.

ThisiswhereprogramminglanguagessuchasPythoncomeintoplay.Pythonisanall-purposegeneralprogramminglanguagethatnotonlymakeswritingstructure-enforcingscriptseasy,butalsointegrateswithalmosteveryplatformandautomatesdataproduction,analysis,andanalyticsintoamorereliableandpredictablepipeline.Apartfromunderstandingpatternsandgivingatleastabasicstructuretodata,Pythonforcesintelligentpipelinestoaccepttherightvaluefortherightattribute.Themajorityofanalyticspipelinesareexactlyofthiskind.Thefollowingfigureillustrateshowmostmarketinganalyticstodaystructuredifferentkindsofdatabypassingitthroughscriptstomakeitatleastsemi-structured:

Figure1.2:Datamodelofmostmarketinganalytics

thatusePython

Bymakinguseofsuchstructure-enforcingscripts,wewillhaveapipelineofsemi-structureddatacominginwithexpectedvaluesintherightfields;however,thedataisnotyetinthebestpossibleformattoperformanalytics.Ifwecancompletelystructureourdata(thatis,arrangeitinflattables,withtherightvaluepointingtotherightattributewithnonestingorhierarchy),itwillbeeasyforustoseehoweverydatapointindividuallycomparestootherpointsbeingconsideredinthecommonfields,andwouldalsomakethepipelinescalable.Wecaneasilygetafeelofthedata—thatis,seeinwhatrangemostvalueslie,identifytheclearoutliers,andsoon—bysimplyscrollingthroughthedata.

Whiletherearealotoftoolsthatcanbeusedtoconvertdatafromanunstructured/semi-structuredformattoafullystructuredformat(forexample,Spark,STATA,andSAS),thetoolthatismostcommonlyusedfordatascience,canbeintegratedwithpracticallyanyframework,hasrichfunctionalities,minimalcosts,andiseasy-to-useforourusecase,ispandas.Thefollowingfigureillustrateshowadatamodelstructuresdifferentkindsofdatafrombeingpossiblyunstructuredtosemi-structured(usingPython),tocompletelystructured(usingpandas):

Figure1.3:Datamodeltostructurethedifferentkinds

ofdata

Note

Forthepurposeofthisbook,wewillassumethatyouaremoreorlesscomfortablewithNumPy.

pandaspandasisasoftwarelibrarywritteninPythonandisthebasisfordatamanipulationandanalysisinthelanguage.Itsnamecomesfrom"paneldata,"aneconometricstermfordatasetsthatincludeobservationsovermultipletimeperiodsforthesameindividuals.

pandasoffersacollectionofhigh-performance,easy-to-use,andintuitivedatastructuresandanalysistoolsthatareofgreatusetomarketinganalystsanddatascientistsalike.Ithasthefollowingtwoprimaryobjecttypes:

DataFrame:Thisisthefundamentaltabularrelationshipobjectthatstoresdatainrowsandcolumns(likeaspreadsheet).Toperformdataanalysis,functionsandoperationscanbedirectlyappliedtoDataFrames.

Series:ThisreferstoasinglecolumnoftheDataFrame.Thevaluecanbeaccessedthroughitsindex.AsSeriesautomaticallyinfersatype,itautomaticallymakesallDataFrameswell-structured.

ThefollowingfigureillustratesapandasDataFramewithanautomaticintegerindex(0,1,2,3...):

https://en.wikipedia.org/wiki/Software_library

https://en.wikipedia.org/wiki/Panel_data

https://en.wikipedia.org/wiki/Econometrics

Figure1.4:AsamplepandasDataFrame

Nowthatweunderstandwhatpandasobjectsareandhowtheycanbeusedtoautomaticallygetstructureddata,let'stakealookatsomeofthefunctionswecanusetoimportandexportdatainpandasandseeifthedatawepassedisreadytobeusedforfurtheranalyses.

ImportingandExportingDataWithpandasDataFramesEveryteaminamarketinggroupcanhaveitsownpreferreddatatypefortheirspecificusecase.ThosewhohavetodealwithalotmoretextthannumbersmightpreferusingJSONorXML,whileothersmaypreferCSV,XLS,orevenPythonobjects.pandashasalotofsimpleAPIs(applicationprograminterfaces)thatallowittoreadalargevarietyofdatadirectlyintoDataFrames.Someofthemainonesareshownhere:

Figure1.5:Waystoimportandexportdifferenttypes

ofdatawithpandasDataFrames

Note

Rememberthatawell-structuredDataFramedoesnothavehierarchicalornesteddata.Theread_xml,read_json(),andread_html()functions

(andothers)causethedatatoloseitshierarchicaldatatypes/nestedstructureandconvertitintoflattenedobjectssuchaslistsandlistsoflists.Pandas,however,doessupporthierarchicaldatafordataanalysis.Youcansaveandloadsuchdatabypicklingfromyoursessionandmaintainingthehierarchyinsuchcases.Whenworkingwithdatapipelines,it'sadvisedtosplitnesteddataintoseparatestreamstomaintainthestructure.

Whenloadingdata,pandasprovidesuswithadditionalparametersthatwecanpasstoreadfunctions,sothatwecanloadthedatadifferently.Someadditional

parametersthatareusedcommonlywhenimportingdataintopandasaregivenhere:

skiprows=k:Thisskipsthefirstkrows.

nrows=k:Thisparsesonlythefirstkrows.

names=[col1,col2...]:Thisliststhecolumnnamestobeused

intheparsedDataFrame.

header=k:Thisappliesthecolumnnamescorrespondingtothekth

rowastheheaderfortheDataFrame.kcanalsobeNone.

index_col=col:ThissetscolastheindexoftheDataFrame

beingused.Thiscanalsobealistofcolumnnames(usedtocreateaMultiIndex)orNone.

usecols=[l1,l2...]:Thisprovideseitherintegerpositional

indicesinthedocumentcolumnsorstringsthatcorrespondtocolumnnamesintheDataFrametoberead.Forexample,[0,1,2]or['foo','bar','baz'].

Note

Therearesimilarspecificparametersforalmosteveryin-builtfunctioninpandas.Youcanfinddetailsaboutthemwiththedocumentationforpandasavailableatthefollowinglink:https://pandas.pydata.org/pandas-docs/stable/.

https://pandas.pydata.org/pandas-docs/stable/

ViewingandInspectingDatainDataFramesOnceyou'vereadtheDataFrameusingtheAPI,asexplainedearlier,you'llnoticethat,unlessthereissomethinggrosslywrongwiththedata,theAPIgenerallyneverfails,andwealwaysgetaDataFrameobjectafterthecall.However,weneedtoinspectthedataourselvestocheckwhethertherightattributehasreceivedtherightdata,forwhichwecanuseseveralin-builtfunctionsthatpandasprovides.AssumethatwehavestoredtheDataFrameinavariablecalleddfthen:

df.head(n)willreturnthefirstnrowsoftheDataFrame.Ifnonis

passed,bydefault,thefunctionconsidersntobe5.

df.tail(n)willreturnthelastnrowsoftheDataFrame.Ifnonis

passed,bydefault,thefunctionconsidersntobe5.

df.shapewillreturnatupleofthetype(numberofrows,numberof

columns).

df.dtypeswillreturnthetypeofdataineachcolumnofthepandas

DataFrame(suchasfloat,char,andsoon).

df.info()willsummarizetheDataFrameandprintitssize,typeof

values,andthecountofnon-nullvalues.

Exercise1:ImportingJSONFilesinto

pandasForthisexercise,youneedtousetheuser_info.jsonfileprovidedtoyou

intheLesson01folder.Thefilecontainssomeanonymouspersonaluser

informationcollectedfromsixcustomersthroughaweb-basedforminJSONformat.YouneedtoopenaJupyterNotebook,importtheJSONfileintotheconsoleasapandasDataFrame,andseewhetherithasloadedcorrectly,withtherightvaluesbeingpassedtotherightattribute.

Note

AlltheexercisesandactivitiesinthischaptercanbedoneinboththeJupyterNotebookandPythonshell.Whilewecandothemintheshellfornow,itishighlyrecommendedtousetheJupyterNotebook.TolearnhowtoinstallJupyterandsetuptheJupyterNotebook,checkhttps://jupyter.readthedocs.io/en/latest/install.html.ItwillbeassumedthatyouareusingaJupyterNotebookfromthenextchapteronward.

1. OpenaJupyterNotebooktoimplementthisexercise.Onceyouareintheconsole,importthepandaslibraryusingtheimportcommand,as

follows:

importpandasaspd

2. Readtheuser_info.jsonJSONfileintotheuser_info

DataFrame:

user_info=pd.read_json("user_info.json")

3. CheckthefirstfewvaluesintheDataFrameusingtheheadcommand:

https://jupyter.readthedocs.io/en/latest/install.html

user_info.head()

Youshouldseethefollowingoutput:

Figure1.6:Viewingthefirstfewrowsof

user_info.json

4. Aswecansee,thedatamakessensesuperficially.Let'sseeifthedatatypesmatchtoo.Typeinthefollowingcommand:

user_info.info()

Youshouldgetthefollowingoutput:

Figure1.7:Informationaboutthedatain

user_info

Fromtheprecedingfigure,noticethattheisActivecolumnisBoolean,

theageandindexcolumnsareintegers,whereasthelatitudeand

longitudecolumnsarefloats.TherestoftheelementsarePython

objects,mostlikelytobestrings.Lookingatthenames,theymatchourintuition.So,thedatatypesseemtomatch.Also,thenumberofobservationsseemstobethesameforallfields,whichimpliesthattherehasbeennodataloss.

Note

The64displayedwiththetypeaboveisanindicatorofprecisionandvariesondifferentplatforms.

5. Let'salsoseethenumberofrowsandcolumnsintheDataFrameusingtheshapeattributeoftheDataFrame:

user_info.shape

Thiswillgiveyou(6,22)astheoutput,indicatingthattheDataFramecreatedbytheJSONhas6rowsand22columns.

Congratulations!Youhaveloadedthedatacorrectly,withtherightattributescorrespondingtotherightcolumnsandwithnomissingvalues.Sincethedatawasalreadystructured,itisnowreadytobeputintothepipelinetobeusedforfurtheranalysis.

Exercise2:IdentifyingSemi-StructuredandUnstructuredDataInthisexercise,youwillbeusingthedata.csvandsales.xlsxfiles

providedtoyouintheLesson01folder.Thedata.csvfilecontainsthe

viewsandlikesof100differentpostsonFacebookinamarketingcampaign,andsales.xlsxcontainssomehistoricalsalesdatarecordedinMSExcelabout

differentcustomerpurchasesinstoresinthepastfewyears.WewanttoreadthefilesintopandasDataFramesandcheckwhethertheoutputisreadytobeaddedintotheanalyticspipeline.Let'sfirstworkwiththedata.csvfile:

1. Importpandasintotheconsole,asfollows:

importpandasaspd

2. Usetheread_csvmethodtoreadthedata.csvCSVfileintoa

campaign_dataDataFrame:

campaign_data=pd.read_csv("data.csv")

3. LookatthecurrentstateoftheDataFrameusingtheheadfunction:

campaign_data.head()

Youroutputshouldlookasfollows:

Figure1.8:Viewingrawcampaign_data

Fromtheprecedingoutput,wecanobservethatthefirstcolumnhasanissue;wewanttohave"views"and"likes"asthecolumnnamesandfortheDataFrametohavenumericvalues.

4. Wewillreadthedataintocampaign_dataagain,butthistimemaking

surethatweusethefirstrowtogetthecolumnnamesusingtheheader

parameter,asfollows:

campaign_data=pd.read_csv("data.csv",header=

1)

5. Let'snowviewcampaign_dataagain,andseewhethertheattributes

areokaynow:

campaign_data.head()

YourDataFrameshouldnowappearasfollows:

Figure1.9:campaign_dataafterbeingreadwith

theheaderparameter

6. Thevaluesseemtomakesense—withtheviewsbeingfarmorethanthelikes—whenwelookatthefirstfewrows,butbecauseofsomemisalignmentormissingvalues,thelastfewrowsmightbedifferent.So,let'shavealookatit:

campaign_data.tail()

Youwillgetthefollowingoutput:

Figure1.10:Thelastfewrowsofcampaign_data

7. Theredoesn'tseemtobeanymisalignmentofdataormissingvaluesattheend.However,althoughwehaveseenthelastfewrows,westillcan'tbesurethatallvaluesinthemiddle(hidden)partoftheDataFrameareokaytoo.WecancheckthedatatypesoftheDataFrametobesure:

campaign_data.info()


Figure1.11:info()ofcampaign_data

8. Wealsoneedtoensurethatwehavenotlostsomeobservationsbecauseofourcleaning.Weusetheshapefunctionforthat:

campaign_data.shape

Youwillgetanoutputof(100,2),indicatingthatwestillhave100observationswith2columns.Thedatasetisnowcompletelystructuredandcaneasilybeapartofanyfurtheranalysisorpipeline.

9. Let'snowanalyzethesales.xlsxfile.Usetheread_excelfunction

toreadthefileinaDataFramecalledsales:

sales=pd.read_excel("sales.xlsx")

10. LookatthefirstfewrowsofthesalesDataFrame:

sales.head()

Youroutputshouldlookasfollows:

Figure1.12:Firstfewrowsofsales.xlsx

Fromtheprecedingfigure,theYearcolumnappearstohavematchedtotherightvalues,butthelinecolumndoesnotseemtomakemuchsense.TheProduct.1,Product.2,columnsimplythattherearemultiplecolumnswiththesamename!EventhevaluesoftheOrderandmethodcolumnsbeingWaterandBag,respectively,makeusfeelasthoughsomethingiswrong.

11. Let'slookatgatheringsomemoreinformation,suchasnullvaluesandthedatatypesofthecolumns,andseeifwecanmakemoresenseofthedata:

sales.info()

Youroutputwilllookasfollows:

Figure1.13:Outputofsales.info()

Astherearesomecolumnswithnonon-nullvalues,thecolumnnamesseemtohavebrokenupincorrectly.Thisisprobablywhytheoutputofinfoshowedacolumnsuchasrevenueashavinganarbitrarydata

typesuchasobject(usuallyusedtorefertocolumnscontaining

strings).Itmakessenseiftheactualcolumnnamesstartwithacapitalletterandtheremainingcolumnsarecreatedasaresultofdataspillingfromtheprecedingcolumns.

12. Let'strytoreadthefilewithjustthenew,correctcolumnnamesandsee

whetherwegetanything.Usethefollowingcode:

sales=pd.read_excel("sales.xlsx",names=

["Year","Productline","Producttype",

"Product","Ordermethodtype","Retailer

Country","Revenue","Plannedrevenue","Product

cost","Quantity","Unitcost","Unitprice",

"GrossProfit","Unitsaleprice"])

Yougetthefollowingoutput:

Figure1.14:Attemptingtostructuresales.xlsx

Unfortunately,theissueisnotjustwiththecolumns,butwiththeunderlyingvaluestoo.Thevalueofonecolumnisleakingintoanotherandthusruiningthestructure.Understandably,thecodefailsbecauseoflengthmismatch.Therefore,wecanconcludethatthesales.xlsxdataisveryunstructured.

WiththeuseoftheAPIandwhatweknowuptillthispoint,wecan'tdirectlygetthisdatatobestructured.Tounderstandhowtoapproachstructuringthiskindof

data,weneedtodivedeepintotheinternalstructureofpandasobjectsandunderstandhowdataisactuallystoredinpandas,whichwewilldointhefollowingsections.Wewillcomebacktopreparingthisdataforfurtheranalysisinalatersection.

StructureofapandasSeriesLet'ssayyouwanttostoresomevaluesfromadatastoreinadatastructure.Itisnotnecessaryforeveryelementofthedatatohavevalues,soyourstructureshouldbeabletohandlethat.Itisalsoaverycommonscenariowherethereissomediscrepancybetweentwodatasourcesonhowtoidentifyadatapoint.So,insteadofusingdefaultnumericalindices(suchas0-100)oruser-givennamestoaccessit,likeinadictionary,youwouldliketoaccesseveryvaluebyanamethatcomesfromwithinthedatasource.ThisisachievedinpandasusingapandasSeries.

ApandasSeriesisnothingbutanindexedNumPyarray.TomakeapandasSeries,allyouneedtodoiscreateanarrayandgiveitanindex.IfyoucreateaSerieswithoutanindex,itwillcreateadefaultnumericindexthatstartsfrom0andgoesonforthelengthoftheSeries,asshowninthefollowingfigure:

Figure1.15:SamplepandasSeries

Note

AsaSeriesisstillaNumPyarray,allfunctionsthatworkonaNumPyarray,workthesamewayonapandasSeriestoo.

Onceyou'vecreatedanumberofSeries,youmightwanttoaccessthevaluesassociatedwithsomespecificindicesallatoncetoperformanoperation.ThisisjustaggregatingtheSerieswithaspecificvalueoftheindex.ItisherethatpandasDataFramescomeintothepicture.ApandasDataFrameisjustadictionarywiththecolumnnamesaskeysandvaluesasdifferentpandasSeries,joinedtogetherbytheindex:

Figure1.16:Seriesjoinedtogetherbythesameindex

createapandasDataframe

Thiswayofstoringdatamakesitveryeasytoperformtheoperationsweneedonthedatawewant.WecaneasilychoosetheSerieswewanttomodifyby

pickingacolumnanddirectlyslicingoffindicesbasedonthevalueinthatcolumn.Wecanalsogroupindiceswithsimilarvaluesinonecolumntogetherandseehowthevalueschangeinothercolumns.

Otherthanthisone-dimensionalSeriesstructuretoaccesstheDataFrame,pandasalsohastheconceptofaxes,whereanoperationcanbeappliedtobothrows(orindices)andcolumns.Youcanchoosewhichonetoapplyittobyspecifyingtheaxis,0referringtorowsand1referringtocolumns,therebymakingitveryeasytoaccesstheunderlyingheadersandthevaluesassociatedwiththem:

Figure1.17:Understandingaxis=0andaxis=1in

pandas

DataManipulation

NowthatwehavedeconstructedthestructureofthepandasDataFramedowntoitsbasics,therestofthewranglingtasks,thatis,creatingnewDataFrames,selectingorslicingaDataFrameintoitsparts,filteringDataFramesforsomevalues,joiningdifferentDataFrames,andsoon,willbecomeveryintuitive.

SelectingandFilteringinpandasItisstandardconventioninspreadsheetstoaddressacellby(columnname,rowname).Sincedataisstoredinpandasinasimilarmanner,thisisalsothewaytoaddressacellinapandasDataFrame:thecolumnnameactsasakeytogiveyouthepandasSeries,andtherownamegivesyouthevalueonthatindexoftheDataFrame.

Butifyouneedtoaccessmorethanasinglecell,suchasasubsetofsomerowsandcolumnsfromtheDataFrame,orchangetheorderofdisplayofsomecolumnsontheDataFrame,youcanmakeuseofthesyntaxlistedinthefollowingtable:

Figure1.18:Atablelistingthesyntaxusedfor

differentoperationsonapandasDataFrame

CreatingTestDataFramesinPythonWefrequentlyneedtocreatetestobjectswhilebuildingadatapipelineinpandas.Testobjectsgiveusareferencepointtofigureoutwhatwehavebeenabletodouptillthatpointandmakeiteasiertodebugourscripts.Generally,testDataFramesaresmallinsize,sothattheoutputofeveryprocessisquickandeasytocompute.TherearetwowaystocreatetestDataFrames—bycreatingcompletelynewDataFrames,orbyduplicatingortakingasliceofapreviouslyexistingDataFrame:

CreatingnewDataFrames:WetypicallyusetheDataFramemethodto

createacompletelynewDataFrame.ThefunctiondirectlyconvertsaPythonobjectintoapandasDataFrame.TheDataFramefunctionwill,

ingeneral,workwithanyiterablecollectionofdata(suchasdict,

list,andsoon).Wecanalsopassanemptycollectionorasingleton

collectiontothefunction.

Forexample,wewillgetthesameDataFramethrougheitherofthefollowinglinesofcode:

pd.DataFrame({'category':pd.Series([1,2,3])}

pd.DataFrame([1,2,3],columns=['category'])

pd.DataFrame.from_dict({'category':[1,2,3]})

Thefollowingfigureshowstheoutputsreceivedeachtime:

Figure1.19:Outputgeneratedbyallthreewaysto

createaDataFrame

ADataFramecanalsobebuiltbypassinganypandasobjectstotheDataFramefunction.Thefollowinglineofcodegivesthesameoutputasthe

precedingfigure:

pd.DataFrame(pd.Series([1,2,3]),columns=

["category"])

DuplicatingorslicingapreviouslyexistingDataFrame:ThesecondwaytocreateatestDataFrameisbycopyingapreviouslyexistingDataFrame.Python,andtherefore,pandas,hasshallowreferences.Whenwesayobj1=obj2,theobjectssharethelocationorthereferenceto

thesameobjectinmemory.So,ifwechangeobj2,obj1alsogets

modified,andviceversa.Thisistackledinthestandardlibrarywiththedeepcopyfunctioninthecopymodule.Thedeepcopyfunctionallows

theusertorecursivelygothroughtheobjectsbeingpointedtobythereferencesandcreateentirelynewobjects.

So,whenyouwanttocopyapreviouslyexistingDataFrameanddon'twantthepreviousDataFrametobeaffectedbymodificationsinthecurrentDataFrame,youneedtousethedeepcopyfunction.Youcan

alsoslicethepreviouslyexistingDataFrameandpassittothefunction,

anditwillbeconsideredanewDataFrame.Forexample,thefollowingcodesnippetwillrecursivelycopyeverythingindf1andnothaveany

referencestoitwhenyoumakechangestodf:

importpandas

importcopy

df=copy.deepcopy(df1)

AddingandRemovingAttributesandObservationspandasprovidesthefollowingfunctionstoaddanddeleterows(observations)andcolumns(attributes):

df['col']=s:Thisaddsanewcolumn,col,totheDataFrame,df,

withtheSeries,s.

df.assign(c1=s1,c2=s2...):Thisaddsnewcolumns,c1,

c2,andsoon,withseries,s1,s2,andsoon,tothedfDataFrameinone

go.

df.append(df2/d2,ignore_index):Thisaddsvaluesfrom

thedf2DataFrametothebottomofthedfDataFramewhereverthe

columnsofdf2matchthoseofdf.Alternatively,italsoacceptsdictand

d2,andifignore_index=True,itdoesnotuseindexlabels.

df.drop(labels,axis):Thisremovetherowsorcolumns

specifiedbythelabelsandcorrespondingaxis,orthosespecifiedbythe

indexorcolumnnamesdirectly.

df.dropna(axis,how):Dependingontheparameterpassedto

how,thisdecideswhethertodroprows(orcolumnsifaxis=1)with

missingvaluesinanyofthefieldsorinallofthefields.Ifnoparameterispassed,thedefaultvalueofhowisanyandthedefaultvalueofaxisis

0.

df.drop_duplicates(keep):Thisremovesrowswithduplicate

valuesintheDataFrame,andkeepsthefirst(keep='first'),last

(keep='last'),ornooccurrence(keep=False)inthedata.

WecanalsocombinedifferentpandasDataFramessequentiallywiththeconcatfunction,asfollows:

pd.concat([df1,df2..]):ThiscreatesanewDataFramewith

df1,df2,andallotherDataFramescombinedsequentially.Itwill

automaticallycombinecolumnshavingthesamenamesinthecombinedDataFrames.

Exercise3:CreatingandModifyingTestDataFramesThisexerciseaimstotesttheunderstandingofthestudentsaboutcreatingandmodifyingDataFramesinpandas.WewillcreateatestDataFramefromscratchandaddandremoverows/columnstoitbymakinguseofthefunctionsandconceptsdescribedsofar:

1. Importpandasandcopylibrariesthatwewillneedforthistask(thecopy

moduleinthiscase):

importpandasaspd

importcopy

2. CreateaDataFrame,df1,andusetheheadmethodtoseethefirstfew

rowsoftheDataFrame.Usethefollowingcode:

df1=pd.DataFrame({'category':pd.Series([1,2,

3])})

df1.head()

Youroutputshouldbeasfollows:

Figure1.20:Thefirstfewrowsofdf1

3. CreateatestDataFrame,df,byduplicatingdf1.Usethedeepcopy

function:

df=copy.deepcopy(df1)

df.head()


Figure1.21:Thefirstfewrowsofdf

4. Addanewcolumn,cities,containingdifferentkindsofcitygroupsto

thetestDataFrameusingthefollowingcodeandtakealookattheDataFrameagain:

df['cities']=pd.Series([['Delhi','Mumbai'],

['Lucknow','Bhopal'],['Chennai',

'Bangalore']])

df.head()


Figure1.22:Addingarowtodf

5. Now,addmultiplecolumnspertainingtotheuserviewershipusingtheassignfunctionandagainlookatthedata.Usethefollowingcode:

df.assign(

young_viewers=pd.Series([2000000,3000000,

1500000]),

adult_viewers=pd.Series([2500000,3500000,

1600000]),

aged_viewers=pd.Series([2300000,2800000,

2000000])

)

df.head()

YourDataFramewillnowappearasfollows:

Figure1.23:Addingmultiplecolumnstodf

6. UsetheappendfunctiontoaddanewrowtotheDataFrame.Aswe

knowthatthenewrowcontainspartialinformation,wewillpasstheignore_indexparameterasTrue:

df.append({'cities':["Kolkata","Hyderabad"],

'adult_viewers':2000000,

'aged_viewers':2000000,'young_viewers':

1500000},ignore_index=True)

df.head()

YourDataFrameshouldnowlookasfollows:

Figure1.24:Addinganotherrowbyusingthe

appendfunctionondf

7. Now,usetheconcatfunctiontoduplicatethetestDataFrameandsaveit

asdf2.TakealookatthenewDataFrame:

df2=pd.concat([df,df],sort=False)

df2

df2willshowduplicateentriesofdf1,asshownhere:

Figure1.25:Usingtheconcatfunctionto

duplicateaDataFrame,df2,inpandas

8. TodeletearowfromthedfDataFrame,wewillnowpasstheindexofthe

rowwewanttodelete—inthiscase,thethirdrow—tothedropfunction,

asfollows:

df.drop([3])


Figure1.26:Usingthedropfunctiontodeletea

row

9. Similarly,let'sdeletetheaged_viewerscolumnfromtheDataFrame.

Wewillpassthecolumnnameastheparametertothedropfunctionand

specifytheaxisas1:

df.drop(['aged_viewers'])

Youroutputwillbeasfollows:

Figure1.27:Droppingtheaged_viewerscolumn

intheDataFrame

10. Notethat,astheresultofthedropfunctionisalsoaDataFrame,wecan

chainanotherfunctiononittoo.So,wedropthecitiesfieldfromdf2

andremovetheduplicatesinitaswell:

df2.drop('cities',axis=1).drop_duplicates()

Thedf2DataFramewillnowlookasfollows:

Figure1.28:Droppingthecitiesfieldandthen

removingduplicatesindf2

Congratulations!You'vesuccessfullyperformedsomebasicoperationsonaDataFrame.YounowknowhowtoaddrowsandcolumnstoDataFramesandhowtoconcatenatemultipleDataFramestogetherinabigDataFrame.

Inthenextsection,youwilllearnhowtocombinemultipledatasourcesintothesameDataFrame.Whencombiningdatasources,weneedtomakesuretoincludecommoncolumnsfrombothsourcesbutmakesurethatnoduplicationoccurs.Wewouldalsoneedtomakesurethat,unliketheconcatfunction,the

combinedDataFrameissmartabouttheindexanddoesnotduplicaterowsthatalreadyexist.Thisfeatureisalsocoveredinthenextsection.

CombiningData

OncethedataispreparedfrommultiplesourcesinseparatepandasDataFrames,wecanusethepd.mergefunctiontocombinethemintothesameDataFrame

basedonarelevantkeypassedthroughtheonparameter.Itispossiblethatthe

joiningkeyisnameddifferentlyinthedifferentDataFramesthatarebeingjoined.So,whilecallingpd.merge(df,df1),wecanprovidealeft_on

parametertospecifythecolumntobemergedfromdfandaright_on

parametertospecifytheindexindf1.

pandasprovidesfourwaysofcombiningDataFramesthroughthehow

parameter.Allvaluesofthesearedifferentjoinsbythemselvesandaredescribedasfollows:

Figure1.29:Tabledescribingdifferentjoins

ThefollowingfigureshowstwosampleDataFrames,df1anddf2,andthe

resultsofthevariousjoinsperformedontheseDataFrames:

Figure1.30:TableshowingtwoDataFramesandthe

outcomesofdifferentjoinsonthem

Forexample,wecanperformarightandouterjoinontheDataFramesofthepreviousexerciseusingthefollowingcode:

pd.merge(df,df1,how='right')

pd.merge(df,df1,how='outer')

Thefollowingwillbetheoutputoftheprecedingtwojoins:

Figure1.31:Examplesofthedifferenttypesofmerges

inpandas

HandlingMissingDataOncewehavejoinedtwodatasets,itiseasytoseewhathappenstoanindexpresentinoneofthetablesbutnotintheother.Theothercolumnsofthatindexgetthenp.nanvalue,whichispandas'wayoftellingusthatdataismissingin

thatcolumn.Dependingonwhereandhowthevaluesaregoingtobeused,missingvaluescanbetreateddifferently.Thefollowingarevariouswaysoftreatingmissingvalues:

Wecangetridofmissingvaluescompletelyusingdf.dropna,as

explainedintheAddingandRemovingAttributesandObservations

section.

Wecanalsoreplaceallthemissingvaluesatonceusingdf.fillna().

Thevaluewewanttofillinwilldependheavilyonthecontextandtheusecaseforthedata.Forexample,wecanreplaceallmissingvalueswiththemeanormedianofthedata,orevensomeeasytofiltervalues,suchas–1usingdf.fillna(df.mean()),df.fillna(df.median),or

df.fillna(-1),asshownhere:

Figure1.32:Usingthedf.fillnafunction

Wecaninterpolatemissingvaluesusingtheinterpolatefunction:

Figure1.33:Usingtheinterpolatefunctiontopredict

category

Otherthanusingin-builtoperations,wecanalsoperformdifferentoperationsonDataFramesbyfilteringoutrowswithmissingvaluesinthefollowingways:

Wecancheckforslicescontainingmissingvaluesusingthepd.isnull()function,orthosewithoutitusingthe

pd.isnotnull()function,respectively:

df.isnull()


Figure1.34:Usingthe.isnullfunction

WecancheckwhetherindividualelementsareNAusingtheisna

function:

df[['category']].isna

Thiswillgiveyouthefollowingoutput:

Figure1.35:Usingtheisnafunction

Thisdescribesmissingvaluesonlyinpandas.YoumightcomeacrossdifferenttypesofmissingvaluesinyourpandasDataFrameifitgetsdatafromdifferentsources,forexample,Noneindatabases.You'llhavetofilterthemoutseparately,asdescribedinprevioussections,andproceed.

Exercise4:CombiningDataFramesandHandlingMissingValuesTheaimofthisexerciseistogetyouusedtocombiningdifferentDataFramesandhandlingmissingvaluesindifferentcontexts,aswellastorevisithowtocreateDataFrames.Thecontextistogetuserinformationaboutusersdefinitelywatchingacertainwebcastonawebsitesothatwecanrecognizepatternsintheirbehavior:

1. Importthenumpyandpandasmodules,whichwe'llbeusing:

importnumpyasnp

importpandasaspd

2. CreatetwoemptyDataFrames,df1anddf2:

df1=pd.DataFrame()

df2=pd.DataFrame()

3. Wewillnowadddummyinformationabouttheviewersofthewebcastinacolumnnamedviewersindf1,andthepeopleusingthewebsiteina

columnnamedusersindf2.Usethefollowingcode:

df1['viewers']=["Sushmita","Aditya","Bala",

"Anurag"]

df2['users']=["Aditya","Anurag","Bala",

"Sushmita","Apoorva"]

4. WewillalsoaddacoupleofadditionalcolumnstoeachDataFrame.Thevaluesforthesecanbeaddedmanuallyorsampledfromadistribution,suchasnormaldistributionthroughNumPy:

np.random.seed(1729)

df1=df1.assign(views=np.random.normal(100,

100,4))

df2=df2.assign(cost=[20,np.nan,15,2,7])

5. ViewthefirstfewrowsofbothDataFrames,stillusingtheheadmethod:

df1.head()

df2.head()

Youshouldgetthefollowingoutputsforbothdf1anddf2:

Figure1.36:Contentsofdf1anddf2

6. Doaleftjoinofdf1withdf2andstoretheoutputinaDataFrame,df,

becauseweonlywanttheuserstatsindf2ofthoseuserswhoare

viewingthewebcastindf1.Therefore,wealsospecifythejoiningkeyas

"viewers"indf1and"users"indf2:

df=df1.merge(df2,left_on="viewers",

right_on="users",how="left")

df.head()

Youroutputshouldnowlookasfollows:

Figure1.37:Usingthemergeandfillnafunctions

7. You'llobservesomemissingvalues(NaN)intheprecedingoutput.We

willhandlethesevaluesintheDataFramebyreplacingthemwiththemeanvaluesinthatcolumn.Usethefollowingcode:

df.fillna(df.mean())

Youroutputwillnowlookasfollows:

Figure1.38:Imputingmissingvalueswiththemean

throughfillna

Congratulations!Youhavesuccessfullywrangledwithdataindatapipelinesandtransformedattributesexternally.Buttohandlethesales.xlsxfilethatwe

sawpreviously,thisisstillnotenough.WeneedtoapplyfunctionsandoperationsonthedatainsidetheDataFrametoo.Let'slearnhowtodothatandmoreinthenextsection.

ApplyingFunctionsandOperationsonDataFramesBydefault,operationsonallpandasobjectsareelement-wiseandreturnthesametypeofpandasobjects.Forinstance,lookatthefollowingcode:

df['viewers']=

df['adult_viewers']+df['aged_viewers']+df['young_viewers']

ThiswilladdaviewerscolumntotheDataFramewiththevalueforeach

observationbeingequaltothesumofthevaluesintheadult_viewers,

aged_viewers,andyoung_viewerscolumns.

Similarly,thefollowingcodewillmultiplyeverynumericalvalueintheviewerscolumnoftheDataFrameby0.03orwhateveryouwanttokeepas

yourtargetCTR(click-throughrate):

df['expectedclicks']=0.03*df['viewers']

Hence,yourDataFramewilllookasfollowsoncetheseoperationsareperformed:

Figure1.39:OperationsonpandasDataFrames

Pandasalsosupportsseveralout-of-the-boxbuilt-infunctionsonpandasobjects.Thesearelistedinthefollowingtable:

Figure1.40:Built-infunctionsusedinpandas

Note

RememberthatpandasobjectsarePythonobjectstoo.Therefore,wecanwriteourowncustomfunctionstoperformspecifictasksonthem.

Wecaniteratethroughtherowsandcolumnsofpandasobjectsusingitertuplesoriteritems.ConsiderthefollowingDataFrame,nameddf:

Figure1.41:DataFramedf

ThefollowingmethodscanbeperformedonthisDataFrame:

itertuples:ThismethoditeratesovertherowsoftheDataFramein

theformofnamedtuples.BysettingtheindexparametertoFalse,we

canremovetheindexasthefirstelementofthetupleandsetacustomnamefortheyieldednamedtuplesbysettingitinthenameparameter.ThefollowingscreenshotillustratesthisovertheDataFrameshownintheprecedingfigure:

Figure1.42:Testingitertuples

iterrows:ThismethoditeratesovertherowsoftheDataFramein

tuplesofthetype(label,content),wherelabelistheindexof

therowandcontentisapandasSeriescontainingeveryiteminthe

row.Thefollowingscreenshotillustratesthis:

Figure1.43:Testingiterrows

iteritems:ThismethoditeratesoverthecolumnsoftheDataFramein

tuplesofthetype(label,content),wherelabelisthenameofthe

columnandcontentisthecontentinthecolumnintheformofa

pandasSeries.Thefollowingscreenshotshowshowthisisperformed:

Figure1.44:Checkingoutiteritems

Toapplybuilt-inorcustomfunctionstopandas,wecanmakeuseofthemap

andapplyfunctions.Wecanpassanybuilt-in,NumPy,orcustomfunctionsas

parameterstothesefunctions,andtheywillbeappliedtoallelementsinthecolumn:

map:Thisreturnsanobjectofthesamekindasthatwaspassedtoit.A

dictionarycanalsobepassedasinputtoit,asshownhere:

Figure1.45:Usingthemapfunction

apply:Thisappliesthefunctiontotheobjectpassedandreturnsa

DataFrame.Itcaneasilytakemultiplecolumnsasinput.Italsoacceptstheaxisparameter,dependingonhowthefunctionistobeapplied,as

shown:

Figure1.46:Usingtheapplyfunction

OtherthanworkingonjustDataFramesandSeries,functionscanalsobeappliedtopandasGroupByobjects.Let'sseehowthatworks.

GroupingDataSupposeyouwanttoapplyafunctiondifferentlyonsomerowsofaDataFrame,dependingonthevaluesinaparticularcolumninthatrow.YoucanslicetheDataFrameonthekey(s)youwanttoaggregateonandthenapplyyourfunctiontothatgroup,storethevalues,andmoveontothenextgroup.

pandasprovidesamuchbetterwaytodothis,usingthegroupbyfunction,

whereyoucanpasskeysforgroupsasaparameter.TheoutputofthisfunctionisaDataFrameGroupByobjectthatholdsgroupscontainingvaluesofallthe

rowsinthatgroup.Wecanselectthenewcolumnwewouldliketoapplyafunctionto,andpandaswillautomaticallyaggregatetheoutputsonthelevelofdifferentvaluesonitskeysandreturnthefinalDataFramewiththefunctionsappliedtoindividualrows.

Forexample,thefollowingwillcollecttherowsthathavethesamenumberofaged_viewerstogether,taketheirvaluesintheexpectedclicks

column,andaddthemtogether:

Figure1.47:UsingthegroupbyfunctiononaSeries

Instead,ifweweretopass[['series']]totheGroupByobject,wewould

havegottenaDataFrameback,asshown:

Figure1.48:Usingthegroupbyfunctionona

DataFrame

Exercise5:ApplyingDataTransformations

Theaimofthisexerciseistogetyouusedtoperformingregularandgroupby

operationsonDataFramesandapplyingfunctionstothem.Youwillusetheuser_info.jsonfileintheLesson02folderonGitHub,whichcontains

informationaboutsixcustomers.

1. Importthepandasmodulethatwe'llbeusing:

importpandasaspd

2. Readtheuser_info.jsonfileintoapandasDataFrame,

user_info,andlookatthefirstfewrowsoftheDataFrame:

user_info=pd.read_json('user_info.json')

user_info.head()


Figure1.49:Outputoftheheadfunctionon

user_info

3. Now,lookattheattributesandthedatainsidethem:

user_info.info()


Figure1.50:Outputoftheinfofunctionon

user_info

4. Let'smakeuseofthemapfunctiontoseehowmanyfriendseachuserin

thedatahas.Usethefollowingcode:

user_info['friends'].map(lambdax:len(x))


Figure1.51:Usingthemapfunctiononuser_info

5. Weusetheapplyfunctiontogetagriponthedatawithineachcolumn

individuallyandapplyregularPythonfunctionstoit.Let'sconvertallthevaluesinthetagscolumnoftheDataFrametocapitallettersusingthe

upperfunctionforstringsinPython,asfollows:

user_info['tags'].apply(lambdax:[t.upper()for

tinx])


Figure1.52:Convertingvaluesintags

6. Usethegroupbyfunctiontogetthedifferentvaluesobtainedbya

certainattribute.Wecanusethecountfunctiononeachsuchmini

pandasDataFramegenerated.We'lldothisfirstfortheeyecolor:

user_info.groupby('eyeColor')['_id'].count()

Youroutputshouldnowlookasfollows:

Figure1.53:CheckingdistributionofeyeColor

7. Similarly,let'slookatthedistributionofanothervariable,favoriteFruit,inthedatatoo:

user_info.groupby('favoriteFruit')

['_id'].count()

Figure1.54:Seeingthedistributioninuse_info

Wearenowsufficientlypreparedtohandleanysortofproblemwemightfacewhentryingtostructureevenunstructureddataintoastructuredformat.Let'sdothatintheactivityhere.

Activity1:AddressingDataSpillingWewillnowsolvetheproblemthatweencounteredinExercise1.Westartbyloadingsales.xlsx,whichcontainssomehistoricalsalesdata,recordedin

MSExcel,aboutdifferentcustomerpurchasesinstoresinthepastfewyears.Yourcurrentteamisonlyinterestedinthefollowingproducttypes:ClimbingAccessories,CookingGear,FirstAid,GolfAccessories,InsectRepellents,andSleepingBags.YouneedtoreadthefilesintopandasDataFramesand

preparetheoutputsothatitcanbeaddedintoyouranalyticspipeline.Followthestepsgivenhere:

1. OpenthePythonconsoleandimportpandasandthecopymodule.

2. Loadthedatafromsales.xlsxintoaseparateDataFrame,named

sales,andlookatthefirstfewrowsofthegeneratedDataFrame.You

willgetthefollowingoutput:

Figure1.55:Outputoftheheadfunctionon

sales.xlsx

3. Analyzethedatatypeofthefieldsandgetholdofpreparedvalues.

4. Getthecolumnnamesright.Inthiscase,everynewcolumnstartswithacapitalcase.

5. Lookatthefirstcolumn,ifthevalueinthecolumnmatchestheexpectedvalues,justcorrectthecolumnnameandmoveontothenextcolumn.

6. Takethefirstcolumnwithvaluesleakingintoothercolumnsandlookatthedistributionofitsvalues.Addthevaluesfromthenextcolumnandgoontoasmanycolumnsasrequiredtogettotherightvaluesforthatcolumn.

7. SliceouttheportionoftheDataFramethathasthelargestnumberofcolumnsrequiredtocoverthevaluefortherightcolumnandstructurethevaluesforthatcolumncorrectlyinanewcolumnwiththerightattributename.

8. Youcannowdropallthecolumnsfromtheslicethatarenolongerrequiredoncethefieldhastherightvaluesandmoveontothenextcolumn.

9. Repeat4–7multipletimes,untilyouhavegottenasliceoftheDataFramecompletelystructuredwithallthevaluescorrectandpointingtotheintendedcolumn.SavethisDataFrameslice.YourfinalstructuredDataFrameshouldappearasfollows:

Figure1.56:Firstfewrowsofthestructured

DataFrame

Note

Thesolutionforthisactivitycanbefoundonpage316.

Summary

Dataprocessingandwranglingistheinitial,andaveryimportant,partofthedatasciencepipeline.Itisgenerallyhelpfulifpeoplepreparingdatahavesomedomainknowledgeaboutthedata,sincethatwillhelpthemstopattherightprocessingpointandusetheirintuitiontobuildthepipelinebetterandmorequickly.Dataprocessingalsorequirescomingupwithinnovativesolutionsandhacks.

Inthischapter,youlearnedhowtostructurelargedatasetsbyarrangingtheminatabularform.Then,wegotthistabulardataintopandasanddistributeditbetweentherightcolumns.Onceweweresurethatourdatawasarrangedcorrectly,wecombineditwithotherdatasources.Wealsogotridofduplicatesandneedlesscolumns,andfinally,dealtwithmissingdata.Afterperformingthesesteps,ourdatawasmadereadyforanalysisandcouldbeputintoadatasciencepipelinedirectly.

Inthenextchapter,wewilldeepenourunderstandingofpandasandtalkaboutreshapingandanalyzingDataFramesforbettervisualizationsandsummarizingdata.Wewillalsoseehowtodirectlysolvegenericbusiness-criticalproblemsefficiently.

Date post:	30-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Data Science for Marketing Analytics · 2020-01-08 · Adding and Removing Attributes and...

Documents