DataScienceforMarketingAnalyticsCopyright©2019PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Authors:TommyBlanchard,DebasishBehera,PranshuBhatnagar
TechnicalReviewer:DipankarNath
ManagingEditor:NehaNair
AcquisitionsEditor:KunalSawant
ProductionEditor:SamitaWarang
EditorialBoard:DavidBarnes,EwanBuckingham,ShivangiChatterji,SimonCox,ManasaKumar,AlexMazonowicz,DouglasPaterson,DominicPereira,ShinyPoojary,SamanSiddiqui,ErolStaveley,AnkitaThakur,andMohitaVyas
FirstPublished:March2019
ProductionReference:1290319
ISBN:978-1-78995-941-3
PublishedbyPacktPublishingLtd.
LiveryPlace,35LiveryStreet
BirminghamB32PB,UK
TableofContents
Preface
Chapter1:DataPreparationandCleaning
Introduction
DataModelsandStructuredData
pandas
ImportingandExportingDataWithpandasDataFrames
ViewingandInspectingDatainDataFrames
Exercise1:ImportingJSONFilesintopandas
Exercise2:IdentifyingSemi-StructuredandUnstructuredData
StructureofapandasSeries
DataManipulation
SelectingandFilteringinpandas
CreatingTestDataFramesinPython
AddingandRemovingAttributesandObservations
Exercise3:CreatingandModifyingTestDataFrames
CombiningData
HandlingMissingData
Exercise4:CombiningDataFramesandHandlingMissingValues
ApplyingFunctionsandOperationson
DataFrames
GroupingData
Exercise5:ApplyingDataTransformations
Activity1:AddressingDataSpilling
Summary
Chapter2:DataExplorationandVisualization
Introduction
IdentifyingtheRightAttributes
Exercise6:ExploringtheAttributesinSalesData
GeneratingTargetedInsights
SelectingandRenamingAttributes
TransformingValues
Exercise7:TargetingInsightsforSpecificUseCases
ReshapingtheData
Exercise8:UnderstandingStackingandUnstacking
PivotTables
VisualizingData
Exercise9:VisualizingDataWithpandas
VisualizationthroughSeaborn
VisualizationwithMatplotlib
Activity2:AnalyzingAdvertisements
Summary
Chapter3:UnsupervisedLearning:CustomerSegmentation
Introduction
CustomerSegmentationMethods
TraditionalSegmentationMethods
UnsupervisedLearning(Clustering)forCustomerSegmentation
SimilarityandDataStandardization
DeterminingSimilarity
StandardizingData
Exercise10:StandardizingAgeandIncomeDataofCustomers
CalculatingDistance
Exercise11:CalculatingDistanceBetweenThreeCustomers
Activity3:Loading,Standardizing,andCalculatingDistancewithaDataset
k-meansClustering
Understandingk-meansClustering
Exercise12:k-meansClusteringonIncome/AgeData
High-DimensionalData
Exercise13:DealingwithHigh-DimensionalData
Activity4:Usingk-meansClusteringon
CustomerBehaviorData
Summary
Chapter4:ChoosingtheBestSegmentationApproach
Introduction
ChoosingtheNumberofClusters
SimpleVisualInspection
Exercise14:ChoosingtheNumberofClustersBasedonVisualInspection
TheElbowMethodwithSumofSquaredErrors
Exercise15:DeterminingtheNumberofClustersUsingtheElbowMethod
Activity5:DeterminingClustersforHigh-EndClothingCustomerDataUsingthe
ElbowMethodwiththeSumofSquaredErrors
DifferentMethodsofClustering
Mean-ShiftClustering
Exercise16:PerformingMean-ShiftClusteringtoClusterData
k-modesandk-prototypesClustering
Exercise17:ClusteringDataUsingthek-prototypesMethod
Activity6:UsingDifferentClusteringTechniquesonCustomerBehaviorData
EvaluatingClustering
SilhouetteScore
Exercise18:CalculatingSilhouetteScoretoPicktheBestkfork-meansandComparingtotheMean-ShiftAlgorithm
TrainandTestSplit
Exercise19:UsingaTrain-TestSplittoEvaluateClusteringPerformance
Activity7:EvaluatingClusteringonCustomerBehaviorData
Summary
Chapter5:PredictingCustomerRevenueUsingLinearRegression
Introduction
UnderstandingRegression
FeatureEngineeringforRegression
FeatureCreation
DataCleaning
Exercise20:CreatingFeaturesforTransactionData
AssessingFeaturesUsingVisualizationsandCorrelations
Exercise21:ExaminingRelationshipsbetweenPredictorsandOutcome
Activity8:ExaminingRelationshipsBetweenStorefrontLocationsandFeaturesaboutTheirArea
PerformingandInterpretingLinearRegression
Exercise22:BuildingaLinearModelPredictingCustomerSpend
Activity9:BuildingaRegressionModeltoPredictStorefrontLocationRevenue
Summary
Chapter6:OtherRegressionTechniquesandToolsforEvaluation
Introduction
EvaluatingtheAccuracyofaRegressionModel
ResidualsandErrors
MeanAbsoluteError
RootMeanSquaredError
Exercise23:EvaluatingRegressionModelsofLocationRevenueUsingMAEandRMSE
Activity10:TestingWhichVariablesareImportantforPredictingResponsestoa
MarketingOffer
UsingRegularizationforFeatureSelection
Exercise24:UsingLassoRegressionforFeatureSelection
Activity11:UsingLassoRegressiontoChooseFeaturesforPredictingCustomerSpend
Tree-BasedRegressionModels
RandomForests
Exercise25:UsingTree-BasedRegressionModelstoCaptureNon-LinearTrends
Activity12:BuildingtheBestRegressionModelforCustomerSpendBasedonDemographicData
Summary
Chapter7:SupervisedLearning:PredictingCustomerChurn
Introduction
ClassificationProblems
UnderstandingLogisticRegression
RevisitingLinearRegression
LogisticRegression
Exercise26:PlottingtheSigmoidFunction
CostFunctionforLogisticRegression
AssumptionsofLogisticRegression
Exercise27:Loading,Splitting,andApplyingLinearandLogisticRegression
toData
CreatingaDataSciencePipeline
ObtainingtheData
Exercise28:ObtainingtheData
ScrubbingtheData
Exercise29:ImputingMissingValues
Exercise30:RenamingColumnsandChangingtheDataType
ExploringtheData
StatisticalOverview
Correlation
Exercise31:ObtainingtheStatistical
OverviewandCorrelationPlot
VisualizingtheData
Exercise32:PerformingExploratoryDataAnalysis(EDA)
Activity13:PerformingOSEofOSEMN
ModelingtheData
FeatureSelection
Exercise33:PerformingFeatureSelection
ModelBuilding
Exercise34:BuildingaLogisticRegressionModel
InterpretingtheData
Activity14:PerformingMNofOSEMN
Summary
Chapter8:Fine-TuningClassificationAlgorithms
Introduction
SupportVectorMachines
IntuitionBehindMaximumMargin
LinearlyInseparableCases
LinearlyInseparableCasesUsingKernel
Exercise35:TraininganSVMAlgorithmOveraDataset
DecisionTrees
Exercise36:ImplementingaDecisionTreeAlgorithmOveraDataset
ImportantTerminologyofDecisionTrees
DecisionTreeAlgorithmFormulation
RandomForest
Exercise37:ImplementingaRandomForestModelOveraDataset
Activity15:ImplementingDifferentClassificationAlgorithms
PreprocessingDataforMachineLearningModels
Standardization
Exercise38:StandardizingData
Scaling
Exercise39:ScalingDataAfterFeature
Selection
Normalization
Exercise40:PerformingNormalizationonData
ModelEvaluation
Exercise41:ImplementingStratifiedk-fold
Fine-TuningoftheModel
Exercise42:Fine-TuningaModel
Activity16:TuningandOptimizingtheModel
PerformanceMetrics
Precision
Recall
F1Score
Exercise43:EvaluatingthePerformanceMetricsforaModel
ROCCurve
Exercise44:PlottingtheROCCurve
Activity17:ComparisonoftheModels
Summary
Chapter9:ModelingCustomerChoice
Introduction
UnderstandingMulticlassClassification
ClassifiersinMulticlassClassification
Exercise45:ImplementingaMulticlassClassificationAlgorithmonaDataset
PerformanceMetrics
Exercise46:EvaluatingPerformanceUsingMulticlassPerformanceMetrics
Activity18:PerformingMulticlassClassificationandEvaluatingPerformance
ClassImbalancedData
Exercise47:PerformingClassificationonImbalancedData
DealingwithClass-ImbalancedData
Exercise48:VisualizingSamplingTechniques
Exercise49:FittingaRandomForestClassifierUsingSMOTEandBuildingtheConfusionMatrix
Activity19:DealingwithImbalancedData
Summary
Appendix
Preface
AboutThissectionbrieflyintroducestheauthors,thecoverageofthisbook,thetechnicalskillsyou'llneedtogetstarted,andthehardwareandsoftwarerequirementsrequiredtocompletealloftheincludedactivitiesandexercises.
AbouttheBookDataScienceforMarketingAnalyticscoverseverystageofdataanalytics,fromworkingwitharawdatasettosegmentingapopulationandmodelingdifferentpartsofitbasedonthesegments.
ThebookstartsbyteachingyouhowtousePythonlibraries,suchaspandasandMatplotlib,toreaddatafromPython,manipulateit,andcreateplotsusingbothcategoricalandcontinuousvariables.Then,you'lllearnhowtosegmentapopulationintogroupsandusedifferentclusteringtechniquestoevaluatecustomersegmentation.Asyoumakeyourwaythroughthechapters,you'llexplorewaystoevaluateandselectthebestsegmentationapproach,andgoontocreatealinearregressionmodeloncustomervaluedatatopredictlifetimevalue.Intheconcludingchapters,you'llgainanunderstandingofregressiontechniquesandtoolsforevaluatingregressionmodels,andexplorewaystopredictcustomerchoiceusingclassificationalgorithms.Finally,you'llapplythesetechniquestocreateachurnmodelformodelingcustomerproductchoices.
Bytheendofthisbook,youwillbeabletobuildyourownmarketingreporting
andinteractivedashboardsolutions.
AbouttheAuthorsTommyBlanchardearnedhisPhDfromtheUniversityofRochesteranddidhispostdoctoraltrainingatHarvard.Now,heleadsthedatascienceteamatFreseniusMedicalCareNorthAmerica.Histeamperformsadvancedanalyticsandcreatespredictivemodelstosolveawidevarietyofproblemsacrossthecompany.
DebasishBeheraworksasadatascientistforalargeJapanesecorporatebank,whereheappliesmachinelearning/AItosolvecomplexproblems.HehasworkedonmultipleusecasesinvolvingAML,predictiveanalytics,customersegmentation,chatbots,andnaturallanguageprocessing.HecurrentlylivesinSingaporeandholdsaMaster'sinBusinessAnalytics(MITB)fromtheSingaporeManagementUniversity.
PranshuBhatnagarworksasadatascientistinthetelematics,insurance,andmobilesoftwarespace.HehaspreviouslyworkedasaquantitativeanalystintheFinTechindustryandoftenwritesaboutalgorithms,timeseriesanalysisinPython,andsimilartopics.HegraduatedwithhonorsfromtheChennaiMathematicalInstitutewithadegreeinMathematicsandComputerScienceandhascompletedcertificationbooksinMachineLearningandArtificialIntelligencefromtheInternationalInstituteofInformationTechnology,Hyderabad.HeisbasedinBangalore,India.
Objectives
AnalyzeandvisualizedatainPythonusingpandasandMatplotlib
Studyclusteringtechniques,suchashierarchicalandk-meansclustering
Createcustomersegmentsbasedonmanipulateddata
Predictcustomerlifetimevalueusinglinearregression
Useclassificationalgorithmstounderstandcustomerchoice
Optimizeclassificationalgorithmstoextractmaximalinformation
AudienceDataScienceforMarketingAnalyticsisdesignedfordevelopersandmarketinganalystslookingtousenew,moresophisticatedtoolsintheirmarketinganalyticsefforts.It'llhelpifyouhavepriorexperienceofcodinginPythonandknowledgeofhighschoollevelmathematics.Someexperiencewithdatabases,Excel,statistics,orTableauisusefulbutnotnecessary.
ApproachDataScienceforMarketingAnalyticstakesahands-onapproachtothepracticalaspectsofusingPythondataanalyticslibrariestoeasemarketinganalyticsefforts.Itcontainsmultipleactivitiesthatusereal-lifebusinessscenariosforyoutopracticeandapplyyournewskillsinahighlyrelevantcontext.
MinimumHardwareRequirementsForanoptimalstudentexperience,werecommendthefollowinghardware
configuration:
Processor:DualCoreorbetter
Memory:4GBRAM
Storage:10GBavailablespace
SoftwareRequirementsYou'llalsoneedthefollowingsoftwareinstalledinadvance:
Anyofthefollowingoperatingsystems:Windows7SP132/64-bit,Windows8.132/64-bit,orWindows1032/64-bit,Ubuntu14.04orlater,ormacOSSierraorlater.
Browser:GoogleChromeorMozillaFirefox
Conda
Python3.x
ConventionsCodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:"Importtheclustermodulefromthesklearnpackage."
Ablockofcodeissetasfollows:
plt.xlabel('Income')
plt.ylabel('Age')
plt.show()
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:"TheYearcolumnappearstohavematchedtotherightvalues,butthelinecolumndoesnotseemtomakemuchsense."
InstallationandSetupWerecommendinstallingPythonusingtheAnacondadistribution,availablehere:https://www.anaconda.com/distribution/.
Itcontainsmostofthemodulesthatwillbeused.AdditionalPythonmodulescanbeinstalledusingthemethodshere:https://docs.python.org/3/installing/index.html.ThereisonlyonemodulethatisusedthatisnotpartofthestandardAnacondadistribution;useoneofthemethodsinthelinkedpagetoinstallit:
kmodes
IfyoudonotusetheAnacondadistribution,makesureyouhavethefollowingmodulesinstalled:
jupyter
pandas
sklearn
numpy
scipy
seaborn
statsmodels
InstallingtheCodeBundleCopythecodebundlefortheclasstotheC:/Codefolder.
AdditionalResourcesThecodebundleforthisbookisalsohostedonGitHubat:https://github.com/TrainingByPackt/Data-Science-for-Marketing-Analytics.
Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableathttps://github.com/PacktPublishing/.Checkthemout!
Chapter1
DataPreparationandCleaning
LearningObjectivesBytheendofthischapter,youwillbeableto:
CreatepandasDataFramesinPython
Readandwritedataintodifferentfileformats
Slice,aggregate,filter,andapplyfunctions(built-inandcustom)toDataFrames
JoinDataFrames,handlemissingvalues,andcombinedifferentdatasources
ThischaptercoversbasicdatapreparationandmanipulationtechniquesinPython,whichisthefoundationofdatascience.
IntroductionThewaywemakedecisionsintoday'sworldischanging.Averylargeproportionofourdecisions—fromchoosingwhichmovietowatch,whichsongtolistento,whichitemtobuy,orwhichrestauranttovisit—allrelyuponrecommendationsandratingsgeneratedbyanalytics.Asdecisionmakerscontinuetousemoreofsuchanalyticstomakedecisions,theythemselves
becomedatapointsforfurtherimprovements,andastheirowncustomneedsfordecisionmakingcontinuetobemet,theyalsokeepusingtheseanalyticalmodelsfrequently.
Thechangeinconsumerbehaviorhasalsoinfluencedthewaycompaniesdevelopstrategiestotargetconsumers.Withtheincreaseddigitizationofdata,greateravailabilityofdatasources,andlowerstorageandprocessingcosts,firmscannowcrunchlargevolumesofincreasinglygranulardatawiththehelpofvariousdatasciencetechniquesandleverageittocreatecomplexmodels,performsophisticatedtasks,andderivevaluableconsumerinsightswithhigheraccuracy.Itisbecauseofthisdramaticincreaseindataandcomputingpower,andtheadvancementintechniquestousethisdatathroughdatasciencealgorithms,thattheMcKinseyGlobalInstitutecallsouragetheAgeofAnalytics.
Severalindustryleadersarealreadyusingdatasciencetomakebetterdecisionsandtoimprovetheirmarketinganalytics.GoogleandAmazonhavebeenmakingtargetedrecommendationscateringtothepreferencesoftheirusersfromtheirveryearlyyears.PredictivedatasciencealgorithmstaskedwithgeneratingleadsfrommarketingcampaignsatDellreportedlyconverted50%ofthefinalleads,whereasthosegeneratedthroughtraditionalmethodshadaconversionrateofonly17%.PricesurgesonUberfornon-passholdersduringrushhouralsoreportedlyhadmassivepositiveeffectsonthecompany'sprofits.Infact,itwasrecentlydiscoveredthatpricemanagementinitiativesbasedonanevaluationofcustomerlifetimevaluetendedtoincreasebusinessmarginsby2%–7%overa12-monthperiodandresultedina200%–350%ROIingeneral.
Althoughusingdatascienceprinciplesinmarketinganalyticsisaprovencost-effective,efficientwayforalotofcompaniestoobserveacustomer'sjourney
andprovideamorecustomizedexperience,multiplereportssuggestthatitisnotbeingusedtoitsfullpotential.Thereisawidegapbetweenthepossibleandactualusageofthesetechniquesbyfirms.Thisbookaimstobridgethatgap,andcoversanarrayofusefultechniquesinvolvingeverythingdatasciencecandointermsofmarketingstrategiesanddecision-makinginmarketing.Bytheendofthebook,youshouldbeabletosuccessfullycreateandmanageanend-to-endmarketinganalyticspipelineinPython,segmentcustomersbasedonthedataprovided,predicttheirlifetimevalue,andmodeltheirdecision-makingbehavioronyourownusingdatasciencetechniques.
Thischapterintroducesyoutocleaningandpreparingdata—thefirststepinanydata-centricpipeline.Rawdatacomingfromexternalsourcescannotgenerallybeuseddirectly;itneedstobestructured,filtered,combined,analyzed,andobservedbeforeitcanbeusedforanyfurtheranalyses.Inthischapter,wewillexplorehowtogettherightdataintherightattributes,manipulaterowsandcolumns,andapplytransformationstodata.Thisisessentialbecause,otherwise,wewillbepassingincorrectdatatothepipeline,therebymakingitaclassicexampleofgarbagein,garbageout.
DataModelsandStructuredDataWhenwebuildananalyticspipeline,thefirstthingthatweneedtodoistobuildadatamodel.Adatamodelisanoverviewofthedatasourcesthatwewillbeusing,theirrelationshipswithotherdatasources,whereexactlythedatafromaspecificsourceisgoingtoenterthepipeline,andinwhatform(suchasanExcelfile,adatabase,oraJSONfromaninternetsource).Thedatamodelforthepipelineevolvesovertimeasdatasourcesandprocesseschange.Adatamodelcancontaindataofthefollowingthreetypes:
StructuredData:Thisisalsoknownascompletelystructuredorwell-structureddata.Thisisthesimplestwaytomanageinformation.Thedataisarrangedinaflattabularformwiththecorrectvaluecorrespondingtothecorrectattribute.Thereisauniquecolumn,knownasanindex,foreasyandquickaccesstothedata,andtherearenoduplicatecolumns.DatacanbequeriedexactlythroughSQLqueries,forexample,datainrelationaldatabases,MySQL,AmazonRedshift,andsoon.
Semi-structureddata:Thisreferstodatathatmaybeofvariablelengthsandthatmaycontaindifferentdatatypes(suchasnumericalorcategorical)inthesamecolumn.Suchdatamaybearrangedinanestedorhierarchicaltabularstructure,butitstillfollowsafixedschema.Therearenoduplicatecolumns(attributes),buttheremaybeduplicaterows(observations).Also,eachrowmightnotcontainvaluesforeveryattribute,thatis,theremaybemissingvalues.Semi-structureddatacanbestoredaccuratelyinNoSQLdatabases,ApacheParquetfiles,JSONfiles,andsoon.
Unstructureddata:Datathatisunstructuredmaynotbetabular,andevenifitistabular,thenumberofattributesorcolumnsperobservationmaybecompletelyarbitrary.Thesamedatacouldberepresentedindifferentways,andtheattributesmightnotmatcheachother,withvaluesleakingintootherparts.Unstructureddatacanbestoredastextfiles,CSVfiles,Excelfiles,images,audioclips,andsoon.
Marketingdata,traditionally,comprisesdataofallthreetypes.Initially,mostdatapointsoriginatedfromdifferent(possiblymanual)datasources,sothevaluesforafieldcouldbeofdifferentlengths,thevalueforonefieldwouldnotmatchthatofotherfieldsbecauseofdifferentfieldnames,somerowscontaining
datafromeventhesamesourcescouldalsohavemissingvaluesforsomeofthefields,andsoon.Butnow,becauseofdigitization,structuredandsemi-structureddataisalsoavailableandisincreasinglybeingusedtoperformanalytics.Thefollowingfigureillustratesthedatamodeloftraditionalmarketinganalyticscomprisingallkindsofdata:structureddatasuchasdatabases(top),semi-structureddatasuchasJSONs(middle),andunstructureddatasuchasExcelfiles(bottom):
Figure1.1:Datamodeloftraditionalmarketing
analytics
Adatamodelwithallthesedifferentkindsofdataispronetoerrorsandisveryriskytouse.Ifwesomehowgetagarbagevalueintooneoftheattributes,ourentireanalysiswillgoawry.Mostofthetimes,thedataweneedisofacertainkindandifwedon'tgetthattypeofdata,wemightrunintoabugorproblemthatwouldneedtobeinvestigated.Therefore,ifwecanenforcesomecheckstoensurethatthedatabeingpassedtoourmodelisalmostalwaysofthesamekind,
wecaneasilyimprovethequalityofdatafromunstructuredtoatleastsemi-structured.
ThisiswhereprogramminglanguagessuchasPythoncomeintoplay.Pythonisanall-purposegeneralprogramminglanguagethatnotonlymakeswritingstructure-enforcingscriptseasy,butalsointegrateswithalmosteveryplatformandautomatesdataproduction,analysis,andanalyticsintoamorereliableandpredictablepipeline.Apartfromunderstandingpatternsandgivingatleastabasicstructuretodata,Pythonforcesintelligentpipelinestoaccepttherightvaluefortherightattribute.Themajorityofanalyticspipelinesareexactlyofthiskind.Thefollowingfigureillustrateshowmostmarketinganalyticstodaystructuredifferentkindsofdatabypassingitthroughscriptstomakeitatleastsemi-structured:
Figure1.2:Datamodelofmostmarketinganalytics
thatusePython
Bymakinguseofsuchstructure-enforcingscripts,wewillhaveapipelineofsemi-structureddatacominginwithexpectedvaluesintherightfields;however,thedataisnotyetinthebestpossibleformattoperformanalytics.Ifwecancompletelystructureourdata(thatis,arrangeitinflattables,withtherightvaluepointingtotherightattributewithnonestingorhierarchy),itwillbeeasyforustoseehoweverydatapointindividuallycomparestootherpointsbeingconsideredinthecommonfields,andwouldalsomakethepipelinescalable.Wecaneasilygetafeelofthedata—thatis,seeinwhatrangemostvalueslie,identifytheclearoutliers,andsoon—bysimplyscrollingthroughthedata.
Whiletherearealotoftoolsthatcanbeusedtoconvertdatafromanunstructured/semi-structuredformattoafullystructuredformat(forexample,Spark,STATA,andSAS),thetoolthatismostcommonlyusedfordatascience,canbeintegratedwithpracticallyanyframework,hasrichfunctionalities,minimalcosts,andiseasy-to-useforourusecase,ispandas.Thefollowingfigureillustrateshowadatamodelstructuresdifferentkindsofdatafrombeingpossiblyunstructuredtosemi-structured(usingPython),tocompletelystructured(usingpandas):
Figure1.3:Datamodeltostructurethedifferentkinds
ofdata
Note
Forthepurposeofthisbook,wewillassumethatyouaremoreorlesscomfortablewithNumPy.
pandaspandasisasoftwarelibrarywritteninPythonandisthebasisfordatamanipulationandanalysisinthelanguage.Itsnamecomesfrom"paneldata,"aneconometricstermfordatasetsthatincludeobservationsovermultipletimeperiodsforthesameindividuals.
pandasoffersacollectionofhigh-performance,easy-to-use,andintuitivedatastructuresandanalysistoolsthatareofgreatusetomarketinganalystsanddatascientistsalike.Ithasthefollowingtwoprimaryobjecttypes:
DataFrame:Thisisthefundamentaltabularrelationshipobjectthatstoresdatainrowsandcolumns(likeaspreadsheet).Toperformdataanalysis,functionsandoperationscanbedirectlyappliedtoDataFrames.
Series:ThisreferstoasinglecolumnoftheDataFrame.Thevaluecanbeaccessedthroughitsindex.AsSeriesautomaticallyinfersatype,itautomaticallymakesallDataFrameswell-structured.
ThefollowingfigureillustratesapandasDataFramewithanautomaticintegerindex(0,1,2,3...):
Figure1.4:AsamplepandasDataFrame
Nowthatweunderstandwhatpandasobjectsareandhowtheycanbeusedtoautomaticallygetstructureddata,let'stakealookatsomeofthefunctionswecanusetoimportandexportdatainpandasandseeifthedatawepassedisreadytobeusedforfurtheranalyses.
ImportingandExportingDataWithpandasDataFramesEveryteaminamarketinggroupcanhaveitsownpreferreddatatypefortheirspecificusecase.ThosewhohavetodealwithalotmoretextthannumbersmightpreferusingJSONorXML,whileothersmaypreferCSV,XLS,orevenPythonobjects.pandashasalotofsimpleAPIs(applicationprograminterfaces)thatallowittoreadalargevarietyofdatadirectlyintoDataFrames.Someofthemainonesareshownhere:
Figure1.5:Waystoimportandexportdifferenttypes
ofdatawithpandasDataFrames
Note
Rememberthatawell-structuredDataFramedoesnothavehierarchicalornesteddata.Theread_xml,read_json(),andread_html()functions
(andothers)causethedatatoloseitshierarchicaldatatypes/nestedstructureandconvertitintoflattenedobjectssuchaslistsandlistsoflists.Pandas,however,doessupporthierarchicaldatafordataanalysis.Youcansaveandloadsuchdatabypicklingfromyoursessionandmaintainingthehierarchyinsuchcases.Whenworkingwithdatapipelines,it'sadvisedtosplitnesteddataintoseparatestreamstomaintainthestructure.
Whenloadingdata,pandasprovidesuswithadditionalparametersthatwecanpasstoreadfunctions,sothatwecanloadthedatadifferently.Someadditional
parametersthatareusedcommonlywhenimportingdataintopandasaregivenhere:
skiprows=k:Thisskipsthefirstkrows.
nrows=k:Thisparsesonlythefirstkrows.
names=[col1,col2...]:Thisliststhecolumnnamestobeused
intheparsedDataFrame.
header=k:Thisappliesthecolumnnamescorrespondingtothekth
rowastheheaderfortheDataFrame.kcanalsobeNone.
index_col=col:ThissetscolastheindexoftheDataFrame
beingused.Thiscanalsobealistofcolumnnames(usedtocreateaMultiIndex)orNone.
usecols=[l1,l2...]:Thisprovideseitherintegerpositional
indicesinthedocumentcolumnsorstringsthatcorrespondtocolumnnamesintheDataFrametoberead.Forexample,[0,1,2]or['foo','bar','baz'].
Note
Therearesimilarspecificparametersforalmosteveryin-builtfunctioninpandas.Youcanfinddetailsaboutthemwiththedocumentationforpandasavailableatthefollowinglink:https://pandas.pydata.org/pandas-docs/stable/.
ViewingandInspectingDatainDataFramesOnceyou'vereadtheDataFrameusingtheAPI,asexplainedearlier,you'llnoticethat,unlessthereissomethinggrosslywrongwiththedata,theAPIgenerallyneverfails,andwealwaysgetaDataFrameobjectafterthecall.However,weneedtoinspectthedataourselvestocheckwhethertherightattributehasreceivedtherightdata,forwhichwecanuseseveralin-builtfunctionsthatpandasprovides.AssumethatwehavestoredtheDataFrameinavariablecalleddfthen:
df.head(n)willreturnthefirstnrowsoftheDataFrame.Ifnonis
passed,bydefault,thefunctionconsidersntobe5.
df.tail(n)willreturnthelastnrowsoftheDataFrame.Ifnonis
passed,bydefault,thefunctionconsidersntobe5.
df.shapewillreturnatupleofthetype(numberofrows,numberof
columns).
df.dtypeswillreturnthetypeofdataineachcolumnofthepandas
DataFrame(suchasfloat,char,andsoon).
df.info()willsummarizetheDataFrameandprintitssize,typeof
values,andthecountofnon-nullvalues.
Exercise1:ImportingJSONFilesinto
pandasForthisexercise,youneedtousetheuser_info.jsonfileprovidedtoyou
intheLesson01folder.Thefilecontainssomeanonymouspersonaluser
informationcollectedfromsixcustomersthroughaweb-basedforminJSONformat.YouneedtoopenaJupyterNotebook,importtheJSONfileintotheconsoleasapandasDataFrame,andseewhetherithasloadedcorrectly,withtherightvaluesbeingpassedtotherightattribute.
Note
AlltheexercisesandactivitiesinthischaptercanbedoneinboththeJupyterNotebookandPythonshell.Whilewecandothemintheshellfornow,itishighlyrecommendedtousetheJupyterNotebook.TolearnhowtoinstallJupyterandsetuptheJupyterNotebook,checkhttps://jupyter.readthedocs.io/en/latest/install.html.ItwillbeassumedthatyouareusingaJupyterNotebookfromthenextchapteronward.
1. OpenaJupyterNotebooktoimplementthisexercise.Onceyouareintheconsole,importthepandaslibraryusingtheimportcommand,as
follows:
importpandasaspd
2. Readtheuser_info.jsonJSONfileintotheuser_info
DataFrame:
user_info=pd.read_json("user_info.json")
3. CheckthefirstfewvaluesintheDataFrameusingtheheadcommand:
user_info.head()
Youshouldseethefollowingoutput:
Figure1.6:Viewingthefirstfewrowsof
user_info.json
4. Aswecansee,thedatamakessensesuperficially.Let'sseeifthedatatypesmatchtoo.Typeinthefollowingcommand:
user_info.info()
Youshouldgetthefollowingoutput:
Figure1.7:Informationaboutthedatain
user_info
Fromtheprecedingfigure,noticethattheisActivecolumnisBoolean,
theageandindexcolumnsareintegers,whereasthelatitudeand
longitudecolumnsarefloats.TherestoftheelementsarePython
objects,mostlikelytobestrings.Lookingatthenames,theymatchourintuition.So,thedatatypesseemtomatch.Also,thenumberofobservationsseemstobethesameforallfields,whichimpliesthattherehasbeennodataloss.
Note
The64displayedwiththetypeaboveisanindicatorofprecisionandvariesondifferentplatforms.
5. Let'salsoseethenumberofrowsandcolumnsintheDataFrameusingtheshapeattributeoftheDataFrame:
user_info.shape
Thiswillgiveyou(6,22)astheoutput,indicatingthattheDataFramecreatedbytheJSONhas6rowsand22columns.
Congratulations!Youhaveloadedthedatacorrectly,withtherightattributescorrespondingtotherightcolumnsandwithnomissingvalues.Sincethedatawasalreadystructured,itisnowreadytobeputintothepipelinetobeusedforfurtheranalysis.
Exercise2:IdentifyingSemi-StructuredandUnstructuredDataInthisexercise,youwillbeusingthedata.csvandsales.xlsxfiles
providedtoyouintheLesson01folder.Thedata.csvfilecontainsthe
viewsandlikesof100differentpostsonFacebookinamarketingcampaign,andsales.xlsxcontainssomehistoricalsalesdatarecordedinMSExcelabout
differentcustomerpurchasesinstoresinthepastfewyears.WewanttoreadthefilesintopandasDataFramesandcheckwhethertheoutputisreadytobeaddedintotheanalyticspipeline.Let'sfirstworkwiththedata.csvfile:
1. Importpandasintotheconsole,asfollows:
importpandasaspd
2. Usetheread_csvmethodtoreadthedata.csvCSVfileintoa
campaign_dataDataFrame:
campaign_data=pd.read_csv("data.csv")
3. LookatthecurrentstateoftheDataFrameusingtheheadfunction:
campaign_data.head()
Youroutputshouldlookasfollows:
Figure1.8:Viewingrawcampaign_data
Fromtheprecedingoutput,wecanobservethatthefirstcolumnhasanissue;wewanttohave"views"and"likes"asthecolumnnamesandfortheDataFrametohavenumericvalues.
4. Wewillreadthedataintocampaign_dataagain,butthistimemaking
surethatweusethefirstrowtogetthecolumnnamesusingtheheader
parameter,asfollows:
campaign_data=pd.read_csv("data.csv",header=
1)
5. Let'snowviewcampaign_dataagain,andseewhethertheattributes
areokaynow:
campaign_data.head()
YourDataFrameshouldnowappearasfollows:
Figure1.9:campaign_dataafterbeingreadwith
theheaderparameter
6. Thevaluesseemtomakesense—withtheviewsbeingfarmorethanthelikes—whenwelookatthefirstfewrows,butbecauseofsomemisalignmentormissingvalues,thelastfewrowsmightbedifferent.So,let'shavealookatit:
campaign_data.tail()
Youwillgetthefollowingoutput:
Figure1.10:Thelastfewrowsofcampaign_data
7. Theredoesn'tseemtobeanymisalignmentofdataormissingvaluesattheend.However,althoughwehaveseenthelastfewrows,westillcan'tbesurethatallvaluesinthemiddle(hidden)partoftheDataFrameareokaytoo.WecancheckthedatatypesoftheDataFrametobesure:
campaign_data.info()
Youshouldgetthefollowingoutput:
Figure1.11:info()ofcampaign_data
8. Wealsoneedtoensurethatwehavenotlostsomeobservationsbecauseofourcleaning.Weusetheshapefunctionforthat:
campaign_data.shape
Youwillgetanoutputof(100,2),indicatingthatwestillhave100observationswith2columns.Thedatasetisnowcompletelystructuredandcaneasilybeapartofanyfurtheranalysisorpipeline.
9. Let'snowanalyzethesales.xlsxfile.Usetheread_excelfunction
toreadthefileinaDataFramecalledsales:
sales=pd.read_excel("sales.xlsx")
10. LookatthefirstfewrowsofthesalesDataFrame:
sales.head()
Youroutputshouldlookasfollows:
Figure1.12:Firstfewrowsofsales.xlsx
Fromtheprecedingfigure,theYearcolumnappearstohavematchedtotherightvalues,butthelinecolumndoesnotseemtomakemuchsense.TheProduct.1,Product.2,columnsimplythattherearemultiplecolumnswiththesamename!EventhevaluesoftheOrderandmethodcolumnsbeingWaterandBag,respectively,makeusfeelasthoughsomethingiswrong.
11. Let'slookatgatheringsomemoreinformation,suchasnullvaluesandthedatatypesofthecolumns,andseeifwecanmakemoresenseofthedata:
sales.info()
Youroutputwilllookasfollows:
Figure1.13:Outputofsales.info()
Astherearesomecolumnswithnonon-nullvalues,thecolumnnamesseemtohavebrokenupincorrectly.Thisisprobablywhytheoutputofinfoshowedacolumnsuchasrevenueashavinganarbitrarydata
typesuchasobject(usuallyusedtorefertocolumnscontaining
strings).Itmakessenseiftheactualcolumnnamesstartwithacapitalletterandtheremainingcolumnsarecreatedasaresultofdataspillingfromtheprecedingcolumns.
12. Let'strytoreadthefilewithjustthenew,correctcolumnnamesandsee
whetherwegetanything.Usethefollowingcode:
sales=pd.read_excel("sales.xlsx",names=
["Year","Productline","Producttype",
"Product","Ordermethodtype","Retailer
Country","Revenue","Plannedrevenue","Product
cost","Quantity","Unitcost","Unitprice",
"GrossProfit","Unitsaleprice"])
Yougetthefollowingoutput:
Figure1.14:Attemptingtostructuresales.xlsx
Unfortunately,theissueisnotjustwiththecolumns,butwiththeunderlyingvaluestoo.Thevalueofonecolumnisleakingintoanotherandthusruiningthestructure.Understandably,thecodefailsbecauseoflengthmismatch.Therefore,wecanconcludethatthesales.xlsxdataisveryunstructured.
WiththeuseoftheAPIandwhatweknowuptillthispoint,wecan'tdirectlygetthisdatatobestructured.Tounderstandhowtoapproachstructuringthiskindof
data,weneedtodivedeepintotheinternalstructureofpandasobjectsandunderstandhowdataisactuallystoredinpandas,whichwewilldointhefollowingsections.Wewillcomebacktopreparingthisdataforfurtheranalysisinalatersection.
StructureofapandasSeriesLet'ssayyouwanttostoresomevaluesfromadatastoreinadatastructure.Itisnotnecessaryforeveryelementofthedatatohavevalues,soyourstructureshouldbeabletohandlethat.Itisalsoaverycommonscenariowherethereissomediscrepancybetweentwodatasourcesonhowtoidentifyadatapoint.So,insteadofusingdefaultnumericalindices(suchas0-100)oruser-givennamestoaccessit,likeinadictionary,youwouldliketoaccesseveryvaluebyanamethatcomesfromwithinthedatasource.ThisisachievedinpandasusingapandasSeries.
ApandasSeriesisnothingbutanindexedNumPyarray.TomakeapandasSeries,allyouneedtodoiscreateanarrayandgiveitanindex.IfyoucreateaSerieswithoutanindex,itwillcreateadefaultnumericindexthatstartsfrom0andgoesonforthelengthoftheSeries,asshowninthefollowingfigure:
Figure1.15:SamplepandasSeries
Note
AsaSeriesisstillaNumPyarray,allfunctionsthatworkonaNumPyarray,workthesamewayonapandasSeriestoo.
Onceyou'vecreatedanumberofSeries,youmightwanttoaccessthevaluesassociatedwithsomespecificindicesallatoncetoperformanoperation.ThisisjustaggregatingtheSerieswithaspecificvalueoftheindex.ItisherethatpandasDataFramescomeintothepicture.ApandasDataFrameisjustadictionarywiththecolumnnamesaskeysandvaluesasdifferentpandasSeries,joinedtogetherbytheindex:
Figure1.16:Seriesjoinedtogetherbythesameindex
createapandasDataframe
Thiswayofstoringdatamakesitveryeasytoperformtheoperationsweneedonthedatawewant.WecaneasilychoosetheSerieswewanttomodifyby
pickingacolumnanddirectlyslicingoffindicesbasedonthevalueinthatcolumn.Wecanalsogroupindiceswithsimilarvaluesinonecolumntogetherandseehowthevalueschangeinothercolumns.
Otherthanthisone-dimensionalSeriesstructuretoaccesstheDataFrame,pandasalsohastheconceptofaxes,whereanoperationcanbeappliedtobothrows(orindices)andcolumns.Youcanchoosewhichonetoapplyittobyspecifyingtheaxis,0referringtorowsand1referringtocolumns,therebymakingitveryeasytoaccesstheunderlyingheadersandthevaluesassociatedwiththem:
Figure1.17:Understandingaxis=0andaxis=1in
pandas
DataManipulation
NowthatwehavedeconstructedthestructureofthepandasDataFramedowntoitsbasics,therestofthewranglingtasks,thatis,creatingnewDataFrames,selectingorslicingaDataFrameintoitsparts,filteringDataFramesforsomevalues,joiningdifferentDataFrames,andsoon,willbecomeveryintuitive.
SelectingandFilteringinpandasItisstandardconventioninspreadsheetstoaddressacellby(columnname,rowname).Sincedataisstoredinpandasinasimilarmanner,thisisalsothewaytoaddressacellinapandasDataFrame:thecolumnnameactsasakeytogiveyouthepandasSeries,andtherownamegivesyouthevalueonthatindexoftheDataFrame.
Butifyouneedtoaccessmorethanasinglecell,suchasasubsetofsomerowsandcolumnsfromtheDataFrame,orchangetheorderofdisplayofsomecolumnsontheDataFrame,youcanmakeuseofthesyntaxlistedinthefollowingtable:
Figure1.18:Atablelistingthesyntaxusedfor
differentoperationsonapandasDataFrame
CreatingTestDataFramesinPythonWefrequentlyneedtocreatetestobjectswhilebuildingadatapipelineinpandas.Testobjectsgiveusareferencepointtofigureoutwhatwehavebeenabletodouptillthatpointandmakeiteasiertodebugourscripts.Generally,testDataFramesaresmallinsize,sothattheoutputofeveryprocessisquickandeasytocompute.TherearetwowaystocreatetestDataFrames—bycreatingcompletelynewDataFrames,orbyduplicatingortakingasliceofapreviouslyexistingDataFrame:
CreatingnewDataFrames:WetypicallyusetheDataFramemethodto
createacompletelynewDataFrame.ThefunctiondirectlyconvertsaPythonobjectintoapandasDataFrame.TheDataFramefunctionwill,
ingeneral,workwithanyiterablecollectionofdata(suchasdict,
list,andsoon).Wecanalsopassanemptycollectionorasingleton
collectiontothefunction.
Forexample,wewillgetthesameDataFramethrougheitherofthefollowinglinesofcode:
pd.DataFrame({'category':pd.Series([1,2,3])}
pd.DataFrame([1,2,3],columns=['category'])
pd.DataFrame.from_dict({'category':[1,2,3]})
Thefollowingfigureshowstheoutputsreceivedeachtime:
Figure1.19:Outputgeneratedbyallthreewaysto
createaDataFrame
ADataFramecanalsobebuiltbypassinganypandasobjectstotheDataFramefunction.Thefollowinglineofcodegivesthesameoutputasthe
precedingfigure:
pd.DataFrame(pd.Series([1,2,3]),columns=
["category"])
DuplicatingorslicingapreviouslyexistingDataFrame:ThesecondwaytocreateatestDataFrameisbycopyingapreviouslyexistingDataFrame.Python,andtherefore,pandas,hasshallowreferences.Whenwesayobj1=obj2,theobjectssharethelocationorthereferenceto
thesameobjectinmemory.So,ifwechangeobj2,obj1alsogets
modified,andviceversa.Thisistackledinthestandardlibrarywiththedeepcopyfunctioninthecopymodule.Thedeepcopyfunctionallows
theusertorecursivelygothroughtheobjectsbeingpointedtobythereferencesandcreateentirelynewobjects.
So,whenyouwanttocopyapreviouslyexistingDataFrameanddon'twantthepreviousDataFrametobeaffectedbymodificationsinthecurrentDataFrame,youneedtousethedeepcopyfunction.Youcan
alsoslicethepreviouslyexistingDataFrameandpassittothefunction,
anditwillbeconsideredanewDataFrame.Forexample,thefollowingcodesnippetwillrecursivelycopyeverythingindf1andnothaveany
referencestoitwhenyoumakechangestodf:
importpandas
importcopy
df=copy.deepcopy(df1)
AddingandRemovingAttributesandObservationspandasprovidesthefollowingfunctionstoaddanddeleterows(observations)andcolumns(attributes):
df['col']=s:Thisaddsanewcolumn,col,totheDataFrame,df,
withtheSeries,s.
df.assign(c1=s1,c2=s2...):Thisaddsnewcolumns,c1,
c2,andsoon,withseries,s1,s2,andsoon,tothedfDataFrameinone
go.
df.append(df2/d2,ignore_index):Thisaddsvaluesfrom
thedf2DataFrametothebottomofthedfDataFramewhereverthe
columnsofdf2matchthoseofdf.Alternatively,italsoacceptsdictand
d2,andifignore_index=True,itdoesnotuseindexlabels.
df.drop(labels,axis):Thisremovetherowsorcolumns
specifiedbythelabelsandcorrespondingaxis,orthosespecifiedbythe
indexorcolumnnamesdirectly.
df.dropna(axis,how):Dependingontheparameterpassedto
how,thisdecideswhethertodroprows(orcolumnsifaxis=1)with
missingvaluesinanyofthefieldsorinallofthefields.Ifnoparameterispassed,thedefaultvalueofhowisanyandthedefaultvalueofaxisis
0.
df.drop_duplicates(keep):Thisremovesrowswithduplicate
valuesintheDataFrame,andkeepsthefirst(keep='first'),last
(keep='last'),ornooccurrence(keep=False)inthedata.
WecanalsocombinedifferentpandasDataFramessequentiallywiththeconcatfunction,asfollows:
pd.concat([df1,df2..]):ThiscreatesanewDataFramewith
df1,df2,andallotherDataFramescombinedsequentially.Itwill
automaticallycombinecolumnshavingthesamenamesinthecombinedDataFrames.
Exercise3:CreatingandModifyingTestDataFramesThisexerciseaimstotesttheunderstandingofthestudentsaboutcreatingandmodifyingDataFramesinpandas.WewillcreateatestDataFramefromscratchandaddandremoverows/columnstoitbymakinguseofthefunctionsandconceptsdescribedsofar:
1. Importpandasandcopylibrariesthatwewillneedforthistask(thecopy
moduleinthiscase):
importpandasaspd
importcopy
2. CreateaDataFrame,df1,andusetheheadmethodtoseethefirstfew
rowsoftheDataFrame.Usethefollowingcode:
df1=pd.DataFrame({'category':pd.Series([1,2,
3])})
df1.head()
Youroutputshouldbeasfollows:
Figure1.20:Thefirstfewrowsofdf1
3. CreateatestDataFrame,df,byduplicatingdf1.Usethedeepcopy
function:
df=copy.deepcopy(df1)
df.head()
Youshouldgetthefollowingoutput:
Figure1.21:Thefirstfewrowsofdf
4. Addanewcolumn,cities,containingdifferentkindsofcitygroupsto
thetestDataFrameusingthefollowingcodeandtakealookattheDataFrameagain:
df['cities']=pd.Series([['Delhi','Mumbai'],
['Lucknow','Bhopal'],['Chennai',
'Bangalore']])
df.head()
Youshouldgetthefollowingoutput:
Figure1.22:Addingarowtodf
5. Now,addmultiplecolumnspertainingtotheuserviewershipusingtheassignfunctionandagainlookatthedata.Usethefollowingcode:
df.assign(
young_viewers=pd.Series([2000000,3000000,
1500000]),
adult_viewers=pd.Series([2500000,3500000,
1600000]),
aged_viewers=pd.Series([2300000,2800000,
2000000])
)
df.head()
YourDataFramewillnowappearasfollows:
Figure1.23:Addingmultiplecolumnstodf
6. UsetheappendfunctiontoaddanewrowtotheDataFrame.Aswe
knowthatthenewrowcontainspartialinformation,wewillpasstheignore_indexparameterasTrue:
df.append({'cities':["Kolkata","Hyderabad"],
'adult_viewers':2000000,
'aged_viewers':2000000,'young_viewers':
1500000},ignore_index=True)
df.head()
YourDataFrameshouldnowlookasfollows:
Figure1.24:Addinganotherrowbyusingthe
appendfunctionondf
7. Now,usetheconcatfunctiontoduplicatethetestDataFrameandsaveit
asdf2.TakealookatthenewDataFrame:
df2=pd.concat([df,df],sort=False)
df2
df2willshowduplicateentriesofdf1,asshownhere:
Figure1.25:Usingtheconcatfunctionto
duplicateaDataFrame,df2,inpandas
8. TodeletearowfromthedfDataFrame,wewillnowpasstheindexofthe
rowwewanttodelete—inthiscase,thethirdrow—tothedropfunction,
asfollows:
df.drop([3])
Youwillgetthefollowingoutput:
Figure1.26:Usingthedropfunctiontodeletea
row
9. Similarly,let'sdeletetheaged_viewerscolumnfromtheDataFrame.
Wewillpassthecolumnnameastheparametertothedropfunctionand
specifytheaxisas1:
df.drop(['aged_viewers'])
Youroutputwillbeasfollows:
Figure1.27:Droppingtheaged_viewerscolumn
intheDataFrame
10. Notethat,astheresultofthedropfunctionisalsoaDataFrame,wecan
chainanotherfunctiononittoo.So,wedropthecitiesfieldfromdf2
andremovetheduplicatesinitaswell:
df2.drop('cities',axis=1).drop_duplicates()
Thedf2DataFramewillnowlookasfollows:
Figure1.28:Droppingthecitiesfieldandthen
removingduplicatesindf2
Congratulations!You'vesuccessfullyperformedsomebasicoperationsonaDataFrame.YounowknowhowtoaddrowsandcolumnstoDataFramesandhowtoconcatenatemultipleDataFramestogetherinabigDataFrame.
Inthenextsection,youwilllearnhowtocombinemultipledatasourcesintothesameDataFrame.Whencombiningdatasources,weneedtomakesuretoincludecommoncolumnsfrombothsourcesbutmakesurethatnoduplicationoccurs.Wewouldalsoneedtomakesurethat,unliketheconcatfunction,the
combinedDataFrameissmartabouttheindexanddoesnotduplicaterowsthatalreadyexist.Thisfeatureisalsocoveredinthenextsection.
CombiningData
OncethedataispreparedfrommultiplesourcesinseparatepandasDataFrames,wecanusethepd.mergefunctiontocombinethemintothesameDataFrame
basedonarelevantkeypassedthroughtheonparameter.Itispossiblethatthe
joiningkeyisnameddifferentlyinthedifferentDataFramesthatarebeingjoined.So,whilecallingpd.merge(df,df1),wecanprovidealeft_on
parametertospecifythecolumntobemergedfromdfandaright_on
parametertospecifytheindexindf1.
pandasprovidesfourwaysofcombiningDataFramesthroughthehow
parameter.Allvaluesofthesearedifferentjoinsbythemselvesandaredescribedasfollows:
Figure1.29:Tabledescribingdifferentjoins
ThefollowingfigureshowstwosampleDataFrames,df1anddf2,andthe
resultsofthevariousjoinsperformedontheseDataFrames:
Figure1.30:TableshowingtwoDataFramesandthe
outcomesofdifferentjoinsonthem
Forexample,wecanperformarightandouterjoinontheDataFramesofthepreviousexerciseusingthefollowingcode:
pd.merge(df,df1,how='right')
pd.merge(df,df1,how='outer')
Thefollowingwillbetheoutputoftheprecedingtwojoins:
Figure1.31:Examplesofthedifferenttypesofmerges
inpandas
HandlingMissingDataOncewehavejoinedtwodatasets,itiseasytoseewhathappenstoanindexpresentinoneofthetablesbutnotintheother.Theothercolumnsofthatindexgetthenp.nanvalue,whichispandas'wayoftellingusthatdataismissingin
thatcolumn.Dependingonwhereandhowthevaluesaregoingtobeused,missingvaluescanbetreateddifferently.Thefollowingarevariouswaysoftreatingmissingvalues:
Wecangetridofmissingvaluescompletelyusingdf.dropna,as
explainedintheAddingandRemovingAttributesandObservations
section.
Wecanalsoreplaceallthemissingvaluesatonceusingdf.fillna().
Thevaluewewanttofillinwilldependheavilyonthecontextandtheusecaseforthedata.Forexample,wecanreplaceallmissingvalueswiththemeanormedianofthedata,orevensomeeasytofiltervalues,suchas–1usingdf.fillna(df.mean()),df.fillna(df.median),or
df.fillna(-1),asshownhere:
Figure1.32:Usingthedf.fillnafunction
Wecaninterpolatemissingvaluesusingtheinterpolatefunction:
Figure1.33:Usingtheinterpolatefunctiontopredict
category
Otherthanusingin-builtoperations,wecanalsoperformdifferentoperationsonDataFramesbyfilteringoutrowswithmissingvaluesinthefollowingways:
Wecancheckforslicescontainingmissingvaluesusingthepd.isnull()function,orthosewithoutitusingthe
pd.isnotnull()function,respectively:
df.isnull()
Youshouldgetthefollowingoutput:
Figure1.34:Usingthe.isnullfunction
WecancheckwhetherindividualelementsareNAusingtheisna
function:
df[['category']].isna
Thiswillgiveyouthefollowingoutput:
Figure1.35:Usingtheisnafunction
Thisdescribesmissingvaluesonlyinpandas.YoumightcomeacrossdifferenttypesofmissingvaluesinyourpandasDataFrameifitgetsdatafromdifferentsources,forexample,Noneindatabases.You'llhavetofilterthemoutseparately,asdescribedinprevioussections,andproceed.
Exercise4:CombiningDataFramesandHandlingMissingValuesTheaimofthisexerciseistogetyouusedtocombiningdifferentDataFramesandhandlingmissingvaluesindifferentcontexts,aswellastorevisithowtocreateDataFrames.Thecontextistogetuserinformationaboutusersdefinitelywatchingacertainwebcastonawebsitesothatwecanrecognizepatternsintheirbehavior:
1. Importthenumpyandpandasmodules,whichwe'llbeusing:
importnumpyasnp
importpandasaspd
2. CreatetwoemptyDataFrames,df1anddf2:
df1=pd.DataFrame()
df2=pd.DataFrame()
3. Wewillnowadddummyinformationabouttheviewersofthewebcastinacolumnnamedviewersindf1,andthepeopleusingthewebsiteina
columnnamedusersindf2.Usethefollowingcode:
df1['viewers']=["Sushmita","Aditya","Bala",
"Anurag"]
df2['users']=["Aditya","Anurag","Bala",
"Sushmita","Apoorva"]
4. WewillalsoaddacoupleofadditionalcolumnstoeachDataFrame.Thevaluesforthesecanbeaddedmanuallyorsampledfromadistribution,suchasnormaldistributionthroughNumPy:
np.random.seed(1729)
df1=df1.assign(views=np.random.normal(100,
100,4))
df2=df2.assign(cost=[20,np.nan,15,2,7])
5. ViewthefirstfewrowsofbothDataFrames,stillusingtheheadmethod:
df1.head()
df2.head()
Youshouldgetthefollowingoutputsforbothdf1anddf2:
Figure1.36:Contentsofdf1anddf2
6. Doaleftjoinofdf1withdf2andstoretheoutputinaDataFrame,df,
becauseweonlywanttheuserstatsindf2ofthoseuserswhoare
viewingthewebcastindf1.Therefore,wealsospecifythejoiningkeyas
"viewers"indf1and"users"indf2:
df=df1.merge(df2,left_on="viewers",
right_on="users",how="left")
df.head()
Youroutputshouldnowlookasfollows:
Figure1.37:Usingthemergeandfillnafunctions
7. You'llobservesomemissingvalues(NaN)intheprecedingoutput.We
willhandlethesevaluesintheDataFramebyreplacingthemwiththemeanvaluesinthatcolumn.Usethefollowingcode:
df.fillna(df.mean())
Youroutputwillnowlookasfollows:
Figure1.38:Imputingmissingvalueswiththemean
throughfillna
Congratulations!Youhavesuccessfullywrangledwithdataindatapipelinesandtransformedattributesexternally.Buttohandlethesales.xlsxfilethatwe
sawpreviously,thisisstillnotenough.WeneedtoapplyfunctionsandoperationsonthedatainsidetheDataFrametoo.Let'slearnhowtodothatandmoreinthenextsection.
ApplyingFunctionsandOperationsonDataFramesBydefault,operationsonallpandasobjectsareelement-wiseandreturnthesametypeofpandasobjects.Forinstance,lookatthefollowingcode:
df['viewers']=
df['adult_viewers']+df['aged_viewers']+df['young_viewers']
ThiswilladdaviewerscolumntotheDataFramewiththevalueforeach
observationbeingequaltothesumofthevaluesintheadult_viewers,
aged_viewers,andyoung_viewerscolumns.
Similarly,thefollowingcodewillmultiplyeverynumericalvalueintheviewerscolumnoftheDataFrameby0.03orwhateveryouwanttokeepas
yourtargetCTR(click-throughrate):
df['expectedclicks']=0.03*df['viewers']
Hence,yourDataFramewilllookasfollowsoncetheseoperationsareperformed:
Figure1.39:OperationsonpandasDataFrames
Pandasalsosupportsseveralout-of-the-boxbuilt-infunctionsonpandasobjects.Thesearelistedinthefollowingtable:
Figure1.40:Built-infunctionsusedinpandas
Note
RememberthatpandasobjectsarePythonobjectstoo.Therefore,wecanwriteourowncustomfunctionstoperformspecifictasksonthem.
Wecaniteratethroughtherowsandcolumnsofpandasobjectsusingitertuplesoriteritems.ConsiderthefollowingDataFrame,nameddf:
Figure1.41:DataFramedf
ThefollowingmethodscanbeperformedonthisDataFrame:
itertuples:ThismethoditeratesovertherowsoftheDataFramein
theformofnamedtuples.BysettingtheindexparametertoFalse,we
canremovetheindexasthefirstelementofthetupleandsetacustomnamefortheyieldednamedtuplesbysettingitinthenameparameter.ThefollowingscreenshotillustratesthisovertheDataFrameshownintheprecedingfigure:
Figure1.42:Testingitertuples
iterrows:ThismethoditeratesovertherowsoftheDataFramein
tuplesofthetype(label,content),wherelabelistheindexof
therowandcontentisapandasSeriescontainingeveryiteminthe
row.Thefollowingscreenshotillustratesthis:
Figure1.43:Testingiterrows
iteritems:ThismethoditeratesoverthecolumnsoftheDataFramein
tuplesofthetype(label,content),wherelabelisthenameofthe
columnandcontentisthecontentinthecolumnintheformofa
pandasSeries.Thefollowingscreenshotshowshowthisisperformed:
Figure1.44:Checkingoutiteritems
Toapplybuilt-inorcustomfunctionstopandas,wecanmakeuseofthemap
andapplyfunctions.Wecanpassanybuilt-in,NumPy,orcustomfunctionsas
parameterstothesefunctions,andtheywillbeappliedtoallelementsinthecolumn:
map:Thisreturnsanobjectofthesamekindasthatwaspassedtoit.A
dictionarycanalsobepassedasinputtoit,asshownhere:
Figure1.45:Usingthemapfunction
apply:Thisappliesthefunctiontotheobjectpassedandreturnsa
DataFrame.Itcaneasilytakemultiplecolumnsasinput.Italsoacceptstheaxisparameter,dependingonhowthefunctionistobeapplied,as
shown:
Figure1.46:Usingtheapplyfunction
OtherthanworkingonjustDataFramesandSeries,functionscanalsobeappliedtopandasGroupByobjects.Let'sseehowthatworks.
GroupingDataSupposeyouwanttoapplyafunctiondifferentlyonsomerowsofaDataFrame,dependingonthevaluesinaparticularcolumninthatrow.YoucanslicetheDataFrameonthekey(s)youwanttoaggregateonandthenapplyyourfunctiontothatgroup,storethevalues,andmoveontothenextgroup.
pandasprovidesamuchbetterwaytodothis,usingthegroupbyfunction,
whereyoucanpasskeysforgroupsasaparameter.TheoutputofthisfunctionisaDataFrameGroupByobjectthatholdsgroupscontainingvaluesofallthe
rowsinthatgroup.Wecanselectthenewcolumnwewouldliketoapplyafunctionto,andpandaswillautomaticallyaggregatetheoutputsonthelevelofdifferentvaluesonitskeysandreturnthefinalDataFramewiththefunctionsappliedtoindividualrows.
Forexample,thefollowingwillcollecttherowsthathavethesamenumberofaged_viewerstogether,taketheirvaluesintheexpectedclicks
column,andaddthemtogether:
Figure1.47:UsingthegroupbyfunctiononaSeries
Instead,ifweweretopass[['series']]totheGroupByobject,wewould
havegottenaDataFrameback,asshown:
Figure1.48:Usingthegroupbyfunctionona
DataFrame
Exercise5:ApplyingDataTransformations
Theaimofthisexerciseistogetyouusedtoperformingregularandgroupby
operationsonDataFramesandapplyingfunctionstothem.Youwillusetheuser_info.jsonfileintheLesson02folderonGitHub,whichcontains
informationaboutsixcustomers.
1. Importthepandasmodulethatwe'llbeusing:
importpandasaspd
2. Readtheuser_info.jsonfileintoapandasDataFrame,
user_info,andlookatthefirstfewrowsoftheDataFrame:
user_info=pd.read_json('user_info.json')
user_info.head()
Youwillgetthefollowingoutput:
Figure1.49:Outputoftheheadfunctionon
user_info
3. Now,lookattheattributesandthedatainsidethem:
user_info.info()
Youwillgetthefollowingoutput:
Figure1.50:Outputoftheinfofunctionon
user_info
4. Let'smakeuseofthemapfunctiontoseehowmanyfriendseachuserin
thedatahas.Usethefollowingcode:
user_info['friends'].map(lambdax:len(x))
Youwillgetthefollowingoutput:
Figure1.51:Usingthemapfunctiononuser_info
5. Weusetheapplyfunctiontogetagriponthedatawithineachcolumn
individuallyandapplyregularPythonfunctionstoit.Let'sconvertallthevaluesinthetagscolumnoftheDataFrametocapitallettersusingthe
upperfunctionforstringsinPython,asfollows:
user_info['tags'].apply(lambdax:[t.upper()for
tinx])
Youshouldgetthefollowingoutput:
Figure1.52:Convertingvaluesintags
6. Usethegroupbyfunctiontogetthedifferentvaluesobtainedbya
certainattribute.Wecanusethecountfunctiononeachsuchmini
pandasDataFramegenerated.We'lldothisfirstfortheeyecolor:
user_info.groupby('eyeColor')['_id'].count()
Youroutputshouldnowlookasfollows:
Figure1.53:CheckingdistributionofeyeColor
7. Similarly,let'slookatthedistributionofanothervariable,favoriteFruit,inthedatatoo:
user_info.groupby('favoriteFruit')
['_id'].count()
Figure1.54:Seeingthedistributioninuse_info
Wearenowsufficientlypreparedtohandleanysortofproblemwemightfacewhentryingtostructureevenunstructureddataintoastructuredformat.Let'sdothatintheactivityhere.
Activity1:AddressingDataSpillingWewillnowsolvetheproblemthatweencounteredinExercise1.Westartbyloadingsales.xlsx,whichcontainssomehistoricalsalesdata,recordedin
MSExcel,aboutdifferentcustomerpurchasesinstoresinthepastfewyears.Yourcurrentteamisonlyinterestedinthefollowingproducttypes:ClimbingAccessories,CookingGear,FirstAid,GolfAccessories,InsectRepellents,andSleepingBags.YouneedtoreadthefilesintopandasDataFramesand
preparetheoutputsothatitcanbeaddedintoyouranalyticspipeline.Followthestepsgivenhere:
1. OpenthePythonconsoleandimportpandasandthecopymodule.
2. Loadthedatafromsales.xlsxintoaseparateDataFrame,named
sales,andlookatthefirstfewrowsofthegeneratedDataFrame.You
willgetthefollowingoutput:
Figure1.55:Outputoftheheadfunctionon
sales.xlsx
3. Analyzethedatatypeofthefieldsandgetholdofpreparedvalues.
4. Getthecolumnnamesright.Inthiscase,everynewcolumnstartswithacapitalcase.
5. Lookatthefirstcolumn,ifthevalueinthecolumnmatchestheexpectedvalues,justcorrectthecolumnnameandmoveontothenextcolumn.
6. Takethefirstcolumnwithvaluesleakingintoothercolumnsandlookatthedistributionofitsvalues.Addthevaluesfromthenextcolumnandgoontoasmanycolumnsasrequiredtogettotherightvaluesforthatcolumn.
7. SliceouttheportionoftheDataFramethathasthelargestnumberofcolumnsrequiredtocoverthevaluefortherightcolumnandstructurethevaluesforthatcolumncorrectlyinanewcolumnwiththerightattributename.
8. Youcannowdropallthecolumnsfromtheslicethatarenolongerrequiredoncethefieldhastherightvaluesandmoveontothenextcolumn.
9. Repeat4–7multipletimes,untilyouhavegottenasliceoftheDataFramecompletelystructuredwithallthevaluescorrectandpointingtotheintendedcolumn.SavethisDataFrameslice.YourfinalstructuredDataFrameshouldappearasfollows:
Figure1.56:Firstfewrowsofthestructured
DataFrame
Note
Thesolutionforthisactivitycanbefoundonpage316.
Summary
Dataprocessingandwranglingistheinitial,andaveryimportant,partofthedatasciencepipeline.Itisgenerallyhelpfulifpeoplepreparingdatahavesomedomainknowledgeaboutthedata,sincethatwillhelpthemstopattherightprocessingpointandusetheirintuitiontobuildthepipelinebetterandmorequickly.Dataprocessingalsorequirescomingupwithinnovativesolutionsandhacks.
Inthischapter,youlearnedhowtostructurelargedatasetsbyarrangingtheminatabularform.Then,wegotthistabulardataintopandasanddistributeditbetweentherightcolumns.Onceweweresurethatourdatawasarrangedcorrectly,wecombineditwithotherdatasources.Wealsogotridofduplicatesandneedlesscolumns,andfinally,dealtwithmissingdata.Afterperformingthesesteps,ourdatawasmadereadyforanalysisandcouldbeputintoadatasciencepipelinedirectly.
Inthenextchapter,wewilldeepenourunderstandingofpandasandtalkaboutreshapingandanalyzingDataFramesforbettervisualizationsandsummarizingdata.Wewillalsoseehowtodirectlysolvegenericbusiness-criticalproblemsefficiently.