UTeachCSPrinciples Unit5:BigData
UNITTOPIC:DataAnalysisDataMining
Youwillinvestigatetheuseofdatamininginthediscoveryofpatternsinlargedatasets.
Youwillapplyassociationruleminingtodiscoverknowledgeindatasets.
UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin
447
UTeachCSPrinciples Unit5:BigDataDataMiningDataMining
Traditionaloreminingbeginswithanexploration(prospecting)ofaresourcepool(stone),andproceedstodeterminingifusableresourcesexist(ore)andtowhatdegree.Prospectorsbasicallyhaveanideaofwhattheyarelookingfor,andtheyrunsmallteststoseeiftheyarecorrect.Sometimestheystrikegold,othertimestheystrikeout.Likethesephysicalminesthatbringuseverythingfromcoaltodiamonds,wehaveanewtypeofmining:datamining.
Dataminingisakintothediscoveryofpatternsinlargedatasets.Likeoremining,dataminingbeginswithanexploration(analysis)ofaresourcepool(data),andproceedstodeterminewhetherusableresourcesexist(correlations)andtowhatdegree(howstrongtheyare).Notalldataminers"strikesitrich."Likeoremining,dataminingcanresultintheobservationofnousefulpatterns.However,likeoremining,sometimesdataminingleadstoabonanzaofusefulinformation.
Indatamining,theemphasisisonthediscoveryofnewknowledge.Dataminerswanttofindnewpatternsthatwerepreviouslyunobserved.Theyusestatisticalanalysisofbigdatatodiscoverwhatthehumaneyecan'tsee,justlikeanoreminermightuseapick,dynamite,orlabtesttouncoverorethatwasnotvisibletothenakedeyebefore.Thisisaformofexploratorydataanalysisratherthanstatisticalhypothesistesting.
DataMiningStrategiesDatamininginvolvessixcommonclassesoftasks,listedbelow,alongwithexamplesofhowthesestrategiescanbeusedinrecommendersystems,suchasthoseusedbyNetflix,Pandora,Amazon,http://www.whatshouldireadnext.com/,andmanyothercontentproviders.Ineachofthedescriptionsbelow,aNetflix-relatedexampleofitsusageisgiven:
Anomalydetection(Outlier/change/deviationdetection)—Theidentificationofunusualdatarecords,thatmightbeinterestingorsimplydataerrorsandrequirefurtherinvestigation.
MovieXisunlikeanyoftheothermoviesinUserY'sdataset.Removeitfromourcalculations.(example:TheTexasChainsawMassacreisonalistthatmostlycontainstitlessuchasTeletubbies,BarneyandFriends,andClifford.
Associationrulelearning(Dependencymodeling)—Searchesforrelationshipsbetweenvariables.Forexample,asupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.
Recommendersystems—UserswholikeMovieXtendtoalsolikeMovieY.
448
Clustering—isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"similar,"withoutusingknownstructuresinthedata.
Dynamicallygroupedmoviecategories:"RomanticComediesinParisstarringformerprofessionalfootballplayers."
Classification—isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,ane-mailprogrammightattempttoclassifyane-mailas"legitimate"oras"spam."
MovieXisaromanticcomedy.Regression—Attemptstofindafunctionthatmodelsthedatawiththeleasterror.
TypeXuserstypicallyincreasetheirmovieconsumptionratebyfourmoviesperyear.
Summarization—providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.
WhattypeofmoviedoesUserXtypicallylike?(i.e.,sumupuserX'spreferencesinYwords)
Thesestrategiesallhavedifferentpurposes,aresometimesmoreeffectiveoncertaindatasetsandlessonothers,andoftentimesworkbestinconjunctionwithoneother.Therefore,thereisnoone"best"waytoperformdatamining.Dataminersusemultiplestrategiestouncoverpatternsanddiscovernewknowledge.
Commonmisconception:DataminingisoftenconfusedwithArtificialIntelligence(AI).
DataminingisactuallyanapplicationoftechniquescommonlyassociatedwithAI."Machinelearning"and"decisionsupport"arestandardAItechniques,butwhenweapplythemto"knowledgediscoveryindatabases,"werefertothemcollectivelysimplyas"toolsfordatamining."
Howmuchpowerliesindatamining?Readthefollowingarticletosee"HowTargetFiguredOutATeenGirlWasPregnantBeforeHerFatherDid.".
UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin
449
UTeachCSPrinciples Unit5:BigDataAssociationRuleMiningCompaniesKnowWhatYouBuy
FrenchtoastisoneofAmerica'sfavoritebreakfastfoods.It'sdeliciousandcanbeeasilypreparedathomeusingavarietyoftechniquesandtoppings.Eventhoughitcanbepreparedanumberofways,almostallFrenchtoastrecipescallforatleastthreethings:
1. bread2. milk3. eggs
Ifyou'regoingtomakeFrenchtoast,you'regoingtoneedbread,you'regoingtoneedmilk,andyou'regoingtoneedeggs.WhatdoesFrenchtoasthavetodowithbigdata?
AssociationRuleMiningAnassociationruleisalinkbetweenonesetofitemsandanother.Specifically,associationrulesidentifyinstancesinwhichtheappearanceofonesetitems(theantecedent)implythatanothersetofitems(theconsequent)willalsoappear.
Forexample:
{X,Y}⇒{Z}
Thisrulecanbereadas,“Iftheantecedents(XandY)appearthenitislikelythattheconsequent(Z)willalsoappear.”
Byusingassociationrules,wecangroupitemstogetherlogicallyandattempttomakepredictions.Bytrackingeachofthesetransactions,tabulatingthem,andthendiscoveringwhichpairs(orlargergroups)ofcolumnscorrelateoftenwithoneanother,associationrulesmaybegeneratedtocapturethesecorrelationsinthedata.ThisappliestoFrenchtoastpreparation.
Forexample:
Ifmostpeoplewhobuymilk,bread,andeggsalsobuymaplesyrup,thenassociationruleminingmightturnupthefollowingrule:
{milk,bread,eggs}⇒{syrup}
Walmartcannowtargetstorepatronswhopurchasemilk,bread,andeggstogentlysuggestthattheymightliketoalsobuysyrup.Thecomputerizedstorefront(orphysicalstorefront
450
withalayoutdeterminedbycomputationaldatamining)doesnotknowthatthesepatronsmaybemakingFrenchtoast,theymerelyhavedevelopedassociationrulestoguideproductplacement.Theprocessofassociationruleminingisbasically"HowTargetFiguredOutaTeenGirlwasPregnant..."
InstructionsYourgrouphasbeenhiredbyDataMarket,acorporationseekingtoopenanewchainofstoresinyourregion.Theirgoalistoprovidecustomerswithoptimalarrangementsofstoreproducts,inanattempttominimizethetimeandeffortrequiredtoshop.
Youwilldesignamockstoreproductplacementscheme—drivenbydatacollectionfromcompetitors’storesinthearea.Usethereceiptsprovidedbyyourteacher(1)togenerateassociationrulesthatmappotentiallycorrelatedproducts,andthen(2)sketchanendcapfordata-drivenproductplacementtargetingpotentialshoppersinthearea.
Asyouextractdatafromthereceipts,considerthefollowingguidingquestions:
1. Whatisthebestwaytousetheprovidedtabletoorganizeyourdatacollection?
2. Whattrendsdoyoufindinthedata?3. Arethereanynegativeassociationsbetweenproducts?4. Whatistheidealsizeforsetsof
antecedents/consequents?5. Whatadditionalinformationmightbehelpful?6. Canyouimaginescenariosinwhichsetsofproductsare
groupedtogether?
UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin
451