2
DataAnalyticsMadeAccessible
Copyright©2015byAnilK.Maheshwari,Ph.D.
By purchasing this book, you agree not to copy the book by any means,mechanicalorelectronic.
Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.
3
Preface
TherearemanygoodbooksinthemarketonDataAnalytics.So,whyshouldanyonewrite another book on this topic? I have been teaching courses inbusiness intelligenceanddataminingforafewyears.Morerecently,IhavebeenteachingthiscoursetocombinedclassesofMBAandComputerSciencestudents.Existingtextbooksseemtoolong,tootechnical,andtoocomplexforusebystudents.Thisbookfillsaneedforanaccessiblebookon this topic.Mygoalwastowriteaconversationalbookthatfeelseasyandinformative.This is an accessible book that covers everything important, with concreteexamples,andinvitesthereadertojointhisfield.
Thebookhasdevelopedfrommyownclassnotes.ItreflectsmydecadesofIT industry experience, as well as many years of academic teachingexperience. The chapters are organized for a typical one-semester graduatecourse.Thebookcontainscaseletsfromreal-worldstoriesatthebeginningofeverychapter.Thereisarunningcasestudyacrossthechaptersasexercises.
Manythanksare inorder.MyfatherMr.RatanLalMaheshwariencouragedmetoputmythoughtsinwriting,andmakeabookoutofit.MywifeNeerjahelpedme find the time andmotivation towrite this book.MybrotherDr.SunilMaheshwariwasthesourcesofmanyencouragingconversationsaboutit.MycolleagueDr.EdiShivajiprovidedadviceduringmyteachingtheDataAnalyticscourses.AnothercolleagueDr.ScottHerriottservedasarolemodelas an author of many textbooks. Yet another colleague, Dr. Greg Guthrieprovided many ideas and ways to disseminate the book. Our departmentassistantMs.KarenSlowickatMUMproof-readthefirstdraftofthisbook.Ms.Adri-MariVilonel in SouthAfrica helped create an opportunity to usethisbookforthefirsttimeatacorporateMBAprogram.
Thanks are also due to to mymany students atMUM and elsewhere whoprovedgoodpartnersinmylearningmoreaboutthisarea.Finally,thankstoMaharishiMaheshYogi forprovidingawonderfuluniversity,MUM,wherestudentsdeveloptheirintellectaswellastheirconsciousness.
Dr.AnilK.MaheshwariFairfield,IA.
November2015
4
5
Contents
Preface
Chapter1:WholenessofDataAnalytics
BusinessIntelligence
Caselet:MoneyBall-DataMininginSports
PatternRecognition
DataProcessingChain
Data
Database
DataWarehouse
DataMining
DataVisualization
Organizationofthebook
ReviewQuestions
Section1
Chapter2:BusinessIntelligenceConceptsandApplications
Caselet:KhanAcademy–BIinEducation
BIforbetterdecisions
Decisiontypes
BITools
BISkills
BIApplications
CustomerRelationshipManagement
HealthcareandWellness
Education
Retail
Banking
FinancialServices
Insurance
Manufacturing
Telecom
PublicSector
6
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:Step1
Chapter3:DataWarehousing
Caselet:UniversityHealthSystem–BIinHealthcare
DesignConsiderationsforDW
DWDevelopmentApproaches
DWArchitecture
DataSources
DataLoadingProcesses
DataWarehouseDesign
DWAccess
DWBestPractices
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:Step2
Chapter4:DataMining
Caselet:TargetCorp–DataMininginRetail
Gatheringandselectingdata
Datacleansingandpreparation
OutputsofDataMining
EvaluatingDataMiningResults
DataMiningTechniques
ToolsandPlatformsforDataMining
DataMiningBestPractices
Mythsaboutdatamining
DataMiningMistakes
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:Step3
Chapter5:DataVisualization
Caselet:DrHansGosling-VisualizingGlobalPublicHealth
ExcellenceinVisualization
TypesofCharts
VisualizationExample
VisualizationExamplephase-2
TipsforDataVisualization
7
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:Step4
Section2
Chapter6:DecisionTrees
Caselet:PredictingHeartAttacksusingDecisionTrees
DecisionTreeproblem
DecisionTreeConstruction
Lessonsfromconstructingtrees
DecisionTreeAlgorithms
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:Step5
Chapter7:Regression
Caselet:DatadrivenPredictionMarkets
CorrelationsandRelationships
Visuallookatrelationships
RegressionExercise
Non-linearregressionexercise
LogisticRegression
AdvantagesandDisadvantagesofRegressionModels
Conclusion
ReviewExercises:
LibertyStoresCaseExercise:Step6
Chapter8:ArtificialNeuralNetworks
Caselet:IBMWatson-AnalyticsinMedicine
BusinessApplicationsofANN
DesignPrinciplesofanArtificialNeuralNetwork
RepresentationofaNeuralNetwork
ArchitectingaNeuralNetwork
DevelopinganANN
AdvantagesandDisadvantagesofusingANNs
Conclusion
ReviewExercises
Chapter9:ClusterAnalysis
Caselet:ClusterAnalysis
ApplicationsofClusterAnalysis
8
DefinitionofaCluster
Representingclusters
Clusteringtechniques
ClusteringExercise
K-MeansAlgorithmforclustering
Selectingthenumberofclusters
AdvantagesandDisadvantagesofK-Meansalgorithm
Conclusion
ReviewExercises
LibertyStoresCaseExercise:Step7
Chapter10:AssociationRuleMining
Caselet:Netflix:DataMininginEntertainment
BusinessApplicationsofAssociationRules
RepresentingAssociationRules
AlgorithmsforAssociationRule
AprioriAlgorithm
Associationrulesexercise
CreatingAssociationRules
Conclusion
ReviewExercises
LibertyStoresCaseExercise:Step8
Section3
Chapter11:TextMining
Caselet:WhatsAppandPrivateSecurity
TextMiningApplications
TextMiningProcess
TermDocumentMatrix
MiningtheTDM
ComparingTextMiningandDataMining
TextMiningBestPractices
Conclusion
ReviewQuestions
Chapter12:WebMining
Webcontentmining
Webstructuremining
Webusagemining
WebMiningAlgorithms
9
Conclusion
ReviewQuestions
Chapter13:BigData
Caselet:PersonalizedPromotionsatSears
DefiningBigData
BigDataLandscape
BusinessImplicationsofBigData
TechnologyImplicationsofBigData
BigDataTechnologies
ManagementofBigData
Conclusion
ReviewQuestions
Chapter14:DataModelingPrimer
Evolutionofdatamanagementsystems
RelationalDataModel
ImplementingtheRelationalDataModel
Databasemanagementsystems(DBMS)
StructuredQueryLanguage
Conclusion
ReviewQuestions
Appendix1:DataMiningTutorialwithWeka
Appendix1:DataMiningTutorialwithR
AdditionalResources
10
Chapter1:WholenessofDataAnalyticsBusinessistheactofdoingsomethingproductivetoservesomeone’sneeds,andthusearnalivingandmaketheworldabetterplace.Businessactivitiesare recorded on paper or using electronic media, and then these recordsbecome data. There is more data from customers’ responses and on theindustry as awhole.All this data can be analyzed andmined using specialtoolsandtechniquestogeneratepatternsandintelligence,whichreflecthowthebusinessisfunctioning.Theseideascanthenbefedbackintothebusinessso that it can evolve to become more effective and efficient in servingcustomerneeds.Andthecyclecontinueson(Figure1.1).
Figure1.1:BusinessIntelligenceandDataMiningCycle
11
BusinessIntelligenceAny business organization needs to continually monitor its businessenvironmentanditsownperformance,andthenrapidlyadjustitsfutureplans.Thisincludesmonitoringtheindustry,thecompetitors,thesuppliers,andthecustomers. The organization needs to also develop a balanced scorecard totrack its own health and vitality. Executives typically determine what theywant to track based on their key performance Indexes (KPIs) or key resultareas(KRAs).Customizedreportsneedtobedesignedtodelivertherequiredinformation to every executive. These reports can be converted intocustomized dashboards that deliver the information rapidly and in easy-to-graspformats.
12
Caselet:MoneyBall-DataMininginSportsAnalytics in sports was madepopular by the book andmovie,Moneyball. Statistician BillJames andOakland A's generalmanager, Billy Bean, placedemphasisoncrunchingnumbersanddatainsteadofwatchinganathlete's style and looks. Theirgoalwas tomake a team betterwhileusingfewerresources.Thekey action plan was to pickimportantroleplayersatalowercost while avoiding the famousplayers who demand highersalaries but may provide a lowreturn on a team's investment.Rather than relying on thescouts' experience and intuitionBean selected players basedalmost exclusively on their on-base percentage (OBP). ByfindingplayerswithahighOBPbut, with characteristics thatlead scouts to dismiss them,Bean assembled a team ofundervalued players with farmore potential than the A'shamstrung finances wouldotherwiseallow.
Using this strategy, they provedthat even small market teamscan be competitive— a case inpoint, theOaklandA's. In2004,two years after adopting thesame sabermetric model, theBoston Red Sox won their firstWorld Series since 1918.(Source:Moneyball,2004).
13
Q: Could similar techniquesapply to thegamesof soccer,orcricket?Ifso,how?
Q2: What are the generallessonsfromthisstory?
Businessintelligenceisabroadsetof informationtechnology(IT)solutionsthat includestoolsforgathering,analyzing,andreportinginformationtotheusers about performance of the organization and its environment. These ITsolutionsareamongthemosthighlyprioritizedsolutionsforinvestment.
Consideraretailbusinesschainthatsellsmanykindsofgoodsandservicesaroundtheworld,onlineandinphysicalstores.Itgeneratesdataaboutsales,purchases,andexpensesfrommultiplelocationsandtimeframes.Analyzingthisdatacouldhelpidentifyfast-sellingitems,regional-sellingitems,seasonalitems,fast-growingcustomersegments,andsoon.Itmightalsohelpgenerateideas about what products sell together, which people tend to buy whichproducts, and so on. These insights and intelligence can help design betterpromotionplans,productbundles,andstore layouts,which in turn lead toabetter-performingbusiness.
Thevicepresidentofsalesofaretailcompanywouldwanttotrackthesalesto date againstmonthly targets, the performance of each store and productcategory,andthetopstoremanagersthatmonth.Thevicepresidentoffinancewould be interested in tracking daily revenue, expense, and cash flows bystore;comparingthemagainstplans;measuringcostofcapital;andsoon.
14
PatternRecognitionA pattern is a design or model that helps grasp something. Patterns helpconnectthingsthatmaynotappeartobeconnected.Patternshelpcutthroughcomplexity and reveal simpler understandable trends. Patterns can be asdefinitiveashardscientificrules,liketherulethatthesunalwaysrisesintheeast. They can also be simple generalizations, such as the Pareto principle,whichstatesthat80percentofeffectscomefrom20percentofthecauses.
Aperfectpatternormodelisonethat(a)accuratelydescribesasituation,(b)is broadly applicable, and (c) can be described in a simplemanner.E=MC2
wouldbesuchageneral,accurate,andsimple(GAS)model.Veryoften,allthreequalitiesarenotachievableinasinglemodel,andonehastosettlefortwoofthreequalitiesinthemodel.
Patternscanbetemporal,whichissomethingthatregularlyoccursovertime.Patternscanalsobespatial,suchasthingsbeingorganizedinacertainway.Patternscanbefunctional,inthatdoingcertainthingsleadstocertaineffects.Goodpatterns areoften symmetric.Theyechobasic structures andpatternsthatwearealreadyawareof.
Atemporalrulewouldbethat“somepeoplearealwayslate,”nomatterwhatthe occasion or time. Some peoplemay be aware of this pattern and somemay not be.Understanding a pattern like thiswould help dissipate a lot ofunnecessary frustration and anger. One can just joke that some people areborn“10minutes late,”and laugh it away.Similarly,Parkinson’s lawstatesthatworksexpandstofillupallthetimeavailabletodoit.
Aspatialpattern,followingthe80–20rule,couldbethatthetop20percentofcustomers lead to 80 percent of the business. Or 20 percent of productsgenerate 80 percent of the business. Or 80 percent of incoming customerservice calls are related to just 20percent of the products.This last patternmaysimply revealadiscrepancybetweenaproduct’s featuresandwhat thecustomersbelieveabouttheproduct.Thebusinesscanthendecidetoinvestineducating the customers better so that the customer service calls can besignificantlyreduced.
A functional patternmay involve test-taking skills. Some students performwellonessay-typequestions.Othersdowellinmultiple-choicequestions.Yetotherstudentsexcel indoinghands-onprojects,or inoralpresentations.Anawarenessofsuchapatterninaclassofstudentscanhelptheteacherdesignabalancedtestingmechanismthatisfairtoall.
15
Retainingstudentsisanongoingchallengeforuniversities.Recentdata-basedresearchshowsthatstudentsleaveaschoolforsocialreasonsmorethantheydo for academic reasons. This pattern/insight can instigate schools to paycloser attention to students engaging in extracurricular activities anddevelopingstrongerbondsatschool.Theschoolcaninvest inentertainmentactivities,sportsactivities,campingtrips,andotheractivities.Theschoolcanalsobegintoactivelygatherdataabouteverystudent’sparticipationinthoseactivities,topredictat-riskstudentsandtakecorrectiveaction.
However, long-established patterns can also be broken. The past cannotalwayspredictthefuture.Apatternlike“allswansarewhite”doesnotmeanthat theremaynotbeablackswan.Onceenoughanomaliesarediscovered,the underlying pattern itself can shift. The economicmeltdown in 2008 to2009was because of the collapse of the accepted pattern, that is, “housingprices always go up.” A deregulated financial environment made marketsmore volatile and led to greater swings inmarkets, leading to the eventualcollapseoftheentirefinancialsystem.
Diamondminingistheactofdiggingintolargeamountsofunrefinedoretodiscover precious gems or nuggets. Similarly, data mining is the act ofdigging into large amounts of rawdata to discover uniquenontrivial usefulpatterns. Data is cleaned up, and then special tools and techniques can beapplied to search for patterns. Diving into clean and nicely organized datafrom the right perspectives can increase the chances of making the rightdiscoveries.
A skilled diamond miner knows what a diamond looks like. Similarly, askilled data miner should know what kinds of patterns to look for. Thepatterns are essentially about what hangs together and what is separate.Therefore, knowing the business domain well is very important. It takesknowledgeandskill todiscover thepatterns. It is like findinganeedle inahaystack.Sometimesthepatternmaybehidinginplainsight.Atothertimes,itmaytakealotofwork,andlookingfarandwide,tofindsurprisingusefulpatterns. Thus, a systematic approach to mining data is necessary toefficientlyrevealvaluableinsights.
For instance, the attitude of employees toward their employer may behypothesizedtobedeterminedbyalargenumberoffactors,suchaslevelofeducation,income,tenureinthecompany,andgender.Itmaybesurprisingifthedata reveals that theattitudesaredetermined first and foremostby theirage bracket. Such a simple insight could be powerful in designingorganizations effectively. The data miner has to be open to any and all
16
possibilities.
Whenusedincleverways,dataminingcanleadtointerestinginsightsandbea sourceof new ideas and initiatives.One canpredict the traffic patternonhighways from the movement of cell phone (in the car) locations on thehighway. If the locations of cell phones on a highway or roadway are notmoving fast enough, it may be a sign of traffic congestion. Telecomcompanies can thus provide real-time traffic information to the drivers ontheir cell phones, or on their GPS devices, without the need of any videocamerasortrafficreporters.
Similarly,organizationscanfindoutanemployee’sarrivaltimeattheofficebywhentheircellphoneshowsupintheparkinglot.Observingtherecordofthe swipe of the parking permit card in the company parking garage caninformtheorganizationwhetheranemployeeisintheofficebuildingoroutoftheofficeatanymomentintime.
Somepatternsmaybesosparsethataverylargeamountofdiversedatahastobeseentogethertonoticeanyconnections.Forinstance,locatingthedebrisofaflightthatmayhavevanishedmidcoursewouldrequirebringingtogetherdatafrommanysources,suchassatellites,ships,andnavigationsystems.Theraw data may come with various levels of quality, and may even beconflicting.Thedata athandmayormaynotbeadequate for findinggoodpatterns.Additionaldimensionsofdatamayneed tobeadded tohelpsolvetheproblem.
17
DataProcessingChainDataisthenewnaturalresource.Implicitinthisstatementistherecognitionofhiddenvalueindata.Dataliesattheheartofbusinessintelligence.Thereisa sequence of steps to be followed to benefit from the data in a systematicway. Data can be modeled and stored in a database. Relevant data can beextractedfromtheoperationaldatastoresaccordingtocertainreportingandanalyzing purposes, and stored in a data warehouse. The data from thewarehousecanbecombinedwithothersourcesofdata,andminedusingdatamining techniques to generate new insights. The insights need to bevisualized and communicated to the right audience in real time forcompetitiveadvantage.Figure1.2explainstheprogressionofdataprocessingactivities.The restof thischapterwillcover these fiveelements in thedataprocessingchain.
Figure1.2:DataProcessingChain
DataAnythingthatisrecordedisdata.Observationsandfactsaredata.Anecdotesandopinionsarealsodata,ofadifferentkind.Datacanbenumbers,liketherecordofdailyweather,ordailysales.Datacanbealphanumeric,suchasthenamesofemployeesandcustomers.
1. Data could come from any number of sources. It could come fromoperationalrecordsinsideanorganization,anditcancomefromrecordscompiled by the industry bodies and government agencies.Data couldcome from individuals telling stories frommemory and from people’sinteractioninsocialcontexts.Datacouldcomefrommachinesreportingtheirownstatusorfromlogsofwebusage.
2. Datacancomeinmanyways.Itmaycomeaspaperreports.Itmaycomeasafilestoredonacomputer.Itmaybewordsspokenoverthephone.Itmaybee-mailorchatontheInternet.ItmaycomeasmoviesandsongsinDVDs,andsoon.
3. Thereisalsodataaboutdata.Itiscalledmetadata.Forexample,peopleregularly upload videos on YouTube. The format of the video file(whether it was a high-def file or lower resolution) is metadata. Theinformationabout the timeofuploadingismetadata.Theaccountfromwhichitwasuploadedisalsometadata.Therecordofdownloadsofthe
18
videoisalsometadata.
Datacanbeofdifferenttypes.
1. Datacouldbeanunorderedcollectionofvalues.Forexample,aretailersellsshirtsofred,blue,andgreencolors.Thereisnointrinsicorderingamong these color values.One can hardly argue that any one color ishigher or lower than the other. This is called nominal (means names)data.
2. Datacouldbeorderedvalueslikesmall,mediumandlarge.Forexample,thesizesofshirtscouldbeextra-small,small,medium,andlarge.Thereis clarity that medium is bigger than small, and large is bigger thanmedium. But the differences may not be equal. This is called ordinal(ordered)data.
3. Another type of data has discrete numeric values defined in a certainrange, with the assumption of equal distance between the values.Customer satisfaction scoremay be ranked on a 10-point scalewith 1being lowest and 10 being highest. This requires the respondent tocarefully calibrate the entire rangeasobjectively aspossible andplacehis own measurement in that scale. This is called interval (equalintervals)data.
4. The highest level of numeric data is ratio datawhich can take on anynumericvalue.Theweightsandheightsofallemployeeswouldbeexactnumericvalues.Thepriceofashirtwillalsotakeanynumericvalue.Itiscalledratio(anyfraction)data.
5. There is another kind of data that does not lend itself to muchmathematical analysis, at least not directly. Suchdata needs to be firststructuredand thenanalyzed.This includesdata like audio, video, andgraphsfiles,oftencalledBLOBs(BinaryLargeObjects).Thesekindsofdata lend themselves to different forms of analysis andmining. Songscanbedescribedashappyor sad, fast-pacedor slow, and soon.Theymay contain sentiment and intention, but these are not quantitativelyprecise.
Theprecisionofanalysisincreasesasdatabecomesmorenumeric.Ratiodatacould be subjected to rigorousmathematical analysis. For example, preciseweatherdataabouttemperature,pressure,andhumiditycanbeusedtocreaterigorousmathematicalmodelsthatcanaccuratelypredictfutureweather.
Datamay be publicly available and sharable, or itmay bemarked private.Traditionally, the law allows the right to privacy concerning one’s personal
19
data. There is a big debate on whether the personal data shared on socialmediaconversationsisprivateorcanbeusedforcommercialpurposes.
Dataficationisanewtermthatmeansthatalmosteveryphenomenonisnowbeingobservedandstored.MoredevicesareconnectedtotheInternet.Morepeopleareconstantlyconnectedto“thegrid,”bytheirphonenetworkortheInternet, and so on. Every click on the web, and every movement of themobile devices, is being recorded. Machines are generating data. The“Internetofthings”isgrowingfasterthantheInternetofpeople.Allofthisisgenerating an exponentially growing volume of data, at high velocity.Kryder’s law predicts that the density and capability of hard drive storagemediawilldoubleevery18months.Asstoragecostskeepcomingdownatarapid rate, there is a greater incentive to record and storemore events andactivities at a higher resolution. Data is getting stored in more detailedresolution,andmanymorevariablesarebeingcapturedandstored.
DatabaseAdatabaseisamodeledcollectionofdatathatisaccessibleinmanyways.Adata model can be designed to integrate the operational data of theorganization.Thedatamodelabstractsthekeyentitiesinvolvedinanactionandtheirrelationships.Mostdatabasestodayfollowtherelationaldatamodeland its variants. Each data modeling technique imposes rigorous rules andconstraintstoensuretheintegrityandconsistencyofdataovertime.
Take the example of a sales organization. A data model for managingcustomerorderswillinvolvedataaboutcustomers,orders,products,andtheirinterrelationships.Therelationshipbetween thecustomersandorderswouldbesuchthatonecustomercanplacemanyorders,butoneorderwillbeplacedby one and only one customer. It is called a one-to-many relationship.Therelationshipbetweenordersandproductsisalittlemorecomplex.Oneordermay contain many products. And one product may be contained in manydifferentorders.This iscalledamany-to-many relationship.Different typesofrelationshipscanbemodeledinadatabase.
Databases have grown tremendously over time. They have grown incomplexity in terms of number of the objects and their properties beingrecorded.Theyhavealsogrowninthequantityofdatabeingstored.Adecadeago, a terabyte-sized database was considered big. Today databases are inpetabytesandexabytes.Videoandothermediafileshavegreatlycontributedto thegrowthofdatabases.E-commerceandotherweb-basedactivitiesalsogeneratehugeamountsofdata.Datageneratedthroughsocialmediahasalsogeneratedlargedatabases.Thee-mailarchives,includingattacheddocuments
20
oforganizations,areinsimilarlargesizes.
Manydatabasemanagementsoftwaresystems(DBMSs)areavailabletohelpstore and manage this data. These include commercial systems, such asOracle and DB2 system. There are also open-source, free DBMS, such asMySQL and Postgres. These DBMSs help process and store millions oftransactionsworthofdataeverysecond.
Here is a simple database of the sales of movies worldwide for a retailorganization.Itshowssalestransactionsofmoviesoverthreequarters.Usingsuchafile,datacanbeadded,accessed,andupdatedasneeded.
MoviesTransactionsDatabase
Order#
Datesold
Productname
Location
Amount
1
April2015
MontyPython
US
$9
2
May2015
GoneWiththeWind
US
$15
3
June2015
MontyPython
India
$9
4
June2015
MontyPython
UK
$12
5
July2015
Matrix
US
$12
6
July2015
MontyPython
US
$12
7
July2015
GoneWiththeWind
US
$15
8
Aug2015
Matrix
US
$12
9
Sept2015
Matrix
India
$12
10
Sept2015
MontyPython
US
$9
11
Sept2015
GoneWiththeWind
US
$15
21
12
Sept2015
MontyPython
India
$9
13
Nov2015
GoneWiththeWind
US
$15
14
Dec2015
MontyPython
US
$9
15
Dec2015
MontyPython
US
$9
DataWarehouseAdatawarehouseisanorganizedstoreofdatafromallovertheorganization,speciallydesignedtohelpmakemanagementdecisions.Datacanbeextractedfrom operational database to answer a particular set of queries. This data,combinedwith other data, can be rolled up to a consistent granularity anduploaded to a separate data store called the datawarehouse. Therefore, thedata warehouse is a simpler version of the operational data base, with thepurposeofaddressingreportinganddecision-makingneedsonly.Thedatainthe warehouse cumulatively grows as more operational data becomesavailableandisextractedandappendedtothedatawarehouse.Unlikeintheoperationaldatabase,thedatavaluesinthewarehousearenotupdated.
Tocreateasimpledatawarehouseforthemoviessalesdata,assumeasimpleobjectiveof trackingsalesofmoviesandmakingdecisionsaboutmanaginginventory. Increating thisdatawarehouse,all thesales transactiondatawillbeextractedfromtheoperationaldatafiles.Thedatawillberolledupforallcombinationsoftimeperiodandproductnumber.Thus,therewillbeonerowfor every combination of time period and product. The resulting datawarehousewilllooklikethetablethatfollows.
MoviesSalesDataWarehouse
Row#
Qtrsold
Productname
Amount
1
Q2
GoneWiththeWind
$15
2
Q2
MontyPython
$30
22
3 Q3 GoneWiththeWind $304
Q3
Matrix
$36
5
Q3
MontyPython
$30
6
Q4
GoneWiththeWind
$15
7
Q4
MontyPython
$18
The data in the data warehouse is at much less detail than the transactiondatabase.Thedatawarehousecouldhavebeendesignedatalowerorhigherlevel of detail, or granularity. If the data warehouse were designed on amonthlylevel,insteadofaquarterlylevel,therewouldbemanymorerowsofdata.Whenthenumberof transactionsapproachesmillionsandhigher,withdozensofattributesineachtransaction,thedatawarehousecanbelargeandrichwith potential insights.One can thenmine the data (slice and dice) inmany differentways and discover uniquemeaningful patterns.Aggregatingthe data helps improve the speed of analysis. A separate data warehouseallows analysis to go on separately in parallel, without burdening theoperationaldatabasesystems(Table1.1).
Function
Database
DataWarehouse
Purpose
Datastoredindatabasescanbeusedformanypurposesincludingday-to-dayoperations
DatastoredinDWiscleanseddatausefulforreportingandanalysis
Granularity
Highlygranulardataincludingallactivityandtransactiondetails
Lowergranularitydata;rolleduptocertainkeydimensionsofinterest
Complexity
Highlycomplexwithdozensorhundredsofdatafiles,linkedthroughcommondatafields
Typicallyorganizedaroundalargefacttables,andmanylookuptables
Databasegrowswithgrowing
Growsasdatafrom
23
Size
volumesofactivityandtransactions.Oldcompletedtransactionsaredeletedtoreducesize.
operationaldatabasesisrolled-upandappendedeveryday.Dataisretainedforlong-termtrendanalyses
Architecturalchoices
Relational,andobject-oriented,databases
Starschema,orSnowflakeschema
DataAccessmechanisms
PrimarilythroughhighlevellanguagessuchasSQL.TraditionalprogrammingaccessDBthroughOpenDataBaseConnectivity(ODBC)interfaces
AccessedthroughSQL;SQLoutputisforwardedtoreportingtoolsanddatavisualizationtools
Table1.1:ComparingDatabasesystemswithDataWarehousingsystems
DataMiningDataMining is theartandscienceofdiscoveringuseful innovativepatternsfromdata.There isawidevarietyofpatterns thatcanbefound in thedata.There are many techniques, simple or complex, that help with findingpatterns.
Inthisexample,asimpledataanalysistechniquecanbeappliedtothedatainthedatawarehouseabove.Asimplecross-tabulationofresultsbyquarterandproductswillrevealsomeeasilyvisiblepatterns.
MoviesSalesbyQuarters–Cross-tabulation
Qtr/Product
GoneWiththeWind
Matrix
MontyPython
TotalSalesAmount
Q2
$15
0
$30
$45
Q3
$30
$36
$30
$96
Q4
$15
0
$18
$33
TotalSalesAmount
$60
$36
$78
$174
24
Based on the cross-tabulation above, one can readily answer some productsalesquestions,like:
1. Whatisthebestsellingmoviebyrevenue?–MontyPython.
2. Whatisthebestquarterbyrevenuethisyear?–Q33. Anyotherpatterns?–MatrixmoviesellsonlyinQ3(seasonalitem).
These simple insights can help plan marketing promotions and manageinventoryofvariousmovies.
If a cross tabulation was designed to include customer location data, onecouldanswerotherquestions,suchas
1. Whatisthebestsellinggeography?–US2. Whatistheworstsellinggeography?–UK3. Anyotherpatterns?–MontyPythonsellsglobally,whileGonewiththe
WindsellsonlyintheUS.
Ifthedataminingwasdoneatthemonthlylevelofdata,itwouldbeeasytomiss theseasonalityof themovies.However,onewouldhaveobserved thatSeptemberisthehighestsellingmonth.
The previous example shows that many differences and patterns can benoticedbyanalyzingdataindifferentways.However,someinsightsaremoreimportant than others. The value of the insight depends upon the problembeingsolved.The insight that therearemoresalesofaproduct inacertainquarterhelpsamanagerplanwhatproductstofocuson.Inthiscase,thestoremanager should stock up onMatrix in Quarter 3 (Q3). Similarly, knowingwhich quarter has the highest overall sales allows for different resourcedecisionsinthatquarter.Inthiscase,ifQ3isbringingmorethanhalfoftotalsales, this requiresgreater attentionon the e-commercewebsite in the thirdquarter.
Data mining should be done to solve high-priority, high-value problems.Much effort is required to gather data, clean and organize it, mine it withmany techniques, interpret the results, and find the right insight. It isimportantthattherebealargeexpectedpayofffromfindingtheinsight.Oneshouldselect the rightdata (and ignore the rest),organize it intoaniceandimaginativeframeworkthatbringsrelevantdatatogether,andthenapplydataminingtechniquestodeducetherightinsight.
25
A retail companymay use datamining techniques to determinewhich newproduct categories to add towhich of their stores; how to increase sales ofexistingproducts;whichnewlocationstoopenstoresin;howtosegmentthecustomersformoreeffectivecommunication;andsoon.
Data can be analyzed at multiple levels of granularity and could lead to alarge number of interesting combinations of data and interesting patterns.Someof thepatternsmaybemoremeaningful than theothers.Suchhighlygranulardataisoftenused,especiallyinfinanceandhigh-techareas,sothatonecangaineventheslightestedgeoverthecompetition.
Here are brief descriptions of some of the most important data miningtechniquesusedtogenerateinsightsfromdata.
DecisionTrees:Theyhelpclassifypopulationsintoclasses.Itissaidthat70%ofalldataminingworkisaboutclassificationsolutions;andthat70%ofallclassification work uses decision trees. Thus, decision trees are the mostpopular and important data mining technique. There are many popularalgorithmstomakedecision trees.Theydiffer in termsof theirmechanismsand each technique work well for different situations. It is possible to trymultiple decision-tree algorithms on a data set and compare the predictiveaccuracyofeachtree.
Regression: This is awell-understood technique from the field of statistics.Thegoalistofindabestfittingcurvethroughthemanydatapoints.Thebestfittingcurve is thatwhichminimizes the(error)distancebetween theactualdatapointsandthevaluespredictedbythecurve.Regressionmodelscanbeprojectedintothefutureforpredictionandforecastingpurposes.
ArtificialNeuralNetworks: Originating in the field of artificial intelligenceand machine learning, ANNs are multi-layer non-linear informationprocessingmodelsthatlearnfrompastdataandpredictfuturevalues.Thesemodelspredictwell,leadingtotheirpopularity.Themodel’sparametersmaynot be very intuitive. Thus, neural networks are opaque like a black-box.Thesesystemsalsorequirealargeamountofpastdatatoadequatetrainthesystem.
Clusteranalysis:Thisisanimportantdataminingtechniquefordividingandconquering largedata sets.Thedata set is divided into a certainnumberofclusters,bydiscerningsimilaritiesanddissimilaritieswithinthedata.Thereisnoonerightanswerforthenumberofclustersinthedata.Theuserneedstomakeadecisionbylookingathowwellthenumberofclusterschosenfitthedata.Thisismostcommonlyusedformarketsegmentation.Unlikedecision
26
treesandregression,thereisnoonerightanswerforclusteranalysis.
AssociationRuleMining:AlsocalledMarketBasketAnalysiswhenused inretailindustry,thesetechniqueslookforassociationsbetweendatavalues.Ananalysisofitemsfrequentlyfoundtogetherinamarketbasketcanhelpcross-sellproducts,andalsocreateproductbundles.
DataVisualizationAsdataandinsightsgrowinnumber,anewrequirementistheabilityoftheexecutivesanddecisionmakerstoabsorbthisinformationinrealtime.Thereisalimittohumancomprehensionandvisualizationcapacity.Thatisagoodreason to prioritize and manage with fewer but key variables that relatedirectlytotheKeyResultAreas(KRAs)ofarole.
Herearefewconsiderationswhenpresentingusingdata:
1. Presenttheconclusionsandnotjustreportthedata.2. Choosewiselyfromapaletteofgraphstosuitthedata.3. Organizetheresultstomakethecentralpointstandout.4. Ensure that the visuals accurately reflect the numbers. Inappropriate
visualscancreatemisinterpretationsandmisunderstandings.5. Makethepresentationunique,imaginativeandmemorable.
Executive dashboards are designed to provide information on select fewvariables for every executive. They use graphs, dials, and lists to show thestatus of important parameters. These dashboards also have a drill-downcapabilitytoenablearoot-causeanalysisofexceptionsituations(Figure1.3).
27
Figure1.3:SampleExecutiveDashboard
Data visualization has been an interesting problem across the disciplines.Manydimensionsofdatacanbeeffectivelydisplayedonatwo-dimensionalsurface to give a rich andmore insightful description of the totality of thestory.
TheclassicpresentationofthestoryofNapoleon’smarchtoRussiain1812,byFrenchcartographerJosephMinard,isshowninFigure1.4.Itcoversaboutsixdimensions.Timeisonhorizontalaxis.Thegeographicalcoordinatesandriversaremappedin.Thethicknessofthebarshowsthenumberoftroopsatanypointoftimethatismapped.Onecolorisusedfortheonwardmarchandanotherfortheretreat.Theweathertemperatureateachtimeisshowninthelinegraphatthebottom.
Figure1.4:SampleDataVisualization
28
OrganizationofthebookThischapterisdesignedtoprovidethewholenessofbusinessintelligenceanddata mining, to provide the reader with an intuition for this area ofknowledge.Therestofthebookcanbeconsideredinthreesections.
Section 1 will cover high level topics. Chapter 2 will cover the field ofbusiness intelligence and its applications across industries and functions.Chapter3willbrieflyexplainwhatisdatawarehousingandhowdoesithelpwith datamining. Chapter 4 will then describe datamining in some detailwithanoverviewofitsmajortoolsandtechniques.
Section 2 is focused on data mining techniques. Every technique will beshownthroughsolvinganexampleindetails.Chapter5willshowthepowerandeaseofdecisiontrees,whicharethemostpopulardataminingtechnique.Chapter6willdescribestatisticalregressionmodelingtechniques.Chapter7will provide an overview of artificial neural networks, a versatile machinelearning technique. Chapter 8 will describe how Cluster Analysis can helpwith market segmentation. Finally, chapter 9 will describe the AssociationRuleMiningtechnique,alsocalledMarketBasketAnalysis, thathelpsfindsshoppingpatterns.
Section3will covermore advancednew topics. Chapter10will introducetheconceptsandtechniquesofTextMining,thathelpsdiscoverinsightsfromtext data including social media data. Chapter 11 will cover provide anoverview of the growing field of web mining, which includes mining thestructure, content and usage of web sites. Chapter 12 will provide anoverview of the recent field of Big Data. Chapter 13 has been added as aprimer on Data Modeling, for those who do not have any background indatabases,andshouldbeusedifnecessary.
29
ReviewQuestions
1:DescribetheBusinessIntelligenceandDataMiningcycle.
2:Describethedataprocessingchain.
3:Whatarethesimilaritiesbetweendiamondmininganddatamining?
4:Whatare thedifferentdatamining techniques?Whichof thesewouldberelevantinyourcurrentwork?
5:Whatisadashboard?Howdoesithelp?
6:Createavisualtoshowtheweatherpatterninyourcity.Couldyoushowtogethertemperature,humidity,wind,andrain/snowoveraperiodoftime.
30
Section1
Thissectioncoversthreeimportanthigh-leveltopics.
Chapter 2 will cover business intelligence concepts, and its applications inmanyindustries.
Chapter3willdescribedatawarehousingsystems,andwaysofcreatingandmanagingthem.
Chapter4willdescribedataminingasawhole,itsmanytechniques,andwithmanydo’sanddon’tsofeffectivedatamining.
Chapter 5 will describe data visualization as a whole, with techniques andexamples,andwithmanythumbrulesofeffectivedatavisualizations.
31
Chapter2:BusinessIntelligenceConceptsandApplications
Business intelligence (BI) is an umbrella term that includes a variety of ITapplicationsthatareusedtoanalyzeanorganization’sdataandcommunicatetheinformationtorelevantusers.(Figure2.1).
Figure2.1:BIDMcycle
Thenatureoflifeandbusinessesistogrow.Informationisthelife-bloodofbusiness. Businesses use many techniques for understanding theirenvironment and predicting the future for their own benefit and growth.Decisions aremade from facts and feelings.Data-based decisions aremoreeffectivethanthosebasedonfeelingsalone.Actionsbasedonaccuratedata,information, knowledge, experimentation, and testing, using fresh insights,canmorelikelysucceedandleadtosustainedgrowth.One’sowndatacanbethemost effective teacher. Therefore, organizations should gather data, siftthroughit,analyzeandmineit, findinsights,andthenembedthoseinsightsintotheiroperatingprocedures.
There is a new sense of importance and urgency around data as it is beingviewed as a new natural resource. It can bemined for value, insights, andcompetitive advantage. In a hyperconnected world, where everything ispotentiallyconnectedtoeverythingelse,withpotentiallyinfinitecorrelations,data represents the impulses of nature in the form of certain events andattributes.Askilledbusinesspersonismotivatedtousethiscacheofdatatoharnessnature, and to findnewnichesof unservedopportunities that couldbecomeprofitableventures.
32
33
Caselet:KhanAcademy–BIinEducationKhan Academy is an innovativenon-profit educationalorganization that is turning theK-12 education system upsidedown. Itprovides shortYouTubebased video lessons onthousands of topics for free. Itshot into prominence when BillGatespromoted it asa resourcethat he used to teach his ownchildren. With this kind of aresource classrooms are beingflipped … i.e. student do theirbasic lecture-type learning athome using those videos, whilethe class time is used for moreone-on-oneproblemsolvingandcoaching. Students can accessthe lessons at any time to learnat theirownpace.Thestudents’progress is recorded includingwhat videos they watched howmanytimes,whichproblemstheystumbled on, and what scorestheygotononlinetests.
Khan Academy has developedtoolstohelpteachersgetapulseon what's happening in theclassroom. Teachers areprovided a set of real-timedashboards to give theminformationfromthemacrolevel("How is my class doing ongeometry?") to the micro level("How is Jane doing onmastering polygons?") Armedwith this information, teacherscan place selective focus on thestudents that need certain help.
34
(Source:KhanAcademy.org)
Q1: How does a dashboardimprove the teachingexperience? And the student’slearningexperience?
Q2: Design a dashboard fortrackingyourowncareer.
35
BIforbetterdecisionsThefuture is inherentlyuncertain.Risk is theresultofaprobabilisticworldwhere there are no certainties and complexities abound. People use crystalballs,astrology,palmistry,groundhogs,andalsomathematicsandnumberstomitigate risk in decision-making. The goal is to make effective decisions,whilereducingrisk.Businessescalculaterisksandmakedecisionsbasedonabroadsetoffactsandinsights.Reliableknowledgeaboutthefuturecanhelpmanagersmaketherightdecisionswithlowerlevelsofrisk.
ThespeedofactionhasrisenexponentiallywiththegrowthoftheInternet.Inahypercompetitiveworld,thespeedofadecisionandtheconsequentactioncanbeakeyadvantage.TheInternetandmobiletechnologiesallowdecisionstobemadeanytime,anywhere.Ignoringfast-movingchangescanthreatentheorganization’sfuture.Researchhasshownthatanunfavorablecommentaboutthecompanyanditsproductsonsocialmediashouldnotgounaddressedforlong.BankshavehadtopayhugepenaltiestoConsumerFinancialProtectionBureau (CFPB) in United States in 2013 for complaints made on CFPB’swebsites.Ontheotherhand,apositivesentimentexpressedonsocialmediashouldalsobeutilizedasapotentialsalesandpromotionopportunity,whiletheopportunitylasts.
36
DecisiontypesThere are two main kinds of decisions: strategic decisions and operationaldecisions. BI can help make both better. Strategic decisions are those thatimpact the direction of the company. The decision to reach out to a newcustomer setwould be a strategic decision.Operational decisions aremoreroutine and tactical decisions, focused on developing greater efficiency.Updatinganoldwebsitewithnewfeatureswillbeanoperationaldecision.
Instrategicdecision-making,thegoalitselfmayormaynotbeclear,andthesameistrueforthepathtoreachthegoal.Theconsequencesofthedecisionwouldbeapparentsometimelater.Thus,oneisconstantlyscanningfornewpossibilities and new paths to achieve the goals. BI can help with what-ifanalysisofmanypossiblescenarios.BIcanalsohelpcreatenewideasbasedonnewpatternsfoundfromdatamining.
Operational decisions can bemademore efficient using an analysis of pastdata.A classification system can be created andmodeled using the data ofpast instances todevelopagoodmodelof thedomain.Thismodelcanhelpimproveoperationaldecisionsinthefuture.BIcanhelpautomateoperationslevel decision-making and improve efficiency by making millions ofmicroleveloperationaldecisionsinamodel-drivenway.Forexample,abankmight want to make decisions about making financial loans in a morescientific way using data-basedmodels. A decision-tree-basedmodel couldprovideaconsistentlyaccurateloandecisions.Developingsuchdecisiontreemodelsisoneofthemainapplicationsofdataminingtechniques.
Effective BI has an evolutionary component, as business models evolve.Whenpeople andorganizations act, new facts (data) are generated.Currentbusinessmodels can be tested against the new data, and it is possible thatthosemodelswillnotholdupwell. In thatcase,decisionmodelsshouldberevised and new insights should be incorporated. An unending process ofgeneratingfreshnewinsightsinrealtimecanhelpmakebetterdecisions,andthuscanbeasignificantcompetitiveadvantage.
37
BIToolsBI includes a variety of software tools and techniques to provide themanagers with the information and insights needed to run the business.Information can be provided about the current state of affairs with thecapabilitytodrilldownintodetails,andalsoinsightsaboutemergingpatternswhichleadtoprojectionsintothefuture.BItoolsincludedatawarehousing,online analytical processing, social media analytics, reporting, dashboards,querying,anddatamining.
BItoolscanrangefromverysimpletoolsthatcouldbeconsideredend-usertools, toverysophisticated tools thatofferaverybroadandcomplexsetoffunctionality.Thus,EvenexecutivescanbetheirownBIexperts,ortheycanrely on BI specialists to set up the BI mechanisms for them. Thus, largeorganizationsinvestinexpensivesophisticatedBIsolutionsthatprovidegoodinformationinrealtime.
Aspreadsheettool,suchasMicrosoftExcel,canactasaneasybuteffectiveBItoolbyitself.Datacanbedownloadedandstoredinthespreadsheet,thenanalyzedtoproduceinsights,thenpresentedintheformofgraphsandtables.This systemoffers limitedautomationusingmacrosandother features.Theanalytical features include basic statistical and financial functions. Pivottableshelpdosophisticatedwhat-ifanalysis.Add-onmodulescanbeinstalledtoenablemoderatelysophisticatedstatisticalanalysis.
A dashboarding system, such as IBM Cognos or Tableau, can offer asophisticatedsetoftoolsforgathering,analyzing,andpresentingdata.Attheuserend,modulardashboardscanbedesignedand redesignedeasilywithagraphical user interface. The back-end data analytical capabilities includemany statistical functions.Thedashboards are linked to datawarehouses atthebackend toensure that the tablesandgraphsandotherelementsof thedashboardareupdatedinrealtime(Figure2.2).
38
Figure2.2:SampleExecutiveDashboard
Data mining systems, such as IBM SPSS Modeler, are industrial strengthsystems thatprovidecapabilities toapplyawiderangeofanalyticalmodelsonlargedatasets.Opensourcesystems,suchasWeka,arepopularplatformsdesignedtohelpminelargeamountsofdatatodiscoverpatterns.
39
BISkillsAsdatagrowsandexceedsourcapacitytomakesenseofit,thetoolsneedtoevolve, and so should the imagination of theBI specialist. “DataScientist”hasbeencalledasthehottestjobofthisdecade.
AskilledandexperiencedBIspecialistshouldbeopenenoughtogooutsidethe box, open the aperture and see a wider perspective that includes moredimensionsandvariables,inordertofindimportantpatternsandinsights.Theproblem needs to be looked at from a wider perspective to considermanymore angles thatmaynotbe immediatelyobvious.An imaginative solutionshouldbeproposedfortheproblemsothatinterestingandusefulresultscanemerge.
A good data mining project begins with an interesting problem to solve.Selecting the right datamining problem is an important skill. The problemshould be valuable enough that solving it would be worth the time andexpense. It takes a lot of time and energy to gather, organize, cleanse, andpreparethedataforminingandotheranalysis.Thedataminerneedstopersistwith the exploration of patterns in the data. The skill level has to be deepenoughtoengagewiththedataandmakeityieldnewusefulinsights.
40
BIApplicationsBItoolsarerequiredinalmostallindustriesandfunctions.Thenatureoftheinformationand the speedof actionmaybedifferent acrossbusinesses, butevery manager today needs access to BI tools to have up-to-date metricsaboutbusinessperformance.Businessesneedtoembednewinsightsintotheiroperating processes to ensure that their activities continue to evolve withmoreefficientpractices.ThefollowingaresomeareasofapplicationsofBIanddatamining.
CustomerRelationshipManagementAbusiness exists to serve a customer.A happy customer becomes a repeatcustomer. A business should understand the needs and sentiments of thecustomer,sellmoreofitsofferingstotheexistingcustomers,andalso,expandthepoolofcustomers it serves.BIapplicationscan impactmanyaspectsofmarketing.
1. Maximize the return on marketing campaigns: Understanding thecustomer’s pain points from data-based analysis can ensure that themarketingmessagesarefine-tunedtobetterresonatewithcustomers.
2. Improve customer retention (churn analysis): It is more difficult andexpensive towinnewcustomers than it is to retainexistingcustomers.Scoringeachcustomerontheirlikelihoodtoquit,canhelpthebusinessdesign effective interventions, such as discounts or free services, toretainprofitablecustomersinacost-effectivemanner.
3. Maximize customer value (cross-, up-selling): Every contact with thecustomershouldbeseenasanopportunitytogaugetheircurrentneeds.Offeringacustomernewproductsandsolutionsbasedonthoseimputedneeds can help increase revenue per customer. Even a customercomplaintcanbeseenasanopportunitytowowthecustomer.Usingtheknowledgeofthecustomer’shistoryandvalue,thebusinesscanchoosetosellapremiumservicetothecustomer.
4. Identify and delight highly-valued customers. By segmenting thecustomers,thebestcustomerscanbeidentified.Theycanbeproactivelycontacted, and delighted, with greater attention and better service.Loyaltyprogramscanbemanagedmoreeffectively.
41
5. Managebrandimage.Abusinesscancreatealisteningposttolistentosocialmediachatteraboutitself.Itcanthendosentimentanalysisofthetexttounderstandthenatureofcomments,andrespondappropriatelytotheprospectsandcustomers.
HealthcareandWellnessHealth care isoneof thebiggest sectors in advancedeconomies.Evidence-basedmedicineisthenewesttrendindata-basedhealthcaremanagement.BIapplicationscanhelpapplythemosteffectivediagnosesandprescriptionsforvariousailments.Theycanalsohelpmanagepublichealthissues,andreducewasteandfraud.
1. Diagnose disease in patients: Diagnosing the cause of a medicalcondition is the critical first step in amedical engagement.Accuratelydiagnosingcasesofcancerordiabetescanbeamatteroflifeanddeathfor thepatient. Inaddition to thepatient’sowncurrent situation,manyother factors can be considered, including the patient’s health history,medication history, family’s history, and other environmental factors.Thismakesdiagnosisasmuchofanart formas it isscience.Systems,suchasIBMWatson,absorball themedicalresearchtodateandmakeprobabilisticdiagnoses in the formofadecision tree,alongwitha fullexplanation for their recommendations.These systems take awaymostoftheguessworkdonebydoctorsindiagnosingailments.
2. Treatmenteffectiveness:Theprescriptionofmedicationandtreatmentisalso a difficult choice out of somanypossibilities. For example, therearemore than 100medications for hypertension (high blood pressure)alone.Therearealsointeractionsintermsofwhichdrugsworkwellwithothers and which drugs do not. Decision trees can help doctors learnaboutandprescribemoreeffective treatments.Thus, thepatientscouldrecovertheirhealthfasterwithalowerriskofcomplicationsandcost.
3. Wellness management: This includes keeping track of patient healthrecords,analyzingcustomerhealthtrendsandproactivelyadvisingthemtotakeanyneededprecautions.
4. Managefraudandabuse:Somemedicalpractitionershaveunfortunately
42
beenfoundtoconductunnecessarytests,and/oroverbillthegovernmentand health insurance companies. Exception reporting systems canidentifysuchprovidersandactioncanbetakenagainstthem.
5. Publichealthmanagement:Themanagementofpublichealth isoneofthe important responsibilities of any government. By using effectiveforecasting tools and techniques, governments can better predict theonset of disease in certain areas in real time. They can thus be betterprepared to fight the diseases. Google has been known to predict themovement of certain diseases by tracking the search terms (like flu,vaccine)usedindifferentpartsoftheworld.
EducationAshighereducationbecomesmoreexpensiveandcompetitive,itbecomesagreat user of data-based decision-making. There is a strong need forefficiency, increasing revenue, and improving the quality of studentexperienceatalllevelsofeducation.
1. Student Enrollment (Recruitment and Retention): Marketing to newpotentialstudentsrequiresschoolstodevelopprofilesofthestudentsthataremostlikelytoattend.Schoolscandevelopmodelsofwhatkindsofstudentsareattractedtotheschool,andthenreachouttothosestudents.The students at risk of not returning can be flagged, and correctivemeasurescanbetakenintime.
2. Courseofferings: Schools can use the class enrolment data to developmodels of which new courses are likely to be more popular withstudents. This can help increase class size, reduce costs, and improvestudentsatisfaction.
3. Fund-raising from Alumni and other donors: Schools can developpredictivemodels ofwhich alumni aremost likely to pledge financialsupporttotheschool.Schoolscancreateaprofileforalumnimorelikelytopledgedonations to theschool.Thiscould lead toareduction in thecostofmailingsandotherformsofoutreachtoalumni.
RetailRetailorganizationsgrowbymeetingcustomerneedswithqualityproducts,in a convenient, timely, and cost-effectivemanner.Understanding emerging
43
customer shopping patterns can help retailers organize their products,inventory,storelayout,andwebpresenceinordertodelighttheircustomers,whichinturnwouldhelpincreaserevenueandprofits.Retailersgeneratealotof transaction and logistics data that can be used to diagnose and solveproblems.
1. Optimizeinventorylevelsatdifferentlocations:Retailersneedtomanagetheir inventories carefully. Carrying too much inventory imposescarrying costs,while carrying too little inventory can cause stock-outsandlostsalesopportunities.Predictingsalestrendsdynamicallycanhelpretailers move inventory to where it is most in demand. Retailorganizations can provide their suppliers with real time informationaboutsalesoftheiritems,sothesupplierscandelivertheirproducttotherightlocationsandminimizestock-outs.
2. Improvestorelayoutandsalespromotions:Amarketbasketanalysiscandevelop predictive models of which products sell together often. Thisknowledge of affinities between products can help retailers co-locatethose products. Alternatively, those affinity products could be locatedfarther apart tomake the customerwalk the length and breadth of thestore, and thus be exposed to other products. Promotional discountedproductbundlescanbecreatedtopushanonsellingitemalongwithasetofproductsthatsellwelltogether.
3. Optimize logistics for seasonal effects: Seasonal products offertremendously profitable short-term sales opportunities, yet they alsooffer the risk of unsold inventories at the end of the season.Understandingwhich products are in season inwhichmarket can helpretailers dynamically manage prices to ensure their inventory is soldduringtheseason.Ifitisraininginacertainarea,thentheinventoryofumbrellaandponchoscouldberapidlymovedtherefromnonrainyareastohelpincreasesales.
4. Minimize losses due to limited shelf life: Perishable goods offerchallenges in termsofdisposingoff the inventory in time.By trackingsalestrends,theperishableproductsatriskofnotsellingbeforethesell-bydate,canbesuitablydiscountedandpromoted.
Banking
44
Banksmake loansandoffercredit cards tomillionsofcustomers.Theyaremost interested in improving the quality of loans and reducing bad debts.They also want to retain more good customers, and sell more services tothem.
1. Automate the loan application process: Decision models can begenerated from past data that predict the likelihood of a loan provingsuccessful.Thesecanbe inserted inbusinessprocesses toautomate thefinancialloanapprovalprocess.
2. Detectfraudulenttransactions:Billionsoffinancialtransactionshappenaround the world every day. Exception-seeking models can identifypatterns of fraudulent transactions. For example, if money is beingtransferred to an unrelated account for the first time, it could be afraudulenttransaction.
3. Maximizecustomervalue(cross-,up-selling).Sellingmoreproductsandservices to existing customers is often the easiest way to increaserevenue.Acheckingaccountcustomeringoodstandingcouldbeofferedhome, auto, or educational loans on more favorable terms than othercustomers, and thus, the value generated from that customer could beincreased.
4. Optimizecashreserveswithforecasting.Bankshavetomaintaincertainliquidity to meet the needs of depositors who may like to withdrawmoney.Usingpastdataandtrendanalysis,bankscanforecasthowmuchtokeepandinvesttheresttoearninterest.
FinancialServicesStockbrokeragesareanintensiveuserofBIsystems.Fortunescanbemadeorlostbasedonaccesstoaccurateandtimelyinformation.
1. Predictchangesinbondandstockprices:Forecastingthepriceofstocksandbondsisafavoritepastimeoffinancialexpertsaswellaslaypeople.Stocktransactiondatafromthepast,alongwithothervariables,canbeusedtopredictfuturepricepatterns.Thiscanhelptradersdeveloplong-termtradingstrategies.
45
2. Assesstheeffectofeventsonmarketmovements.Decisionmodelsusingdecisiontreescanbecreatedtoassesstheimpactofeventsonchangesinmarket volume and prices. Monetary policy changes (such as FederalReserve interest ratechange)orgeopolitical changes (suchaswar inapartoftheworld)canbefactoredintothepredictivemodeltohelptakeactionwithgreaterconfidenceandlessrisk.
3. Identify and prevent fraudulent activities in trading: There haveunfortunately been many cases of insider trading, leading to manyprominent financial industry stalwarts going to jail. Fraud detectionmodels seek out-of-the-ordinary activities, and help identify and flagfraudulentactivitypatterns.
InsuranceThis industry is a prolific user of prediction models in pricing insuranceproposalsandmanaginglossesfromclaimsagainstinsuredassets.
1. Forecast claim costs for better business planning: When naturaldisasters, such as hurricanes and earthquakes strike, loss of life andpropertyoccurs.Byusingthebestavailabledatatomodelthelikelihood(or risk) of such events happening, the insurer can plan for losses andmanageresourcesandprofitseffectively.
2. Determine optimal rate plans: Pricing an insurance rate plan requirescovering the potential losses andmaking a profit. Insurers use actuarytables toproject lifespansanddisease tables toprojectmortality rates,andthuspricethemselvescompetitivelyyetprofitably.
3. Optimize marketing to specific customers: By micro-segmentingpotential customers, a data-savvy insurer can cherry pick the bestcustomers and leave the less profitable customers to its competitors.ProgressiveInsuranceisaUS-basedcompanythatisknowntoactivelyusedataminingtocherrypickcustomersandincreaseitsprofitability.
4. Identify and prevent fraudulent claim activities. Patterns can beidentifiedastowhereandwhatkindsoffraudaremorelikelytooccur.Decision-tree-basedmodels canbeused to identify and flag fraudulent
46
claims.
ManufacturingManufacturing operations are complex systems with inter-related sub-systems.Frommachinesworkingright,toworkershavingtherightskills,tothe right components arriving with the right quality at the right time, tomoney to source the components, many things have to go right. Toyota’sfamousleanmanufacturingcompanyworksonjust-in-timeinventorysystemsto optimize investments in inventory and to improve flexibility in theirproduct-mix.
1. Discovernovelpatternstoimproveproductquality:Qualityofaproductcan also be tracked, and this data can be used to create a predictivemodel of product quality deteriorating. Many companies, such asautomobilecompanies,have to recall theirproducts if theyhave founddefectsthathaveapublicsafetyimplication.Dataminingcanhelpwithrootcauseanalysisthatcanbeusedtoidentifysourcesoferrorsandhelpimproveproductqualityinthefuture.
2. Predict/preventmachinery failures:Statistically, all equipment is likelytobreakdownatsomepointintime.Predictingwhichmachineislikelyto shut down is a complex process. Decision models to forecastmachinery failures could be constructed using past data. Preventivemaintenance can be planned, and manufacturing capacity can beadjusted,toaccountforsuchmaintenanceactivities.
TelecomBIintelecomcanhelpwiththecustomersideaswellasnetworksideoftheoperations. Key BI applications include churn management,marketing/customerprofiling,networkfailure,andfrauddetection.
1. Churn management: Telecom customers have shown a tendency toswitchtheirprovidersinsearchforbetterdeals.Telecomcompaniestendtorespondwithmanyincentivesanddiscountstoholdontocustomers.However, theyneed todeterminewhichcustomersareat a real riskofswitching and which others are just negotiating for a better deal. Thelevelof risk should tobe factored into thekindofdeals anddiscountsthat should be given. Millions of such customer calls happen everymonth. The telecom companies need to provide a consistent and data-basedwaytopredicttheriskofthecustomerswitching,andthenmake
47
an operational decision in real time while the customer call is takingplace.Adecision-tree-oraneuralnetwork-basedsystemcanbeusedtoguidethecustomer-servicecalloperatortomaketherightdecisionsforthecompany,inaconsistentmanner.
2. Marketingandproductcreation.Inadditiontocustomerdata,telecomcompaniesalsostorecalldetailrecords(CDRs),whichcanbeanalyzedtopreciselydescribethecallingbehaviorofeachcustomer.Thisuniquedatacanbeusedtoprofilecustomersandthencanbeusedforcreatingnew products/services bundles for marketing purposes. An Americantelecomcompany,MCI,createdaprogramcalledFriends&Familythatallowed free calls with one’s friends and family on that network, andthus,effectivelylockedmanypeopleintotheirnetwork.
3. Networkfailuremanagement:Failureoftelecomnetworksfortechnicalfailures or malicious attacks can have devastating impacts on people,businesses,andsociety.Intelecominfrastructure,someequipmentwilllikelyfailwithcertainmeantimebetweenfailures.Modelingthefailurepatternofvariouscomponentsof thenetworkcanhelpwithpreventivemaintenanceandcapacityplanning.
4. Fraud Management: There are many kinds of fraud in consumertransactions. Subscription fraud occurs when a customer opens anaccount with the intention of never paying for the services.Superimposition fraud involves illegitimate activity by a person otherthan the legitimate account holder.Decision rules can be developed toanalyze each CDR in real time to identify chances of fraud and takeeffectiveaction.
PublicSectorGovernment gathers a large amount of data by virtue of their regulatoryfunction. That data could be analyzed for developing models of effectivefunctioning.Thereareinnumerableapplicationsthatcanbenefitfromminingthatdata.Acoupleofsampleapplicationsareshownhere.
1. Lawenforcement:Socialbehaviorisalotmorepatternedandpredictablethanonewould imagine.Forexample,LosAngelesPoliceDepartment(LAPD)minedthedatafromits13millioncrimerecordsover80yearsanddevelopedmodelsofwhatkindofcrimegoingtohappenwhenand
48
where.Byincreasingpatrollinginthoseparticularareas,LAPDwasabletoreducepropertycrimeby27percent.Internetchattercanbeanalyzedtolearnofandpreventanyevildesigns.
2. Scientificresearch:Anylargecollectionofresearchdataisamenabletobeing mined for patterns and insights. Protein folding (microbiology),nuclear reaction analysis (sub-atomic physics), disease control (publichealth) are some exampleswhere datamining can yield powerful newinsights.
49
ConclusionBusiness Intelligence isacomprehensivesetof IT tools to supportdecisionmaking with imaginative solutions for a variety of problems. BI can helpimprovetheperformanceinnearlyallindustriesandapplications.
50
ReviewQuestions1. Whyshouldorganizationsinvestinbusinessintelligencesolutions?Are
thesemoreimportantthanITsecuritysolutions?Whyorwhynot?2. List3businessintelligenceapplicationsinthehospitalityindustry.3. Describe2BItoolsusedinyourorganization.4. Businesses need a ‘two-second advantage’ to succeed.What does that
meantoyou?
51
LibertyStoresCaseExercise:Step1LibertyStoresIncisaspecializedglobalretailchainthatsells organic food, organic clothing, wellness products,andeducationproductstoenlightenedLOHAS(Lifestylesof theHealthy and Sustainable) citizensworldwide. Thecompany is20yearsold,and isgrowingrapidly. Itnowoperatesin5continents,50countries,150cities,andhas500 stores. It sells 20000 products and has 10000employees.Thecompanyhasrevenuesofover$5billionand has a profit of about 5% of revenue. The companypays special attention to the conditionsunderwhich theproductsaregrownandproduced. Itdonatesaboutone-fifth (20%) of its pre-tax profits from global localcharitablecauses.
1:CreateacomprehensivedashboardfortheCEOofthecompany.
2:Createanotherdashboardforacountryhead.
52
Chapter3:DataWarehousing
A data warehouse (DW) is an organized collection of integrated, subject-oriented databases designed to support decision support functions. DW isorganized at the right level of granularity to provide clean enterprise-widedata in a standardized format for reports, queries, and analysis. DW isphysically and functionally separate from an operational and transactionaldatabase. Creating a DW for analysis and queries represents significantinvestmentintimeandeffort.Ithastobeconstantlykeptup-to-dateforittobeuseful.DWoffersmanybusinessandtechnicalbenefits.
DW supports business reporting and datamining activities. It can facilitatedistributed access to up-to-date business knowledge for departments andfunctions,thusimprovingbusinessefficiencyandcustomerservice.DWcanpresentacompetitiveadvantagebyfacilitatingdecisionmakingandhelpingreformbusinessprocesses.
DWenablesaconsolidatedviewofcorporatedata,allcleanedandorganized.Thus, the entire organization can see an integrated view of itself.DW thusprovides better and timely information. It simplifies data access and allowsendusers toperformextensiveanalysis. Itenhancesoverall ITperformanceby not burdening the operational databases used by Enterprise ResourcePlanning(ERP)andothersystems.
53
Caselet:UniversityHealthSystem–BIinHealthcareIndiana University Health(IUH), a large academic healthcaresystem,decided tobuildanenterprise data warehouse(EDW) to foster a genuinelydata-drivenmanagementculture.IUH hired a data warehousingvendor to develop an EDWwhich also integrates with theirElectronic Health Records(EHR) system. They loaded 14billion rows of data into theEDW—fully 10 years of clinicaldatafromacrossIUH’snetwork.Clinical events, patientencounters, lab and radiology,and other patient data wereincluded, as were IUH’sperformance management,revenue cycle, and patientsatisfaction data. They soon putin a new interactive dashboardusing the EDW that providedIUH’s leadership with the dailyoperationalinsightstheyneedtosolvethequality/costequation.Itoffers visibility into keyoperationalmetricsandtrendstoeasily track the performancemeasures critical to controllingcosts and maintaining quality.The EDW can easily be usedacross IUH’s departments toanalyze, track and measureclinical, financial, and patientexperience outcomes. (Source:healthcatalyst.com)
Q1: What are the benefits of asingle large comprehensive
54
EDW?
Q2:Whatkindsofdatawouldbeneeded for an EDW for anairlinecompany?
55
DesignConsiderationsforDWThe objective ofDW is to provide business knowledge to support decisionmaking. For DW to serve its objective, it should be aligned around thosedecisions. It shouldbe comprehensive, easy to access, andup-to-date.HerearesomerequirementsforagoodDW:
1. Subjectoriented: To be effective, aDW should be designed around asubjectdomain,i.e.tohelpsolveacertaincategoryofproblems.
2. Integrated:TheDWshould includedata frommany functions that canshedlightonaparticularsubjectarea.Thustheorganizationcanbenefitfromacomprehensiveviewofthesubjectarea.
3. Time-variant(timeseries):ThedatainDWshouldgrowatdailyorotherchosenintervals.Thatallowslatestcomparisonsovertime.
4. Nonvolatile:DWshouldbepersistent,thatis,itshouldnotbecreatedontheflyfromtheoperationsdatabases.Thus,DWisconsistentlyavailableforanalysis,acrosstheorganizationandovertime.
5. Summarized: DWcontains rolled-updataat the right level forqueriesandanalysis.Theprocessof rollingup thedatahelpscreateconsistentgranularityforeffectivecomparisons.Italsohelpsreducesthenumberofvariablesordimensionsof thedata tomake themmoremeaningful forthedecisionmakers.
6. Not normalized: DW often uses a star schema, which is a rectangularcentraltable,surroundedbysomelook-uptables.Thesingletableviewsignificantlyenhancesspeedofqueries.
7. Metadata: Many of the variables in the database are computed fromothervariablesintheoperationaldatabase.Forexample,totaldailysalesmaybeacomputedfield.Themethodofitscalculationforeachvariableshouldbeeffectivelydocumented.Everyelement in theDWshouldbesufficientlywell-defined.
8. Near Real-time and/or right-time (active): DWs should be updated innear real-time in many high transaction volume industries, such asairlines.ThecostofimplementingandupdatingDWinreal timecouldbe discouraging though. Another downside of real-time DW is thepossibilitiesofinconsistenciesinreportsdrawnjustafewminutesapart.
56
57
DWDevelopmentApproachesThere are two fundamentally different approaches to developing DW: topdown and bottom up. The top-down approach is tomake a comprehensiveDW that covers all the reporting needs of the enterprise. The bottom-upapproachis toproducesmalldatamarts,for thereportingneedsofdifferentdepartmentsor functions, asneeded.The smallerdatamartswill eventuallyalign to deliver comprehensive EDW capabilities. The top-down approachprovides consistency but takes more time and resources. The bottom-upapproachleadstohealthylocalownershipandmaintainabilityofdata(Table3.1).
FunctionalDataMart
EnterpriseDataWarehouse
Scope
Onesubjectorfunctionalarea
Completeenterprisedataneeds
Value
Functionalareareportingandinsights
Deeperinsightsconnectingmultiplefunctionalareas
Targetorganization
Decentralizedmanagement
Centralizedmanagement
Time
Lowtomedium
High
Cost
Low
High
Size
Smalltomedium
Mediumtolarge
Approach
Bottomup
Topdown
Complexity
Low(fewerdatatransformations)
High(datastandardization)
Technology
Smallerscaleserversanddatabases
Industrialstrength
Table3.1:ComparingDataMartandDataWarehouse
58
DWArchitectureDWhasfourkeyelements(Figure3.1).Thefirstelementisthedatasourcesthatprovidetherawdata.Thesecondelementistheprocessoftransformingthat data to meet the decision needs. The third element is the methods ofregularly and accurately loading of that data into EDW or datamarts. Thefourth element is the data access and analysis part, where devices andapplications use the data fromDW to deliver insights and other benefits tousers.
Figure3.1:DataWarehousingArchitecture
59
DataSourcesDataWarehousesarecreatedfromstructureddatasources.UnstructureddatasuchastextdatawouldneedtobestructuredbeforeinsertedintotheDW.
1. Operations data: This includes data from all business applications,including from ERPs systems that form the backbone of anorganization’sITsystems.Thedatatobeextractedwilldependuponthesubjectmatterofthedatawarehouse.Forexample,forasales/marketingdatamart,only thedataaboutcustomers,orders,customerservice,andsoonwouldbeextracted.
2. Specializedapplications:ThisincludesapplicationssuchasPointofSale(POS) terminals, and e-commerce applications, that also providecustomer-facing data. Supplier data could come from Supply ChainManagementsystems.Planningandbudgetdatashouldalsobeaddedasneededformakingcomparisonsagainsttargets.
3. Externalsyndicateddata:This includespubliclyavailabledata suchasweatheroreconomicactivitydata.ItcouldalsobeaddedtotheDW,asneeded,toprovidegoodcontextualinformationtodecisionmakers.
60
DataLoadingProcessesThe heart of a useful DW is the processes to populate the DWwith goodqualitydata.ThisiscalledtheExtract-Transform-Load(ETL)cycle.
1. Data should be extracted from the operational (transactional) databasesources,aswellasfromotherapplications,onaregularbasis.
2. The extracted data should be aligned together by key fields andintegrated into a single data set. It should be cleansed of anyirregularities or missing values. It should be rolled-up together to thesame level of granularity. Desired fields, such as daily sales totals,shouldbecomputed.TheentiredatashouldthenbebroughttothesameformatasthecentraltableofDW.
3. ThistransformeddatashouldthenbeuploadedintotheDW.
ThisETLprocessshouldberunataregularfrequency.Dailytransactiondatacanbeextracted fromERPs, transformed, anduploaded to thedatabase thesamenight.Thus,theDWisuptodateeverymorning.IfaDWisneededfornear-real-time informationaccess, then theETLprocesseswouldneed tobeexecutedmore frequently.ETLwork isusuallydoneusingautomatedusingprogramming scripts that are written, tested, and then deployed forperiodicallyupdatingtheDW.
61
DataWarehouseDesignStar schema is the preferred data architecture for most DWs. There is acentralfacttablethatprovidesmostoftheinformationofinterest.Therearelookuptablesthatprovidedetailedvaluesforcodesusedinthecentraltable.Forexample,thecentraltablemayusedigitstorepresentasalesperson.Thelookuptablewillhelpprovidethenameforthatsalespersoncode.Hereisanexampleof a star schema for a datamart formonitoring sales performance(Figure3.2).
Figure3.2:StarSchemaArchitectureforDW
Otherschemasincludethesnowflakearchitecture.Thedifferencebetweenastarandsnowflakeisthatinthelatter,thelook-uptablescanhavetheirownfurtherlookuptables.
There are many technology choices for developing DW. This includesselecting the right database management system and the right set of datamanagementtools.ThereareafewbigandreliableprovidersofDWsystems.The provider of the operational DBMS may be chosen for DW also.Alternatively, a best-of-breed DW vendor could be used. There are also avarietyof toolsout there fordatamigration,dataupload,data retrieval,anddataanalysis.
62
DWAccessData from the DW could be accessed for many purposes, by many users,throughmanydevices.
1. AprimaryuseofDWistoproduceroutinemanagementandmonitoringreports. For example, a sales performance reportwould show sales bymanydimensions,andcomparedwithplan.Adashboardingsystemwillusedatafromthewarehouseandpresentanalysistousers.ThedatafromDW can be used to populate customized performance dashboards forexecutives. The dashboard could include drill-down capabilities toanalyzetheperformancedataforrootcauseanalysis.
2. Thedata fromtheDWcouldbeused forad-hocqueriesandanyotherapplicationsthatmakeuseoftheinternaldata.
3. DatafromDWisusedtoprovidedataforminingpurposes.Partsofthedatawouldbeextracted,andthencombinedwithotherrelevantdata,fordatamining.
63
DWBestPracticesAdatawarehousingprojectreflectsasignificantinvestmentintoinformationtechnology (IT). All of the best practices in implementing any IT projectshouldbefollowed.
1. The DW project should align with the corporate strategy. Topmanagement should be consulted for setting objectives. Financialviability (ROI)shouldbeestablished.TheprojectmustbemanagedbybothITandbusinessprofessionals.TheDWdesignshouldbecarefullytested before beginning development work. It is often much moreexpensivetoredesignafterdevelopmentworkhasbegun.
2. Itisimportanttomanageuserexpectations.Thedatawarehouseshouldbe built incrementally.Users should be trained in using the system sotheycanabsorbthemanyfeaturesofthesystem.
3. Qualityandadaptabilityshouldbebuiltinfromthestart.Onlyrelevant,cleansed,andhigh-qualitydatashouldbeloaded.Thesystemshouldbeable to adapt to new tools for access. As business needs change, newdatamartsmayneedtobecreatedfornewneeds.
64
ConclusionDataWarehousesarespecialdatamanagementfacilitiesintendedforcreatingreports and analysis to support managerial decision making. They aredesignedtomakereportingandqueryingsimpleandefficient.Thesourcesofdataareoperationalsystems,andexternaldatasources.TheDWneedstobeupdatedwithnewdataregularlytokeepituseful.DatafromDWprovidesausefulinputfordataminingactivities.
65
ReviewQuestions1:Whatisthepurposeofadatawarehouse?
2:Whatarethekeyelementsofadatawarehouse?Describeeachone.
3:Whatarethesourcesandtypesofdataforadatawarehouse?
4:Howwilldatawarehousingevolveintheageofsocialmedia?
66
LibertyStoresCaseExercise:Step2The Liberty Stores company wants to be fully informed about its sales ofproductsandtakeadvantageofgrowthopportunitiesastheyarise.Itwantstoanalyzesalesofallitsproductsbyallstorelocations.ThenewlyhiredChiefKnowledgeOfficerhasdecidedtobuildaDataWarehouse.
1. Design a DW structure for the company to monitor its salesperformance.(Hint:Designthecentraltableandlook-uptables).
2. Design another DW for the company’s sustainability and charitableactivities.
67
Chapter4:DataMining
Datamining is the art and science of discovering knowledge, insights, andpatterns indata. It is theactofextractingusefulpatterns fromanorganizedcollection of data. Patterns must be valid, novel, potentially useful, andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.
Data mining is a multidisciplinary field that borrows techniques from avarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfrom the databases area. It drawsmodeling and analytical techniques fromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.
The field of data mining emerged in the context of pattern recognition indefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspired technologies, it has evolved to help gain a competitiveadvantageinbusiness.
Forexample,“customerswhobuycheeseandmilkalsobuybread90percentof the time”would be a useful pattern for a grocery store,which can thenstock the products appropriately. Similarly, “people with blood pressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.
Past data canbe of predictive value inmany complex situations, especiallywhere the pattern may not be so easily visible without the modelingtechnique.Here is adramatic caseof adata-drivendecision-making systemthatbeats thebestofhumanexperts.Usingpastdata,adecision treemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvote in a5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentof the time. Incontrast, the legal analysts couldatbestpredict correctly59percentofthetime.(Source:Martinetal.2004)
68
69
Caselet:TargetCorp–DataMininginRetailTargetisalargeretailchainthatcrunches data to developinsights that help targetmarketing and advertisingcampaigns. Target analystsmanagedtodevelopapregnancyprediction score based on acustomer'spurchasinghistoryof25 products. In a widelypublicizedstory,theyfiguredoutthatateenagegirlwaspregnantbefore her father did. Thetargetingcanbequitesuccessfuland dramatic as this examplepublishedintheNewYorkTimesillustrates.
AboutayearafterTargetcreatedtheir pregnancy-predictionmodel, a man walked into aTarget store and demanded tosee the manager. He wasclutchingcouponsthathadbeensent tohisdaughterandhewasangry,accordingtoanemployeewho participated in theconversation. “My daughter gotthisinthemail!”hesaid.“She’sstill in high school, and you’resending her coupons for babyclothesandcribs?Areyoutryingto encourage her to getpregnant?”
The manager didn’t have anyidea what the man was talkingabout. He looked at the mailer.Sureenough,itwasaddressedtothe man’s daughter andcontained advertisements for
70
maternity clothing, nurseryfurnitureandpicturesofsmilinginfants.Themanagerapologizedandthencalledafewdayslatertoapologizeagain.
Onthephone,though,thefatherwassomewhatsubdued.“Ihadatalkwithmydaughter,”hesaid.“It turns out there’s been someactivities in my house I haven’tbeencompletelyawareof.Ioweyou an apology.” (Source: NewYorkTimes).
1:DoTargetandother retailershave full rights to use theiracquired data as it sees fit, andto contact desired consumerswithalllegallyadmissiblemeansand messages? What are theissuesinvolvedhere?
2:FaceBookandGoogleprovidemanyservicesforfree.Inreturnthey mine our email and blogsandsendustargetedads.Isthatafairdeal?
71
GatheringandselectingdataThetotalamountofdataintheworldisdoublingevery18months.Thereisanever-growingavalancheofdatacomingwithhighervelocity,volume,andvariety. One has to quickly use it or lose it. Smart data mining requireschoosingwhere toplay.Onehas tomake judiciousdecisionsaboutwhat togatherandwhattoignore,basedonthepurposeofthedataminingexercises.Itislikedecidingwheretofish;asnotallstreamsofdatawillbeequallyrichinpotentialinsights.
Tolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized, and then efficiently mined. One requires the skills andtechnologies for consolidation and integration of data elements frommanysources. Most organizations develop an enterprise data model (EDM) toorganize their data. An EDM is a unified, high-levelmodel of all the datastored in an organization’s databases. The EDM is usually inclusive of thedatageneratedfromallinternalsystems.TheEDMprovidesthebasicmenuofdata tocreateadatawarehouseforaparticulardecision-makingpurpose.DWshelporganizeallthisdatainaneasyandusablemannersothatitcanbeselected and deployed for mining. The EDM can also help imagine whatrelevantexternaldatashouldbegatheredtoprovidecontextanddevelopgoodpredictive relationships with the internal data. In the United States, thevarious federal and localgovernments and their regulatoryagenciesmakeavastvarietyandquantityofdataavailableatdata.gov.
Gathering and curating data takes time and effort, particularly when it isunstructured or semistructured.Unstructured data can come inmany formslikedatabases,blogs, images,videos,audio,andchats.Therearestreamsofunstructured social media data from blogs, chats, and tweets. There arestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetof things,andsoon.Eventually thedatashouldberectangularized,that is, put in rectangular data shapeswith clear columns and rows, beforesubmittingittodatamining.
Knowledgeofthebusinessdomainhelpsselect therightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegathered from thedatawarehouse.Every industryand functionwillhave its own requirements and constraints. The health care industry willprovideadifferent typeofdatawithdifferentdatanames.TheHRfunction
72
would provide different kinds of data. There would be different issues ofqualityandprivacyforthesedata.
73
DatacleansingandpreparationThe quality of data is critical to the success and value of the data miningproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.
Dataalmostcertainlyneeds tobecleansedandtransformedbefore itcanbeused for data mining. There are many ways in what data may need to becleansed – filling missing values, reigning in the effects of outliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-70%ofthetimeneededforadataminingproject.
1. Duplicate data needs to be removed. The same data may be receivedfrommultiple sources.Whenmerging the data sets, data must be de-duped.
2. Missing values need to be filled in, or those rows should be removedfromanalysis.Missingvaluescanbefilledinwithaverageormodalordefaultvalues.
3. Data elements should be comparable. They may need to be (a)transformedfromoneunittoanother.Forexample,totalcostsofhealthcare and the total number of patients may need to be reduced tocost/patient to allow comparability of that value. Data elements mayneed to be adjusted tomake them (b) comparable over time also. Forexample, currency values may need to be adjusted for inflation; theywould need to be converted to the same base year for comparability.Theymayneedtobeconvertedtoacommoncurrency.Datashouldbe(c)storedatthesamegranularitytoensurecomparability.Forexample,salesdatamaybeavailabledaily,butthesalespersoncompensationdatamayonlybeavailablemonthly.Torelate thesevariables, thedatamustbebroughttothelowestcommondenominator,inthiscase,monthly.
4. Continuousvaluesmayneedtobebinnedintoafewbucketstohelpwithsome analyses. For instance,work experience could be binned as low,medium,andhigh.
5. Outlierdataelementsneedtoberemovedaftercarefulreview,toavoidthe skewing of results. For example, one big donor could skew theanalysisofalumnidonorsinaneducationalsetting.
74
6. Ensure that thedata is representativeof thephenomenaunderanalysisbycorrectingforanybiasesintheselectionofdata.Forexample,ifthedata includesmanymoremembersofonegender than is typicalof thepopulationofinterest,thenadjustmentsneedtobeappliedtothedata.
7. Datamayneedtobeselectedtoincreaseinformationdensity.Somedatamaynotshowmuchvariability,becauseitwasnotproperlyrecordedorforother reasons.Thisdatamaydull theeffectsofotherdifferences inthedata and shouldbe removed to improve the informationdensityofthedata.
75
OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflect theobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.
One popular form of data mining output is a decision tree. It is ahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakea model-based decision. The tree may have certain attributes, such asprobabilities assigned to each branch.A related format is a set of businessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemapped to business rules. If the objective function is prediction, then adecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.
The output can be in the form of a regression equation or mathematicalfunction that represents the best fitting curve to represent the data. Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.
Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-net worth professionals, married with two children, living in the coastalareas”. Or a population of “20-something, ivy-league-educated, techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmore than 20 years old, giving low mileage per gallon, which failedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.
Business rules are an appropriate representation of the output of a marketbasket analysis exercise. These rules are if-then statements with someprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).
The output can be in the form of a regression equation or mathematicalfunction that represents the best fitting curve to represent the data. Thisequationmayincludelinearandnon-linearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.
Population‘centroid’isastatisticalmeasurefordescribingcentraltendencies
76
ofacollectionofdatapoints.Thesemightbedefinedinamulti-dimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwith2children,livinginthecoastalareas”.Or a population of “20-something, ivy-league-educated, tech entrepreneursbasedinSiliconValley”.Oracollectionof“vehiclesmorethan20yearsold,giving low mileage per gallon, that failed the environmental inspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.
Business rules are an appropriate representation of the output of amarket-basket analysis exercise. These rules are if-then statements with someprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbread,willalsobuybutter(with80%probability).
77
EvaluatingDataMiningResultsThere are two primary kinds of datamining processes: supervised learningand unsupervised learning. In supervised learning, a decisionmodel can becreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswer for future data instances. Classification is the main category ofsupervised learning activity. There are many techniques for classification,decision trees being themost popular one.Eachof these techniques canbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.
PredictiveAccuracy=(CorrectPredictions)/TotalPredictions
Suppose a data mining project has been initiated to develop a predictivemodel for cancer patients using a decision tree. Using a relevant set ofvariables and data instances, a decision tree model has been created. Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapoint is positive, that is a correct prediction, called a true positive (TP).Similarly,whena truenegativedatapoint isclassifiedasnegative, that isatrue negative (TN). On the other hand, when a true-positive data point isclassified by themodel as negative, that is an incorrect prediction, called afalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive, that is classifiedas a falsepositive (FP).This is representedusingtheconfusionmatrix(Figure4.1).
ConfusionMatrix
TrueClass
Positive
Negative
PredictedClass
Predictedclass
Positive
TruePositive(TP)
FalsePositive(FP)
Negative
FalseNegative(FN)
TrueNegative(TN)Figure4.1:ConfusionMatrix
Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.
PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).
78
All classification techniques have a predictive accuracy associated with apredictive model. The highest value can be 100%. In practice, predictivemodelswithmore than70%accuracy canbe consideredusable inbusinessdomains,dependinguponthenatureofthebusiness.
TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.
79
DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused toexplore thedata to find interestingassociativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure4.2).
DataMiningTechniques
SupervisedLearning
(Predictiveabilitybasedonpastdata)
Classification–MachineLearning
DecisionTrees
NeuralNetworks
Classification-Statistics
Regression
UnsupervisedLearning
(Exploratoryanalysistodiscoverpatterns)
ClusteringAnalysis
AssociationRulesFigure4.2:ImportantDataMiningTechniques
The most important class of problems solved using data mining areclassification problems. Classification techniques are called supervisedlearning as there is away to supervisewhether themodel is providing therightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyof the decisionmaking process in the future. The data of past decisions isorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.
Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.
1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.
2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.
3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuch
80
datapreparationfromtheusers.4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.
Therearemanyalgorithmstoimplementdecisiontrees.SomeofthepopularonesareC5,CARTandCHAID.
Regression is amost popular statistical datamining technique. The goal ofregression is to derive a smooth well-defined curve to best the data.Regression analysis techniques, for example, can be used to model andpredict the energy consumption as a function of daily temperature. Simplyplotting the data may show a non-linear curve. Applying a non-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfutureday can be predicted using this equation. The accuracy of the regressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.
Artificial Neural Networks (ANN) is a sophisticated datamining techniquefrom the Artificial Intelligence stream in Computer Science. It mimics thebehavior of human neural structure:Neurons receive stimuli, process them,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuron outputs a decision. A decision task may be processed by just oneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemany layers of neurons involved in a decision task, dependingupon thecomplexity of the domain. The neural network can be trained bymaking adecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedback receivedon itspreviousdecisions.The intermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.
At some point, the neural network will have learned enough and begin tomatchthepredictiveaccuracyofahumanexpertoralternativeclassificationtechniques.ThepredictionsofsomeANNsthathavebeentrainedoveralongperiod of time with a large amount of data have become decisively moreaccurate than human experts. At that point, the ANNs can begin to beseriouslyconsideredfordeployment,inrealsituationsinrealtime.ANNsarepopularbecausetheyareeventuallyabletoreachahighpredictiveaccuracy.ANNsarealsorelativelysimpletoimplementanddonothaveanyissueswithdataquality.However,ANNsrequirealotofdatatotrainittodevelopgoodpredictiveability.
81
ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyinga set of similar groups in the data. It is a technique used for automaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata. The K-means technique is a popular technique and allows the userguidanceinselectingtherightnumber(K)ofclustersfromthedata.
Clustering is alsoknownas the segmentation technique. Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.The output is the centroids for each cluster and the allocation of datapoints to their cluster. The centroid definition is used to assign new datainstancescanbeassignedto theirclusterhomes.Clustering isalsoapartoftheartificialintelligencefamilyoftechniques.
Association rules are a popular dataminingmethod in business, especiallywheresellingis involved.Alsoknownasmarketbasketanalysis, ithelps inansweringquestionsaboutcross-sellingopportunities.Thisistheheartofthepersonalization engine used by ecommerce sites like Amazon.com andstreamingmoviesites likeNetflix.com.The techniquehelps find interestingrelationships (affinities) between variables (items or events). These arerepresentedasrulesoftheformX®Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andthereareno right or wrong answers. There are just stronger and weaker affinities.Thus,eachrulehasaconfidencelevelassignedtoit. Apartofthemachinelearningfamily, this techniqueachieved legendarystatuswhena fascinatingrelationshipwasfoundinthesalesofdiapersandbeers.
82
ToolsandPlatformsforDataMiningData Mining tools have existed for many decades. However, they haverecently becomemore important as the values of data have grown and thefieldofbigdataanalyticshascomeintoprominence.Thereareawiderangeofdataminingplatformsavailableinthemarkettoday.
1. Simple or sophisticated: There are simple end-user data mining toolssuchasMSExcel,and therearemoresophisticated toolssuchas IBMSPSSModeler.
2. Stand-aloneorEmbedded:Therearestandalonetoolsandtherearetoolsembedded inanexisting transactionprocessingordatawarehousingorERPsystem.
3. OpensourceorCommercial:ThereareopensourceandfreelyavailabletoolssuchasWeka,andtherearecommercialproducts.
4. User interface: There are text-based tools that require someprogramming skills, and there are GUI-based drag-and-drop formattools.
5. Dataformats:Therearetoolsthatworkonlyonproprietarydataformatsand there are those directly accept data from a host of popular datamanagementtoolsformats.
Here we compare three platforms that we have used extensively andeffectivelyformanydataminingprojects.
Table4.1:ComparisonofPopularDataMiningPlatforms
Feature
Excel
IBMSPSSModeler
Weka
Ownership
Commercial
Commercial,expensive
Open-source,free
DataMiningFeatures
Limited;extensiblewithadd-onmodules
Extensivefeatures,unlimiteddatasizes
Extensive,performanceissueswithlargedata
Stand-alone
Stand-alone
EmbeddedinBIsoftwaresuites
Stand-alone
Userskillsneeded
End-users
ForskilledBIanalysts
SkilledBIanalysts
83
Userinterface Textandclick,Easy Drag&Dropuse,colorful,beautifulGUI
GUI,mostlyb&wtextoutput
Dataformats
Industry-standard
Varietyofdatasourcesaccepted
Proprietary
MSExcel is a relatively simple and easy datamining tool. It can get quiteversatileonceAnalystPackandsomeotheradd-onproductsareinstalledonit.
IBM’sSPSSModelerisanindustry-leadingdataminingplatform.Ifoffersapowerful set of tools and algorithms for most popular data miningcapabilities.IthascolorfulGUIformatwithdrag-and-dropcapabilities.ItcanacceptdatainmultipleformatsincludingreadingExcelfilesdirectly.
Weka is an open-sourceGUI based tool that offers a large number of dataminingalgorithms.
ERP systems include some data analytic capabilities, too. SAP has itsBusiness Objects (BO) software. BO is considered one of the leading BIsuitesintheindustry,andisoftenusedbyorganizationsthatuseSAP.
84
DataMiningBestPracticesEffective and successful use of datamining activity requires both businessand technologyskills.Thebusinessaspectshelpunderstand thedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources, clean up the data, assemble it to meet the needs of the businessproblem,andthenrunthedataminingtechniquesontheplatform.
An important element is to go after the problem iteratively. It is better todivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbest practices learned from the use of data mining techniques over a longperiod of time. The Data Mining industry has proposed a Cross-IndustryStandard Process for Data Mining (CRISP-DM). It has six essential steps(Figure4.3):
Figure4.3:CRISP-DMDataMiningcycle
1. Business Understanding: The first and most important step in dataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeany
85
other project, in that it should show strong payoffs if the project issuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject, which means that the project aligns well with the businessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginative hypotheses for the solution. Thinking outside the box isimportant, both in terms of a proposedmodel aswell in the data setsavailableandrequired.
2. DataUnderstanding:Arelated importantstep is tounderstand thedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesesto solve a problem. Without relevant data, the hypotheses cannot betested.
3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70% of the time in a data mining project. It may be desirable tocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.
4. Modeling:This is theactual taskofrunningmanyalgorithmsusingtheavailable data to discover if the hypotheses are supported. Patience isrequired in continuously engaging with the data until the data yieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused. A tool could be tried with different options, such as runningdifferentdecisiontreealgorithms.
5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbetter to triangulate the analysis by applying multiple data miningtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracy with more test data. When the accuracy has reached somesatisfactorylevel,thenthemodelshouldbedeployed.
6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresented to the key stakeholders, and is deployed in the organization.Otherwise theprojectwillbeawasteof timeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization. The model should be eventually embedded in theorganization’sbusinessprocesses.
86
MythsaboutdataminingThere are many myths about this area, scaring away many businessexecutivesfromusingdatamining.DataMiningisamindsetthatpresupposesafaithintheabilitytorevealinsights.Byitself,dataminingisnottoohard,nor is it too easy. It does require a disciplined approach and some cross-disciplinaryskills.
Myth#1:DataMiningisaboutalgorithms.Dataminingisusedbybusinessto answer important and practical business questions. Formulating theproblemstatementcorrectlyandidentifyingimaginativesolutionsfortestingare far more important before the data mining algorithms gets called in.Understanding therelativestrengthsofvariousalgorithms ishelpfulbutnotmandatory.
Myth #2: Data Mining is about predictive accuracy. While important,predictiveaccuracyisafeatureofthealgorithm.Asinmyth#1,thequalityofoutputisastrongfunctionoftherightproblem,righthypothesis,andtherightdata.
Myth #3: DataMining requires a datawarehouse.While the presence of adatawarehouseassistsinthegatheringofinformation,sometimesthecreationofthedatawarehouseitselfcanbenefitfromsomeexploratorydatamining.Some datamining problemsmay benefit from clean data available directlyfromtheDW,butaDWisnotmandatory.
Myth#4:DataMiningrequireslargequantitiesofdata.Manyinterestingdataminingexercisesaredoneusingsmallormediumsizeddatasets,atlowcosts,usingend-usertools.
Myth #5: DataMining requires a technology expert.Many interesting dataminingexercisesaredonebyend-usersandexecutivesusingsimpleeverydaytoolslikespreadsheets.
87
DataMiningMistakesDataminingisanexerciseinextractingnon-trivialusefulpatternsinthedata.Itrequiresalotofpreparationandpatiencetopursuethemanyleadsthatdatamayprovide.Muchdomainknowledge,toolsandskillisrequiredtofindsuchpatterns.Herearesomeofthemorecommonmistakesindoingdatamining,andshouldbeavoided.
Mistake#1:Selecting thewrongproblemfordatamining:Without therightgoalsorhavingnogoals, datamining leads to awasteof time.Getting theright answer to an irrelevant question could be interesting, but itwould bepointlessfromabusinessperspective.AgoodgoalwouldbeonethatwoulddeliveragoodROItotheorganization.
Mistake #2: Buried under mountains of data without clear metadata: It ismore important to be engagedwith the data, than to have lots of data.Therelevantdatarequiredmaybemuchlessthaninitiallythought.Theremaybeinsufficientknowledgeaboutthedata,ormetadata.Examinethedatawithacriticaleyeanddonotnaivelybelieveeverythingyouaretoldaboutthedata.
Mistake #3: Disorganized data mining: Without clear goals, much time iswasted. Doing the same tests using the samemining algorithms repeatedlyandblindly,withoutthinkingaboutthenextstage,withoutaplan,wouldleadtowastedtimeandenergy.Thiscancomefrombeingsloppyaboutkeepingtrackofthedataminingprocedureandresults.Notleavingsufficienttimefordataacquisition,selectionandpreparationcanleadtodataqualityissues,andGIGO.Similarlynotprovidingenoughtimefortestingthemodel,trainingtheusersanddeployingthesystemcanmaketheprojectafailure.
Mistake#4:Insufficientbusinessknowledge:Withoutadeepunderstandingofthebusinessdomain, theresultswouldbegibberishandmeaningless.Don’tmake erroneous assumptions, courtesy of experts. Don’t rule out anythingwhenobservingdataanalysis results.Don’t ignoresuspicious(goodorbad)findings and quickly move on. Be open to surprises. Even when insightsemergeatonelevel,itisimportanttosliceanddicethedataatotherlevelstoseeifmorepowerfulinsightscanbeextracted.
Mistake#5: Incompatibilityofdatamining toolsanddatasets.All the toolsfrom data gathering, preparation, mining, and visualization, should worktogether.Usetoolsthatcanworkwithdatafrommultiplesourcesinmultipleindustrystandardformats.
Mistake #6: Looking only at aggregated results and not at individual
88
records/predictions. It ispossible that therightresultsat theaggregate levelprovideabsurdconclusionsatanindividualrecordlevel.Divingintothedataattherightanglecanyieldinsightsatmanylevelsofdata.
Mistake#7:Notmeasuringyourresultsdifferentlyfromthewayyoursponsormeasuresthem.Ifthedataminingteamlosesitssenseofbusinessobjectives,andbeginningtominedataforitsownsake,itwillloserespectandexecutivesupportveryquickly.TheBIDMcycle(Figure1.1)shouldberemembered.
89
ConclusionData Mining is like diving into the rough material to discover a valuablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportant toprovide imaginative solutions thatcan thenbe testedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.
90
ReviewQuestions1. What is data mining?What are supervised and unsupervised learning
techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportantto
followtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. Whatarethemajormistakestobeavoidedwhendoingdatamining?7. Whatarethekeyrequirementsforaskilleddataanalyst?
91
LibertyStoresCaseExercise:Step3Liberty is constantly evaluating opportunities forimprovingefficiencies inall itsoperations, including thecommercialoperationsaswellitscharitableactivities.
1. Whatdataminingtechniqueswouldyouusetoanalyzeandpredictsalespatterns?
2. Whatdataminingtechniquewouldyouusetocategorizeitscustomers
92
Chapter5:DataVisualization
DataVisualization is the art and scienceofmakingdata easy tounderstandandconsume,fortheenduser.Idealvisualizationshowstherightamountofdata, in the rightorder, in the rightvisual form, toconvey thehighpriorityinformation. The right visualization requires an understanding of theconsumer’s needs, nature of the data, and the many tools and techniquesavailable to present data. The right visualization arises from a completeunderstandingofthetotalityofthesituation.Oneshouldusevisualstotellatrue,completeandfast-pacedstory.
Datavisualizationisthelaststepinthedatalifecycle.Thisiswherethedatais processed for presentation in an easy-to-consume manner to the rightaudiencefortherightpurpose.Thedatashouldbeconvertedintoalanguageandformatthatisbestpreferredandunderstoodbytheconsumerofdata.Thepresentation should aim to highlight the insights from the data in anactionable manner. If the data is presented in too much detail, then theconsumerofthatdatamightloseinterestandtheinsight.
93
Caselet:DrHansGosling-VisualizingGlobalPublicHealthDr.HansRoslingisamasteratdatavisualization.Hehasperfected the art of showing data in novel ways tohighlightunexpectedtruths.Hehasbecomeanonlinestarbyusingdatavisualizationstomakeseriouspointsaboutglobalhealthpolicyanddevelopment.Usingnovelwaysto illustrate data obtained from UN agencies, he hashelpeddemonstratetheprogressthattheworldhasmadeinimprovingpublichealthonmanydimensions.ThebestwaytograspthepowerofhisworkistoclickheretoseethisTEDvideo,whereLifeExpectancy ismappedalongwith Fertility Rate for all countries from 1962 to 2003.Figure5.1showsaonegraphicfromthisvideo.
Figure5.1:VisualizingGlobalHealthData(source:ted.com)
“THEbiggestmythisthatifwesaveallthepoorkids,wewilldestroytheplanet,”saysHansRosling,adoctorandprofessor of international health at the KarolinskaInstitute in Sweden. “But you can't stop populationgrowth by letting poor children die.” He has thecomputerised graphs to prove it: colourful visuals withcirclesthatswarm,swellandshrinklikelivingcreatures.DrRosling'smesmerizinggraphicshavebeenimpressingaudiences on the international lecture circuit, from theTEDconferencestotheWorldEconomicForumatDavos.Instead of bar charts and histograms, Dr Rosling uses
94
Legobricks,IKEAboxesanddata-visualizationsoftwaredeveloped by his Gapminder Foundation to transformreams of economic and public-health data into grippingstories.Hisaimisambitious.“Iproducearoadmapforthemodernworld,”hesays.“Wherepeoplewanttodriveis up to them. But I have the idea that if they have aproperroadmapandknowwhattheglobalrealitiesare,they'llmakebetterdecisions.”(source:economist.com).
Q1:Whatarethebusinessandsocialimplicationsofthiskindofdatavisualization?Q2:Howcouldthesetechniquesbeappliedinyourorganizationandareaofwork?
95
ExcellenceinVisualizationDatacanbepresentedintheformofrectangulartables,oritcanbepresentedincolorfulgraphsofvarious types.“Small,non-comparative,highly-labeleddatasetsusuallybelongintables”–(EdTufte,2001,p33).However,astheamount of data grows, graphs are preferable. Graphics help give shape todata.Tufte,apioneeringexpertondatavisualization,presents thefollowingobjectivesforgraphicalexcellence:
1. Show,andevenreveal,thedata:Thedatashouldtellastory,especiallyastoryhiddeninlargemassesofdata.However,revealthedataincontext,sothestoryiscorrectlytold.
2. Inducetheviewertothinkofthesubstanceofthedata:Theformatofthegraph shouldbe sonatural to thedata, that it hides itself and lets datashine.
3. Avoiddistortingwhatthedatahavetosay:Statisticscanbeusedtolie.In the name of simplifying, some crucial context could be removedleadingtodistortedcommunication.
4. Make largedata sets coherent: By giving shape to data, visualizationscanhelpbringthedatatogethertotellacomprehensivestory.
5. Encourage the eyes to compare different pieces of data: Organize thechartinwaystheeyeswouldnaturallymovetoderiveinsightsfromthegraph.
6. Reveal the data at several levels of detail: Graphs leads to insights,which raise further curiosity, and thus presentations should help get totherootcause.
7. Serveareasonablyclearpurpose–informingordecision-making.8. Closely integrate with the statistical and verbal descriptions of thedataset:Thereshouldbenoseparationofchartsandtextinpresentation.Each mode should tell a complete story. Intersperse text with themap/graphictohighlightthemaininsights.
Context is important in interpreting graphics. Perception of the chart is asimportantastheactualcharts.Donotignoretheintelligenceorthebiasesofthe reader. Keep the template consistent, and only show variations in data.There can be many excuses for graphical distortion. E.g. “we are justapproximating.”Qualityofinformationtransmissioncomespriortoaestheticsofchart.Leavingoutthecontextualdatacanbemisleading.
A lot of graphics are published because they serve a particular cause or apoint of view. It is particularly importantwhen in a for-profit or politically
96
contestedenvironments.Manyrelateddimensionscanbefoldedintoagraph.Themorethedimensionsthatarerepresentedinagraph,thericherandmoreuseful the chart become. The data visualizer should understand the client’sobjects and present the data for accurate perception of the totality of thesituation.
97
TypesofChartsTherearemanykindsofdataasseeninthecaseletabove.Timeseriesdataisthemostpopular formofdata. It helps revealpatternsover time.However,datacouldbeorganizedaroundalphabeticallistofthings,suchascountriesorproductsorsalespeople.Figure5.2showssomeofthepopularcharttypesandtheirusage.
1. Line graph. This is a basic and most popular type of displayinginformation.Itshowsdataasaseriesofpointsconnectedbystraightlinesegments.Ifminingwithtime-seriesdata,timeisusuallyshownonthex-axis.Multiplevariablescanberepresentedonthesamescaleony-axistocompareofthelinegraphsofallthevariables.
2. Scatterplot:Thisisanotherverybasicandusefulgraphicform.Ithelpsreveal the relationship between two variables. In the above caselet, itshows twodimensions:LifeExpectancyandFertilityRate.Unlike inalinegraph,therearenolinesegmentsconnectingthepoints.
3. Bargraph:Abargraphshows thincolorful rectangularbarswith theirlengths being proportional to the values represented. The bars can beplottedverticallyorhorizontally.Thebargraphsuse a lot ofmore inkthanthelinegraphandshouldbeusedwhenlinegraphsareinadequate.
4. StackedBargraphs:Theseareaparticularmethodofdoingbargraphs.Valuesofmultiplevariablesarestackedoneontopoftheothertotellaninterestingstory.Barscanalsobenormalizedsuchasthetotalheightofeverybarisequal,soitcanshowtherelativecompositionofeachbar.
5. Histograms: These are like bar graphs, except that they are useful inshowing data frequencies or data values on classes (or ranges) of anumericalvariable.
98
Figure5.1:Manytypesofgraphs
6. Piecharts:Theseareverypopulartoshowthedistributionofavariable,such as sales by region. The size of a slice is representative of therelativestrengthsofeachvalue.
7. Boxcharts:Thesearespecialformofchartstoshowthedistributionofvariables.Theboxshowsthemiddlehalfof thevalues,whilewhiskersonbothsidesextendtotheextremevaluesineitherdirection.
8. Bubble Graph: This is an interesting way of displaying multipledimensionsinonechart.It isavariantofascatterplotwithmanydatapointsmarkedontwodimensions.Nowimaginethateachdatapointonthegraphisabubble(oracircle)…thesizeofthecircleandthecolorfillinthecirclecouldrepresenttwoadditionaldimensions.
9. Dials:Thesearechartslikethespeeddialinthecar,thatshowswhetherthevariable value (such as sales number) is in the low range,mediumrange,orhighrange.Theserangescouldbecoloredred,yellowandgreetogiveaninstantviewofthedata.
10. GeographicalDatamapsareparticularlyusefulmapstodenotestatistics. Figure 5.3 shows a tweet density map of the US. It showswherethetweetsemergefromintheUS.
99
Figure5.3:UStweetmap(Source:Slate.com)
11. Pictographs:Onecanusepicturestorepresentdata.E.g.Figure5.2showsthenumberoflitersofwaterneededtoproduceonepoundofeachof the products, where images are used to show the product for easyreference.Eachdropletofwateralsorepresents50litersofwater.
Figure5.4:PictographofWaterfootprint(source:waterfootprint.org)
100
VisualizationExampleTodemonstratehoweachofthevisualizationtoolscouldbeused,imagineanexecutiveforacompanywhowants toanalyzethesalesperformanceofhisdivision.Figure5.1 show the important raw sales data for the current year,alphabeticallysortedbyProductnames.
Product
Revenue
Orders
SalesPers
AA
9731
131
23
BB
355
43
8
CC
992
32
6
DD
125
31
4
EE
933
30
7
FF
676
35
6
GG
1411
128
13
HH
5116
132
38
JJ
215
7
2
KK
3833
122
50
LL
1348
15
7
MM
1201
28
13Table5.1:RawPerformanceData
Torevealsomemeaningfulpattern,agoodfirststepwouldbetosortthetablebyProductrevenue,withhighestrevenuefirst.WecouldtotalupthevaluesofRevenue,Orders,andSalespersonsforallproducts.Wecanalsoaddsome
101
importantratiostotherightofthetable(Table5.2).
Product
Revenue
Orders
SalesPers
Rev/Order
Rev/SalesP
Orders/SalesP
AA
9731
131
23
74.3
423.1
5.7
HH
5116
132
38
38.8
134.6
3.5
KK
3833
122
50
31.4
76.7
2.4
GG
1411
128
13
11.0
108.5
9.8
LL
1348
15
7
89.9
192.6
2.1
MM
1201
28
13
42.9
92.4
2.2
CC
992
32
6
31.0
165.3
5.3
EE
933
30
7
31.1
133.3
4.3
FF
676
35
6
19.3
112.7
5.8
BB
355
43
8
8.3
44.4
5.4
JJ
215
7
2
30.7
107.5
3.5
DD
125
31
4
4.0
31.3
7.8
Total
25936
734
177
35.3
146.5
4.1
Table5.2:Sorteddata,withadditionalratios
Therearetoomanynumbersonthistabletovisualizeanytrendsinthem.Thenumbersareindifferentscalessoplottingthemonthesamechartwouldnotbe easy. E.g. the Revenue numbers are in thousands while the SalesPersnumbersandOrders/SalesPersareinthesingleordoubledigit.
One could start by visualizing the revenue as a pie-chart. The revenue
102
proportiondropssignificantlyfromthefirstproducttothenext.(Figure5.5).It is interesting to note that the top 3 products produce almost 75% of therevenue.
Figure5.5:RevenueSharebyProduct
Thenumberofordersforeachproductcanbeplottedasabargraph(Figure5.2).This shows thatwhile the revenue iswidely different for the top fourproducts,theyhaveapproximatelythesamenumberoforders.
Figure5.6:OrdersbyProducts
Therefore,theordersdatacouldbeinvestigatedfurthertoseeorderpatterns.Supposeadditionaldata ismadeavailable forOrdersby their size.Supposethe orders are chunked into 4 sizes: Tiny, Small, Medium, and Large.AdditionaldataisshowninTable5.3.
103
Product
TotalOrders
Tiny
Small
Medium
Large
AA
131
5
44
70
12
HH
132
38
60
30
4
KK
122
20
50
44
8
GG
128
52
70
6
0
LL
15
2
3
5
5
MM
28
8
12
6
2
CC
32
5
17
10
0
EE
30
6
14
10
0
FF
35
10
22
3
0
BB
43
18
25
0
0
JJ
7
4
2
1
0
DD
31
21
10
0
0
Total
734
189
329
185
31
Table5.3:Additionaldataonordersizes
Figure5.7isastackedbargraphthatshowsthepercentageofOrdersbysizeforeachproduct.Thischart(Figure5.7)bringsadifferentsetof insights. Itshows that the product HH has a larger proportion of tiny orders. Theproductsatthefarrighthavealargenumberoftinyordersandveryfewlargeorders.
104
Figure5.7:ProductOrdersbyOrderSize
105
VisualizationExamplephase-2The executive wants to understand the productivity of salespersons. Thisanalysiscouldbedonebothintermsofthenumberoforders,orrevenue,persalesperson.Therecouldbetwoseparategraphs,oneforthenumberofordersper salesperson,and theother for the revenueper salesperson.However,aninterestingway is to plot bothmeasures on the same graph to give amorecomplete picture. This can be done evenwhen the two data have differentscales.Thedataishereresortedbynumberoforderspersalesperson.
Figure 5.8 shows two line graphs superimposed upon each other. One lineshows the revenue per salesperson, while the other shows the number oforderspersalesperson.Itshowsthatthehighestproductivityof5.3orderspersales person, down to 2.1 orders per salesperson.The second line, the bluelineshowstherevenuepersalespersonforeachfortheproducts.Therevenuepersalespersonishighestat630,whileitislowestatjust30.
Andthusadditionallayersofdatavisualizationcangoonforthisdataset.
Figure5.8:Salespersonproductivitybyproduct
106
TipsforDataVisualizationTohelptheclientinunderstandingthesituation,thefollowingconsiderationsareimportant:
1. Fetch appropriate and correct data for analysis. This requires someunderstandingofthedomainoftheclientandwhatisimportantfortheclient.E.g. inabusinesssetting,onemayneedtounderstandthemanymeasureofprofitabilityandproductivity.
2. Sort the data in the most appropriate manner. It could be sorted bynumericalvariables,oralphabeticallybyname.
3. Choose appropriate method to present the data. The data could bepresentedasatable,oritcouldbepresentedasanyofthegraphtypes.
4. The data set could be pruned to include only the more significantelements.Moredata isnotnecessarilybetter, unless itmakes themostsignificantimpactonthesituation.
5. Thevisualizationcouldshowadditionaldimensionforreferencesuchastheexpectationsortargetswithwhichtocomparetheresults.
6. Thenumericaldatamayneedtobebinnedintoafewcategories.E.g.theorders per person were plotted as actual values, while the order sizeswerebinnedinto4categoricalchoices.
7. High-levelvisualizationcouldbebackedbymoredetailedanalysis.Forthemostsignificantresults,adrill-downmayberequired.
8. Theremaybeneed topresentadditional textual information to tell thewhole story. For example, one may require notes to explain someextraordinaryresults.
107
ConclusionDataVisualizationisthelastphaseofthedatalifecycle,andleadstotheconsumptionofdatabytheenduser.Itshouldtellanaccurate, completeandsimple storybackedbydate,whilekeeping it insightful andengaging.Thereare innumerable typesofvisual graphing techniques available for visualizing data. The choice of the right tools requires a good understanding of thebusiness domain, the data set and the client needs. There is ample room for creativity to design ever more compelling datavisualizationtomostefficientlyconveytheinsightsfromthedata.
108
ReviewQuestions1. Whatisdatavisualization?2. Howwouldyoujudgethequalityofdatavisualizations?3. Whatarethedatavisualizationtechniques?Whenwouldyouusetables
orgraphs?4. Describesomekeystepsindatavisualization.5. Whataresomekeyrequirementsforgoodvisualization.
109
LibertyStoresCaseExercise:Step4Liberty is constantly evaluating its performance forimprovingefficiencies inall itsoperations, including thecommercialoperationsaswellitscharitableactivities.
1. What data visualization techniques would you use to help understandsalespatterns?
2. What data visualization technique would you use to categorize itscustomers?
110
Section2
Thissectioncoversfiveimportantdataminingtechniques.
Thefirstthreetechniquesareexamplesofsupervisedlearning,consistingofclassificationtechniques.
Chapter6willcoverdecisiontrees,whicharethemostpopularformofdataminingtechniques.Therearemanyalgorithmstodevelopdecisiontrees.
Chapter7willdescriberegressionmodelingtechniques.Thesearestatisticaltechniques.
Chapter8willcoverartificialneuralnetworks,whichareamachinelearningtechnique.
Thenexttwotechniquesareexamplesofunsupervisedlearning,consistingofdataexplorationtechniques.
Chapter 9 will cover Cluster Analysis. This is also called MarketSegmentationanalysis.
Chapter 10 will cover the Association Rule Mining technique, also calledMarketBasketAnalysis.
111
Chapter6:DecisionTrees
Decision trees are a simple way to guide one’s path to a decision. Thedecisionmaybeasimplebinaryone,whethertoapprovealoanornot.Oritmaybeacomplexmulti-valueddecision,astowhatmaybethediagnosisforaparticularsickness.Decisiontreesarehierarchicallybranchedstructuresthathelponecometoadecisionbasedonaskingcertainquestionsinaparticularsequence. Decision trees are one of the most widely used techniques forclassification. A good decision tree should be short and ask only a fewmeaningfulquestions.Theyareveryefficienttouse,easytoexplain,andtheirclassificationaccuracyiscompetitivewithothermethods.Decisiontreescangenerate knowledge froma few test instances that can thenbe applied to abroadpopulation.Decisiontreesareusedmostlytoanswerrelativelysimplebinarydecisions.
112
Caselet:PredictingHeartAttacksusingDecisionTreesA study was done at UC SanDiego concerning heart diseasepatient data. The patients werediagnosed with a heart attackfrom chest pain, diagnosed byEKG,highenzymelevelsintheirheartmuscles,etc.Theobjectivewas to predict which of thesepatientswasatriskofdyingfromasecondheartattackwithin thenext 30 days. The predictionwould determine the treatmentplan,suchaswhethertokeepthepatient in intensive care or not.For eachpatientmore than100variables were collected,including demographics,medical history and lab data.Using that data, and the CARTalgorithm, a decision tree wasconstructed.
Thedecision tree showed that ifBloodPressurewaslow(<=90),the chance of another heartattack was very high (70%). Ifthepatient’sBPwasok,thenextquestiontoaskwasthepatient’sage. If theagewas low(<=62),then the patient’s survival wasalmost guaranteed (98%). If theage was higher, then the nextquestion to askwas about sinusproblems. If their sinus was ok,the chances of survival were89%. Otherwise, the chance ofsurvival dropped to 50%. Thisdecision tree predicts 86.5% ofthe cases correctly. (Source:SalfordSystems).
113
1:Isadecisiontreegoodenoughin terms of accuracy, design,readability,forthisdataetc.
2: Identify the benefits fromcreating such a decision tree.Canthesebequantified?
114
DecisionTreeproblemImagine a conversation between a doctor and a patient. The doctor asksquestionstodeterminethecauseoftheailment.Thedoctorwouldcontinuetoask questions, till she is able to arrive at a reasonable decision. If nothingseemsplausible,shemightrecommendsometeststogeneratemoredataandoptions.
This ishowexperts inany field solveproblems.Theyusedecision treesordecision rules. For every question they ask, the potential answers createseparatebranchesforfurtherquestioning.Foreachbranch,theexpertwouldknowhowtoproceedahead.Theprocesscontinuesuntiltheendofthetreeisreached,whichmeansaleafnodeisreached.
Human experts learn from past experiences or data points. Similarly, amachine canbe trained to learn from the past data points and extract someknowledgeorrulesfromit.Decisiontreesusemachinelearningalgorithmstoabstract knowledge from data. A decision tree would have a predictiveaccuracybasedonhowoftenitmakescorrectdecisions.
1. The more training data is provided, the more accurate its knowledgeextractionwillbe,andthus,itwillmakemoreaccuratedecisions.
2. Themorevariablesthetreecanchoosefrom,thegreateristhelikelyoftheaccuracyofthedecisiontree.
3. Inaddition,agooddecisiontreeshouldalsobefrugalsothatittakestheleastnumberofquestions,andthus,theleastamountofeffort,togettotherightdecision.
Hereisanexercisetocreateadecisiontreethathelpsmakedecisionsaboutapproving theplayofanoutdoorgame.Theobjective is topredict theplaydecisiongiventheatmosphericconditionsoutthere.Thedecisionis:Shouldthegamebeallowedornot?Hereisthedecisionproblem.
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
Normal
True
??
To answer that question, one should look at past experience, and seewhatdecisionwasmadeinasimilarinstance,ifsuchaninstanceexists.Onecould
115
lookupthedatabaseofpastdecisionstofindtheanswerandtrytocometoananswer. Here is a list of the decisions taken in 14 instances of past soccergamesituations.(Datasetcourtesy:Witten,Frank,andHall,2010).
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
IftherewerearowforSunny/Hot/Normal/Windyconditioninthedatatable,itwouldmatch the current problem; and the decision from that row could be
116
usedtoanswerthecurrentproblem.However,thereisnosuchpastinstanceinthiscase.Therearethreedisadvantagesoflookingupthedatatable:
1. Asmentionedearlier,howtodecideifthereisn’tarowthatcorrespondsto the exact situation today? If there is no exact matching instanceavailableinthedatabase,thepastexperiencecannotguidethedecision.
2. Searching through the entire past database may be time consuming,depending on the number of variables and the organization of thedatabase.
3. What if the data values are not available for all the variables? In thisinstance,ifthedataforhumidityvariablewasnotavailable,lookingupthepastdatawouldnothelp.
Abetterwayofsolvingtheproblemmaybetoabstracttheknowledgefromthepastdata intodecision treeor rules.These rulescanbe represented inadecisiontree,andthenthattreecanbeusedmakethedecisions.Thedecisiontreemaynotneedvaluesforallthevariables.
117
DecisionTreeConstructionAdecisiontreeisahierarchicallybranchedstructure.Whatshouldbethefirstquestion asked in creating the tree? One should ask the more importantquestion first, and the less important questions later. What is the mostimportant question that should be asked to solve the problem?How is theimportanceof thequestionsdetermined?Thus,howshould therootnodeofthetreebedetermined?
Determining root node of the tree: In this example, there are four choicesbasedonthefourvariables.Onecouldbeginbyaskingoneofthefollowingquestions:whatistheoutlook,whatisthetemperature,whatisthehumidity,andwhat is thewind speed? A criterion should be used to evaluate thesechoices.Thekeycriterionwouldbethat:whichoneofthesequestionsgivesthemostinsightaboutthesituation?Anotherwaytolookatitwouldbethecriterion of frugality. That is, which question will provide us the shortestultimatedecisiontree?Anotherwaytolookatthisisthatifoneisallowedtoaskoneandonlyonequestion,whichonewouldoneask? In this case, themostimportantquestionshouldbetheonethat,byitself,helpsmakethemostcorrect decisions with the fewest errors. The four questions can now besystematically compared, to seewhichvariableby itselfwill helpmake themostcorrectdecisions.Oneshouldsystematicallycalculatethecorrectnessofdecisionsbasedoneachquestion.Thenonecanselect thequestionwith themostcorrectpredictions,orthefewesterrors.
Start with the first variable, in this case outlook. It can take three values,sunny,overcast,andrainy.
Start with the sunny value of outlook. There are five instances where theoutlookissunny.In2ofthe5instancestheplaydecisionwasyes,andintheother three, the decision was No. Thus, if the decision rule was thatOutlook:sunny→No,then3outof5decisionswouldbecorrect,while2outof5suchdecisionswouldbeincorrect.Thereare2errorsoutof5.ThiscanberecordedinRow1.
Attribute
Rules
Error
TotalError
Outlook
Sunny→No
2/5
118
Similaranalysiswouldbedoneforothervaluesoftheoutlookvariable.Therearefour instanceswhere theoutlookisovercast. Inall4out4 instances thePlaydecisionwasyes.Thus,ifthedecisionrulewasthatOutlook:overcast→Yes, then 4 out of 4 decisions would be correct, while none of decisionswouldbe incorrect.Thereare0errorsoutof4.Thiscanberecorded in thenextrow.
Attribute
Rules
Error
TotalError
Outlook
Sunny→No
2/5
Overcast→yes
0/4
Therearefiveinstanceswheretheoutlookisrainy.In3ofthe5instancestheplaydecisionwasyes,andintheotherthree,thedecisionwasno.Thus,ifthedecisionrulewasthatOutlook:rainy→Yes,then3outof5decisionswouldbecorrect,while2outof5decisionswouldbe incorrect.Therewillbe2/5errors.Thiscanberecordedinnextrow.
Attribute
Rules
Error
TotalError
Outlook
Sunny→No
2/5
4/14
Overcast→yes
0/4
Rainy→yes
2/5
Adding up errors for all values of outlook, there are 4 errors out of 14. Inotherwords, Outlook gives 10 correct decisions out of 14, and 4 incorrectones.
Asimilaranalysiscanbedonefortheotherthreevariables.Attheendofthatanalyticalexercise,thefollowingErrortablewillbeconstructed.
119
Attribute
Rules
Error
TotalError
Outlook
Sunny→No
2/5
4/14
Overcast→yes
0/4
Rainy→yes
2/5
Temp
Hot→No
2/4
5/14
Mild→Yes
2/6
Cool→Yes
1/4
Humidity
High→No
3/7
4/14
Normal→Yes
1/7
Windy
False→Yes
2/8
5/14
True→No
3/6
The variable that leads to the least number of errors (and thus the mostnumberofcorrectdecisions)shouldbechosenasthefirstnode.Inthiscase,twovariableshavetheleastnumberoferrors.Thereisatiebetweenoutlookandhumidity,asbothhave4errorsoutof14instances.Thetiecanbebrokenusinganothercriterion,thepurityofresultingsub-trees.
Ifall theerrorswereconcentratedinafewof thesubtrees,andsomeof thebranches were completely free of error, that is preferred from a usabilityperspective.Outlookhasoneerror-freebranch,fortheovercastvalue,whilethereisnosuchpuresub-classforhumidityvariable.Thusthetieisbrokeninfavorofoutlook.Thedecision treewilluseoutlookas thefirstnode,or thefirst splitting variable. The first question that should be asked to solve thePlayproblem,is‘Whatisthevalueofoutlook’?
SplittingtheTree:Fromtherootnode,thedecisiontreewillbesplitintothree
120
branchesorsub-trees,oneforeachofthethreevaluesofoutlook.Datafortheroot node (the entire data)will be divided into the three segments, one foreachof thevalueofoutlook.Thesunnybranchwill inherit thedata for theinstances that had sunny as the value of outlook. These will be used forfurtherbuildingof thatsub-tree.Similarly, therainybranchwill inheritdatafortheinstancesthathadrainyasthevalueofoutlook.Thesewillbeusedforfurtherbuildingofthatsub-tree.Theovercastbranchwillinheritthedatafortheinstancesthathadovercastastheoutlook.However,therewillbenoneedtobuildfurtheronthatbranch.Thereisacleardecision,yes,forallinstanceswhenoutlookvalueisovercast.
Thedecisiontreewilllooklikethisafterthefirstlevelofsplitting.
Determining the next nodes of the tree: A similar recursive logic of treebuildingshouldbeappliedtoeachbranch.Forthesunnybranchontheleft,errorvalueswillbecalculatedforthethreeothervariables–temp,humidityandwindy.Finalcomparisonlookslikethis:
Attribute
Rules
Error
TotalError
Temp
Hot->No
0/2
1/5
Mild->No
1/2
Cool->yes
0/1
Humidity
High->No
0/3
0/5
Normal->Yes
0/2
121
Windy
False->No
1/3
2/5
True->Yes
1/2
Thevariableofhumidityshowstheleastamountoferror,i.e.zeroerror.Theothertwovariableshavenon-zeroerrors.ThustheOutlook:sunnybranchontheleftwillusehumidityasthenextsplittingvariable.
Similaranalysisshouldbedoneforthe‘rainy’valueofthetree.Theanalysiswouldlooklikethis.
Attribute
Rules
Error
TotalError
Temp
Mild->Yes
1/3
2/5
Cool->yes
1/2
Humidity
High->No
1/2
2/5
Normal->Yes
1/3
Windy
False->Yes
0/3
0/5
True-No
0/2
FortheRainybranch, itcansimilarlybeseenthat thevariableWindygivesall thecorrect answers,whilenoneof theother twovariablesmakesall thecorrectdecisions.
Thisishowthefinaldecisiontreelookslike.HereitisproducedusingWekaopen-source data mining platform (Figure 6.1). This is the model thatabstractstheknowledgeofthepastdataofdecision.
122
Figure6.1:DecisionTreefortheweatherproblem
This decision tree can be used to solve the current problem. Here is theproblemagain.
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
Normal
True
??
According to the tree, the first question to ask is about outlook. In thisproblemtheoutlookissunny.So, thedecisionproblemmovesto theSunnybranch of the tree. The node in that sub-tree is humidity. In the problem,HumidityisNormal.ThatbranchleadstoananswerYes.Thus,theanswertotheplayproblemisYes.
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
Normal
True
Yes
123
LessonsfromconstructingtreesHerearesomebenefitsofusingthisdecisiontreecomparedwithlookinguptheanswersfromthedatatable(Figure6.1)
DecisionTree
TableLookup
Accuracy
Variedlevelof
accuracy
100%accurate
Generality
General.Appliestoall
situations
Appliesonlywhenasimilarcasehad
occurredearlier
Frugality
Onlythreevariables
needed
Allfourvariablesareneeded
Simple
Onlyone,ormaxtwovariablevaluesare
needed
Allfourvariablevaluesareneeded
Easy
Logical,andeasyto
understand
Canbecumbersometolookup;nounderstandingofthelogicbehindthe
decisionFigure6.1:ComparingDecisionTreewithTableLook-up
Hereareafewobservationsabouthowthetreewasconstructed:
1. The finaldecision treehaszeroerrors inmapping to thepriordata. Inother words, the tree has a predictive accuracy of 100%. The treecompletely fits the data. In real life situations, such perfect predictiveaccuracy is not possible when making decision trees.When there arelarger,complicateddatasets,withmanymorevariables,aperfect fit isunachievable. This is especially true in business and social contexts,wherethingsarenotalwaysfullyclearandconsistent.
2. Thedecision treealgorithmselected theminimumnumber of variablesthat are needed to solve the problem. Thus, one can start with allavailable data variables, and let the decision-tree algorithm select theonesthatareuseful,anddiscardtherest.
124
3. Thistreeisalmostsymmetricwithallbranchesbeingofalmostsimilarlengths.However, in real life situations, some of the branchesmay bemuchlongerthantheothers,andthetreemayneedtobeprunedtomakeitmorebalancedandusable.
4. Itmaybepossibletoincreasepredictiveaccuracybymakingmoresub-treesandmakingthetreelonger.However,themarginalaccuracygainedfromeachsubsequentlevelinthetreewillbeless,andmaynotbeworththelossineaseandinterpretabilityofthetree.Ifthebranchesarelongand complicated, itwill be difficult to understand and use.The longerbranchesmayneedtobetrimmedtokeepthetreeeasytouse.
5. A perfectly fitting tree has the danger of over-fitting the data, thuscapturing all the random variations in the data. Itmay fit the trainingdatawell,butmaynotdowellinpredictingthefuturerealinstances.
6. Therewasasinglebesttreeforthisdata.Therecouldhoweverbetwoormore equally efficient decision trees of similar length with similarpredictive accuracy for the same data set. Decision trees are basedstrictlyonpatternswithin thedata, and donot rely on anyunderlyingtheory of the problem domain. When multiple candidate trees areavailable, one could choose whichever is easier to understand,communicateorimplement.
125
DecisionTreeAlgorithmsAswesaw,decisiontreesemploythedivideandconquermethod.Thedataisbranched at each node according to certain criteria until all the data isassignedtoleafnodes.Itrecursivelydividesatrainingsetuntileachdivisionconsistsofexamplesfromoneclass.
Thefollowingisapseudocodeformakingdecisiontrees:
1. Createarootnodeandassignallofthetrainingdatatoit.2. Selectthebestsplittingattributeaccordingtocertaincriteria.3. Addabranchtotherootnodeforeachvalueofthesplit.4. Split the data into mutually exclusive subsets along the lines of the
specificsplit.5. Repeat steps 2 and 3 for each and every leaf node until a stopping
criteriaisreached.
There are many algorithms for making decision trees. Decision treealgorithmsdifferonthreekeyelements:
1. Splittingcriteria1. Whichvariabletouseforthefirstsplit?Howshouldonedetermine
themost importantvariable for the firstbranch,andsubsequently,for each sub-tree? There are many measures like least errors,informationgain,gini’scoefficient,etc.
2. What values to use for the split? If the variables have continuousvaluessuchasforageorbloodpressure,whatvalue-rangesshouldbeusedtomakebins?
3. Howmanybranchesshouldbeallowedforeachnode?Therecouldbebinarytrees,withjusttwobranchesateachnode.Ortherecouldbemorebranchesallowed.
2. Stoppingcriteria:When tostopbuilding the tree?Thereare twomajorways to make that determination. The tree building could be stoppedwhen a certain depth of the branches has been reached and the treebecomesunreadableafterthat.Thetreecouldalsobestoppedwhentheerrorlevelatanynodeiswithinpredefinedtolerablelevels.
3. Pruning:Thetreecouldbetrimmedtomakeitmorebalancedandmoreeasilyusable.Thepruningisoftendoneafterthetreeisconstructed,tobalance out the tree and improve usability. The symptoms of an over-fitted treearea tree toodeep,with toomanybranches, someofwhichmayreflectanomaliesduetonoiseoroutliers.Thus,thetreeshouldbe
126
pruned.Therearetwoapproachestoavoidover-fitting.
- Pre-pruningmeansto halt the treeconstruction early,when certain criteriaaremet.Thedownsideis that it is difficult todecidewhat criteria touse for halting theconstruction, becausewe do not know whatmay happensubsequently, if wekeepgrowingthetree.- Post-pruning:Remove branches orsub-treesfroma“fullygrown” tree. Thismethod is commonlyused. C4.5 algorithmuses a statisticalmethodtoestimatetheerrorsateachnodeforpruning. A validationset may be used forpruningaswell.
ThemostpopulardecisiontreealgorithmsareC5,CARTandCHAID(Table6.2)
Figure6.2:ComparingpopularDecisionTreealgorithms
Decision-Tree
C4.5
CART
CHAID
FullName
IterativeDichotomiser(ID3)
ClassificationandRegressionTrees
Chi-squareAutomaticInteractionDetector
127
Basicalgorithm
Hunt’salgorithm
Hunt’salgorithm
adjustedsignificancetesting
Developer
RossQuinlan
Bremman
GordonKass
Whendeveloped
1986
1984
1980
Typesoftrees
Classification
Classification&Regressiontrees
Classification®ression
Serialimplementation
Tree-growth&Tree-pruning
Tree-growth&Tree-pruning
Tree-growth&Tree-pruning
Typeofdata
Discrete&Continuous;Incompletedata
DiscreteandContinuous
Non-normaldataalsoaccepted
Typesofsplits
Multi-waysplits
Binarysplitsonly;Cleversurrogatesplitstoreducetreedepth
Multi-waysplitsasdefault
Splittingcriteria
Informationgain
Gini’scoefficient,andothers
Chi-squaretest
PruningCriteria
Cleverbottom-uptechniqueavoidsoverfitting
Removeweakestlinksfirst
Treescanbecomeverylarge
Implementation
Publiclyavailable
Publiclyavailableinmostpackages
Popularinmarketresearch,forsegmentation
128
ConclusionDecision trees are themost popular, versatile, and easy to use dataminingtechnique with high predictive accuracy. They are also very useful ascommunication tools with executives. There are many successful decisiontreealgorithms. Allpubliclyavailabledataminingsoftwareplatformsoffermultipledecisiontreeimplementations.
129
ReviewQuestions1: What is a decision tree? Why are decision trees the most popularclassificationtechnique?
2:What isasplittingvariable?Describe threecriteria forchoosingsplittingvariable.
3:Whatispruning?Whatarepre-pruningandpost-pruning?Whychooseoneovertheother?
4:Whataregini’scoefficient,andinformationgain?(Hint:googleit).
Hands-on Exercise: Create a decision tree for the following data set. Theobjectiveistopredicttheclasscategory.(loanapprovedornot).
Age
Job
House
Credit
LoanApproved
Young
False
No
Fair
No
Young
False
No
Good
No
Young
True
No
Good
Yes
Young
True
Yes
Fair
Yes
Young
False
No
Fair
No
Middle
False
No
Fair
No
Middle
False
No
Good
No
Middle
True
Yes
Good
Yes
Middle
False
Yes
Excellent
Yes
Middle
False
Yes
Excellent
Yes
130
Old False Yes Excellent YesOld
False
Yes
Good
Yes
Old
True
No
Good
Yes
Old
True
No
Excellent
Yes
Old
False
No
Fair
No
Thensolvethefollowingproblemusingthemodel.
Age
Job
House
Credit
LoanApproved
Young
False
False
Good
??
131
LibertyStoresCaseExercise:Step5Libertyisconstantlyevaluatingrequestsforopeningnewstores.Theywouldlike to formalize the process for handling many requests, so that the bestcandidatesareselectedfordetailedevaluation.
Develop a decision tree for evaluating new stores options. Here is thetrainingdata:
City-size
AvgIncome
Localinvestors
LOHASawareness
Decision
Big
High
yes
High
yes
Med
Med
no
Med
no
Small
Low
yes
Low
no
Big
High
no
High
Yes
Small
med
yes
High
No
Med
high
yes
med
Yes
Med
med
yes
med
No
Big
med
no
med
No
Med
high
yes
low
No
Small
High
no
High
Yes
Small
med
no
High
No
Med
high
no
med
No
Usethedecisiontreetoanswerthefollowingquestion?
132
City-size
AvgIncome
Localinvestors
LOHASawareness
Decision
Med
med
no
med
??
133
Chapter7:Regression
Regression is a well-known statistical technique to model the predictiverelationshipbetweenseveralindependentvariables(DVs)andonedependentvariable. The objective is to find the best-fitting curve for a dependentvariableinamultidimensionalspace,witheachindependentvariablebeingadimension.Thecurvecouldbeastraightline,oritcouldbeanonlinearcurve.Thequalityoffitofthecurvetothedatacanbemeasuredbyacoefficientofcorrelation(r),whichis thesquarerootoftheamountofvarianceexplainedbythecurve.
Thekeystepsforregressionaresimple:
1. Listallthevariablesavailableformakingthemodel.2. EstablishaDependentVariable(DV)ofinterest.3. Examinevisual(ifpossible)relationshipsbetweenvariablesofinterest.4. FindawaytopredictDVusingtheothervariables.
134
Caselet:DatadrivenPredictionMarketsTraditionalpollstersstillseemtobeusingmethodologiesthatworkedwell a decade or two ago.Nate Silver is anew breed of data-based political forecasters who areseeped in big data and advanced analytics. In the 2012elections,hepredictedthatObamawouldwintheelectionwith 291 electoral votes, compared to 247 for MittRomney,givingthePresidenta62%leadandre-election.He stunned the political forecasting world by correctlypredicting the Presidential winner in all 50 states,including all nine swing states. He also, correctlypredictedthewinnerin31ofthe33USSenateraces.Nate Silver brings a different view to the world offorecasting political elections, viewing it as a scientificdiscipline. State the hypothesis scientifically, gather allavailable information, analyze the data and extractinsights using sophisticated models and algorithms andfinally,applyhumanjudgmenttointerpretthoseinsights.The results are likely to be much more grounded andsuccessful.(Source:TheSignalandtheNoise:WhyMostPredictionsFailbutSomeDon’t,byNateSilver,2012)Q1: What is the impact of this story on traditionalpollsters&commentators?
135
CorrelationsandRelationshipsStatistical relationshipsareaboutwhichelementsofdatahangtogether,andwhich ones hang separately. It is about categorizing variables that have arelationshipwithoneanother,andcategorizingvariablesthataredistinctandunrelated to other variables. It is about describing significant positiverelationshipsandsignificantnegativedifferences.
Thefirstandforemostmeasureofthestrengthofarelationshipisco-relation(orcorrelation).Thestrengthofacorrelationisaquantitativemeasurethatismeasured in anormalized rangebetween0 (zero) and1.Acorrelationof1indicatesaperfectrelationship,wherethetwovariablesareinperfectsync.Acorrelationof0indicatesthatthereisnorelationshipbetweenthevariables.
Therelationshipcanbepositive,oritcanbeaninverserelationship,thatis,the variables may move together in the same direction or in the oppositedirection. Therefore, a good measure of correlation is the correlationcoefficient,whichisthesquarerootofcorrelation.Thiscoefficient,calledr,canthusrangefrom−1to+1.Anrvalueof0signifiesnorelationship.Anrvalueof1showsperfectrelationshipinthesamedirection,andanrvalueof−1showsaperfectrelationshipbutmovinginoppositedirections.
Given two numeric variables x and y, the coefficient of correlation r ismathematically computed by the following equation.̄ x (called x-bar) is themeanofx,andȳ(y-bar)isthemeanofy.
136
VisuallookatrelationshipsA scatter plot (or scatter diagram) is a simple exercise for plotting all datapointsbetweentwovariablesonatwo-dimensionalgraph.Itprovidesavisuallayoutofwhereallthedatapointsareplacedinthattwo-dimensionalspace.The scatter plot can be useful for graphically intuiting the relationshipbetweentwovariables.
Here is a picture (Figure 7.1) that showsmany possible patterns in scatterdiagrams.
Figure7.1:Scatterplotsshowingtypesofrelationshipsamongtwovariables(Source:Groebneretal.2013)
Chart(a)showsaverystronglinearrelationshipbetweenthevariablesxandy.Thatmeans thevalueofy increasesproportionallywithx.Chart (b) alsoshowsastronglinearrelationshipbetweenthevariablesxandy.Hereitisaninverserelationship.Thatmeansthevalueofydecreasesproportionallywithx.
Chart(c)showsacurvilinearrelationship.Itisaninverserelationship,whichmeansthatthevalueofydecreasesproportionallywithx.However,itseemsarelatively well-defined relationship, like an arc of a circle, which can berepresented by a simple quadratic equation (quadratic means the power oftwo,thatis,usingtermslikex2andy2).Chart(d)showsapositivecurvilinearrelationship.However,itdoesnotseemtoresemblearegularshape,andthuswouldnotbe a strong relationship.Charts (e) and (f) showno relationship.Thatmeansvariablesxandyareindependentofeachother.
Charts(a)and(b)aregoodcandidatesthatmodelasimplelinearregressionmodel (the terms regression model and regression equation can be used
137
interchangeably).Chart(c)toocouldbemodeledwithalittlemorecomplex,quadratic regression equation.Chart (d)might require an evenhigher orderpolynomialregressionequationtorepresentthedata.Charts(e)and(f)havenorelationship,thus,theycannotbemodeledtogether,byregressionorusinganyothermodelingtool.
138
RegressionExerciseTheregressionmodel isdescribedasa linearequation that follows.y is thedependentvariable,thatis,thevariablebeingpredicted.xistheindependentvariable, or the predictor variable.There could bemanypredictor variables(suchasx1,x2,...)inaregressionequation.However,therecanbeonlyonedependentvariable(y)intheregressionequation.
y=β0+β1x+ε
Asimpleexampleofaregressionequationwouldbetopredictahousepricefromthesizeofthehouse.Hereisasamplehousepricesdata:
HousePrice
Size(sqft)
$229,500
1850
$273,300
2190
$247,000
2100
$195,100
1930
$261,000
2300
$179,700
1710
$168,500
1550
$234,400
1920
$168,800
1840
$180,400
1720
$156,200
1660
$288,350
2405
139
$186,750
1525
$202,100
2030
$256,800
2240
The two dimensions of (one predictor, one outcome variable) data can beplottedonascatterdiagram.Ascatterplotwithabest-fittinglinelookslikethegraphthatfollows(Figure7.2).
Figure 7.2: Scatter plot and regression equation between House price andhousesize.
Visually, one can see a positive correlation between House Price and Size(sqft).However, the relationship isnotperfect.Runninga regressionmodelbetweenthetwovariablesproducesthefollowingoutput(truncated).
RegressionStatistics
r
0.891
r2
0.794
Coefficients
140
Intercept
-54191
Size(sqft)
139.48
It shows the coefficient of correlation is 0.891. r2, the measure of totalvariance explained by the equation, is 0.794, or 79%. Thatmeans the twovariables are moderately and positively correlated. Regression coefficientshelpcreatethefollowingequationforpredictinghouseprices.
HousePrice($)=139.48*Size(sqft)–54191
This equation explains only 79% of the variance in house prices. Supposeotherpredictorvariablesaremadeavailable,suchasthenumberofroomsinthehouse.Itmighthelpimprovetheregressionmodel.
141
Thehousedatanowlookslikethis:
HousePrice
Size(sqft)
#Rooms
$229,500
1850
4
$273,300
2190
5
$247,000
2100
4
$195,100
1930
3
$261,000
2300
4
$179,700
1710
2
$168,500
1550
2
$234,400
1920
4
$168,800
1840
2
$180,400
1720
2
$156,200
1660
2
$288,350
2405
5
$186,750
1525
3
$202,100
2030
2
$256,800
2240
4
142
Whileitispossibletomakea3-dimensionalscatterplot,onecanalternativelyexaminethecorrelationmatrixamongthevariables.
HousePrice
Size(sqft)
#Rooms
HousePrice
1
Size(sqft)
0.891
1
Rooms
0.944
0.748
1
ItshowsthattheHousepricehasastrongcorrelationwithnumberofrooms(0.944) aswell.Thus, it is likely that adding this variable to the regressionmodelwilladdtothestrengthofthemodel.
Running a regression model between these three variables produces thefollowingoutput(truncated).
RegressionStatisticsr
0.984
r2
0.968
Coefficients
Intercept
12923
Size(sqft)
65.60
Rooms
23613
Itshowstheco-efficientofcorrelationof thisregressionmodel is0.984.R2,thetotalvarianceexplainedbytheequation,is0.968or97%.Thatmeansthevariablesarepositivelyandverystronglycorrelated.Addinganewrelevantvariablehashelpedimprovethestrengthoftheregressionmodel.
143
Using the regression coefficients helps create the following equation forpredictinghouseprices.
HousePrice($)=65.6*Size(sqft)+23613*Rooms+12924
Thisequationshowsa97%goodnessoffitwiththedata,whichisverygoodfor business and economic data. There is always some randomvariation innaturallyoccurringbusinessdata,anditisnotdesirabletooverfitthemodeltothedata.
This predictive equation should be used for future transactions. Given asituationasbelow, itwill bepossible topredict thepriceof thehousewith2000sqftand3rooms.
HousePrice
Size(sqft)
#Rooms
??
2000
3
HousePrice($)=65.6*2000(sqft)+23613*3+12924=$214,963
Thepredictedvaluesshouldbecomparedwith theactualvalues toseehowclosethemodelisabletopredicttheactualvalue.Asnewdatapointsbecomeavailable,thereareopportunitiestofine-tuneandimprovethemodel.
144
Non-linearregressionexerciseTherelationshipbetweenthevariablesmayalsobecurvilinear.Forexample,givenpastdatafromelectricityconsumption(KwH)andtemperature(temp),the objective is to predict the electrical consumption from the temperaturevalue.Hereareadozenpastobservations.
KWatts
Temp(F)
12530
46.8
10800
52.1
10180
55.1
9730
59.2
9750
61.9
10230
66.2
11160
69.9
13910
76.8
15690
79.3
15110
79.7
17020
80.2
17880
83.3
Intwodimensions(onepredictor,oneoutcomevariable)datacanbeplottedonascatterdiagram.Ascatterplotwithabest-fittinglinelookslikethegraphbelow(Figure7.3).
145
Figure6.2:Scatterplotsshowingregressionbetween(a)kwattsandtemp,and(b)kwattsandtempsquare
Itisvisuallyclearthatthefirstlinedoesnotfitthedatawell.Therelationshipbetween temperature andKwatts follows a curvilinearmodel,where it hitsbottomatacertainvalueoftemperature.TheregressionmodelconfirmstherelationshipsinceRisonly0.77andR-square isalsoonly60%.Thus,only60%ofthevarianceisexplained.
The regression model can then be enhanced using a Temp2 variable in theequation.Thesecondline is therelationshipbetweenKWHandTemp2.Thescatter plot shows that the Energy consumption shows a strong linearrelationshipwiththequadraticTemp2variable.Runningtheregressionmodelafteraddingthequadraticvariable,leadstothefollowingresults:
RegressionStatisticsr
0.992
r2
0.984
Coefficients
Intercept
67245
146
Temp(F) -1911Temp-sq
15.87
It shows that the co-efficient of correlation of the regressionmodel is now0.99.R2,thetotalvarianceexplainedbytheequationis0.985,or98.5%.Thatmeans the variables are very strongly and positively correlated. Theregressioncoefficientshelpcreatethefollowingequationfor
EnergyConsumption(Kwatts)=15.87*Temp2-1911*Temp+67245
This equation shows a 98.5% fit which is very good for business andeconomic contexts. Now one can predict the Kwatts value for when thetemperatureis72-degrees.
Energyconsumption=(15.87*72*72)-(1911*72)+67245=11923Kwatts
147
LogisticRegressionRegressionmodelstraditionallyworkwithcontinuousnumericvaluedatafordependent and independent variables. Logistic regression models can,however,workwithdependentvariableswithbinaryvalues,suchaswhetheraloanisapproved(yesorno).Logisticregressionmeasurestherelationshipbetween a categorical dependent variable and one or more independentvariables.Forexample,Logisticregressionmightbeusedtopredictwhetherapatienthasagivendisease(e.g.diabetes),basedonobservedcharacteristicsofthepatient(age,gender,bodymassindex,resultsofbloodtests,etc.).
Logisticalregressionmodelsuseprobabilityscoresasthepredictedvaluesofthedependentvariable.Logisticregressiontakesthenaturallogarithmoftheoddsofthedependentvariablebeingacase(referredtoasthelogit)tocreatea continuous criterion as a transformed version of the dependent variable.Thus the logit transformation isused in logistic regressionas thedependentvariable. The net effect is that although the dependent variable in logisticregressionisbinomial(orcategorical, i.e.hasonlytwopossiblevalues), thelogit is the continuous function uponwhich linear regression is conducted.Here is the general logistic function, with independent variable on thehorizontal axis and the logit dependentvariableon thevertical axis (Figure7.3).
Figure7.3:GeneralLogitfunction
All popular data mining platforms provide support for regular multipleregressionmodels,aswellasoptionsforLogisticRegression.
148
AdvantagesandDisadvantagesofRegressionModelsRegressionModelsareverypopularbecausetheyoffermanyadvantages.
1. Regressionmodels are easy to understand as they are built uponbasicstatisticalprinciplessuchascorrelationandleastsquareerror.
2. Regressionmodels provide simple algebraic equations that are easy tounderstandanduse.
3. Thestrength(orthegoodnessoffit)oftheregressionmodelismeasuredin terms of the correlation coefficients, and other related statisticalparametersthatarewellunderstood.
4. Regression models can match and beat the predictive power of othermodelingtechniques.
5. Regressionmodelscanincludeallthevariablesthatonewantstoincludeinthemodel.
6. Regressionmodeling tools are pervasive. They are found in statisticalpackages aswell asdataminingpackages.MSExcel spreadsheets canalsoprovidesimpleregressionmodelingcapabilities.
Regressionmodelscanhoweverproveinadequateundermanycircumstances.
1. Regressionmodelscannotcoverforpoordataqualityissues.Ifthedataisnotpreparedwelltoremovemissingvalues,orisnotwell-behavedintermsofanormaldistribution,thevalidityofthemodelsuffers.
2. Regression models suffer from collinearity problems (meaning stronglinear correlations among some independent variables). If theindependentvariableshave strongcorrelations among themselves, thenthey will eat into each other’s predictive power and the regressioncoefficients will lose their ruggedness. Regression models will notautomaticallychoosebetweenhighlycollinearvariables,althoughsomepackagesattempttodothat.
3. Regressionmodelscanbeunwieldyandunreliableifalargenumberofvariablesareincludedinthemodel.Allvariablesenteredintothemodelwill be reflected in the regression equation, irrespective of theircontributiontothepredictivepowerofthemodel.Thereisnoconceptofautomaticpruningoftheregressionmodel.
4. Regressionmodelsdonotautomatically takecareofnon-linearity.Theuserneedstoimaginethekindofadditionaltermsthatmightbeneededtobeaddedtotheregressionmodeltoimproveitsfit.
5. Regressionmodelsworkonlywithnumericdataandnotwithcategoricalvariables.There areways to dealwith categorical variables though by
149
creatingmultiplenewvariableswithayes/novalue.
150
ConclusionRegression models are simple, versatile, visual/graphical tools with highpredictive ability. They include non-linear as well as binary predictions.Regression models should be used in conjunction with other data miningtechniquestoconfirmthefindings.
***
151
ReviewExercises:Q1:Whatisaregressionmodel?
Q2:Whatisascatterplot?Howdoesithelp?
Q3:Compareandcontrastdecisiontreeswithregressionmodels?
Q4:Usingthedatabelow,createaregressionmodeltopredicttheTest2fromtheTest1score.Thenpredictthescoreforonewhogota46inTest1.
Test1
Test2
59
56
52
63
44
55
51
50
42
66
42
48
41
58
45
36
27
13
63
50
54
81
44
56
50
64
152
47
50
153
LibertyStoresCaseExercise:Step6Libertywantstoforecastitssalesfornextyear,forfinancialbudgeting.
Year
GlobalGDPindexpercapita
#custservcalls(‘000s)
#employees(‘000)
#Items(‘000)
Revenue
($M)
1
100
25
45
11
2000
2
112
27
53
11
2400
3
115
22
54
12
2700
4
123
27
58
14
2900
5
122
32
60
14
3200
6
132
33
65
15
3500
7
143
40
72
16
4000
8
126
30
65
16
4200
9
166
34
85
17
4500
10
157
47
97
18
4700
11
176
33
98
18
4900
12
180
45
100
20
5000
Checkthecorrelations.Whichvariablesarestronglycorrelated?
Createaregressionmodelthatbestpredictstherevenue.
154
155
Chapter8:ArtificialNeuralNetworks
ArtificialNeuralNetworks(ANN)areinspiredbytheinformationprocessingmodelofthemind/brain.Thehumanbrainconsistsofbillionsofneuronsthatlink with one another in an intricate pattern. Every neuron receivesinformation frommanyotherneurons,processes it, gets excitedornot, andpassesitsstateinformationtootherneurons.
Just like the brain is a multipurpose system, so also the ANNs are veryversatilesystems.Theycanbeusedformanykindsofpatternrecognitionandprediction. They are also used for classification, regression, clustering,association,andoptimizationactivities.Theyareusedinfinance,marketing,manufacturing,operations,informationsystemsapplications,andsoon.
ANNsarecomposedof a largenumberofhighly interconnectedprocessingelements(neurons)workinginamulti-layeredstructuresthatreceiveinputs,processtheinputs,andproduceanoutput.AnANNisdesignedforaspecificapplication, such as pattern recognition or data classification, and trainedthrough a learning process. Just like in biological systems, ANNs makeadjustmentstothesynapticconnectionswitheachlearninginstance.
ANNsarelikeablackboxtrainedintosolvingaparticulartypeofproblem,and they can develop high predictive powers. Their intermediate synapticparameter values evolve as the system obtains feedback on its predictions,andthusanANNlearnsfrommoretrainingdata(Figure8.1).
Figure8.1:GeneralANNmodel
156
Caselet:IBMWatson-AnalyticsinMedicineThe amount of medicalinformation available isdoubling every five years andmuch of this data isunstructured. Physicians simplydon't have time to read everyjournal that can help them keepup to date with the latestadvances.Mistakesindiagnosisare likely to happen and clientshave becomemore aware of theevidence. Analytics willtransform the field of medicineinto Evidence-based medicine.How can healthcare providersaddresstheseproblems?IBM’s Watson cognitivecomputing system can analyzelarge amounts of unstructuredtext and develop hypothesesbased on that analysis.Physicians can use Watson toassistindiagnosingandtreatingpatients. First, the physicianmight describe symptoms andother related factors to thesystem.Watsoncanthenidentifythekeypiecesofinformationandmine the patient’s data to findrelevant facts about familyhistory, currentmedicationsandother existing conditions. Itcombines this information withcurrent findings from tests, andthen forms and tests ahypotheses by examining avariety of data sources—treatment guidelines, electronicmedicalrecorddataanddoctors’and nurses’ notes, as well as
157
peer-reviewed research andclinical studies. From here,Watson can provide potentialtreatment options and itsconfidence rating for eachsuggestion.Watson has been deployed atmany leading healthcareinstitutions to improve thequality and efficiency ofhealthcare decisions; to helpclinicians uncover insights fromits patient information inelectronic medical records(EMR);amongotherbenefits.Q1: How would IBM Watsonchangemedical practices in thefuture?Q2: In what other industries &functions could this technologybeapplied?
158
BusinessApplicationsofANNNeuralnetworksareusedmostoftenwhentheobjectivefunctioniscomplex,andwherethereexistsplentyofdata,andthemodel isexpectedto improveoveraperiodoftime.Afewsampleapplications:
1. Theyareusedinstockpricepredictionwheretherulesofthegameareextremely complicated, and a lot of data needs to be processed veryquickly.
2. Theyareusedforcharacter recognition,as in recognizinghand-writtentext, or damagedormangled text.They areused in recognizing fingerprints. These are complicated patterns and are unique for each person.Layers of neurons can progressively clarify the pattern leading to aremarkablyaccurateresult.
3. Theyarealsousedintraditionalclassificationproblems,likeapprovingafinancialloanapplication.
159
DesignPrinciplesofanArtificialNeuralNetwork1. A neuron is the basic processing unit of the network. The neuron (or
processingelement)receivesinputsfromitsprecedingneurons(orPEs),doessomenonlinearweightedcomputationonthebasisofthoseinputs,transformstheresultintoitsoutputvalue,andthenpassesontheoutputtothenextneuroninthenetwork(Figure8.2).X’saretheinputs,w’saretheweightsforeachinput,andyistheoutput.
Figure8.2:Modelforasingleartificialneuron
2. ANeuralnetwork is amulti-layeredmodel.There is at leastone inputneuron,oneoutputneuron,andatleastoneprocessingneuron.AnANNwith just this basic structure would be a simple, single-stagecomputational unit. A simple task may be processed by just that oneneuronandtheresultmaybecommunicatedsoon.ANNshowever,mayhavemultiplelayersofprocessingelementsinsequence.Therecouldbemanyneuronsinvolvedinasequencedependinguponthecomplexityofthepredictiveaction.ThelayersofPEscouldworkinsequence,ortheycouldworkinparallel(Figure8.3).
160
Figure8.3:Modelforamulti-layerANN
3. Theprocessinglogicofeachneuronmayassigndifferentweightstothevariousincominginputstreams.Theprocessinglogicmayalsousenon-linear transformation, such as a sigmoid function, from the processedvalues to the output value. This processing logic and the intermediateweightandprocessingfunctionsarejustwhatworksforthesystemasawhole, in its objective of solving a problem collectively. Thus, neuralnetworksareconsideredtobeanopaqueandablack-boxsystem.
4. Theneuralnetworkcanbetrainedbymakingsimilardecisionsoverandover again with many training cases. It will continue to learn byadjustingitsinternalcomputationandcommunicationbasedonfeedbackaboutitspreviousdecisions.Thus,theneuralnetworksbecomebetteratmakingadecisionastheyhandlemoreandmoredecisions.
Depending upon the nature of the problem and the availability of goodtrainingdata,atsomepointtheneuralnetworkwilllearnenoughandbegintomatchthepredictiveaccuracyofahumanexpert.Inmanypracticalsituations,the predictions of ANN, trained over a long period of time with a largenumberoftrainingdata,havebeguntodecisivelybecomemoreaccuratethanhumanexperts.At thatpointANNcanbegin tobeseriouslyconsidered fordeploymentinrealsituationsinrealtime.
161
RepresentationofaNeuralNetworkAneuralnetworkisaseriesofneuronsthatreceiveinputsfromotherneurons.They do a weighted summation function of all the inputs, using differentweights(orimportance)foreachinput.Theweightedsumisthentransformedintoanoutputvalueusingatransferfunction.
LearninginANNoccurswhenthevariousprocessingelementsintheneuralnetwork adjust the underlying relationship (weights, transfer function, etc)betweeninputandoutputs,inresponsetothefeedbackontheirpredictions.Ifthepredictionmadewascorrect,thentheweightswouldremainthesame,butifthepredictionwasincorrect,thentheparametervalueswouldchange.
TheTransformation(Transfer)Functionisanyfunctionsuitableforthetaskathand. The transfer function for ANNs is usually a non-linear sigmoidfunction.Thus,ifthenormalizedcomputedvalueislessthansomevalue(say0.5)thentheoutputvaluewillbezero.Ifthecomputedvalueisatthecut-offthreshold,thentheoutputvaluewillbea1.Itcouldbeanonlinearhyperbolicfunctioninwhichtheoutputiseithera-1ora1.Manyotherfunctionscouldbedesignedforanyoralloftheprocessingelements.
Thus, in a neural network, every processing element can potentially have adifferentnumberof inputvalues,adifferentsetofweights for those inputs,andadifferenttransformationfunction.Thosevaluessupportandcompensatefor one another until the neural network as a whole learns to provide thecorrectoutput,asdesiredbytheuser.
162
ArchitectingaNeuralNetworkThere are many ways to architect the functioning of an ANN using fairlysimpleandopenruleswithatremendousamountofflexibilityateachstage.The most popular architecture is a Feed-forward, multi-layered perceptronwith back-propagation learning algorithm. That means there are multiplelayersofPEsinthesystemandtheoutputofneuronsarefedforwardtothePEsinthenextlayers;andthefeedbackonthepredictionisfedbackintotheneuralnetworkforlearningtooccur.Thisisessentiallywhatwasdescribedinthe earlier paragraphs. ANN architectures for different applications areshowninTable8.1.
Classification
Feedforwardnetworks(MLP),radialbasisfunction,andprobabilistic
Regression
Feedforwardnetworks(MLP),radialbasisfunction
Clustering
Adaptiveresonancetheory(ART),Self-organizingmaps(SOMs)
AssociationRuleMining
Hopfieldnetworks
Table8.1:ANNarchitecturesfordifferentapplications
163
DevelopinganANNIttakesresources,trainingdata,skillandtimetodevelopa neural network. Most data mining platforms offer atleast the Multi-Layer-Perceptron (MLP) algorithm toimplement a neural network. Other neural networkarchitectures include Probabilistic networks and Self-organizingfeaturemaps.
ThestepsrequiredtobuildanANNareasfollows:
1. Gather data. Divide into training data and test data. The training dataneedstobefurtherdividedintotrainingdataandvalidationdata.
2. Selectthenetworkarchitecture,suchasFeedforwardnetwork.3. Selectthealgorithm,suchasMulti-layerPerception.4. Setnetworkparameters.5. TraintheANNwithtrainingdata.6. Validatethemodelwithvalidationdata.7. Freezetheweightsandotherparameters.8. Testthetrainednetworkwithtestdata.9. DeploytheANNwhenitachievesgoodpredictiveaccuracy.
Training anANN requires that the training data be splitintothreeparts(Table8.2):
Trainingset
Thisdatasetisusedtoadjusttheweightsontheneuralnetwork(∼60%).
Validationset
Thisdatasetisusedtominimizeoverfittingandverifyingaccuracy(∼20%).
Testingset
Thisdatasetisusedonlyfortestingthefinalsolutioninordertoconfirmtheactualpredictivepowerofthenetwork(∼20%).
k-foldcross-validation
Thisapproachmeansthatthedataisdividedintokequalpieces,andthelearningprocessisrepeatedk-timeswitheachpiecesbecomingthetrainingset.Thisprocessleadstolessbiasandmoreaccuracy,butis
164
moretimeconsuming.Table8.2:ANNTrainingdatasets
165
AdvantagesandDisadvantagesofusingANNsTherearemanybenefitsofusingANN.
1. ANNs impose very little restrictions on their use. ANN can deal with(identify/model) highly nonlinear relationships on their own, withoutmuchworkfromtheuseroranalyst.Theyhelpfindpracticaldata-drivensolutions where algorithmic solutions are non-existent or toocomplicated.
2. There is no need to program neural networks, as they learn fromexamples.Theygetbetterwithuse,withoutmuchprogrammingeffort.
3. They can handle a variety of problem types, including classification,clustering,associations,etc.
4. ANNaretolerantofdataqualityissuesandtheydonotrestrictthedatatofollowstrictnormalityand/orindependenceassumptions.
5. Theycanhandlebothnumericalandcategoricalvariables.6. ANNscanbemuchfasterthanothertechniques.7. Most importantly, theyusuallyprovidebetter results (predictionand/or
clustering) compared to statistical counterparts, once they have beentrainedenough.
The key disadvantages arise from the fact that they arenoteasytointerpretorexplainorcompute.
1. Theyaredeemedtobeblack-boxsolutions,lackingexplainability.Thustheyaredifficult tocommunicateabout,except through thestrengthoftheirresults.
2. OptimaldesignofANNisstillanart:itrequiresexpertiseandextensiveexperimentation.
3. Itcanbedifficult tohandlea largenumberofvariables (especially therichnominalattributes).
4. IttakeslargedatasetstotrainanANN.
166
ConclusionArtificial neural networks are complex systems thatmirror the functioning of the human brain. They areversatile enough to solve many data mining tasks withhigh accuracy. However, they are like black boxes andtheyprovide littleguidanceon the intuitive logicbehindtheirpredictions.
167
ReviewExercises1:Whatisaneuralnetwork?Howdoesitwork?
2:Compareaneuralnetworkwithadecisiontree.
3: What makes a neural network versatile enough forsupervisedaswellasnon-supervisedlearningtasks?
4:Examine thesteps indevelopinganeuralnetworkforpredictingstockprices.WhatkindofobjectivefunctionandwhatkindofdatawouldberequiredforagoodstockpricepredictorsystemusingANN?
***
168
Chapter9:ClusterAnalysisCluster analysis is used for automatic identification of natural groupings ofthings.Itisalsoknownasthesegmentationtechnique.Inthistechnique,datainstances that are similar to (or near) each other are categorized into onecluster. Similarly, data instances that are very different (or far away) fromeachotheraremovedintodifferentclusters.
Clustering is an unsupervised learning technique as there is no output ordependentvariableforwhicharightorwronganswercanbecomputed.Thecorrect number of clusters or the definition of those clusters is not knownaheadoftime.Clusteringtechniquescanonlysuggesttotheuserhowmanyclusterswouldmakesensefromthecharacteristicsofthedata.Theusercanspecifyadifferent,largerorsmaller,numberofdesiredclustersbasedontheirmakingbusinesssense.Theclusteranalysistechniquewillthendefinemanydistinctclustersfromanalysisofthedata,withclusterdefinitionsforeachofthoseclusters.However,therearegoodclusterdefinitions,dependingonhowcloselytheclusterparametersfitthedata.
169
Caselet:ClusterAnalysisAnationalinsurancecompanydistributesitspersonalandsmallcommercialinsurance products through independent agents. They wanted to increasetheir salesbybetterunderstanding their customers.Theywere interested inincreasing their market share by doing some direct marketing campaigns,however without creating a channel conflict with the independent agents.Theywerealsointerestedinexaminingdifferentcustomersegmentsbasedontheirneeds,andtheprofitabilityofeachofthosesegments.
They gathered attitudinal, behavioral, and demographic data using a mailsurvey of 2000 U.S. households that own auto insurance. Additional geo-demographic and credit informationwas added to the survey data. Clusteranalysisofthedatarevealedfiveroughlyequalsegments:
Non-Traditionals: interested in using the Internet and/or buyinginsuranceatwork.DirectBuyers:interestedinbuyingviadirectmailortelephone.BudgetConscious: interested inminimalcoverageand finding thebestdeal.AgentLoyals:expressedstrongloyaltytotheiragentsandhighlevelsofpersonalservice.Hassle-Free: similar toAgentLoyalsbut less interested in face-to-faceservice.
(Source:greenbook.org)
Q1.Whichcustomersegmentswouldyouchoosefordirectmarketing?Willthesecreateachannelconflict?
Q2. Could this segmentation apply to otherservicebusinesses?Whichones?
170
ApplicationsofClusterAnalysisClusteranalysisisusedinalmosteveryfieldwherethereisalargevarietyoftransactions. It helps provide characterization, definition, and labels forpopulations. It can help identify natural groupings of customers, products,patients,andsoon.Itcanalsohelpidentifyoutliersinaspecificdomainandthus decrease the size and complexity of problems. A prominent businessapplicationofclusteranalysisisinmarketresearch.Customersaresegmentedinto clusters based on their characteristics—wants and needs, geography,pricesensitivity,andsoon.Herearesomeexamplesofclustering:
1. Market Segmentation: Categorizing customers according to theirsimilarities, for instance by their common wants and needs, andpropensitytopay,canhelpwithtargetedmarketing.
2. Product portfolio: People of similar sizes can be grouped together tomakesmall,mediumandlargesizesforclothingitems.
3. Text Mining: Clustering can help organize a given collection of textdocumentsaccordingtotheircontentsimilaritiesintoclustersofrelatedtopics.
171
DefinitionofaClusterAn operational definition of a cluster is that, given a representation of nobjects, find K groups based on a measure of similarity, such that objectswithin the same group are alike but the objects in different groups are notalike.
However, thenotionofsimilaritycanbeinterpretedinmanyways.Clusterscandifferintermsoftheirshape,size,anddensity.Clustersarepatterns,andtherecanbemanykindsofpatterns.Someclustersare thetraditional types,suchasdatapointshangingtogether.However,thereareotherclusters,suchas all points representing the circumference of a circle. There may beconcentric circles with points of different circles representing differentclusters.Thepresenceofnoiseinthedatamakesthedetectionoftheclustersevenmoredifficult.
Anidealclustercanbedefinedasasetofpointsthatiscompactandisolated.Inreality,aclusterisasubjectiveentitywhosesignificanceandinterpretationrequiresdomainknowledge.Inthesampledatabelow(Figure9.1),howmanyclusterscanonevisualize?
Figure9.1:Visualclusterexample
It seems like there are two clusters of approximately equal sizes.However,theycanbe seenas threeclusters,dependingonhowwedraw thedividinglines.Thereisnotatrulyoptimalwaytocalculateit.Heuristicsareoftenusedtodefinethenumberofclusters.
172
RepresentingclustersTheclusterscanberepresentedbyacentralormodalvalue.Aclustercanbedefinedasthecentroidofthecollectionofpointsbelongingtoit.Acentroidisameasure of central tendency. It is the point fromwhere the sum total ofsquared distance from all the points is theminimum.A real-life equivalentwouldbethecitycenterasthepointthatisconsideredthemosteasytousebyall constituents of the city. Thus all cities are defined by their centers ordowntownareas.
Acluster canalsobe representedby themost frequentlyoccurringvalue inthecluster, i.e. theclustercandefinedbyitsmodalvalue.Thus,aparticularclusterrepresentingasocialpointofviewcouldbecalledthe‘soccermoms’,even though not allmembers of that cluster need currently be amomwithsoccer-playingchildren.
173
ClusteringtechniquesClusteranalysis isamachine-learningtechnique.Thequalityofaclusteringresult dependson thealgorithm, thedistance function, and theapplication.First, consider the distance function. Most cluster analysis methods use adistancemeasuretocalculatetheclosenessbetweenpairsofitems.Therearetwomajormeasuresofdistances:Euclidiandistance (“as the crow flies”orstraightline)isthemostintuitivemeasure.TheotherpopularmeasureistheManhattan (rectilinear) distance, where one can go only in orthogonaldirections.TheEuclidiandistanceisthehypotenuseofarighttriangle,whiletheManhattandistanceisthesumofthetwolegsoftherighttriangle.
Ineithercase,thekeyobjectiveoftheclusteringalgorithmisthesame:
- Inter-clusters distanceÞmaximized;and- Intra-clusters distanceÞminimized
There are many algorithms to produce clusters. There are top-down,hierarchicalmethods that start with creating a given number of best-fittingclusters. There are also bottom-up methods that begin with identifyingnaturallyoccurringclusters.
ThemostpopularclusteringalgorithmistheK-meansalgorithm.Itisatop-down, statistical technique, based on the method of minimizing the leastsquareddistancefromthecenterpointsoftheclusters.Othertechniques,suchasneuralnetworks,arealsousedforclustering.Comparingclusteralgorithmsisadifficulttaskasthereisnosinglerightnumberofclusters.However,thespeed of the algorithm and its versatility in terms of different dataset areimportantcriteria.
Hereisthegenericpseudocodeforclustering
1. Pickanarbitrarynumberofgroups/segmentstobecreated
2. Startwithsomeinitialrandomly-chosencentervaluesforgroups
3. Classifyinstancestoclosestgroups
4. Computenewvaluesforthegroupcenters
5. Repeatstep3&4tillgroupsconverge
174
6. Ifclustersarenotsatisfactory,gotostep1andpickadifferentnumberofgroups/segments
Theclusteringexercisecanbecontinuedwithadifferentnumberofclustersand different location of those points. Clusters are considered good if theclusterdefinitionsstabilize,andthestabilizeddefinitionsproveusefulforthepurposeathand.Else,repeattheclusteringexercisewithadifferentnumberofclusters,anddifferentstartingpointsforgroupmeans.
175
ClusteringExerciseHereisasimpleexercisetovisuallyandintuitiveidentifyclustersfromdata.X andY are two dimensions of interest. The objective is to determine thenumberofclusters,andthecenterpointsofthoseclusters.
X
Y
2
4
2
6
5
6
4
7
8
3
6
6
5
2
5
7
6
3
4
4
A scatter plot of 10 items in 2 dimensions shows them distributed fairlyrandomly. As a bottom-up technique, the number of clusters and theircentroids can be intuited (Figure 9.2).
176
Figure9.2:Initialdatapointsandthecentroid(shownasthickdot)
Thepoints aredistributed randomlyenough that it couldbe consideredonecluster.Thesolidcirclewouldrepresent thecentralpoint(centroid)of thesepoints.
However,thereisabigdistancebetweenthepoints(2,6)and(8,3).So,thisdatacouldbebrokeninto2clusters.Thethreepointsatthebottomrightcouldformone cluster and the other seven could form the other cluster.The twoclusterswould look like this (Figure 9.3). The two circleswill be the newcentroids.
Figure9.3:Dividingintotwoclusters(centroidsshownasthickdots)
Thebiggerclusterseemstoofarapart.So,itseemslikethe4pointsonthetopwill form a separate cluster. The three clusters could look like this (Figure
177
9.4).
Figure9.4:Dividingintothreeclusters(centroidsshownasthickdots)
Thissolutionhasthreeclusters.Theclusterontherightisfarfromtheothertwoclusters.However,itscentroidisnottooclosetoallthedatapoints.Thecluster at the top looks very tight-fitting, with a nice centroid. The thirdcluster,attheleft,isspreadoutandmaynotbeofmuchusefulness.
Thiswasabottom-upexerciseinvisuallyproducingthreebest-fittingclusterdefinitionsfromthegivendata.Therightnumberofclusterswilldependonthedataandtheapplicationforwhichthedatawouldbeused.
178
K-MeansAlgorithmforclusteringK-meansisthemostpopularclusteringalgorithm.Ititerativelycomputestheclustersandtheircentroids.It isatopdownapproachtoclustering.Startingwith a given number of K clusters, say 3 clusters. Thus three randomcentroidswillbecreatedasstartingpointsofthecentersofthreeclusters.Thecirclesareinitialclustercentroids(Figure9.5).
Figure9.5:Randomlyassigningthreecentroidsforthreedataclusters
Step 1: For a data point, distance values will be from each of the threecentroids. The data point will be assigned to the cluster with the shortestdistance to the centroid. All data points will thus, be assigned to one datapointortheother(Figure9.6).Thearrowsfromeachdataelementshowsthecentroidthatthepointisassignedto.
179
Figure9.6:Assigningdatapointstoclosestcentroid
Step2:Thecentroidforeachclusterwillnowberecalculatedsuchthat it isclosesttoallthedatapointsallocatedtothatcluster.Thedashedarrowsshowthecentroidsbeingmovedfromtheirold(shaded)valuestotherevisednewvalues(Figure9.7).
Figure9.7:Recomputingcentroidsforeachcluster
Step3:Onceagain,datapointsareassignedtothethreecentroidsclosesttoit(Figure9.8).
Figure9.8:Assigningdatapointstorecomputedcentroids
180
Thenewcentroidswillbecomputedfromthedatapointsintheclusteruntilfinally, thecentroids stabilize in their locations.Theseare the threeclusterscomputedbythisalgorithm.
Figure9.9:Recomputingcentroidsforeachclustertillclustersstabilize
Thethreeclustersshownare:a3-datapointsclusterwithcentroid(6.5,4.5),a2- datapoint cluster with centroid (4.5,3) and a 5-datapoint cluster withcentroid(3.5,3)(Figure9.9).
Theseclusterdefinitionsaredifferentfromtheonesderivedvisually.Thisisafunction of the random starting centroid values. The centroid points usedearlier in the visual exercise were different from that chosen with the K-means clustering algorithm. The K-means clustering exercise shouldtherefore,berunagainwiththisdata,butwithnewrandomcentroidstartingvalues.Withmany runs, the cluster definitions are likely to stabilize. If thecluster definitions do not stabilize, that may be a sign that the number ofclusterschosenistoohighortoolow.ThealgorithmshouldalsoberunwithdifferentvaluesofK.
181
182
SelectingthenumberofclustersThe correct choice of the value of k is often ambiguous. It depends on theshapeand scale of the distribution points in a data set and the desiredclustering resolution of the user. Heuristics are needed to pick the rightnumber.Onecangraphthepercentageofvarianceexplainedby theclustersagainst the number of clusters (Fig 9.10). The first clusters will add moreinformation(explainalotofvariance),butatsomepointthemarginalgaininvariancewill fall, giving a sharp angle to thegraph, looking like an elbow.Beyondthatelbowpoint,addingmoreclusterswillnotaddmuchincrementalvalue. That would be the desired K.
Figure9.10:Elbowmethodfordeterminingnumberofclustersinadataset
Toengagewiththedataandtounderstandtheclustersbetter,itisoftenbettertostartwitha smallnumberofclusters suchas2or3,dependingupon thedata set and the application domain. The number can be increasedsubsequently, as needed from an application point of view. This helpsunderstandthedataandtheclustersprogressivelybetter.
183
AdvantagesandDisadvantagesofK-MeansalgorithmTherearemanyadvantagesofK-MeansAlgorithm
1. K-Meansalgorithmissimple,easytounderstandandeasytoimplement.2. Itisalsoefficient,inthatthetimetakentoclusterk-means,riseslinearly
withthenumberofdatapoints.3. NootherclusteringalgorithmperformsbetterthanK-Means,ingeneral.
Thereareafewdisadvantagestoo:
1. TheuserneedstospecifyaninitialvalueofK.2. Theprocessoffindingtheclustersmaynotconverge.3. It is not suitable for discovering clusters shapes that are not hyper-
ellipsoids(orhyper-spheres).
Neural networks can also be deployed for clustering, using the appropriateobjective function. The neural networkwill produce the appropriate clustercentroidsandclusterpopulationforeachcluster.
184
ConclusionCluster analysis is a useful, unsupervised learning technique that is used inmanybusiness situations to segment thedata intomeaningful smallgroups.K-Meansalgorithmisaneasystatisticaltechniquetoiterativelysegmentthedata.However,thereisonlyaheuristictechniquetoselecttherightnumberofclusters.
185
ReviewExercises1:Whatisunsupervisedlearning?Whenisitused?
2:Describethreebusinessapplicationsinyourindustrywhereclusteranalysiswillbeuseful.
3:Dataaboutheightandweightforafewvolunteersisavailable.Createasetofclustersforthefollowingdata,todecidehowmanysizesofT-shirtsshouldbeordered.
Height
Weight
71
165
68
165
72
180
67
113
72
178
62
101
70
150
69
172
72
185
63
149
69
132
61
115
186
187
LibertyStoresCaseExercise:Step7Liberty wants to find suitablenumber of segments for itscustomers, for targetedmarketing. Here is a list ofrepresentativecustomers.
Cust
#oftrans-actions
TotalPurchase($)
Income($K)
1
5
450
90
2
10
800
82
3
15
900
77
4
2
50
30
5
18
900
60
6
9
200
45
7
14
500
82
8
8
300
22
9
7
250
90
10
9
1000
80
11
1
30
60
12
6
700
80
1.WhatistherightnumberofclustersforLiberty?
188
2.Whataretheircentroidsfortheclusters?
189
Chapter10:AssociationRuleMining
Associateruleminingisapopular,unsupervisedlearningtechnique,usedinbusinesstohelpidentifyshoppingpatterns.Itisalsoknownasmarketbasketanalysis. It helps find interesting relationships (affinities) between variables(itemsorevents).Thus, it canhelpcross-sell related itemsand increase thesizeofasale.
Alldatausedinthistechniqueiscategorical.Thereisnodependentvariable.It uses machine learning algorithms. The fascinating “relationship betweensalesofdiapersandbeers’ ishowit isoftenexplained inpopular literature.This technique accepts as input the raw point-of-sale transaction data. Theoutput produced is the description of the most frequent affinities amongitems.Anexampleofanassociationrulewouldbe,“ACustomerwhoboughtaflightticketsandahotelreservationalsoboughtarentalcarplan60percentofthetime."
190
Caselet:Netflix:DataMininginEntertainmentNetflix suggestions andrecommendation engines arepoweredbyasutieofalgorithmsusing data about millions ofcustomer ratings aboutthousands of movies. Most ofthese algorithms are based onthepremise that similar viewingpatterns represent similar usertastes. This suite of algorithms,called CineMatch, instructsNetflix's servers to processinformationfromitsdatabasestodetermine which movies acustomer is likely to enjoy. Thealgorithm takes into accountmany factors about the filmsthemselves, the customers'ratings, and the combinedratings of all Netflix users. Thecompany estimates that awhopping 75 percent of vieweractivity is driven byrecommendations. According toNetflix, these predictions werevalid around 75 percent of thetime and half of Netflix userswho rented CineMatch-recommendedmovies gave themafive-starrating.
Tomakematches,acomputer
1. Searches theCineMatchdatabase forpeoplewhohave rated the samemovie-forexample,"TheReturnoftheJedi".
2. Determineswhichofthosepeoplehavealsoratedasecondmovie,suchas"TheMatrix".
3. Calculatesthestatisticallikelihoodthatpeoplewholiked"ReturnoftheJedi"willalsolike"TheMatrix".
191
4. Continues this process to establish a pattern of correlations betweensubscribers'ratingsofmanydifferentfilms.
Netflix launched a contest in2006 to find an algorithm thatcould beat CineMatch. Thecontest, called theNetflixPrize,promised $1 million to the firstperson or team to meet theaccuracy goals forrecommending movies based onusers' personal preferences.Each of these algorithmsubmissions was required todemonstrate a 10 percentimprovement over CineMatch.Threeyears later, the$1millionprizewas awarded to a teamofseven people. (source:http://electronics.howstuffworks.com
1: Are Netflix customers beingmanipulated into seeing whatNetflixwantsthemtosee?
2: Compare this story withAmazon’s personalizationengine.
192
BusinessApplicationsofAssociationRulesIn business environments a pattern or knowledge can be used for manypurposes. In sales andmarketing, it is used for cross-marketing and cross-selling, catalog design, e-commerce site design, online advertisingoptimization, product pricing, and sales/promotion configurations. Thisanalysiscansuggestnottoputoneitemonsaleatatime,andinsteadtocreateabundleofproductspromotedasapackagetosellothernon-sellingitems.
In retail environments, it can be used for store design. Strongly associateditemscanbekeptclosetougherforcustomerconvenience.Ortheycouldbeplacedfarfromeachothersothatthecustomerhastowalktheaislesandbydoingsoispotentiallyexposedtootheritems.
Inmedicine,thistechniquecanbeusedforrelationshipsbetweensymptomsandillnesses;diagnosisandpatientcharacteristics/treatments;genesandtheirfunctions;etc.
193
RepresentingAssociationRulesAgenericAssociationRuleisrepresentedbetweenasetXandY:XÞY[S%,C%]
X,Y:productsand/orservices
X:Left-hand-side(LHS)
Y:Right-hand-side(RHS)
S:Support:howoftenXandYgotogetherinthedataset–i.e.P(XUY)
C:Confidence:howoftenYisfound,givenX–i.e.P(YǀX)
Example:{Hotelbooking,Flightbooking}Þ{RentalCar}[30%,60%]
[Note: P (X) is the mathematical representation of the the probability orchanceofXoccurringinthedataset.}
Computationexample:
Supposethereare1000transactionsinadataset.Thereare300occurrencesofX,and150occurrencesof(X,Y)inthedataset.
SupportSforXÞYwillbeP(XUY)=150/1000=15%.
ConfidenceforXÞYwillbeP(YǀX);orP(XUY)/P(X)=150/300=50%
194
AlgorithmsforAssociationRuleNotallassociationrulesareinterestinganduseful,onlythosethatarestrongrulesandalsothosethatoccurfrequently.Inassociationrulemining,thegoalis to find all rules that satisfy the user-specified minimum support andminimumconfidence.Theresultingsetsofrulesareallthesameirrespectiveof the algorithm used, that is, given a transaction data set T, a minimumsupportandaminimumconfidence,thesetofassociationrulesexistinginTisuniquelydetermined.
Fortunately, there is a large number of algorithms that are available forgeneratingassociationrules.ThemostpopularalgorithmsareApriori,Eclat,FP-Growth, alongwithvariousderivativesandhybridsof the three.All thealgorithmshelp identify the frequent itemsets,whichare thenconverted toassociationrules.
195
AprioriAlgorithmThis is the most popular algorithm used for association rule mining. Theobjectiveistofindsubsetsthatarecommontoatleastaminimumnumberoftheitemsets.Afrequentitemsetisanitemsetwhosesupportisgreaterthanorequal to minimum support threshold. The Apriori property is a downwardclosureproperty,whichmeansthatanysubsetsofafrequentitemsetarealsofrequent itemsets.Thus, if (A,B,C,D) is a frequent itemset, then any subsetsuchas(A,B,C)or(B,D)arealsofrequentitemsets.
It uses a bottom-up approach; and the size of frequent subsets is graduallyincreased,fromone-itemsubsetstotwo-itemsubsets,thenthree-itemsubsets,andsoon.Groupsofcandidatesateachlevelare testedagainst thedataforminimumsupport.
196
AssociationrulesexerciseHereareadozensalestransactions.Therearesixproductsbeingsold:Milk,Bread,Butter,Eggs,Cookies, andKetchup.Transaction#1 soldMilk,Eggs,BreadandButter.Transaction#2soldMilk,Butter,Egg&Ketchup.Andsoon. The objective is to use this transaction data to find affinities betweenproducts,i.e.whichproductsselltogetheroften.
Thesupportlevelwillbesetat33percent;theconfidencelevelwillbesetat50 percent. That means that we have decided to consider rules from onlythose itemsets that occur at least 33 percent of the time in the total set oftransactions.Confidence levelmeans thatwithin those itemsets, the rulesoftheformX→Yshouldbesuchthatthereisatleast50percentchanceofYoccurringbasedonXoccurring.
TransactionsList
1
Milk
Egg
Bread
Butter
2
Milk
Butter
Egg
Ketchup
3
Bread
Butter
Ketchup
4
Milk
Bread
Butter
5
Bread
Butter
Cookies
6
Milk
Bread
Butter
Cookies
7
Milk
Cookies
8
Milk
Bread
Butter
9
Bread
Butter
Egg
Cookies
10
Milk
Butter
Bread
11
Milk
Bread
Butter
197
12
Milk
Bread
Cookies
Ketchup
First step is to compute 1-item Itemsets. i.e. How often does any productindividuallysell.
1-itemSets
Freq
Milk
9
Bread
10
Butter
10
Egg
3
Ketchup
3
Cookies
5
Thus, Milk sells in 9 out of 12 transactions. Bread sells in 10 out of 12transactions.Andsoon.
Ateverypoint,thereisanopportunitytoselectitemsetsofinterest,andthusfurtheranalysis.Otheritemsetsthatoccurveryinfrequentlymayberemoved.Ifitemsetsthatoccur4ormoretimesoutof12areselected,thatcorrespondstomeetingaminimumsupportlevelof33percent(4outof12).Only4itemsmakethecut.Thefrequentitemsthatmeetthesupportlevelof33percentare:
Frequent1-itemSets
Freq
Milk
9
Bread
10
198
Butter
10
Cookies
5
The next step is to go for the next level of itemsets using items selectedearlier:2-itemitemsets.
2-itemSets
Freq
Milk,Bread
7
Milk,Butter
7
Milk,Cookies
3
Bread,Butter
9
Butter,Cookies
3
Bread,Cookies
4
Thus(Milk,Bread)sell7timesoutof12.(Milk,Butter)selltogether7times,(Bread,Buttersell)together9times,and(Bread,Cookies)sell4times.
Howeveronlyfourof thesetransactionsmeet theminimumsupport levelof33%.
2-itemSets
Freq
Milk,Bread
7
Milk,Butter
7
Bread,Butter
9
199
Bread,Cookies 4
Thenextstepistolistthenexthigherlevelofitemsets:3-itemitemsets.
3-itemSets
Freq
Milk,Bread,Butter
6
Milk,Bread,Cookies
1
Bread,Butter,Cookies
3
Thus(Milk,Bread,Butter)sell6timesoutof12.(Bread,Butter,Cookies)sell3 times out of 12. One one 3-item itemset meets the minimum supportrequirements.
3-itemSets
Freq
Milk,Bread,Butter
6
Thereisnoroomtocreatea4-itemitemsetforthissupportlevel.
200
CreatingAssociationRulesThemostinterestingandcomplexrulesathighersizeitemsetsstarttop-downwiththemostfrequentitemsetsofhighersize-numbers.Associationrulesarecreatedthatmeetthesupportlevel(>33%)andconfidencelevels(>50%).
Thehighestlevelitemsetthatmeetsthesupportrequirementsisthethree-itemitemset.Thefollowingitemsethasasupportlevelof50%(6outof12).
Milk,Bread,Butter
6
ThisitemsetcouldleadtomultiplecandidateAssociationrules.
Startwiththefollowingrule:(Bread,Butter)Milk.
Thereareatotaloftotal12transactions.
X(inthiscaseBread,Butter)occurs9times;
X,Y(inthiscaseBread,Butter,Milk)occurs6times.
The support level for this rule is6/12=50%.Theconfidence level for thisrule is 6/9 = 67%. This rulemeets our thresholds for support (>33%) andconfidence(>50%).
Thus,thefirstvalidAssociationrulefromthisdatais:(Bread,Butter)Milk{S=50%,C=67%}.
Inexactlythesameway,otherrulescanbeconsideredfortheirvalidity.
Considertherule:(Milk,Bread)Butter.Outoftotal12transactions,(Milk,Bread)occur7times;and(Milk,Bread,Butter)occurs6times.
The support level for this rule is6/12=50%.Theconfidence level for thisrule is 6/7 = 84%. This rulemeets our thresholds for support (>33%) andconfidence(>50%).
Thus,thesecondvalidAssociationrulefromthisdatais(Milk,Bread)Butter{S=50%,C=67%}.
Consider therule(Milk,Butter)Bread. Outof total12transactions(Milk,
201
Butter)occurs7timeswhile(Milk,Butter,Bread)occur6times.
The support level for this rule is6/12=50%.Theconfidence level for thisrule is 6/7 = 84%. This rulemeets our thresholds for support (>33%) andconfidence(>50%).
Thus, the next valid Association rule is: Milk,Butter Bread {S=50%,C=84%}.
Thus,therewereonlythreepossiblerulesatthe3-itemitemsetlevel,andallwerefoundtobevalid.
Onecanget to thenext lower level andgenerate association rules at the2-itemitemsetlevel.
Consider the ruleMilkBread. Outof total12 transactionsMilkoccurs9timeswhile(Milk,Bread)occur7times.
The support level for this rule is7/12=58%.Theconfidence level for thisrule is 7/9 = 78%. This rulemeets our thresholds for support (>33%) andconfidence(>50%).
Thus,thenextvalidAssociationruleis:
Milk->Bread{58%,77%}.
Manysuchrulescouldbederivedifneeded.
Notallsuchassociationrulesareinteresting.Theclientmaybeinterestedinonlythetopfewrulesthattheywanttoimplement.Thenumberofassociationrulesdependsuponbusinessneed. Implementingeveryrule inbusinesswillrequire somecostandeffort,with somepotentialofgains.Thestrongestofrules,withthehighersupportandconfidencerates,shouldbeusedfirst,andtheothersshouldbeprogressivelyimplementedlater.
202
ConclusionAssociationRuleshelpdiscoveraffinitiesbetweenproductsintransactions.Ithelpsmakecross-sellingrecommendationsmuchmoretargetedandeffective.Aprioritechniqueisthemostpopulartechnique,anditisamachinelearningtechnique.
203
ReviewExercisesQ1:Whatareassociationrules?Howdotheyhelp?
Q2:Howmanyassociationrulesshouldbeused?
204
LibertyStoresCaseExercise:Step8Here isa listofTransactions fromLiberty’sstores.Createassociationrulesforthefollowingdata.With33%supportleveland66%confidence.
1
A
B
C
2
B
E
F
3
A
C
E
4
B
C
F
5
A
C
E
6
C
F
G
7
A
D
F
8
D
E
F
9
A
B
D
10
A
B
C
11
B
D
E
12
A
C
D
205
Section3
Thissectioncoverssomeadditionaltopics.
Chapter11willcoverTextMining,theartandscienceofgeneratinginsightsfromtext.Itisveryimportantintheageofsocialmedia.
Chapter12willcoverWebMining,theartandscienceofgeneratinginsightsfrom theworld-wideweb, its content andusage. It is very important in thedigitalagewherealotofadvertisingandsellingismovingtotheweb.
Chapter13willcoverBigData.Thisisanewmonikercreatedtodescribethephenomenon of large amounts of data being generated from many datasources,andwhichcannotbehandledwith the traditionaldatamanagementtools.
Chapter14willcoveraprimeronDataModeling.Thisisusefulasaramp-upto data mining, especially for those who have not had much exposure totraditionaldatamanagementormayneedarefresher.
206
Chapter11:TextMining
Text mining is the art and science of discovering knowledge, insights andpatternsfromanorganizedcollectionoftextualdatabases.Textualminingcanhelp with frequency analysis of important terms, and their semanticrelationships.
Text is an important part of the growing data in the world. Social mediatechnologieshaveenableduserstobecomeproducersoftextandimagesandotherkindsofinformation.Textminingcanbeappliedtolarge-scalesocialmediadataforgatheringpreferences,andmeasuringemotionalsentiments.Itcanalsobeappliedtosocietal,organizationalandindividualscales.
207
Caselet:WhatsAppandPrivateSecurityDo you think that what you post on social mediaremainsprivate?Thinkagain.Anewdashboardshowshowmuchpersonal informationisout there,andhowcompaniesareabletoconstructwaystomakeuseofitfor commercial benefits. Here is a dashboard ofconversationsbetweentwopeopleJenniferandNicoleover45days.
There is a variety of categories that Nicole andJennifer speak about such as computers, politics,laundry, desserts. The polarity of Jennifer’s personalthoughts and tone is overwhelmingly positive, andJenniferrespondstoNicolemuchmorethanviceversa,identifying Nicole as the influencer in theirrelationship.
The data visualization reveals the waking hours ofJennifer, showing that she is most active around8:00pmandheadstobedaroundmidnight.53%ofherconversationisaboutfood–and15%aboutdesserts.Maybe she’s a strategic person to push restaurant orweightlossads.
The most intimate detail exposed during thisconversation is thatNicole and Jennifer discuss rightwing populism, radical parties, and conservativepolitics. It exemplifies that the amount of privateinformation obtained from your WhatsAppconversationsislimitlessandpotentiallydangerous.
WhatsAppistheworld’slargestmessagingservicethathasover450millionusers.FaceBookrecentlyboughtthis three year old company for a whopping $19billion. People share a lot of sensitive personalinformationonWhatsAppthattheymaynotevensharewiththeirfamilymembers.
(Sources:WhatFacebookKnowsAboutYouFromOneWhatsAppConv,byAdiAzaria,onLinkedIn,April10,2014).1:Whatarethebusinessandsocialimplicationsofthiskindofanalysis?2:Areyouworried?Shouldyoubeworried?
Textminingworks on texts from practically any kind of sources from anybusinessornon-businessdomains,inanyformatsincludingWorddocuments,PDF files, XML files, text messages, etc. Here are some representativeexamples:
1. In the legal profession, text sources would include law, courtdeliberations,courtorders,etc.
2. In academic research, it would include texts of interviews, publishedresearcharticles,etc.
3. Theworldoffinancewillincludestatutoryreports,internalreports,CFO
208
statements,andmore.4. In medicine, it would include medical journals, patient histories,
dischargesummaries,etc.5. Inmarketing,itwouldincludeadvertisements,customercomments,etc.6. In the world of technology and search, it would include patent
applications,thewholeofinformationontheworld-wideweb,andmore.
209
TextMiningApplicationsTextminingisausefultoolinthehandsofchiefknowledgeofficerstoextractknowledge relevant to an organization. Text mining can be used acrossindustrysectorsandapplicationareas, includingdecisionsupport, sentimentanalysis,frauddetection,surveyanalysis,andmanymore.
1. Marketing:Thevoiceofthecustomercanbecapturedinitsnativeandrawformatandthenanalyzedforcustomerpreferencesandcomplaints.1. Social personas are a clustering technique to develop customer
segments of interest. Consumer input from social media sources,such as reviews, blogs, and tweets, contain numerous leadingindicators that can be used towards anticipating and predictingconsumerbehavior.
2. A‘listeningplatform’isatextminingapplication,thatinrealtime,gatherssocialmedia,blogs,andother textual feedback,andfiltersoutthechattertoextracttrueconsumersentiment.Theinsightscanlead to more effective product marketing and better customerservice.
3. Thecustomercallcenterconversationsandrecordscanbeanalyzedfor patterns of customer complaints. Decision trees can organizethis data to create decision choices that could help with productmanagement activities and to become proactive in avoiding thosecomplaints.
2. Business operations: Many aspects of business functioning can beaccuratelygaugedfromanalyzingtext./1. Socialnetworkanalysis and textminingcanbeapplied toemails,
blogs,socialmediaandotherdata tomeasure theemotionalstatesand the mood of employee populations. Sentiment analysis canrevealearlysignsofemployeedissatisfactionwhichcanthencanbeproactivelymanaged.
2. Studying people as emotional investors and using text analysis ofthe social Internet to measure mass psychology can help inobtainingsuperiorinvestmentreturns.
3. Legal: In legal applications, lawyers and paralegals can more easilysearchcasehistoriesandlawsforrelevantdocumentsinaparticularcasetoimprovetheirchancesofwinning.
210
1. Textminingisalsoembeddedine-discoveryplatformsthathelpinminimizing risk in the process of sharing legally mandateddocuments.
2. Case histories, testimonies, and client meeting notes can revealadditionalinformation,suchasmorbiditiesinahealthcaresituationthatcanhelpbetterpredicthigh-costinjuriesandpreventcosts.
4. GovernanceandPolitics: Governmentscanbeoverturnedbasedonatweetoriginatingfromaself-immolatingfruit-vendorinTunisia.1. Socialnetworkanalysisandtextminingoflarge-scalesocialmedia
datacanbeusedformeasuring theemotionalstatesandthemoodof constituent populations. Micro-targeting constituents withspecificmessagesgleanedfromsocialmediaanalysiscanbeamoreefficientuseofresourceswhenfightingdemocraticelections.
2. In geopolitical security, internet chatter can be processed for real-timeinformationandtoconnectthedotsonanyemergingthreats.
3. In academic, research streams could be meta-analyzed forunderlyingresearchtrends.
211
TextMiningProcessTextMining isa rapidlyevolvingareaof research.As theamountofsocialmedia and other text data grows, there is need for efficient abstraction andcategorizationofmeaningfulinformationfromthetext.
Thefirstlevelofanalysisisidentifyingfrequentwords.Thiscreatesabagofimportant words. Texts – documents or smaller messages – can then berankedonhow theymatch toaparticularbag-of-words.However, therearechallengeswiththisapproach.Forexample,thewordsmaybespelledalittledifferently.Ortheremaybedifferentwordswithsimilarmeanings.
Thenext level isat the levelof identifyingmeaningfulphrasesfromwords.Thus ‘ice’ and ‘cream’ will be two different key words that often cometogether.However,thereisamoremeaningfulphrasebycombiningthetwowords into ‘ice cream’. There might be similarly meaningful phrases like‘ApplePie’.
Thenexthigher level is thatofTopics.Multiplephrasescouldbecombinedinto Topic area. Thus the two phrases above could be put into a commonbasket,andthisbucketcouldbecalled‘Desserts’.
Text mining is a semi-automated process. Text data needs to be gathered,structured,andthenmined,ina3-stepprocess(Figure11.1)
Figure11.1:TextMiningArchitecture
1. Thetextanddocumentsarefirstgatheredintoacorpus,andorganized.2. Thecorpusisthenanalyzedforstructure.Theresultisamatrixmapping
importanttermstosourcedocuments.3. Thestructureddataisthenanalyzedforwordstructures,sequences,and
frequency.
212
213
TermDocumentMatrixThis is the heart of the structuring process. Free flowing text can betransformed into numeric data in a TDM, which can then be mined usingregulardataminingtechniques.
1. There are several efficient techniques for identifyingkey terms fromatext.Therearelessefficienttechniquesavailableforcreatingtopicsoutof them.For the purpose of this discussion, one could call keywords,phrases or topics as a term of interest. This approach measures thefrequenciesofselect important termsoccurringineachdocument.ThiscreatesatxdTerm–by–DocumentMatrix(TDM)wheretisthenumberoftermsanddisthenumberofdocuments(Table11.1).
2. CreatingaTDMrequiresmakingchoicesofwhichtermstoinclude.Theterms chosen should reflect the stated purpose of the text miningexercise.Thelistoftermsshouldbeasextensiveasneeded,butshouldnot includeunnecessarystuff thatwillserve toconfuse theanalysis,orslowthecomputation.
TermDocumentMatrix
Document/Terms
investment
Profit
happy
Success
…
Doc1
10
4
3
4
Doc2
7
2
2
Doc3
2
6
Doc4
1
5
3
Doc5
6
2
Doc6
4
2
…
Table11.1:Term-DocumentMatrix
214
HerearesomeconsiderationsincreatingaTDM.
1. A large collection of documentsmapped to a large bag of words willlikely lead to a very sparse matrix if they have few common words.Reducingdimensionalityofdatawillhelpimprovethespeedofanalysisand meaningfulness of the results. Synonyms, or terms will similarmeaning, should be combined and should be counted together, as acommonterm.Thiswouldhelpreducethenumberofdistincttermsofwordsor‘tokens’.
2. Data should be cleaned for spelling errors. Common spelling errorsshould be ignored and the terms should be combined. Uppercase-lowercasetermsshouldalsobecombined.
3. Whenmanyvariantsofthesametermareused,justthestemofthewordwouldbeused to reduce thenumberof terms.For instance, terms likecustomerorder, ordering, order data, shouldbe combined into a singletokenword,called‘Order’.
4. Ontheotherside,homonyms(termswiththesamespellingbutdifferentmeanings)shouldbecountedseparately.Thiswouldenhancethequalityofanalysis.Forexample,thetermordercanmeanacustomerorder,orthe ranking of certain choices.These two should be treated separately.“Thebossorderedthatthecustomerordersdataanalysisbepresentedinchronologicalorder’.Thisstatementshowsthreedifferentmeaningsfortheword‘order’.Thus,therewillbeaneedforamanualreviewoftheTDmatrix.
5. Terms with very few occurrences in very few documents should beeliminatedfromthematrix.Thiswouldhelpincreasethedensityofthematrixandthequalityofanalysis.
6. The measures in each cell of the matrix could be one of severalpossibilities.Itcouldbeasimplecountofthenumberofoccurrencesofeachterminadocument.Itcouldalsobethelogofthatnumber.Itcouldbethefractionnumbercomputedbydividingthefrequencycountbythetotalnumberofwordsinthedocument.Ortheremaybebinaryvaluesinthematrixtorepresentwhetheratermismentionedornot.Thechoiceofvalueinthecellswilldependuponthepurposeofthetextanalysis.
At theendof thisanalysisandcleansing,awell-formed,denselypopulated,rectangular,TDMwillbereadyforanalysis.TheTDMcouldbeminedusingalltheavailabledataminingtechniques.
215
216
MiningtheTDMThe TDM can be mined to extract patterns/knowledge. A variety oftechniquescouldbeappliedtotheTDMtoextractnewknowledge.
Predictors of desirable terms could be discovered through predictivetechniques,suchasregressionanalysis.Supposethewordprofitisadesirableword in a document. The number of occurrences of the word profit in adocument could be regressed against many other terms in the TDM. Therelative strengths of the coefficients of various predictor variables wouldshowtherelativeimpactofthosetermsoncreatingaprofitdiscussion.
Predictingthechancesofadocumentbeinglikedisanotherformofanalysis.Forexample, importantspeechesmadebytheCEOor theCFOtoinvestorscouldbeevaluatedforquality.Iftheclassificationofthosedocuments(suchas good or poor speeches)was available, then the terms of TDM could beused to predict the speech class. A decision tree could be constructed thatmakesasimpletreewithafewdecisionpointsthatpredictsthesuccessofaspeech80percentof the time.This treecouldbe trainedwithmoredata tobecomebetterovertime.
Clusteringtechniquescanhelpcategorizedocumentsbycommonprofile.Forexample,documents containing thewords investment andprofitmoreoftencould be bundled together. Similarly, documents containing the words,customerordersandmarketing,moreoftencouldbebundledtogether.Thus,afew strongly demarcated bundles could capture the essence of the entireTDM.Thesebundlescouldthushelpwithfurtherprocessing,suchashandingoverselectdocumentstoothersforlegaldiscovery.
Associationruleanalysiscouldshowrelationshipsofcoexistence.Thus,onecouldsaythatthewords,tastyandsweet,occurtogetheroften(say5percentofthetime);andfurther,whenthesetwowordsarepresent,70percentofthetime,thewordhappy,isalsopresentinthedocument.
217
ComparingTextMiningandDataMiningTextMining is a form of data mining. There are many common elementsbetween Text and Data Mining. However, there are some key differences(Table11.2).Thekeydifferenceisthattextminingrequiresconversionoftextdataintofrequencydata,beforedataminingtechniquescanbeapplied.
Dimension
TextMining
DataMining
Natureofdata
Unstructureddata:Words,phrases,sentences
Numbers;alphabeticalandlogicalvalues
Languageused
Manylanguagesanddialectsusedintheworld;manylanguagesareextinct,newdocumentsarediscovered
Similarnumericalsystemsacrosstheworld
Clarityandprecision
Sentencescanbeambiguous;sentimentmaycontradictthewords
Numbersareprecise.
Consistency
Differentpartsofthetextcancontradicteachother
Differentpartsofdatacanbeinconsistent,thus,requiringstatisticalsignificanceanalysis
Sentiment
Textmaypresentaclearandconsistentormixedsentiment,acrossacontinuum.Spokenwordsaddsfurthersentiment
Notapplicable
Quality
Spellingerrors.Differingvaluesofpropernounssuchasnames.Varyingqualityoflanguagetranslation
Issueswithmissingvalues,outliers,etc
Natureof
Keywordbasedsearch;co-existenceofthemes;Sentiment
Afullwiderangeofstatisticalandmachinelearninganalysisfor
218
mining; relationshipsanddifferences
Table11.2:ComparingTextMiningandDataMining
219
TextMiningBestPracticesManyofthebestpracticesthatapplytotheuseofdataminingtechniqueswillalsoapplytotextmining.
1. Thefirstandmostimportantpracticeistoasktherightquestion.Agoodquestion isonewhichgivesananswerandwould lead to largepayoffsfortheorganization.ThepurposeandthekeyquestionwilldefinehowandatwhatlevelsofgranularitytheTDMwouldbemade.Forexample,TDMdefined for simpler searcheswould be different from those usedforcomplexsemanticanalysisornetworkanalysis.
2. A second important practice is to be creative and open in proposingimaginative hypotheses for the solution. Thinking outside the box isimportant, both in the quality of the proposed solution as well as infinding the high quality data sets required to test the hypothesizedsolution. For example, a TDM of consumer sentiment data should becombinedwithcustomerorderdatainordertodevelopacomprehensiveviewofcustomerbehavior.It’simportanttoassembleateamthathasahealthymixoftechnicalandbusinessskills.
3. Another important element is to pursue the problem iteratively. Toomuchdatacanoverwhelmtheinfrastructureandalsobefuddlethemind.ItisbettertodivideandconquertheproblemwithasimplerTDM,withfewer termsandfewerdocumentsanddatasources.Expandasneeded,in an iterative sequence of steps. In the future, add new terms to helpimprovepredictiveaccuracy.
4. Avarietyofdataminingtoolsshouldbeusedtotesttherelationshipsinthe TDM. Different decision tree algorithms could be run alongsidecluster analysis and other techniques. Triangulating the findings withmultipletechniques,andmanywhat-ifscenarios,helpsbuildconfidencein the solution. Test the solution in many ways before committing todeployit.
220
ConclusionTextMiningisdivingintotheunstructuredtexttodiscovervaluableinsightsabout the business. The text is gathered and then structured into a term-documentmatrix based on the frequency of a bag ofwords in a corpus ofdocuments. The TDM can then be mined for useful, novel patterns, andinsights.While the technique is important, the business objective shouldbewellunderstoodandshouldalwaysbekeptinmind.
***
221
ReviewQuestions1:Whyistextminingusefulintheageofsocialmedia?
2:Whatkindsofproblemscanbeaddressedusingtextmining?
3:Whatkindsofsentimentscanbefoundinthetext?
DoaTextmininganalysisofsalesspeechesbythreesalesmen.
1. DidyouknowyourteamcanbuildPowerpointmuscles?Yes,IhelpbuildPowerPoint muscles. I teach people how to use PowerPoint moreeffectively in business. Now, for instance, I’m working with a globalconsulting firm to trainall their seniorconsultants togivebetter salespresentationssotheycanclosemorebusiness.
2. I train people how to make sure their PowerPoint slides aren’t acompletedisaster.Thosewhoattendmyworkshopcancreateslidesthatare50%moreclearand50%moreconvincingbytheendofthetraining,basedonscoresstudentsgiveeachotherbeforeandaftertheworkshop.I’m not sure if my training could work at your company. But I’d behappytotalktoyouaboutit.
3. You know how most business people use PowerPoint but most use itpretty poorly? Well, bad PowerPoint has all kinds of consequences –salesthatdon’tclose,goodideasthatgetignored,timewastedbuildingslidesthatcouldhavebeenuseddevelopingorexecutingstrategies.Mycompany shows businesses how to use PowerPoint to capture thosesales,bringattentiontothosegreatideasandusethosewastedhoursonmoreimportantprojects.
Thepurposeistoselectthebestspeech.
1:Howwouldyouselecttherightbagofwords?
2: If speech#1was thebest speech,use theTDMtocreatea rule forgoodspeeches.
LibertyStoresCaseExercise:Step8
HereareafewcommentsfromcustomerservicecallsreceivedbyLiberty.
1. Ilovedthedesignoftheshirt.Thesizefittedmeverywell.However,thefabricseemedflimsy.Iamcallingtoseeifyou
222
canreplacetheshirtwithadifferentone.Orpleaserefundmymoney.
2. Iwasrunninglatefromwork,andIstoppedbytopickupsomegroceries.IdidnotlikethewaythemanagerclosedthestorewhileIwasstillshopping.
3. Istoppedbytopickupflowers.Thecheckoutlinewasverylong.Themanagerwaspolitebutdidnotopennewcashiers.Igotlateformyappointment.
4. Themanagerpromisedthattheproductwillbethere,butwhenIwenttheretheproductwasnotthere.Thevisitwasawaste.Themanagershouldhavecompensatedmeformytrouble.
5. Whentherewasaproblemwithmycateringorder,thestoremanagerpromptlycontactedmeandquicklygotthekinksouttosendmereplacementfoodimmediately.Thereareverycourteous.
CreateaTDMwithnotmorethan6keyterms.[Hint:Treateachcommentasadocument]
223
Chapter12:WebMining
Webmining is theartandscienceofdiscoveringpatternsand insights fromtheWorld-wideweb so as to improve it.Theworld-wideweb isat theheartof thedigitalrevolution.Moredataispostedonthewebeverydaythanwasthereonthewholewebjust20yearsago.Billionsofusersareusingiteverydayforavariety of purposes. The web is used for electronic commerce, businesscommunication,andmanyotherapplications.Webmininganalyzesdatafromthe web and helps find insights that could optimize the web content andimprove the user experience. Data for web mining is collected via Webcrawlers,weblogs,andothermeans.
Herearesomecharacteristicsofoptimizedwebsites:
1. Appearance:Aestheticdesign.Well-formattedcontent,easytoscanandnavigate.Goodcolorcontrasts.
2. Content: Well planned information architecture with useful content.Freshcontent.Search-engineoptimized.Linkstoothergoodsites.
3. Functionality: Accessible to all authorized users. Fast loading times.Usableforms.Mobileenabled.
Thistypeofcontentanditsstructureisofinteresttoensurethewebiseasytouse.The analysis ofweb usage provides feedback on theweb content, andalso the consumer’s browsing habits. This data can be of immense use forcommercialadvertising,andevenforsocialengineering.
Theweb could be analyzed for its structure as well as content. The usagepatternofwebpagescouldalsobeanalyzed.Dependinguponobjectives,webmining can be divided into three different types: Web usage mining, WebcontentminingandWebstructuremining(Figure12.1).
224
Figure:12.1WebMiningstructure
225
WebcontentminingAwebsite is designed in the form of pageswith a distinctURL (universalresource locator). A largewebsitemay contain thousands of pages. ThesepagesandtheircontentismanagedusingspecializedsoftwaresystemscalledContent Management Systems. Every page can have text, graphics, audio,video, forms, applications, and more kinds of content including usergeneratedcontent.
The websites keep a record of all requests received for its page/URLs,includingtherequesterinformationusing‘cookies’.Thelogoftheserequestscould be analyzed to gauge the popularity of those pages among differentsegments of the population. The text and application content on the pagescould be analyzed for its usage by visit counts. The pages on a websitethemselvescouldbeanalyzedforqualityofcontent thatattractsmostusers.Thustheunwantedorunpopularpagescouldbeweededout,or theycanbetransformedwithdifferentcontentandstyle.Similarly,moreresourcescouldbeassignedtokeepthemorepopularpagesmorefreshandinviting.
226
WebstructureminingTheWebworksthroughasystemofhyperlinksusingthehypertextprotocol(http).Anypagecancreateahyperlinktoanyotherpage,itcanbelinkedtobyanotherpage.Theintertwinedorself-referralnatureofweblendsitselftosomeuniquenetworkanalyticalalgorithms.ThestructureofWebpagescouldalsobe analyzed to examine thepatternofhyperlinks amongpages. Therearetwobasicstrategicmodelsforsuccessfulwebsites:HubsandAuthorities.
1. Hubs: These are pages with a large number of interesting links. Theyserve as a hub, or a gathering point, where people visit to access avarietyofinformation.MediasiteslikeYahoo.com,orgovernmentsiteswouldservethatpurpose.MorefocusedsiteslikeTraveladvisor.comandyelp.comcouldaspiretobecominghubsfornewemergingareas.
2. Authorities: Ultimately, people would gravitate towards pages thatprovidethemostcompleteandauthoritativeinformationonaparticularsubject. This could be factual information, news, advice, user reviewsetc.Thesewebsiteswouldhavethemostnumberofinboundlinksfromother websites. Thus Mayoclinic.com would serve as an authoritativepage for expert medical opinion. NYtimes.com would serve as anauthoritativepagefordailynews.
227
WebusageminingAsauserclicksanywhereonawebpageorapplication,theactionisrecordedbymany entities inmany locations.Thebrowser at the clientmachinewillrecordtheclick,andthewebserverprovidingthecontentwouldalsomakearecordofthepagesservedandtheuseractivityonthosepages.Theentitiesbetween the client and the server, such as the router, proxy server, or adserver,toowouldrecordthatclick.
Thegoal ofwebusagemining is to extract useful information andpatternsfrom data generated throughWeb page visits and transactions. The activitydatacomes fromdata stored in serveraccess logs, referrer logs, agent logs,and client-side cookies. The user characteristics and usage profiles are alsogathered directly, or indirectly, through syndicated data. Further, metadata,suchaspageattributes,contentattributes,andusagedataarealsogathered.
Thewebcontentcouldbeanalyzedatmultiplelevels(Figure12.2).
1. Theserversideanalysiswouldshowtherelativepopularityof thewebpagesaccessed.Thosewebsitescouldbehubsandauthorities.
2. Theclient sideanalysis could focus on the usage pattern or the actualcontentconsumedandcreatedbyusers.1. Usage pattern could be analyzed using ‘clickstream’ analysis, i.e.
analyzingweb activity for patterns of sequence of clicks, and thelocationanddurationofvisitsonwebsites.Clickstreamanalysiscanbe useful for web activity analysis, software testing, marketresearch,andanalyzingemployeeproductivity.
2. Textualinformationaccessedonthepagesretrievedbyuserscouldbe analyzed using text mining techniques. The text would begatheredandstructuredusingthebag-of-wordstechniquetobuildaTerm-document matrix. This matrix could then be mined usingcluster analysis and association rules for patterns such as populartopics,usersegmentation,andsentimentanalysis.
228
Figure:12.2WebUsageMiningarchitecture
Web usagemining hasmany business applications. It can help predict userbehaviorbasedonpreviously learned rulesandusers'profiles,andcanhelpdetermine lifetime value of clients. It can also help design cross-marketingstrategiesacrossproducts,byobservingassociationrulesamongthepagesonthewebsite.Webusagecanhelpevaluatepromotionalcampaignsandseeifthe users were attracted to the website and used the pages relevant to thecampaign.Webusageminingcouldbeused topresentdynamic informationtousersbasedontheirinterestsandprofiles.Thisincludestargetedonlineadsandcouponsatusergroupsbasedonuseraccesspatterns.
229
WebMiningAlgorithmsHyperlink-InducedTopicSearch(HITS)isalinkanalysisalgorithmthatrateswebpagesasbeinghubsor authorities.ManyotherHITS-basedalgorithmshavealsobeenpublished.Themostfamousandpowerfulofthesealgorithmsis thePageRankalgorithm. InventedbyGoogleco-founderLarryPage, thisalgorithmisusedbyGoogletoorganizetheresultsofitssearchfunction.Thisalgorithmhelpsdeterminetherelativeimportanceofanyparticularwebpageby counting the number and quality of links to a page. The websites withmorenumberoflinks,and/ormorelinksfromhigher-qualitywebsites,willberankedhigher.Itworksinasimilarwayasdeterminingthestatusofapersoninasocietyofpeople.Thosewithrelationstomorepeopleand/orrelationstopeopleofhigherstatuswillbeaccordedahigherstatus.
PageRankisthealgorithmthathelpsdeterminetheorderofpageslisteduponaGoogleSearchquery.TheoriginalPageRankalgorithmformuationhasbeenupdated in many ways and the latest algorithm is kept a secret so otherwebsitescannottakeadvantageofthealgorithmandmanipulatetheirwebsiteaccording to it. However, there are many standard elements that remainunchanged. These elements lead to the principles for a goodwebsite. ThisprocessisalsocalledSearchEngineOptimization(SEO).
230
ConclusionThewebhasgrowingresources,withmorecontenteverydayandmoreusersvisitingitformanypurposes.Agoodwebsiteshouldbeuseful,easytouse,and flexible for evolution. From the insights gleaned using web mining,websitesshouldbeconstantlyoptimized.
Web usage mining can help discover what content users really like andconsume, and help prioritize that for improvement.Web structure can helpimprovetraffictothosesites,bybuildingauthorityforthesites.
231
ReviewQuestions1:Whatarethethreetypesofwebmining?
2:Whatisclickstreamanalysis?
3:Whatarethetwomajorwaysthatawebsitecanbecomepopular?
4:Whataretheprivacyissuesinwebmining?
5:Auserspends60minutesontheweb,visiting10webpagesinall.Giventheclickstreamdata,whatkindofananalysiswouldyoudo?
232
Chapter13:BigDataBigdataisanumbrellatermforacollectionofdatasetssolargeandcomplexthat it becomesdifficult to process themusing traditional datamanagementtools. There has been increasing democratization of the process of contentcreation and sharing over the Internet, using socialmedia applications.Thecombination of cloud-based storage, social media applications, and mobileaccess devices is helping crystallize the big data phenomenon. The leadingmanagement consulting firm, McKinsey & Co. created a flutter when itpublished a report in 2011 showing a huge impact of such big data onbusiness and other organizations. They also reported that there will bemillionsofnewjobsinthenextdecade,relatedtotheuseofbigdatainmanyindustries.
Bigdatacanbeused todiscovernew insights froma360-degreeviewofasituation that can allow for a complete new perspective on situations, newmodels of reality, and potentially new types of solutions. It can help spotbusinesstrendsandopportunities.Forexample,Googleisabletopredictthespread of a disease by tracking the use of search terms related to thesymptoms of the disease over the globe in real time. Big Data can helpdetermine the quality of research, prevent diseases, link legal citations,combatcrime,anddeterminereal-timeroadwaytrafficconditions.BigDataisenablingevidence-basedmedicine,andmanyotherinnovations.
Data has become the new natural resource.Organizations have a choice inhowtoengagewiththisexponentiallygrowingvolume,varietyandvelocityofdata.Theycanchoosetobeburiedundertheavalanche,ortheycanchoosetouseitforcompetitiveadvantage.Challengesinbigdataincludetheentirerangeofoperationsfromcapture,curation,storage,search,sharing,analysis,andvisualization.Bigdataismorevaluablewhenanalyzedasawhole.Moreandmoreinformationisderivablefromanalysisofasinglelargesetofrelateddata,ascomparedtoseparatesmallersets.However,special toolsandskillsareneededtomanagesuchextremelylargedatasets.
233
Caselet: PersonalizedPromotionsatSearsA couple of years ago, SearsHoldingscametotheconclusionthat it needed to generategreater value from the hugeamounts of customer, product,and promotion data it collectedfrom its many brands. Searsrequired about eight weeks togenerate personalizedpromotions,atwhichpointmanyof themwere no longer optimalfor thecompany. It tookso longmainlybecausethedatarequiredfor these large-scale analyseswerebothvoluminousandhighlyfragmented—housed in manydatabases and “datawarehouses” maintained by thevarious brands. Sears turned tothetechnologiesandpracticesofbigdata.Asoneofitsfirststeps,itsetupaHadoopcluster,usinga group of inexpensivecommodityservers.
Sears started using the Hadoopcluster to store incoming datafrom all its brands and fromexistingdatawarehouses.Itthenconducted analyses on thecluster directly, avoiding thetime-consuming complexities ofpulling data from varioussources and combining them sothat they can be analyzed.Sears’s Hadoop cluster storesand processes several petabytesof data at a fraction of the costof a comparable standard data
234
warehouse. The time needed togenerate a comprehensive set ofpromotions dropped from eightweeks to one. And thesepromotionsareofhigherquality,because they’re more timely,more granular, and morepersonalized. (Source: McAfee&BrynjolfssonHBSOct2012)
1:WhatareotherwaysinwhichSears can benefit from BigData?
2: What are the challenges inmakinguseofBigData?
235
DefiningBigDataIn2000,therewere800,000Petabytesofdataintheworld.Itisexpectedtogrowto35zettabytesbytheyear2020.Aboutamillionbooksworthofdatais being created daily on social media alone. Big Data is big, fast,unstructured,andofmanytypes.Thereareseveraluniquefeatures:
1. Variety:Therearemanytypesofdata,includingstructuredandunstructured data. Structured data consists of numeric and text fields.Unstructureddataincludesimages,video,audio,andmanyothertypes.There are also many sources of data. The traditional sources ofstructured data include data from ERPs systems and other operationalsystems.Sourcesforunstructureddataincludesocialmedia,Web,RFID,machinedata,andothers.Unstructureddatacomesinavarietyofsizes,resolutions,andaresubject todifferentkindsofanalysis.Forexample,videofilescanbetaggedwithlabels,andtheycanbeplayed,butvideodata is typically not computed, which is the same with audio data.Graphicdatacanbeanalyzedfornetworkdistances.Facebooktextsandtweetscanbeanalyzedforsentiments,butcannotbedirectlycompared.
2. Velocity:TheInternetgreatly increases thespeedofmovementofdata,from e-mails to social media to video files, data can move quickly.Cloud-basedstoragemakessharinginstantaneous,andeasilyaccessiblefrom anywhere. Socialmedia applications enable people to share theirdatawitheachother instantly.Mobile access to theseapplicationsalsospeedsupthegenerationandaccesstodata(Figure13.1).
Figure13.1SourcesofBigData(Source:Hortonworks.com)
236
3. Volume:Websiteshavebecomegreatsourcedandrepositoriesformanykindsofdata.Userclickstreamsarerecordedandstoredforfutureuse.SocialmediaapplicationssuchasFacebook,Twitter,Pinterest,andotherapplicationshaveenableduserstobecomeprosumersofdata(producersandconsumers).Thereisanincreaseinthenumberofdatashares,andalso the sizeof eachdata element.High-definitionvideos can increasethetotalshareddata.Thereareautonomousdatastreamsofvideo,audio,text, data, and so on coming from social media sites, websites, RFIDapplications,andsoon.
4. SourcesofData:Thereareseveralsourcesofdata,includingsomenewones.Data fromoutside the organizationmay be incomplete, and of adifferentqualityandaccuracy.1. Social Media: All activities on the web and social media are
considered stores and are accessible. Email was the first majorsource of new data. Google searches, Facebook posts, Tweets,Youtube videos, and blogs enable people to generate data for oneanother.
2. Organizations:Businessorganizationsandgovernmentareamajorsourceofdata.ERPsystems,e-Commercesystems,user-generatedcontent,web-accesslogs,andmanyothersourcesofdatageneratevaluabledatafororganizations.
3. Machines: TheInternetof things isevolving.Manymachinesareconnected to the web and autonomously generate data that isuntouched by humans. RFID tags and telematics are two majorapplications that generate enormous amounts of data. Connecteddevices such asphones and refrigeratorsgeneratedata about theirlocationandstatus.
4. Metadata:There is enormousdata aboutdata itself.Webcrawlersand web-bots scan the web to capture new webpages, their htmlstructure, and their metadata. This data is used by manyapplications,includingwebsearchengines.
Thedataalsoincludesvariedqualityofdata.Itdependsuponthepurposeofcollectingthedata,andhowcarefullyithasbeencollectedandcurated.Datafrom within the organization is likely to be of a higher quality. Publiclyavailable data would include some trustworthy data such as from thegovernment.
237
238
BigDataLandscapeBigdatacanbeunderstoodatmanylevels(Figure13.2).Atthehighestlevelare business applications to suit particular industries or to suit businessintelligence for executives.A unique concept of “data as a service” is alsopossible for particular industries. At the next level, there are infrastructureelements for broad cross-industry applications, such as analytics andstructured databases. This also includes offering this infrastructure as aservicewithsomeoperationalmanagementservicesbuiltin.Atthecore,bigdataisabouttechnologiesandstandardstostoreandmanipulatethelargefaststreams of data, and make them available for rapid data-based decision-making.
Figure13.2TheBigDataLandscape(source:bigdatalandscape.com)
239
BusinessImplicationsofBigData“Big data will disrupt yourbusiness. Your actions willdetermine whether thesedisruptions are positive ornegative.”(Gartner,2012).
Any industry that produces information-based products ismost likely to bedisrupted. Thus, the newspaper industry has taken a hit from digitaldistribution channels, as well as from published-on-web-only blogs.Entertainment has also been impacted by digital distribution and piracy, aswellasbyuser-generated-and-uploadedcontentontheinternet.Theeducationindustryisbeingdisruptedbymassivelyon-lineopencourses(MOOCs)anduser-uploadedcontent.Healthcaredelivery is impactedbyelectronichealthrecordsanddigitalmedicine.Theretailindustryhasbeenhighlydisruptedbyecommerce companies. Fashion companies are impacted by quick feedbackontheirdesignsonsocialmedia.Thebankingindustryhasbeenimpactedbythecost-effectiveonlineself-servebankingapplicationsandthiswill impactemploymentlevelsintheindustry.
There is rapid change inbusinessmodels enabledbybigdata technologies.SteveJobs,theex-CEOofApple,concededthathiscompany’sproductsandbusinessmodelswould be disrupted.He preferred his older products to becannibalizedbyhisownnewproductsratherthanbythoseofthecompetition.
Everyotherbusinesstoowilllikelybedisrupted.Thekeyissueforbusinessishowtoharnessbigdataforbusinesstogenerategrowthopportunitiesandtoleapfrog competition. Organizations need to learn how to organize theirbusinesses so that they do not get buried in high volume, velocity, and thevarietyofdata,butinsteaduseitsmartlyandproactivelytoobtainaquickbutdecisive advantage over their competition.Organizations need to figure outhowtousebigdataasastrategicassetinrealtime,toidentifyopportunities,thwart threats, build new capabilities, and enhance operational efficiencies.Organizationscannoweffectivelyfusestrategyanddigitalbusiness,andthenstrive to design innovative “digital business strategy” around digital assetsandcapabilities.
240
241
TechnologyImplicationsofBigData"Big data" forces organizationsto address the variety ofinformation assets and how fastthese new asset types arechanging informationmanagementdemands. (Gartner,2012).
Thegrowthofdataismadepossible in part by the advancement of storage technology. The attachedgraphshowsthegrowthofdisk-driveaveragecapacities.Thecostofstorageis falling, the size of storage is getting smaller, and the speed of access isgoing up (Figure 13.3). Flash drives are become cheaper. Random accessmemorystorageused tobeexpensive,butnowisso inexpensive thatentiredatabasescanbeloadedandprocessedquickly,insteadofswappingsectionsofitintoandoutofhigh-speedmemory.
New data management and processing technologies have emerged. ITprofessionals integrate “big data” structured assets with content and mustincrease their business requirement identification skills. Big data is goingdemocratic.Businessfunctionswillbeprotectiveoftheirdataandwillbegininitiativesaroundexploitingit.ITsupportteamsneedtofindwaystosupportend-user-deployedbigdatasolutions.Enterprisedatawarehouseswillneedtoincludebigdata insomeform.TheITplatformneeds tobestrengthened tohelp provide the enablement of a “digital business strategy” around digitalassetsandcapabilities.
242
243
BigDataTechnologiesNew tools and techniqueshave arisen in the last 10-20years tohandle thislargeandstillgrowingdata.Therearetechnologiesforstoringandaccessingthisdata.
1. Non-relationaldatastructures: Bigdata isstoredusingnon-traditionaldata structures. Large non-relational databases like Hadoop haveemerged as a leading data management platform for big data. InHadoop’s Distributed File System (HDFS), data is stored as ‘key anddata-value’ combinations. Google BigFile is another prominenttechnology. NoSQL is emerging as a popular language to access andmanagenon-relationaldatabases.ThereisamatchingDataWarehousingsystem called Hive along with its own PigSQL language. The open-source stack of programming languages (such as Pig) and other toolshelpmakeHadoopapowerfulandpopulartool.
2. Massivelyparallelcomputing:Giventhesizeofdata,itisusefulto divide and conquer the problem quickly using multiple processorssimultaneously. Parallel processing allows for the data to be processedby multiple machines so that results can be achieved sooner. Map-Reduce algorithm, originally generated at Google for doing searchesfaster, has emerged as a popular parallel processing mechanism. Theoriginal problem is divided into smaller problems, which are thenmapped tomultipleprocessors thatcanoperate inparallel.Theoutputsof these processors are passed to an output processor that reduces theoutputtoasinglestream,whichisthensenttotheenduser.Figure13.4showsanexampleofaMap-Reducealgorithm.
Figure13.4AMapReduceAlgorithmexample(source:
244
www.cs.uml.edu)
3. UnstructuredInformationManagementArchitecture (UIMA).This is one of elements in the “secret sauce” behind IBM’ Watson’ssystemthatreadsmassiveamountsofdata,andorganizesforjust-in-timeprocessing.WatsonbeattheJeopardy(quizprogram)championin2011andisnowusedformanybusinessapplications,likediagnosis,inhealthcare situations. Natural language processing is another capability thathelpsextendthepowerofbigdatatechnologies.
245
ManagementofBigDataMany organizations have started initiatives around the use of Big Data.However,most organizationsdonot necessarily have a grip on it.Here aresomeemerginginsightsintomakingbetteruseofbigdata.
1. Acrossall industries, thebusinesscase forbigdata is strongly focusedonaddressingcustomer-centricobjectives.Thefirst focusondeployingbigdatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.
2. Solve a real pain-point. Big data should be deployed for specificbusiness objectives in order to avoid being overwhelmed by the sheersizeofitall.
3. Organizations are beginning their pilot implementations by usingexisting and newly accessible internal sources of data. It is better tobegin with data under one’s control and where one has a superiorunderstandingofthedata.
4. Puthumansanddatatogether togetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.
5. Advanced analytical capabilities are required, yet lacking, fororganizations to get themost value from big data. There is a growingawarenessofbuildingorhiringthoseskillsandcapabilities.
6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.
7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.
8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyou initiallyanticipate.Datacanaddperspective tootherdatalaterinamultiplicativemanner.
9. Maintain one copy of your data, not multiple. This would help avoidconfusionandincreaseefficiency.
246
10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponential rates. Storage costs continue to fall, data generationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.
11. Ascalableandextensible informationmanagement foundation is aprerequisite for big data advancement. Big data builds upon resilient,secure, efficient, flexible, and real-time information processingenvironment.
12. Bigdataistransformingbusiness,justlikeITdid.Bigdataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.
247
ConclusionBig Data is a new natural force and natural resource. The exponentiallygrowing volume, variety and velocity of data is constantly disruptingbusinesses across all industries, atmultiple levels from product to businessmodels. Organizations need to begin initiatives around big data; acquireskills, tools and technologies; and show the vision to disrupt their industryandcomeoutahead.
248
ReviewQuestions1:Whatarethe3VsofBigData?
2:HowdoesBigDataimpactthebusinessmodels?
3:WhatisHadoop?
4:HowdoesMap-Reducealgorithmwork?
5:WhatarethekeyissuesinmanagingBigData?
249
Chapter14:DataModelingPrimer
Data needs to be efficiently structured and stored so that it includes all theinformation needed for decision making, without duplication and loss ofintegrity.Herearetoptenqualitiesofgooddata.
Datashouldbe:
1.Accurate:Datashouldretainconsistentvaluesacrossdatastores,usersandapplications.Thisisthemostimportantaspect of data.Anyuse of inaccurate or corrupted data to do anyanalysisisknownasthegarbage-in-garbage-out(GIGO)condition.
2.Persistent:Datashouldbeavailableforalltimes, now and later. It should thus be nonvolatile, stored andmanagedforlateraccess.
3.Available:Datashouldbemadeavailabletoauthorized users, when, where, and how they want to access it,withinpolicyconstraints.
4.Accessible:Notonlyshoulddatabeavailabletouser, it should also be easy to use. Thus, data should be madeavailableindesiredformats,witheasytools.MSExcelisapopularmediumtoaccessnumericdata,andthentransfertootherformats.
5.Comprehensive:Datashouldbegatheredfromall relevantsources toprovideacompleteandholisticviewof thesituation. New dimensions should be added to data as and whentheybecomeavailable.
6. Analyzable:Datashouldbeavailable foranalysis, for historical and predictive purposes. Thus, data shouldbe organized such that it can be used by analytical tools, such asOLAP,datacube,ordatamining.
7.Flexible:Dataisgrowinginvarietyoftypes.Thus, data stores should be able to store a variety of data types:
250
small/large,text/video,andsoon
8.Scalable:Dataisgrowinginvolume.Datastorageshouldbeorganizedtomeetemergentdemands.
9.Secure:Datashouldbedoublyandtriplybackedup, and protected against loss and damage.There is no bigger ITnightmarethancorrupteddata.Inconsistentdatahastobemanuallysortedoutwhichleadstolossofface, lossofbusiness,downtime,andsometimesthebusinessneverrecovers.
10.Cost-effective:Thecostofcollectingdataandstoring it is coming down rapidly.However, still the total cost ofgathering, organizing, and storing a type of data should beproportionaltotheestimatedvaluefromitsuse.
251
EvolutionofdatamanagementsystemsData management has evolved from manual filing systems to the mostadvancedonlinesystemscapableofhandlingmillionsofdataprocessingandaccessrequestseverysecond.
Thefirstdatamanagementsystemswerecalledfilesystems.Thesemimickedpaperfilesandfolders.Everythingwasstoredchronologically.Accesstothisdatawassequential.
Thenextstepindatamodelingwastofindwaystoaccessanyrandomrecordquickly. Thus hierarchical database systems appeared. They were able toconnectallitemsforanorder,givenanordernumber.
The next step was to traverse the linkages both ways, from top of thehierarchytothebottom,andfromthebottomtothetop.Givenanitemsold,oneshouldbeabletofinditsordernumber,andlistalltheotheritemssoldinthatorder.Thustherewerenetworksoflinksestablishedinthedatatotrackthoserelationships.
The major leap came when the relationship between data elements itselfbecamethecenterofattention.Therelationshipbetweendatavalueswasthekey element of storage. Relationships were established through matchingvaluesof commonattributes, rather thanby locationof the record in a file.Thisledtodatamodelingusingrelationalalgebra.Relationscouldbejoinedandsubtracted,withsetoperationslikeunionandintersection.Searchingthedatabecameaneasiertaskbydeclaringthevaluesofavariableofinterest.
Therelationalmodelwasenhancedtoincludevariableswithnon-comparablevalues like binary objects (such as pictures) which had to be processeddifferently.Thusemergedtheideaofencapsulatingtheproceduresalongwiththe data elements they worked on. The data and its methods wereencapsulated intoanobject.Those objects could be further specialized. Forexample,avehicle isanobjectwithcertainattributes.Acaranda truckaremorespecializedversionsofavehicle.Theyinheritedthedatastructureofthevehicle, but had their own additional attributes. Similarly the specializedobject inherited all the procedures and programs associated with the moregeneralentity.Thisbecametheobject-orientedmodel.
252
RelationalDataModelThe first mathematical-theory-driven model for data management wasdesignedbyEdCoddofIBMin1970.
1.Arelationaldatabaseiscomposedofasetofrelations(datatables),whichcanbejoinedusingsharedattributes.A“data table” isacollectionof instances (or records),withakeyattributetouniquelyidentifyeachinstance.
2.DatatablescanbeJOINedusingtheshared“key” attributes to create larger temporary tables, which can bequeriedtofetchinformationacrosstables.Joinscanbesimpleonesasbetweentwotables.JoinscanalsobecomplexwithAND,OR,UNIONorINTERSECTION,andmoreoperations.
3. High-levelcommandsinStructuredQueryLanguage (SQL) can be used to perform joins, selection, andorganizingofrecords.
Relational data models flow from conceptual models, to logical models tophysical implementations.Datacanbeconceivedofasbeingaboutentities,and relationships among entities. A relationship between entities may behierarchybetweenentities,or transactions involvingmultipleentities.Thesecanbegraphicallyrepresentedasanentity–relationshipdiagram(ERD).
In Figure 14.1, the rectangle reflects the entities students and courses. Therelationship is enrolment. In the example below the rectangle reflects theentities Students and Courses. The diamond shows the Enrolmentrelationship.
Figure:14.1Simplerelationshipbetweentwoentities
HerearesomefundamentalconceptsonERD:
1. Anentityisanyobjectoreventaboutwhichsomeonechoosestocollectdata, which may be a person, place, or thing (e.g., sales person, city,product,vehicle,employee).
2. Entitieshaveattributes.Attributesaredataitemsthathavesomethingin
253
common with the entity. For example, student id, student name, andstudent address represent details for a student entity.Attributes can besingle-valued(e.g.,studentname)ormulti-valued(listofpastaddressesfor the student). Attribute can be simple (e.g., student name) orcomposite(e.g.,studentaddress,composedofstreet,city,andstate).
3. Everyentitymusthaveakeyattribute(s)thatcanbeusedtoidentifyaninstance. E.g. Student ID can identify a student. A primary key is aunique attribute value for the instance (e.g. Student ID).Any attributethatcanserveasaprimarykey(e.g.StudentAddress)isacandidatekey.Asecondarykey—akeywhichmaynotbeunique,maybeusedtoselecta groupof records (Student city). Some entitieswill have a compositekey—a combination of two or more attributes that together uniquelyrepresentthekey(e.g.FlightnumberandFlightdate).Aforeignkey isuseful in representing a one-to-many relationship. The primary key ofthefileattheoneendoftherelationshipshouldbecontainedasaforeignkeyonthefileatthemanyendoftherelationship.
4. Relationships have many characteristics: degree, cardinality, andparticipation.
5. Degree of relationship depends upon the number of entitiesparticipating in a relationship. Relationships can be unary (e.g.,employeeandmanager-as-employee),binary (e.g., studentandcourse),andternary(e.g.,vendor,part,warehouse)
6. Cardinality represents the extent of participation of each entity in arelationship.1. One-to-one(e.g.,employeeandparkingspace)2. One-to-many(e.g.,customerandorders)3. Many-to-many(e.g.,studentandcourse)
7. Participationindicatestheoptionalormandatorynatureofrelationship.1. Customerandorder(mandatory)2. Employeeandcourse(optional)
8. Therearealsoweakentitiesthataredependentonanotherentityforitsexistence (e.g., employees and dependents). If an employee data isremoved,thenthedependentdatamustalsoberemoved.
9. There are associative entities used to represent many-to-manyrelationship relationships (e.g., student-course enrolment). There aretwowaystoimplementamany-manyrelationship.Itcouldbeconvertedinto two one-to-many relationships with an associative entity in themiddle. Alternatively, the combination of primary keys of the entitiesparticipating in the relationship will form the primary key for theassociativeentity.
10. Therearealsosupersubtypeentities.Thesehelp representadditionalattributes,onasubsetoftherecords.Forexample,vehicleisa
254
supertypeandpassengercarisitssubtype.
255
ImplementingtheRelationalDataModelOncethelogicaldatamodelhasbeencreated, it iseasytotranslateit intoaphysical datamodel, which can then be implemented it using any publiclyavailableDBMS.Everyentityshouldbeimplementedbycreatingadatabasetable. Every table will be a specific data field (key) that would uniquelyidentify each relation (or row) in that table. Eachmaster table or databaserelationshouldhaveprogramstocreate,read,update,anddeletetherecords.
Thedatabasesshouldfollow3IntegrityConstraints.
1. Entityintegrityensuresthattheentityoratableishealthy.Theprimarykeycannothaveanullvalue.Everyrowmusthaveauniquevalue.Orelse that rowshouldbedeleted.Asacorollary, if theprimarykey is acompositekey,noneof thefieldsparticipatingin thekeycancontainanullvalue.Everykeymustbeunique.
2. Domainintegrityisenforcedbyusingrulestovalidatethedataasbeingoftheappropriatesrangeandtype.
3. Referential integrity governs the nature of records in a one-to-manyrelationship.Thisensures that thevalueofaforeignkeyshouldhaveamatching value in primary keys of the table referred to by the foreignkey.
256
Databasemanagementsystems(DBMS)Thesearemanydatabasemanagementsoftwaresystemsthathelpmanagetheactivities related to storing the data model, the data itself, and doing theoperations on the data and relations. The data in the DBMS grows, and itservesmanyusers of the data concurrently.TheDBMS typically runs on acomputercalledadatabaseserver–inann-tierapplicationarchitecture.Thusinanairlinereservationsystem,millionsoftransactionsmightsimultaneouslytry toaccess thesamesetofdata.Thedatabaseisconstantlymonitoredandmanagedtoprovidedataaccesstoallauthorizedusers,securelyandspeedily,while keeping the database consistent and useful. Content managementsystemsarespecialpurposeDBMS,or just featureswithinstandardDBMS,that help people manage their own data on a web-site. There are object-orientedandothermorecomplexwaysofmanagingdata.
257
StructuredQueryLanguageSQL is a very easy and powerful language to access relational databases.There are two essential components of SQL: theDataDefinitionLanguage(DDL)andDataManipulationLanguage.
DDLprovides instructions to createnewdatabase, and to createnew tableswithinadatabase.Furtheritprovidesinstructionstodeleteadatabase,orjustafewtableswithinadatabase.Thereareotherancilliarycommandstodefineindexesetcforefficientaccesstothedatabase.
DML is theheart ofSQL. It provides instructions to add, read,modify anddeletedata fromthedatabaseandanyof its tables.Thedatacanselectivelyaccessed,andthenformatted, toansweraspecificquestion.Forexample, tofindthesalesofmoviesbyquarter,theSQLquerywouldbe:
SELECT Product-Name,SUM(Amount)FROMMovies-TransactionsGROUPBYProduct-Name
258
ConclusionDatashouldbemodeledtoachievethebusinessobjectives.Gooddatashouldbe accurate and accessible, so that it can be used for business operations.Relationaldatamodelisthetwomostpopularwayofmanagingdatatoday.
259
ReviewQuestions1:Whoinventedrelationalmodelandwhen?
2: How does relational model mark a clear break from previous databasemodels?
3:WhatisanEntity-Relationshipdiagram?
4:Whatkindsofattributescananentityhave?
5:Whatarethedifferentkindsofrelationships?
260
Appendix1:DataMiningTutorialwithWeka
DataMiningTutorialwithWeka
Developedforacademicuseonly
byDr.AnilMaheshwari&Dr.EdiShivaji
261
ThistutorialfortheWEKAsoftwareplatformisdesignedforusebyastudentofacourseinDataMiningapplications.Thistutorialwillprovideexamplesof solving certain data mining problems using Weka tool and the sampledatasetsprovidedwithit.
Step1:DownloadthefreeWekasoftware
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
Step2:DownloadthefreeWekadatasets
http://www.cs.waikato.ac.nz/ml/weka/datasets.html
Step3:Accesstheassociatedtextbooktolearnaboutdatamining
http://www.cs.waikato.ac.nz/~ml/weka/book.html
This tutorial used data from the freeWeka datasets. The sample problemsaddressedinthistutorialare:
1. Classificationmodels:Thesearethemost importantapplicationofdatamining.WewilluseDecisiontreesandRegressionmethods
2. Clustering:UsingtheK-meansalgorithm3. AssociationRuleMining:UsingApriorialgorithm.
Exercise1:ClassificationusingDECISIONTREES
262
Problemstatement:Whatisthebestwaytopredictthatagamewillbeonoroffbasedonweatherindicators?Adatasetofpastdecisionhasbeenprovided.
Datasetused:Weather–nominal.Itdescribes14instancesofweatherconditionsandwhetheranoutdoorgamewaspossibleornot(Play)underthoseweatherconditions.Hereistherawdata.
Loadthedataset.Itisnominal.However,thereisnoneedfornominalityofdataforClassification.
Analysisused:J48decisiontreealgorithm(ItisanimplementationofC4.5algorithm).Itisatop-downapproach.
Results:
Instances:14
Attributes:5outlooktemperaturehumiditywindyplay
263
Testmode:evaluateontrainingdata
===Classifiermodel(fulltrainingset)===
J48prunedtree------------------outlook=sunny|humidity=high:no(3.0)|humidity=normal:yes(2.0)outlook=overcast:yes(4.0)outlook=rainy|windy=TRUE:no(2.0)|windy=FALSE:yes(3.0)NumberofLeaves:5Sizeofthetree:8===Summary===
CorrectlyClassifiedInstances14100%IncorrectlyClassifiedInstances00%Kappastatistic1Meanabsoluteerror0Rootmeansquarederror0Relativeabsoluteerror0%Rootrelativesquarederror0%TotalNumberofInstances14===DetailedAccuracyByClass===TPRateFPRatePrecisionRecallF-MeasureROCAreaClass101111yes101111noWtdAvg.101111===ConfusionMatrix===ab<--classifiedas90|a=yes05|b=no
Note: The model explains 100% of the instances correctly. The pruned treeshowstherulesformakingthedecisioninatextform.
Interpretingthetree:Thefirstsplitvariableis“Outlook”.Ifoutlookisovercast,thencheck forhumidity. If outlook is sunny, the answer isyes. If theoutlook israiny,thencheckforwindy.
Visualizingtheoutput:Wekacancreateavisualversionofthetree.
264
InterpretingtheVisualTree:Thevisualdecisiontreeissimpleandself-explanatory.
Exercise:
1. TrydifferentdecisiontreealgorithmsinWekaforthissimpledataset.2. Comparetimetaken,accuracy,andinterpretabilityoftheoutput.
Exercise2:ClassificationusingDECISIONTREES
Problemstatement:Whatisthebestmodeltodiagnosewhetherabreastlumpisbenignormalignant?
Datasetused:breast-w.Thisismuchlargerdataset.Itshowsmanymorevariablesandinstances.Itdescribes699instancesofbiopsyanalysesofbreastcancersuspects.Thereare15variables:someofwhicharenominalwhileothersarenumeric.Theclassvariableshowsiftheinstancewasjudgedtobebenignofmalignant?
Load the data set. There is no need for nominality of data for Decision trees. For simplicity of analysis however, only thenominalvariableswerekept,whileotherswereremovedfromthedatasetbeforeanalysis.
Analysisused:J48decisiontreealgorithm.
Results:
Scheme:weka.classifiers.trees.J48-C0.25-M2Relation:wisconsin-breast-cancerInstances:699Attributes:10Clump_ThicknessCell_Size_UniformityCell_Shape_UniformityMarginal_AdhesionSingle_Epi_Cell_SizeBare_NucleiBland_ChromatinNormal_NucleoliMitoses
265
Class–(Benign/Malignant)Testmode:evaluateontrainingdata===Classifiermodel(fulltrainingset)===J48prunedtree------------------Cell_Size_Uniformity<=2|Bare_Nuclei<=3:benign(405.39/2.0)|Bare_Nuclei>3||Clump_Thickness<=3:benign(11.55)||Clump_Thickness>3|||Bland_Chromatin<=2||||Marginal_Adhesion<=3:malignant(2.0)||||Marginal_Adhesion>3:benign(2.0)|||Bland_Chromatin>2:malignant(8.06/0.06)Cell_Size_Uniformity>2|Cell_Shape_Uniformity<=2||Clump_Thickness<=5:benign(19.0/1.0)||Clump_Thickness>5:malignant(4.0)|Cell_Shape_Uniformity>2||Cell_Size_Uniformity<=4|||Bare_Nuclei<=2||||Marginal_Adhesion<=3:benign(11.41/1.21)||||Marginal_Adhesion>3:malignant(3.0)|||Bare_Nuclei>2||||Clump_Thickness<=6|||||Cell_Size_Uniformity<=3:malignant(13.0/2.0)|||||Cell_Size_Uniformity>3||||||Marginal_Adhesion<=5:benign(5.79/1.0)||||||Marginal_Adhesion>5:malignant(5.0)||||Clump_Thickness>6:malignant(31.79/1.0)||Cell_Size_Uniformity>4:malignant(177.0/5.0)NumberofLeaves:14Sizeofthetree:27Timetakentobuildmodel:0.07seconds===Evaluationontrainingset======Summary===CorrectlyClassifiedInstances68698.1402%(i.e98%casesareclassifiedcorrectly)IncorrectlyClassifiedInstances131.8598%Kappastatistic0.959Meanabsoluteerror0.0355Rootmeansquarederror0.1324Relativeabsoluteerror7.8614%Rootrelativesquarederror27.8462%TotalNumberofInstances699===DetailedAccuracyByClass===TPRateFPRatePrecisionRecallF-MeasureROCAreaClass0.9830.0210.9890.9830.9860.989benign0.9790.0170.9670.9790.9730.989malignantWeightedAvg.0.9810.020.9810.9810.9810.989===ConfusionMatrix===ab<--classifiedas4508|a=benign(450benigncasesarecorrectlyclassifiedasbenign,8arefalsepositives)5236|b=malignant(236malignantcasesarecorrectlyclassifiedasmalignant,5arefalsenegatives)Visualizingtheoutput:Theprunedtreelooksverycomplexandunreadable,andisthereforeremovedfromthisdocument.Thevisualdecisiontreemakesitmoreeasytograsp.
266
Interpretingthedecisiontreeoutput:
1. Thenumbersonthe leafnodesshowthecorrectlyandincorrectlyclassifiedinstancesfor thatnode.Thedecisionrule/nodeontherightincorrectlyclassifies5instances,evenwhileitaccuratelyclassifies177oftheinstancescorrectly.
2. Notallnodesareequallyimportant.Somenodesexplainmanymoreinstancesthanothernodes.1. E.g.asinglenodeontheleftofthetreerepresentsaverysimplerule(cell_size_uniformity<2andbare_nuclei
<3)explainseasily90%(405outof450)ofthebenigncases,andmorethan55%ofthetotalcases(405outof
699).
2. Similarly,thenodeontherightexplainsover73%ofthemalignantcases(177outof241),andthusprovidesaclearruleorheuristic.
3. Thetreeshowsaclearpathfordiagonozingeachcase.Andsoonandon.
Exercise3:ClusterAnalysisusingK-Meansalgorithm
Natureofproblem/opportunity:Understandtheunderlyingclustersinstancesofbreastcancerevaluations.
Datasetused:breast-w.Itdescribes699instancesofbiopsyanalysesofbreastcancersuspects.Thereare15variables:someofwhicharenominalwhileothersarenumeric.Theclassvariableshowsiftheinstancewasjudgedtobebenignofmalignant.
Datapreparation:Loadthedataset.
Analysisused:K-meansalgorithm.Choicesincludenumberofclusterstobeginwith.
Outputoftheanalysis.
Instances:699Attributes:10===Modelandevaluationontrainingset===
267
kMeans======Numberofiterations:5Withinclustersumofsquarederrors:259.92291180466714Missingvaluesgloballyreplacedwithmean/modeClusterCentroids:Cluster#AttributeFullData01(699)(246)(453)=====================================================
Clump_Thickness4.41777.17482.9205Cell_Size_Uniformity3.13456.59761.2539Cell_Shape_Uniformity3.20746.57321.3797Marginal_Adhesion2.80695.53251.3267Single_Epi_Cell_Size3.2165.30892.0795Bare_Nuclei3.54477.55761.3654Bland_Chromatin3.43785.96342.0662Normal_Nucleoli2.8675.89431.223Mitoses1.58942.5611.0618Classbenignmalignantbenign===Modelandevaluationontrainingset===ClusteredInstances0246(35%)1453(65%)Interpretation:Thisisaveryclearresult.Thereareclearlytwoclasses…malignantandbenign.Sensitivity analysis ofClustering: The two classes above could be unduly influenced by the bipolar variable class variable(benign,malignant).So,removethatvariableandrunthesameanalysisagain.
kMeans======Numberofiterations:6Withinclustersumofsquarederrors:243.1478671867869Missingvaluesgloballyreplacedwithmean/modeClustercentroids:Cluster#AttributeFullData01(699)(233)(466)========================================================Clump_Thickness4.41777.15883.0472Cell_Size_Uniformity3.13456.79831.3026Cell_Shape_Uniformity3.20746.72961.4464Marginal_Adhesion2.80695.73391.3433Single_Epi_Cell_Size3.2165.47212.088Bare_Nuclei3.54477.8741.38Bland_Chromatin3.43786.1032.1052Normal_Nucleoli2.8676.07731.2618Mitoses1.58942.54941.1094===Modelandevaluationontrainingset===ClusteredInstances0233(33%)1466(67%)
Interpretation:
1. Theclusterstructurehasnotchanged.2. However,thestrengthofinstancesineachclusterisslightlychanged…from35-65%to33-67%.So,thereismoreerror
268
ofType-2;i.e.morecasesaremarkedinthe‘benign’category,thanisactuallythecase.
SensitivityanalysisofClustering#2:Maybetherearecasesthatarenotfullymalignant,arebutarenottrulybenign.So,changethenumberofclustersto3,insteadof2.Runtheanalysisagain.
Results:Withinclustersumofsquarederrors:227.7071391007967Missingvaluesgloballyreplacedwithmean/modeClustercentroids:Cluster#AttributeFullData012(699)(222)(178)(299)=================================================Clump_Thickness4.41777.19825.03371.9866Cell_Size_Uniformity3.13456.9641.75841.1104Cell_Shape_Uniformity3.20746.88292.01691.1873Marginal_Adhesion2.80695.91441.71911.1472Single_Epi_Cell_Size3.2165.50452.44941.9732Bare_Nuclei3.54477.95781.83811.284Bland_Chromatin3.43786.20272.52251.9298Normal_Nucleoli2.8676.19371.73031.0736Mitoses1.58942.61261.1911.0669
Interpretationofresults:
1. Theclusterstructurehasobviouslychangedsince thenumberofclustershaschanged. It isclear that the realsplithasbeeninthebenigngroup
2. Asignificantnumberofbenigninstancesseemstohavefallenintoanintermediate/borderlinecategory.Somemarginallymalignantcaseshavealsofallenintothissamecategory.Thesecasesmayneedtobeputunderextrascrutiny.
Exercise4:AssociationRulesusingApriorialgorithm
ASSOCIATIONRULES
Natureofproblem/opportunity:UnderstandtheunderlyingassociationsamongcommercialaspectsoflifeofforeignworkersinGermany.
DataSetused:Credit-g.arff.Thisshowsdataaboutdemographics,jobtype,assets,andcreditclassofworkersinGermany.Itshows17variablesfor1000germanworkers.
Datapreparation: Load the data set. Ensure all non-nominal variables are removed from analysis.Because association rulesworkonlyonnominaldata.
Analysisused:Apriorialgoritm.Choicesincludechangingtheminimumlevelofconfidenceinarule(say90%),andminimalsupportlevel(10%).
Outputoftheanalysis.
Instances:1000
Attributes:11
269
checking_statuscredit_historypurposesavings_statusemploymentpersonal_statusproperty_magnitudehousingjobown_telephoneclass===Associatormodel(fulltrainingset)===Apriori=======Minimumsupport:0.1(100instances)
Minimummetric<confidence>:0.9
Numberofcyclesperformed:18
……
TenBestrulesfound:
1.housing=forfree108==>property_magnitude=noknownproperty104conf:(0.96)
2.checking_status=nocheckingcredit_history=critical/otherexistingcredithousing=own126==>class=good120conf:(0.95)
3.checking_status=nocheckingpurpose=radio/tv127==>class=good120conf:(0.94)
4.checking_status=nocheckingpurpose=radio/tvhousing=own108==>class=good102conf:(0.94)
5.personal_status=malesingleproperty_magnitude=carjob=skilled124==>housing=own117conf:(0.94)
6.checking_status=nocheckingpersonal_status=malesinglehousing=ownjob=skilled121==>class=good114conf:(0.94)
7.checking_status=nocheckingcredit_history=critical/otherexistingcredit153==>class=good143conf:(0.93)
8.checking_status=nocheckingemployment=>=7115==>class=good107conf:(0.93)
9.personal_status=malesingleproperty_magnitude=carclass=good129==>housing=own120conf:(0.93)
10.checking_status=nocheckingjob=skilledown_telephone=yes117==>class=good108conf:(0.92)
InterpretingtheOutput
1. Rule1impliesthat96%ofthosewholiveinfreehousing,donotownanyproperty.2. Rule5impliesthatsinglemalesthatholdskilledjobsandownacar,arealsolikelytoownahouse(94%chance).3. Rule9impliessinglemalesthathavegoodcredithistoryandownacar,arealsolikelytoownahouse(93%chance).4. Rules5and9arehighlyoverlapping.Thesearetwocandidatesforpotentiallycombining.5. Andsoonandon.
---***---
270
271
Appendix1:DataMiningTutorialwithR
DataMiningTutorialwithR
Developedforacademicuseonly
byDr.AnilMaheshwari&Mr.TonmayBhattacharjee
272
BasicRtutorialfordataminingLearnthebasic:
1. Google“codeR”andgototheRcodeschoolwebsite.Youcandirectlygotohttp://tryr.codeschool.comtoo.2. Signup/registerprovidingthesimpleinformation.3. Followthesimpleinstructionandpracticeonthegivencodewindow.4. Finishthestepandunlockthenextsteps.5. Finishallsevenstepsandyou’llseeacongratulationpagelikebellow.
InstallR:
1. ClickontheofficialRprogrammingsiteordirectlyvisithttp://www.r-project.org/
Youshouldseesomethinglikethefollowing
273
2. ClickdownloadRtogetthepropermirror.Thisshouldtakeyoutoapagesomethinglikebellow.
3. ChoosethelinkforIowaStateUniversityoranyothermirroryoulike.
274
4. Chooseyouroperatingsystem.Formycaseitwaswindows.
5. ClickinstallRforthefirsttime.
6. Clickondownloadtodownloadtheexefile.(forwindows)
7. ClickonSavefiletosavetheexetoyourcomputer.
275
8. Doubleclickthe.exefileforinstallation.
9. ClickRuntostarttheinstallation.Followthesteps.Clicknext,acceptagreement,selectyourinstallationfolderandfinishtheinstallation.
276
CodingwithR:
SelecttheRapplicationfromyourstartmenu.AllcodingstyleshouldbesamewhatyoupracticedonRcodeschool.
Decisiontree:
1. LoadlibraryMASStosupportfunctionsanddatasetsforVenablesandRipley'susinglibrary(“MASS”)
2. Convertyour.xlsor.xlsxfileto.csvfileandputonDocumentsfolder.3. Load thedata toavariableusingread.csv(“filename.csv”). Inmycase I’ve loaded thedata to thevariablenameddata
usingdata<-read.csv(“height.csv”)
4. Loadthelibraryrpartforthedecisiontreeusinglibrary(“rpart”)5. Draw the tree and assign to a variable like tree<-rpart(gend~Height+age+wt, data=data, method=class). Here gend,
Height,ageandwtarecolumnnamesandI’mdrawingdecisiontreetofindoutgendbasedonHeight,ageandwt.datais
thevariablenameofyourcsvfileloadedtoit.Andmethod=classstandsforclassification.
6. Youcanplotthetreeusingplot(tree)7. Toputthelabelsontreeyoucanusetext(tree).Asimpledecisiontreeshouldbedrawn.8. Tomakethetreelittlebitfancyyoucaninstallrpart.plotusinginstall.packages(‘rpart.plot’)9. Selectyourmirrorfortheinstallation.10. In the same way install RColorBrewer using install.packages(‘RColorBrewer’). It has library rattle which is a free
graphicalinterfacefordataminingtocodewithR.
11. Loadtherattlelibraryusinglibrary(‘rattle’)12. Loadthelibraryrpart.plotusinglibrary(‘rpart.plot’)13. LoadthelibraryRColorBrewerusinglibrary(‘RColorBrewer’)14. NowdrawthetreeusingfancyRpartPlot(tree)
Thefollowingexamplecodeandtreeisgivenbellow
277
Correlationandregression:
1. Inthesamewaydescribedindecisiontreeyoucaninstallthenecessarylibraryandloadthedata.2. Usingcov(data)youcanseerelation
278
3. Usingpairs(data)youcanseetheregression.
Thefollowingexampleillustratesthesteps:
279
Hereisanotherexample:
280
281
Foranyhelpvisit:
http://www.rdatamining.com/docs/introduction-to-data-mining-with-r
282
AdditionalResources
Teradatanetwork.com:JoinTeradataUniversityNetwork toaccess toolsandmaterialsforBusinessIntelligence.Itiscompletelyfreeforstudents.
Here are some other books and papers for a deeper dive into the topicscoveredinthisbook.
1. Ayres,I.(2007)SuperCrunchers:WhyThinking-by-NumbersIstheNewWaytobeSmart.RandomHousePublishing.
2. Davenport,T.&J.Harris (2007).CompetingonAnalytics:TheNewScienceofWinning.HBSPress.
3. Gartner(2012).BusinessImplicationsofBigData.4. Gartner(2012).TechnologyImplicationsofBigData.5. GordonLinoff&MichaelBerry (2011).DataMiningTechniques. 3rd
edition.Wiley.6. Groebner, David F,P.W. Shannon, P.C. Fry. (2013). Business Statistics
(9thedition).Pearson.7. Jain,AnilK.(2008).“DataClustering:50yearsbeyondK-Means.”19th
InternationalConferenceonPatternRecognition.8. Lewis, Michael (2004).Moneyball: The Art ofWinning an Unfair
Game.Norton&Co.9. AndrewDMartinetal.“CompetingApproachestoPredictingSupremeCourtDecisionmaking”,PerspectiveinPolitics,2004).
10. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink .HoughtonMifflinHarcourt.
11. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com
12. Sathi,Arvind(2011).CustomerExperienceAnalytics:TheKeyto Real-Time, Adaptive Customer Relationships. IndependentPublishersGroup.
13. Sharda, R., D. Dusen, and E. Turban. (2014). BusinessIntelligenceandDataAnalytics.10thedition.Pearson.
14. Shmueli,G,N.Patel,&P.Bruce (2010).DataMining forBusinessIntelligence.Wiley.
15. Siegel,Eric,(2013).PredictiveAnalytics.Wiley.
283
16. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.
17. Statsoft.www.statsoft/textbook18. Taylor, James (2011).DecisionManagement Systems:A
Practical Guide to Using Business Rules and Predictive Analytics(IBMPress).PearsonEducation.
19. Weka system. http://www.cs.waikato.ac.nz/ml/weka/downloading.html
20. Witten,I.,E.Frank,M.Hall(2009).DataMining.3rdedition.MorganKauffman.
284
AdvancePraiseforthisbook:
“This book is a splendid and valuable addition to this subject. The wholebookiswellwrittenandIhavenohesitationtorecommendthat thiscanbeadaptedasatextbookforgraduatecoursesinBusinessIntelligenceandDataMining.”Dr.EdiShivaji,DesMoines,Iowa,USA.
“Reallywellwritten and timely as theWorld gets in theBigDatamode! Ithink thiscanbeagoodbridgeandprimer for theuninitiatedmanagerwhoknowsBigData is thefuturebutdoesn'tknowwhere tobegin!”–Dr.AlokMishra,Singapore.
“Thisbookhasdoneagreatjoboftakingacomplex,highlyimportantsubjectareaandmakingitaccessibletoeveryone.Itbeginsbysimplyconnectingtowhatyouknow,and thenbang -you've suddenly foundout aboutDecisionTrees, Regression Models and Artificial Neural Networks, not to mentioncluster analysis,webmining andBigData.” –Ms.CharmaineOak,UnitedKingdom.
“AsacompletenovicetothisareajuststartingoutonaMBAcourseIfoundthe book incredibly useful and very easy to follow and understand. Theconcepts are clearly explained and make it an easy task to gain anunderstandingofthesubjectmatter.”–Mr.CraigDomoney,SouthAfrica.
AbouttheAuthor
Dr.AnilMaheshwari isaProfessorofManagement InformationSystemsatMaharishi University ofManagement, and the Director of their Center forDataAnalytics.He teaches courses in data analytics, and helps researcherswith extracting deep insights from their data. He worked in a variety ofleadership roles at IBM inAustin TX, and has alsoworked atmany othercompaniesincludingstartups.HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineering degree from Indian Institute ofTechnology inDelhi,anMBA from Indian Institute ofManagement inAhmedabad, and a Ph.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.Heblogsinterestingstuffatanilmah.wordpress.com
285
286
TableofContents
PrefaceChapter1:WholenessofDataAnalytics
BusinessIntelligenceCaselet:MoneyBall-DataMininginSportsPatternRecognitionDataProcessingChain
DataDatabaseDataWarehouseDataMiningDataVisualization
OrganizationofthebookReviewQuestions
Section1Chapter2:BusinessIntelligenceConceptsandApplications
Caselet:KhanAcademy–BIinEducationBIforbetterdecisionsDecisiontypesBIToolsBISkillsBIApplications
CustomerRelationshipManagementHealthcareandWellnessEducationRetailBankingFinancialServicesInsuranceManufacturingTelecomPublicSector
ConclusionReviewQuestionsLibertyStoresCaseExercise:Step1
Chapter3:DataWarehousingCaselet:UniversityHealthSystem–BIinHealthcareDesignConsiderationsforDWDWDevelopmentApproachesDWArchitecture
287
DataSourcesDataLoadingProcessesDataWarehouseDesignDWAccessDWBestPracticesConclusionReviewQuestionsLibertyStoresCaseExercise:Step2
Chapter4:DataMiningCaselet:TargetCorp–DataMininginRetailGatheringandselectingdataDatacleansingandpreparationOutputsofDataMiningEvaluatingDataMiningResultsDataMiningTechniquesToolsandPlatformsforDataMiningDataMiningBestPracticesMythsaboutdataminingDataMiningMistakesConclusionReviewQuestionsLibertyStoresCaseExercise:Step3
Chapter5:DataVisualizationCaselet:DrHansGosling-VisualizingGlobalPublicHealthExcellenceinVisualizationTypesofChartsVisualizationExampleVisualizationExamplephase-2TipsforDataVisualizationConclusionReviewQuestionsLibertyStoresCaseExercise:Step4
Section2Chapter6:DecisionTrees
Caselet:PredictingHeartAttacksusingDecisionTreesDecisionTreeproblemDecisionTreeConstructionLessonsfromconstructingtreesDecisionTreeAlgorithmsConclusionReviewQuestionsLibertyStoresCaseExercise:Step5
288
Chapter7:RegressionCaselet:DatadrivenPredictionMarketsCorrelationsandRelationshipsVisuallookatrelationshipsRegressionExerciseNon-linearregressionexerciseLogisticRegressionAdvantagesandDisadvantagesofRegressionModelsConclusionReviewExercises:LibertyStoresCaseExercise:Step6
Chapter8:ArtificialNeuralNetworksCaselet:IBMWatson-AnalyticsinMedicineBusinessApplicationsofANNDesignPrinciplesofanArtificialNeuralNetworkRepresentationofaNeuralNetworkArchitectingaNeuralNetworkDevelopinganANNAdvantagesandDisadvantagesofusingANNsConclusionReviewExercises
Chapter9:ClusterAnalysisCaselet:ClusterAnalysisApplicationsofClusterAnalysisDefinitionofaClusterRepresentingclustersClusteringtechniquesClusteringExerciseK-MeansAlgorithmforclusteringSelectingthenumberofclustersAdvantagesandDisadvantagesofK-MeansalgorithmConclusionReviewExercisesLibertyStoresCaseExercise:Step7
Chapter10:AssociationRuleMiningCaselet:Netflix:DataMininginEntertainmentBusinessApplicationsofAssociationRulesRepresentingAssociationRulesAlgorithmsforAssociationRuleAprioriAlgorithmAssociationrulesexerciseCreatingAssociationRules
289
ConclusionReviewExercisesLibertyStoresCaseExercise:Step8
Section3Chapter11:TextMining
Caselet:WhatsAppandPrivateSecurityTextMiningApplicationsTextMiningProcessTermDocumentMatrixMiningtheTDMComparingTextMiningandDataMiningTextMiningBestPracticesConclusionReviewQuestions
Chapter12:WebMiningWebcontentminingWebstructureminingWebusageminingWebMiningAlgorithmsConclusionReviewQuestions
Chapter13:BigDataCaselet:PersonalizedPromotionsatSearsDefiningBigDataBigDataLandscapeBusinessImplicationsofBigDataTechnologyImplicationsofBigDataBigDataTechnologiesManagementofBigDataConclusionReviewQuestions
Chapter14:DataModelingPrimerEvolutionofdatamanagementsystemsRelationalDataModelImplementingtheRelationalDataModelDatabasemanagementsystems(DBMS)StructuredQueryLanguageConclusionReviewQuestions
Appendix1:DataMiningTutorialwithWekaAppendix1:DataMiningTutorialwithRAdditionalResources
290
291
Indice
Preface 4Chapter1:WholenessofDataAnalytics 11BusinessIntelligence 12Caselet:MoneyBall-DataMininginSports 13PatternRecognition 15DataProcessingChain 18
Data 18Database 20DataWarehouse 22DataMining 24DataVisualization 27
Organizationofthebook 29ReviewQuestions 30
Section1 31Chapter2:BusinessIntelligenceConceptsandApplications 32Caselet:KhanAcademy–BIinEducation 34BIforbetterdecisions 36Decisiontypes 37BITools 38BISkills 40BIApplications 41
CustomerRelationshipManagement 41HealthcareandWellness 42Education 43Retail 43Banking 44FinancialServices 45Insurance 46Manufacturing 47Telecom 47PublicSector 48
Conclusion 50ReviewQuestions 51
292
LibertyStoresCaseExercise:Step1 52Chapter3:DataWarehousing 53Caselet:UniversityHealthSystem–BIinHealthcare 54DesignConsiderationsforDW 56DWDevelopmentApproaches 58DWArchitecture 59DataSources 60DataLoadingProcesses 61DataWarehouseDesign 62DWAccess 63DWBestPractices 64Conclusion 65ReviewQuestions 66LibertyStoresCaseExercise:Step2 67
Chapter4:DataMining 68Caselet:TargetCorp–DataMininginRetail 70Gatheringandselectingdata 72Datacleansingandpreparation 74OutputsofDataMining 76EvaluatingDataMiningResults 78DataMiningTechniques 80ToolsandPlatformsforDataMining 83DataMiningBestPractices 85Mythsaboutdatamining 87DataMiningMistakes 88Conclusion 90ReviewQuestions 91LibertyStoresCaseExercise:Step3 92
Chapter5:DataVisualization 93Caselet:DrHansGosling-VisualizingGlobalPublicHealth 94ExcellenceinVisualization 96TypesofCharts 98VisualizationExample 101VisualizationExamplephase-2 106TipsforDataVisualization 107
293
Conclusion 108ReviewQuestions 109LibertyStoresCaseExercise:Step4 110
Section2 111Chapter6:DecisionTrees 112Caselet:PredictingHeartAttacksusingDecisionTrees 113DecisionTreeproblem 115DecisionTreeConstruction 118Lessonsfromconstructingtrees 124DecisionTreeAlgorithms 126Conclusion 129ReviewQuestions 130LibertyStoresCaseExercise:Step5 132
Chapter7:Regression 134Caselet:DatadrivenPredictionMarkets 135CorrelationsandRelationships 136Visuallookatrelationships 137RegressionExercise 139Non-linearregressionexercise 145LogisticRegression 148AdvantagesandDisadvantagesofRegressionModels 149Conclusion 151ReviewExercises: 152LibertyStoresCaseExercise:Step6 154
Chapter8:ArtificialNeuralNetworks 156Caselet:IBMWatson-AnalyticsinMedicine 157BusinessApplicationsofANN 159DesignPrinciplesofanArtificialNeuralNetwork 160RepresentationofaNeuralNetwork 162ArchitectingaNeuralNetwork 163DevelopinganANN 164AdvantagesandDisadvantagesofusingANNs 166Conclusion 167ReviewExercises 168
Chapter9:ClusterAnalysis 169
294
Caselet:ClusterAnalysis 170ApplicationsofClusterAnalysis 171DefinitionofaCluster 172Representingclusters 173Clusteringtechniques 174ClusteringExercise 176K-MeansAlgorithmforclustering 179Selectingthenumberofclusters 183AdvantagesandDisadvantagesofK-Meansalgorithm 184Conclusion 185ReviewExercises 186LibertyStoresCaseExercise:Step7 188
Chapter10:AssociationRuleMining 190Caselet:Netflix:DataMininginEntertainment 191BusinessApplicationsofAssociationRules 193RepresentingAssociationRules 194AlgorithmsforAssociationRule 195AprioriAlgorithm 196Associationrulesexercise 197CreatingAssociationRules 201Conclusion 203ReviewExercises 204LibertyStoresCaseExercise:Step8 205
Section3 206Chapter11:TextMining 207Caselet:WhatsAppandPrivateSecurity 208TextMiningApplications 210TextMiningProcess 212TermDocumentMatrix 214MiningtheTDM 217ComparingTextMiningandDataMining 218TextMiningBestPractices 220Conclusion 221ReviewQuestions 222
Chapter12:WebMining 224
295
Webcontentmining 226Webstructuremining 227Webusagemining 228WebMiningAlgorithms 230Conclusion 231ReviewQuestions 232
Chapter13:BigData 233Caselet:PersonalizedPromotionsatSears 234DefiningBigData 236BigDataLandscape 239BusinessImplicationsofBigData 240TechnologyImplicationsofBigData 242BigDataTechnologies 244ManagementofBigData 246Conclusion 248ReviewQuestions 249
Chapter14:DataModelingPrimer 250Evolutionofdatamanagementsystems 252RelationalDataModel 253ImplementingtheRelationalDataModel 256Databasemanagementsystems(DBMS) 257StructuredQueryLanguage 258Conclusion 259ReviewQuestions 260
Appendix1:DataMiningTutorialwithWeka 261Appendix1:DataMiningTutorialwithR 272AdditionalResources 283
296