Data Analytics Made Accessible - Università di Roma LUMSA Analytics - Made Accessible.pdf · areas...

2

DataAnalyticsMadeAccessible

Copyright©2015byAnilK.Maheshwari,Ph.D.

By purchasing this book, you agree not to copy the book by any means,mechanicalorelectronic.

Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.

3

Preface

TherearemanygoodbooksinthemarketonDataAnalytics.So,whyshouldanyonewrite another book on this topic? I have been teaching courses inbusiness intelligenceanddataminingforafewyears.Morerecently,IhavebeenteachingthiscoursetocombinedclassesofMBAandComputerSciencestudents.Existingtextbooksseemtoolong,tootechnical,andtoocomplexforusebystudents.Thisbookfillsaneedforanaccessiblebookon this topic.Mygoalwastowriteaconversationalbookthatfeelseasyandinformative.This is an accessible book that covers everything important, with concreteexamples,andinvitesthereadertojointhisfield.

Thebookhasdevelopedfrommyownclassnotes.ItreflectsmydecadesofIT industry experience, as well as many years of academic teachingexperience. The chapters are organized for a typical one-semester graduatecourse.Thebookcontainscaseletsfromreal-worldstoriesatthebeginningofeverychapter.Thereisarunningcasestudyacrossthechaptersasexercises.

Manythanksare inorder.MyfatherMr.RatanLalMaheshwariencouragedmetoputmythoughtsinwriting,andmakeabookoutofit.MywifeNeerjahelpedme find the time andmotivation towrite this book.MybrotherDr.SunilMaheshwariwasthesourcesofmanyencouragingconversationsaboutit.MycolleagueDr.EdiShivajiprovidedadviceduringmyteachingtheDataAnalyticscourses.AnothercolleagueDr.ScottHerriottservedasarolemodelas an author of many textbooks. Yet another colleague, Dr. Greg Guthrieprovided many ideas and ways to disseminate the book. Our departmentassistantMs.KarenSlowickatMUMproof-readthefirstdraftofthisbook.Ms.Adri-MariVilonel in SouthAfrica helped create an opportunity to usethisbookforthefirsttimeatacorporateMBAprogram.

Thanks are also due to to mymany students atMUM and elsewhere whoprovedgoodpartnersinmylearningmoreaboutthisarea.Finally,thankstoMaharishiMaheshYogi forprovidingawonderfuluniversity,MUM,wherestudentsdeveloptheirintellectaswellastheirconsciousness.

Dr.AnilK.MaheshwariFairfield,IA.

November2015

4

5

Contents

Preface

Chapter1:WholenessofDataAnalytics

BusinessIntelligence

Caselet:MoneyBall-DataMininginSports

PatternRecognition

DataProcessingChain

Data

Database

DataWarehouse

DataMining

DataVisualization

Organizationofthebook

ReviewQuestions

Section1

Chapter2:BusinessIntelligenceConceptsandApplications

Caselet:KhanAcademy–BIinEducation

BIforbetterdecisions

Decisiontypes

BITools

BISkills

BIApplications

CustomerRelationshipManagement

HealthcareandWellness

Education

Retail

Banking

FinancialServices

Insurance

Manufacturing

Telecom

PublicSector

6

Conclusion

ReviewQuestions

LibertyStoresCaseExercise:Step1

Chapter3:DataWarehousing

Caselet:UniversityHealthSystem–BIinHealthcare

DesignConsiderationsforDW

DWDevelopmentApproaches

DWArchitecture

DataSources

DataLoadingProcesses

DataWarehouseDesign

DWAccess

DWBestPractices

Conclusion

ReviewQuestions


Chapter4:DataMining

Caselet:TargetCorp–DataMininginRetail

Gatheringandselectingdata

Datacleansingandpreparation

OutputsofDataMining

EvaluatingDataMiningResults

DataMiningTechniques

ToolsandPlatformsforDataMining

DataMiningBestPractices

Mythsaboutdatamining

DataMiningMistakes

Conclusion

ReviewQuestions


Chapter5:DataVisualization

Caselet:DrHansGosling-VisualizingGlobalPublicHealth

ExcellenceinVisualization

TypesofCharts

VisualizationExample

VisualizationExamplephase-2

TipsforDataVisualization

7

Conclusion

ReviewQuestions


Section2

Chapter6:DecisionTrees

Caselet:PredictingHeartAttacksusingDecisionTrees

DecisionTreeproblem

DecisionTreeConstruction

Lessonsfromconstructingtrees

DecisionTreeAlgorithms

Conclusion

ReviewQuestions


Chapter7:Regression

Caselet:DatadrivenPredictionMarkets

CorrelationsandRelationships

Visuallookatrelationships

RegressionExercise

Non-linearregressionexercise

LogisticRegression

AdvantagesandDisadvantagesofRegressionModels

Conclusion

ReviewExercises:


Chapter8:ArtificialNeuralNetworks

Caselet:IBMWatson-AnalyticsinMedicine

BusinessApplicationsofANN

DesignPrinciplesofanArtificialNeuralNetwork

RepresentationofaNeuralNetwork

ArchitectingaNeuralNetwork

DevelopinganANN

AdvantagesandDisadvantagesofusingANNs

Conclusion

ReviewExercises

Chapter9:ClusterAnalysis

Caselet:ClusterAnalysis

ApplicationsofClusterAnalysis

8

DefinitionofaCluster

Representingclusters

Clusteringtechniques

ClusteringExercise

K-MeansAlgorithmforclustering

Selectingthenumberofclusters

AdvantagesandDisadvantagesofK-Meansalgorithm

Conclusion

ReviewExercises


Chapter10:AssociationRuleMining

Caselet:Netflix:DataMininginEntertainment

BusinessApplicationsofAssociationRules

RepresentingAssociationRules

AlgorithmsforAssociationRule

AprioriAlgorithm

Associationrulesexercise

CreatingAssociationRules

Conclusion

ReviewExercises


Section3

Chapter11:TextMining

Caselet:WhatsAppandPrivateSecurity

TextMiningApplications

TextMiningProcess

TermDocumentMatrix

MiningtheTDM

ComparingTextMiningandDataMining

TextMiningBestPractices

Conclusion

ReviewQuestions

Chapter12:WebMining

Webcontentmining

Webstructuremining

Webusagemining

WebMiningAlgorithms

9

Conclusion

ReviewQuestions

Chapter13:BigData

Caselet:PersonalizedPromotionsatSears

DefiningBigData

BigDataLandscape

BusinessImplicationsofBigData

TechnologyImplicationsofBigData

BigDataTechnologies

ManagementofBigData

Conclusion

ReviewQuestions

Chapter14:DataModelingPrimer

Evolutionofdatamanagementsystems

RelationalDataModel

ImplementingtheRelationalDataModel

Databasemanagementsystems(DBMS)

StructuredQueryLanguage

Conclusion

ReviewQuestions

Appendix1:DataMiningTutorialwithWeka

Appendix1:DataMiningTutorialwithR

AdditionalResources

10

Chapter1:WholenessofDataAnalyticsBusinessistheactofdoingsomethingproductivetoservesomeone’sneeds,andthusearnalivingandmaketheworldabetterplace.Businessactivitiesare recorded on paper or using electronic media, and then these recordsbecome data. There is more data from customers’ responses and on theindustry as awhole.All this data can be analyzed andmined using specialtoolsandtechniquestogeneratepatternsandintelligence,whichreflecthowthebusinessisfunctioning.Theseideascanthenbefedbackintothebusinessso that it can evolve to become more effective and efficient in servingcustomerneeds.Andthecyclecontinueson(Figure1.1).

Figure1.1:BusinessIntelligenceandDataMiningCycle

11

BusinessIntelligenceAny business organization needs to continually monitor its businessenvironmentanditsownperformance,andthenrapidlyadjustitsfutureplans.Thisincludesmonitoringtheindustry,thecompetitors,thesuppliers,andthecustomers. The organization needs to also develop a balanced scorecard totrack its own health and vitality. Executives typically determine what theywant to track based on their key performance Indexes (KPIs) or key resultareas(KRAs).Customizedreportsneedtobedesignedtodelivertherequiredinformation to every executive. These reports can be converted intocustomized dashboards that deliver the information rapidly and in easy-to-graspformats.

12

Caselet:MoneyBall-DataMininginSportsAnalytics in sports was madepopular by the book andmovie,Moneyball. Statistician BillJames andOakland A's generalmanager, Billy Bean, placedemphasisoncrunchingnumbersanddatainsteadofwatchinganathlete's style and looks. Theirgoalwas tomake a team betterwhileusingfewerresources.Thekey action plan was to pickimportantroleplayersatalowercost while avoiding the famousplayers who demand highersalaries but may provide a lowreturn on a team's investment.Rather than relying on thescouts' experience and intuitionBean selected players basedalmost exclusively on their on-base percentage (OBP). ByfindingplayerswithahighOBPbut, with characteristics thatlead scouts to dismiss them,Bean assembled a team ofundervalued players with farmore potential than the A'shamstrung finances wouldotherwiseallow.

Using this strategy, they provedthat even small market teamscan be competitive— a case inpoint, theOaklandA's. In2004,two years after adopting thesame sabermetric model, theBoston Red Sox won their firstWorld Series since 1918.(Source:Moneyball,2004).

13

http://en.wikipedia.org/wiki/On-base_percentage

http://en.wikipedia.org/wiki/World_Series

Q: Could similar techniquesapply to thegamesof soccer,orcricket?Ifso,how?

Q2: What are the generallessonsfromthisstory?

Businessintelligenceisabroadsetof informationtechnology(IT)solutionsthat includestoolsforgathering,analyzing,andreportinginformationtotheusers about performance of the organization and its environment. These ITsolutionsareamongthemosthighlyprioritizedsolutionsforinvestment.

Consideraretailbusinesschainthatsellsmanykindsofgoodsandservicesaroundtheworld,onlineandinphysicalstores.Itgeneratesdataaboutsales,purchases,andexpensesfrommultiplelocationsandtimeframes.Analyzingthisdatacouldhelpidentifyfast-sellingitems,regional-sellingitems,seasonalitems,fast-growingcustomersegments,andsoon.Itmightalsohelpgenerateideas about what products sell together, which people tend to buy whichproducts, and so on. These insights and intelligence can help design betterpromotionplans,productbundles,andstore layouts,which in turn lead toabetter-performingbusiness.

Thevicepresidentofsalesofaretailcompanywouldwanttotrackthesalesto date againstmonthly targets, the performance of each store and productcategory,andthetopstoremanagersthatmonth.Thevicepresidentoffinancewould be interested in tracking daily revenue, expense, and cash flows bystore;comparingthemagainstplans;measuringcostofcapital;andsoon.

14

PatternRecognitionA pattern is a design or model that helps grasp something. Patterns helpconnectthingsthatmaynotappeartobeconnected.Patternshelpcutthroughcomplexity and reveal simpler understandable trends. Patterns can be asdefinitiveashardscientificrules,liketherulethatthesunalwaysrisesintheeast. They can also be simple generalizations, such as the Pareto principle,whichstatesthat80percentofeffectscomefrom20percentofthecauses.

Aperfectpatternormodelisonethat(a)accuratelydescribesasituation,(b)is broadly applicable, and (c) can be described in a simplemanner.E=MC2

wouldbesuchageneral,accurate,andsimple(GAS)model.Veryoften,allthreequalitiesarenotachievableinasinglemodel,andonehastosettlefortwoofthreequalitiesinthemodel.

Patternscanbetemporal,whichissomethingthatregularlyoccursovertime.Patternscanalsobespatial,suchasthingsbeingorganizedinacertainway.Patternscanbefunctional,inthatdoingcertainthingsleadstocertaineffects.Goodpatterns areoften symmetric.Theyechobasic structures andpatternsthatwearealreadyawareof.

Atemporalrulewouldbethat“somepeoplearealwayslate,”nomatterwhatthe occasion or time. Some peoplemay be aware of this pattern and somemay not be.Understanding a pattern like thiswould help dissipate a lot ofunnecessary frustration and anger. One can just joke that some people areborn“10minutes late,”and laugh it away.Similarly,Parkinson’s lawstatesthatworksexpandstofillupallthetimeavailabletodoit.

Aspatialpattern,followingthe80–20rule,couldbethatthetop20percentofcustomers lead to 80 percent of the business. Or 20 percent of productsgenerate 80 percent of the business. Or 80 percent of incoming customerservice calls are related to just 20percent of the products.This last patternmaysimply revealadiscrepancybetweenaproduct’s featuresandwhat thecustomersbelieveabouttheproduct.Thebusinesscanthendecidetoinvestineducating the customers better so that the customer service calls can besignificantlyreduced.

A functional patternmay involve test-taking skills. Some students performwellonessay-typequestions.Othersdowellinmultiple-choicequestions.Yetotherstudentsexcel indoinghands-onprojects,or inoralpresentations.Anawarenessofsuchapatterninaclassofstudentscanhelptheteacherdesignabalancedtestingmechanismthatisfairtoall.

15

Retainingstudentsisanongoingchallengeforuniversities.Recentdata-basedresearchshowsthatstudentsleaveaschoolforsocialreasonsmorethantheydo for academic reasons. This pattern/insight can instigate schools to paycloser attention to students engaging in extracurricular activities anddevelopingstrongerbondsatschool.Theschoolcaninvest inentertainmentactivities,sportsactivities,campingtrips,andotheractivities.Theschoolcanalsobegintoactivelygatherdataabouteverystudent’sparticipationinthoseactivities,topredictat-riskstudentsandtakecorrectiveaction.

However, long-established patterns can also be broken. The past cannotalwayspredictthefuture.Apatternlike“allswansarewhite”doesnotmeanthat theremaynotbeablackswan.Onceenoughanomaliesarediscovered,the underlying pattern itself can shift. The economicmeltdown in 2008 to2009was because of the collapse of the accepted pattern, that is, “housingprices always go up.” A deregulated financial environment made marketsmore volatile and led to greater swings inmarkets, leading to the eventualcollapseoftheentirefinancialsystem.

Diamondminingistheactofdiggingintolargeamountsofunrefinedoretodiscover precious gems or nuggets. Similarly, data mining is the act ofdigging into large amounts of rawdata to discover uniquenontrivial usefulpatterns. Data is cleaned up, and then special tools and techniques can beapplied to search for patterns. Diving into clean and nicely organized datafrom the right perspectives can increase the chances of making the rightdiscoveries.

A skilled diamond miner knows what a diamond looks like. Similarly, askilled data miner should know what kinds of patterns to look for. Thepatterns are essentially about what hangs together and what is separate.Therefore, knowing the business domain well is very important. It takesknowledgeandskill todiscover thepatterns. It is like findinganeedle inahaystack.Sometimesthepatternmaybehidinginplainsight.Atothertimes,itmaytakealotofwork,andlookingfarandwide,tofindsurprisingusefulpatterns. Thus, a systematic approach to mining data is necessary toefficientlyrevealvaluableinsights.

For instance, the attitude of employees toward their employer may behypothesizedtobedeterminedbyalargenumberoffactors,suchaslevelofeducation,income,tenureinthecompany,andgender.Itmaybesurprisingifthedata reveals that theattitudesaredetermined first and foremostby theirage bracket. Such a simple insight could be powerful in designingorganizations effectively. The data miner has to be open to any and all

16

possibilities.

Whenusedincleverways,dataminingcanleadtointerestinginsightsandbea sourceof new ideas and initiatives.One canpredict the traffic patternonhighways from the movement of cell phone (in the car) locations on thehighway. If the locations of cell phones on a highway or roadway are notmoving fast enough, it may be a sign of traffic congestion. Telecomcompanies can thus provide real-time traffic information to the drivers ontheir cell phones, or on their GPS devices, without the need of any videocamerasortrafficreporters.

Similarly,organizationscanfindoutanemployee’sarrivaltimeattheofficebywhentheircellphoneshowsupintheparkinglot.Observingtherecordofthe swipe of the parking permit card in the company parking garage caninformtheorganizationwhetheranemployeeisintheofficebuildingoroutoftheofficeatanymomentintime.

Somepatternsmaybesosparsethataverylargeamountofdiversedatahastobeseentogethertonoticeanyconnections.Forinstance,locatingthedebrisofaflightthatmayhavevanishedmidcoursewouldrequirebringingtogetherdatafrommanysources,suchassatellites,ships,andnavigationsystems.Theraw data may come with various levels of quality, and may even beconflicting.Thedata athandmayormaynotbeadequate for findinggoodpatterns.Additionaldimensionsofdatamayneed tobeadded tohelpsolvetheproblem.

17

DataProcessingChainDataisthenewnaturalresource.Implicitinthisstatementistherecognitionofhiddenvalueindata.Dataliesattheheartofbusinessintelligence.Thereisa sequence of steps to be followed to benefit from the data in a systematicway. Data can be modeled and stored in a database. Relevant data can beextractedfromtheoperationaldatastoresaccordingtocertainreportingandanalyzing purposes, and stored in a data warehouse. The data from thewarehousecanbecombinedwithothersourcesofdata,andminedusingdatamining techniques to generate new insights. The insights need to bevisualized and communicated to the right audience in real time forcompetitiveadvantage.Figure1.2explainstheprogressionofdataprocessingactivities.The restof thischapterwillcover these fiveelements in thedataprocessingchain.

Figure1.2:DataProcessingChain

DataAnythingthatisrecordedisdata.Observationsandfactsaredata.Anecdotesandopinionsarealsodata,ofadifferentkind.Datacanbenumbers,liketherecordofdailyweather,ordailysales.Datacanbealphanumeric,suchasthenamesofemployeesandcustomers.

1. Data could come from any number of sources. It could come fromoperationalrecordsinsideanorganization,anditcancomefromrecordscompiled by the industry bodies and government agencies.Data couldcome from individuals telling stories frommemory and from people’sinteractioninsocialcontexts.Datacouldcomefrommachinesreportingtheirownstatusorfromlogsofwebusage.

2. Datacancomeinmanyways.Itmaycomeaspaperreports.Itmaycomeasafilestoredonacomputer.Itmaybewordsspokenoverthephone.Itmaybee-mailorchatontheInternet.ItmaycomeasmoviesandsongsinDVDs,andsoon.

3. Thereisalsodataaboutdata.Itiscalledmetadata.Forexample,peopleregularly upload videos on YouTube. The format of the video file(whether it was a high-def file or lower resolution) is metadata. Theinformationabout the timeofuploadingismetadata.Theaccountfromwhichitwasuploadedisalsometadata.Therecordofdownloadsofthe

18

videoisalsometadata.

Datacanbeofdifferenttypes.

1. Datacouldbeanunorderedcollectionofvalues.Forexample,aretailersellsshirtsofred,blue,andgreencolors.Thereisnointrinsicorderingamong these color values.One can hardly argue that any one color ishigher or lower than the other. This is called nominal (means names)data.

2. Datacouldbeorderedvalueslikesmall,mediumandlarge.Forexample,thesizesofshirtscouldbeextra-small,small,medium,andlarge.Thereis clarity that medium is bigger than small, and large is bigger thanmedium. But the differences may not be equal. This is called ordinal(ordered)data.

3. Another type of data has discrete numeric values defined in a certainrange, with the assumption of equal distance between the values.Customer satisfaction scoremay be ranked on a 10-point scalewith 1being lowest and 10 being highest. This requires the respondent tocarefully calibrate the entire rangeasobjectively aspossible andplacehis own measurement in that scale. This is called interval (equalintervals)data.

4. The highest level of numeric data is ratio datawhich can take on anynumericvalue.Theweightsandheightsofallemployeeswouldbeexactnumericvalues.Thepriceofashirtwillalsotakeanynumericvalue.Itiscalledratio(anyfraction)data.

5. There is another kind of data that does not lend itself to muchmathematical analysis, at least not directly. Suchdata needs to be firststructuredand thenanalyzed.This includesdata like audio, video, andgraphsfiles,oftencalledBLOBs(BinaryLargeObjects).Thesekindsofdata lend themselves to different forms of analysis andmining. Songscanbedescribedashappyor sad, fast-pacedor slow, and soon.Theymay contain sentiment and intention, but these are not quantitativelyprecise.

Theprecisionofanalysisincreasesasdatabecomesmorenumeric.Ratiodatacould be subjected to rigorousmathematical analysis. For example, preciseweatherdataabouttemperature,pressure,andhumiditycanbeusedtocreaterigorousmathematicalmodelsthatcanaccuratelypredictfutureweather.

Datamay be publicly available and sharable, or itmay bemarked private.Traditionally, the law allows the right to privacy concerning one’s personal

19

data. There is a big debate on whether the personal data shared on socialmediaconversationsisprivateorcanbeusedforcommercialpurposes.

Dataficationisanewtermthatmeansthatalmosteveryphenomenonisnowbeingobservedandstored.MoredevicesareconnectedtotheInternet.Morepeopleareconstantlyconnectedto“thegrid,”bytheirphonenetworkortheInternet, and so on. Every click on the web, and every movement of themobile devices, is being recorded. Machines are generating data. The“Internetofthings”isgrowingfasterthantheInternetofpeople.Allofthisisgenerating an exponentially growing volume of data, at high velocity.Kryder’s law predicts that the density and capability of hard drive storagemediawilldoubleevery18months.Asstoragecostskeepcomingdownatarapid rate, there is a greater incentive to record and storemore events andactivities at a higher resolution. Data is getting stored in more detailedresolution,andmanymorevariablesarebeingcapturedandstored.

DatabaseAdatabaseisamodeledcollectionofdatathatisaccessibleinmanyways.Adata model can be designed to integrate the operational data of theorganization.Thedatamodelabstractsthekeyentitiesinvolvedinanactionandtheirrelationships.Mostdatabasestodayfollowtherelationaldatamodeland its variants. Each data modeling technique imposes rigorous rules andconstraintstoensuretheintegrityandconsistencyofdataovertime.

Take the example of a sales organization. A data model for managingcustomerorderswillinvolvedataaboutcustomers,orders,products,andtheirinterrelationships.Therelationshipbetween thecustomersandorderswouldbesuchthatonecustomercanplacemanyorders,butoneorderwillbeplacedby one and only one customer. It is called a one-to-many relationship.Therelationshipbetweenordersandproductsisalittlemorecomplex.Oneordermay contain many products. And one product may be contained in manydifferentorders.This iscalledamany-to-many relationship.Different typesofrelationshipscanbemodeledinadatabase.

Databases have grown tremendously over time. They have grown incomplexity in terms of number of the objects and their properties beingrecorded.Theyhavealsogrowninthequantityofdatabeingstored.Adecadeago, a terabyte-sized database was considered big. Today databases are inpetabytesandexabytes.Videoandothermediafileshavegreatlycontributedto thegrowthofdatabases.E-commerceandotherweb-basedactivitiesalsogeneratehugeamountsofdata.Datageneratedthroughsocialmediahasalsogeneratedlargedatabases.Thee-mailarchives,includingattacheddocuments

20

oforganizations,areinsimilarlargesizes.

Manydatabasemanagementsoftwaresystems(DBMSs)areavailabletohelpstore and manage this data. These include commercial systems, such asOracle and DB2 system. There are also open-source, free DBMS, such asMySQL and Postgres. These DBMSs help process and store millions oftransactionsworthofdataeverysecond.

Here is a simple database of the sales of movies worldwide for a retailorganization.Itshowssalestransactionsofmoviesoverthreequarters.Usingsuchafile,datacanbeadded,accessed,andupdatedasneeded.

MoviesTransactionsDatabase

Order#

Datesold

Productname

Location

Amount

1

April2015

MontyPython

US

$9

2

May2015

GoneWiththeWind

US

$15

3

June2015

MontyPython

India

$9

4

June2015

MontyPython

UK

$12

5

July2015

Matrix

US

$12

6

July2015

MontyPython

US

$12

7

July2015

GoneWiththeWind

US

$15

8

Aug2015

Matrix

US

$12

9

Sept2015

Matrix

India

$12

10

Sept2015

MontyPython

US

$9

11

Sept2015

GoneWiththeWind

US

$15

21

12

Sept2015

MontyPython

India

$9

13

Nov2015

GoneWiththeWind

US

$15

14

Dec2015

MontyPython

US

$9

15

Dec2015

MontyPython

US

$9

DataWarehouseAdatawarehouseisanorganizedstoreofdatafromallovertheorganization,speciallydesignedtohelpmakemanagementdecisions.Datacanbeextractedfrom operational database to answer a particular set of queries. This data,combinedwith other data, can be rolled up to a consistent granularity anduploaded to a separate data store called the datawarehouse. Therefore, thedata warehouse is a simpler version of the operational data base, with thepurposeofaddressingreportinganddecision-makingneedsonly.Thedatainthe warehouse cumulatively grows as more operational data becomesavailableandisextractedandappendedtothedatawarehouse.Unlikeintheoperationaldatabase,thedatavaluesinthewarehousearenotupdated.

Tocreateasimpledatawarehouseforthemoviessalesdata,assumeasimpleobjectiveof trackingsalesofmoviesandmakingdecisionsaboutmanaginginventory. Increating thisdatawarehouse,all thesales transactiondatawillbeextractedfromtheoperationaldatafiles.Thedatawillberolledupforallcombinationsoftimeperiodandproductnumber.Thus,therewillbeonerowfor every combination of time period and product. The resulting datawarehousewilllooklikethetablethatfollows.

MoviesSalesDataWarehouse

Row#

Qtrsold

Productname

Amount

1

Q2

GoneWiththeWind

$15

2

Q2

MontyPython

$30

22

3 Q3 GoneWiththeWind $304

Q3

Matrix

$36

5

Q3

MontyPython

$30

6

Q4

GoneWiththeWind

$15

7

Q4

MontyPython

$18

The data in the data warehouse is at much less detail than the transactiondatabase.Thedatawarehousecouldhavebeendesignedatalowerorhigherlevel of detail, or granularity. If the data warehouse were designed on amonthlylevel,insteadofaquarterlylevel,therewouldbemanymorerowsofdata.Whenthenumberof transactionsapproachesmillionsandhigher,withdozensofattributesineachtransaction,thedatawarehousecanbelargeandrichwith potential insights.One can thenmine the data (slice and dice) inmany differentways and discover uniquemeaningful patterns.Aggregatingthe data helps improve the speed of analysis. A separate data warehouseallows analysis to go on separately in parallel, without burdening theoperationaldatabasesystems(Table1.1).

Function

Database

DataWarehouse

Purpose

Datastoredindatabasescanbeusedformanypurposesincludingday-to-dayoperations

DatastoredinDWiscleanseddatausefulforreportingandanalysis

Granularity

Highlygranulardataincludingallactivityandtransactiondetails

Lowergranularitydata;rolleduptocertainkeydimensionsofinterest

Complexity

Highlycomplexwithdozensorhundredsofdatafiles,linkedthroughcommondatafields

Typicallyorganizedaroundalargefacttables,andmanylookuptables

Databasegrowswithgrowing

Growsasdatafrom

23

Size

volumesofactivityandtransactions.Oldcompletedtransactionsaredeletedtoreducesize.

operationaldatabasesisrolled-upandappendedeveryday.Dataisretainedforlong-termtrendanalyses

Architecturalchoices

Relational,andobject-oriented,databases

Starschema,orSnowflakeschema

DataAccessmechanisms

PrimarilythroughhighlevellanguagessuchasSQL.TraditionalprogrammingaccessDBthroughOpenDataBaseConnectivity(ODBC)interfaces

AccessedthroughSQL;SQLoutputisforwardedtoreportingtoolsanddatavisualizationtools

Table1.1:ComparingDatabasesystemswithDataWarehousingsystems

DataMiningDataMining is theartandscienceofdiscoveringuseful innovativepatternsfromdata.There isawidevarietyofpatterns thatcanbefound in thedata.There are many techniques, simple or complex, that help with findingpatterns.

Inthisexample,asimpledataanalysistechniquecanbeappliedtothedatainthedatawarehouseabove.Asimplecross-tabulationofresultsbyquarterandproductswillrevealsomeeasilyvisiblepatterns.

MoviesSalesbyQuarters–Cross-tabulation

Qtr/Product

GoneWiththeWind

Matrix

MontyPython

TotalSalesAmount

Q2

$15

0

$30

$45

Q3

$30

$36

$30

$96

Q4

$15

0

$18

$33

TotalSalesAmount

$60

$36

$78

$174

24

Based on the cross-tabulation above, one can readily answer some productsalesquestions,like:

1. Whatisthebestsellingmoviebyrevenue?–MontyPython.

2. Whatisthebestquarterbyrevenuethisyear?–Q33. Anyotherpatterns?–MatrixmoviesellsonlyinQ3(seasonalitem).

These simple insights can help plan marketing promotions and manageinventoryofvariousmovies.

If a cross tabulation was designed to include customer location data, onecouldanswerotherquestions,suchas

1. Whatisthebestsellinggeography?–US2. Whatistheworstsellinggeography?–UK3. Anyotherpatterns?–MontyPythonsellsglobally,whileGonewiththe

WindsellsonlyintheUS.

Ifthedataminingwasdoneatthemonthlylevelofdata,itwouldbeeasytomiss theseasonalityof themovies.However,onewouldhaveobserved thatSeptemberisthehighestsellingmonth.

The previous example shows that many differences and patterns can benoticedbyanalyzingdataindifferentways.However,someinsightsaremoreimportant than others. The value of the insight depends upon the problembeingsolved.The insight that therearemoresalesofaproduct inacertainquarterhelpsamanagerplanwhatproductstofocuson.Inthiscase,thestoremanager should stock up onMatrix in Quarter 3 (Q3). Similarly, knowingwhich quarter has the highest overall sales allows for different resourcedecisionsinthatquarter.Inthiscase,ifQ3isbringingmorethanhalfoftotalsales, this requiresgreater attentionon the e-commercewebsite in the thirdquarter.

Data mining should be done to solve high-priority, high-value problems.Much effort is required to gather data, clean and organize it, mine it withmany techniques, interpret the results, and find the right insight. It isimportantthattherebealargeexpectedpayofffromfindingtheinsight.Oneshouldselect the rightdata (and ignore the rest),organize it intoaniceandimaginativeframeworkthatbringsrelevantdatatogether,andthenapplydataminingtechniquestodeducetherightinsight.

25

A retail companymay use datamining techniques to determinewhich newproduct categories to add towhich of their stores; how to increase sales ofexistingproducts;whichnewlocationstoopenstoresin;howtosegmentthecustomersformoreeffectivecommunication;andsoon.

Data can be analyzed at multiple levels of granularity and could lead to alarge number of interesting combinations of data and interesting patterns.Someof thepatternsmaybemoremeaningful than theothers.Suchhighlygranulardataisoftenused,especiallyinfinanceandhigh-techareas,sothatonecangaineventheslightestedgeoverthecompetition.

Here are brief descriptions of some of the most important data miningtechniquesusedtogenerateinsightsfromdata.

DecisionTrees:Theyhelpclassifypopulationsintoclasses.Itissaidthat70%ofalldataminingworkisaboutclassificationsolutions;andthat70%ofallclassification work uses decision trees. Thus, decision trees are the mostpopular and important data mining technique. There are many popularalgorithmstomakedecision trees.Theydiffer in termsof theirmechanismsand each technique work well for different situations. It is possible to trymultiple decision-tree algorithms on a data set and compare the predictiveaccuracyofeachtree.

Regression: This is awell-understood technique from the field of statistics.Thegoalistofindabestfittingcurvethroughthemanydatapoints.Thebestfittingcurve is thatwhichminimizes the(error)distancebetween theactualdatapointsandthevaluespredictedbythecurve.Regressionmodelscanbeprojectedintothefutureforpredictionandforecastingpurposes.

ArtificialNeuralNetworks: Originating in the field of artificial intelligenceand machine learning, ANNs are multi-layer non-linear informationprocessingmodelsthatlearnfrompastdataandpredictfuturevalues.Thesemodelspredictwell,leadingtotheirpopularity.Themodel’sparametersmaynot be very intuitive. Thus, neural networks are opaque like a black-box.Thesesystemsalsorequirealargeamountofpastdatatoadequatetrainthesystem.

Clusteranalysis:Thisisanimportantdataminingtechniquefordividingandconquering largedata sets.Thedata set is divided into a certainnumberofclusters,bydiscerningsimilaritiesanddissimilaritieswithinthedata.Thereisnoonerightanswerforthenumberofclustersinthedata.Theuserneedstomakeadecisionbylookingathowwellthenumberofclusterschosenfitthedata.Thisismostcommonlyusedformarketsegmentation.Unlikedecision

26

treesandregression,thereisnoonerightanswerforclusteranalysis.

AssociationRuleMining:AlsocalledMarketBasketAnalysiswhenused inretailindustry,thesetechniqueslookforassociationsbetweendatavalues.Ananalysisofitemsfrequentlyfoundtogetherinamarketbasketcanhelpcross-sellproducts,andalsocreateproductbundles.

DataVisualizationAsdataandinsightsgrowinnumber,anewrequirementistheabilityoftheexecutivesanddecisionmakerstoabsorbthisinformationinrealtime.Thereisalimittohumancomprehensionandvisualizationcapacity.Thatisagoodreason to prioritize and manage with fewer but key variables that relatedirectlytotheKeyResultAreas(KRAs)ofarole.

Herearefewconsiderationswhenpresentingusingdata:

1. Presenttheconclusionsandnotjustreportthedata.2. Choosewiselyfromapaletteofgraphstosuitthedata.3. Organizetheresultstomakethecentralpointstandout.4. Ensure that the visuals accurately reflect the numbers. Inappropriate

visualscancreatemisinterpretationsandmisunderstandings.5. Makethepresentationunique,imaginativeandmemorable.

Executive dashboards are designed to provide information on select fewvariables for every executive. They use graphs, dials, and lists to show thestatus of important parameters. These dashboards also have a drill-downcapabilitytoenablearoot-causeanalysisofexceptionsituations(Figure1.3).

27

Figure1.3:SampleExecutiveDashboard

Data visualization has been an interesting problem across the disciplines.Manydimensionsofdatacanbeeffectivelydisplayedonatwo-dimensionalsurface to give a rich andmore insightful description of the totality of thestory.

TheclassicpresentationofthestoryofNapoleon’smarchtoRussiain1812,byFrenchcartographerJosephMinard,isshowninFigure1.4.Itcoversaboutsixdimensions.Timeisonhorizontalaxis.Thegeographicalcoordinatesandriversaremappedin.Thethicknessofthebarshowsthenumberoftroopsatanypointoftimethatismapped.Onecolorisusedfortheonwardmarchandanotherfortheretreat.Theweathertemperatureateachtimeisshowninthelinegraphatthebottom.

Figure1.4:SampleDataVisualization

28

OrganizationofthebookThischapterisdesignedtoprovidethewholenessofbusinessintelligenceanddata mining, to provide the reader with an intuition for this area ofknowledge.Therestofthebookcanbeconsideredinthreesections.

Section 1 will cover high level topics. Chapter 2 will cover the field ofbusiness intelligence and its applications across industries and functions.Chapter3willbrieflyexplainwhatisdatawarehousingandhowdoesithelpwith datamining. Chapter 4 will then describe datamining in some detailwithanoverviewofitsmajortoolsandtechniques.

Section 2 is focused on data mining techniques. Every technique will beshownthroughsolvinganexampleindetails.Chapter5willshowthepowerandeaseofdecisiontrees,whicharethemostpopulardataminingtechnique.Chapter6willdescribestatisticalregressionmodelingtechniques.Chapter7will provide an overview of artificial neural networks, a versatile machinelearning technique. Chapter 8 will describe how Cluster Analysis can helpwith market segmentation. Finally, chapter 9 will describe the AssociationRuleMiningtechnique,alsocalledMarketBasketAnalysis, thathelpsfindsshoppingpatterns.

Section3will covermore advancednew topics. Chapter10will introducetheconceptsandtechniquesofTextMining,thathelpsdiscoverinsightsfromtext data including social media data. Chapter 11 will cover provide anoverview of the growing field of web mining, which includes mining thestructure, content and usage of web sites. Chapter 12 will provide anoverview of the recent field of Big Data. Chapter 13 has been added as aprimer on Data Modeling, for those who do not have any background indatabases,andshouldbeusedifnecessary.

29

ReviewQuestions

1:DescribetheBusinessIntelligenceandDataMiningcycle.

2:Describethedataprocessingchain.

3:Whatarethesimilaritiesbetweendiamondmininganddatamining?

4:Whatare thedifferentdatamining techniques?Whichof thesewouldberelevantinyourcurrentwork?

5:Whatisadashboard?Howdoesithelp?

6:Createavisualtoshowtheweatherpatterninyourcity.Couldyoushowtogethertemperature,humidity,wind,andrain/snowoveraperiodoftime.

30

Section1

Thissectioncoversthreeimportanthigh-leveltopics.

Chapter 2 will cover business intelligence concepts, and its applications inmanyindustries.

Chapter3willdescribedatawarehousingsystems,andwaysofcreatingandmanagingthem.

Chapter4willdescribedataminingasawhole,itsmanytechniques,andwithmanydo’sanddon’tsofeffectivedatamining.

Chapter 5 will describe data visualization as a whole, with techniques andexamples,andwithmanythumbrulesofeffectivedatavisualizations.

31

Chapter2:BusinessIntelligenceConceptsandApplications

Business intelligence (BI) is an umbrella term that includes a variety of ITapplicationsthatareusedtoanalyzeanorganization’sdataandcommunicatetheinformationtorelevantusers.(Figure2.1).

Figure2.1:BIDMcycle

Thenatureoflifeandbusinessesistogrow.Informationisthelife-bloodofbusiness. Businesses use many techniques for understanding theirenvironment and predicting the future for their own benefit and growth.Decisions aremade from facts and feelings.Data-based decisions aremoreeffectivethanthosebasedonfeelingsalone.Actionsbasedonaccuratedata,information, knowledge, experimentation, and testing, using fresh insights,canmorelikelysucceedandleadtosustainedgrowth.One’sowndatacanbethemost effective teacher. Therefore, organizations should gather data, siftthroughit,analyzeandmineit, findinsights,andthenembedthoseinsightsintotheiroperatingprocedures.

There is a new sense of importance and urgency around data as it is beingviewed as a new natural resource. It can bemined for value, insights, andcompetitive advantage. In a hyperconnected world, where everything ispotentiallyconnectedtoeverythingelse,withpotentiallyinfinitecorrelations,data represents the impulses of nature in the form of certain events andattributes.Askilledbusinesspersonismotivatedtousethiscacheofdatatoharnessnature, and to findnewnichesof unservedopportunities that couldbecomeprofitableventures.

32

33

Caselet:KhanAcademy–BIinEducationKhan Academy is an innovativenon-profit educationalorganization that is turning theK-12 education system upsidedown. Itprovides shortYouTubebased video lessons onthousands of topics for free. Itshot into prominence when BillGatespromoted it asa resourcethat he used to teach his ownchildren. With this kind of aresource classrooms are beingflipped … i.e. student do theirbasic lecture-type learning athome using those videos, whilethe class time is used for moreone-on-oneproblemsolvingandcoaching. Students can accessthe lessons at any time to learnat theirownpace.Thestudents’progress is recorded includingwhat videos they watched howmanytimes,whichproblemstheystumbled on, and what scorestheygotononlinetests.

Khan Academy has developedtoolstohelpteachersgetapulseon what's happening in theclassroom. Teachers areprovided a set of real-timedashboards to give theminformationfromthemacrolevel("How is my class doing ongeometry?") to the micro level("How is Jane doing onmastering polygons?") Armedwith this information, teacherscan place selective focus on thestudents that need certain help.

34

(Source:KhanAcademy.org)

Q1: How does a dashboardimprove the teachingexperience? And the student’slearningexperience?

Q2: Design a dashboard fortrackingyourowncareer.

35

BIforbetterdecisionsThefuture is inherentlyuncertain.Risk is theresultofaprobabilisticworldwhere there are no certainties and complexities abound. People use crystalballs,astrology,palmistry,groundhogs,andalsomathematicsandnumberstomitigate risk in decision-making. The goal is to make effective decisions,whilereducingrisk.Businessescalculaterisksandmakedecisionsbasedonabroadsetoffactsandinsights.Reliableknowledgeaboutthefuturecanhelpmanagersmaketherightdecisionswithlowerlevelsofrisk.

ThespeedofactionhasrisenexponentiallywiththegrowthoftheInternet.Inahypercompetitiveworld,thespeedofadecisionandtheconsequentactioncanbeakeyadvantage.TheInternetandmobiletechnologiesallowdecisionstobemadeanytime,anywhere.Ignoringfast-movingchangescanthreatentheorganization’sfuture.Researchhasshownthatanunfavorablecommentaboutthecompanyanditsproductsonsocialmediashouldnotgounaddressedforlong.BankshavehadtopayhugepenaltiestoConsumerFinancialProtectionBureau (CFPB) in United States in 2013 for complaints made on CFPB’swebsites.Ontheotherhand,apositivesentimentexpressedonsocialmediashouldalsobeutilizedasapotentialsalesandpromotionopportunity,whiletheopportunitylasts.

36

DecisiontypesThere are two main kinds of decisions: strategic decisions and operationaldecisions. BI can help make both better. Strategic decisions are those thatimpact the direction of the company. The decision to reach out to a newcustomer setwould be a strategic decision.Operational decisions aremoreroutine and tactical decisions, focused on developing greater efficiency.Updatinganoldwebsitewithnewfeatureswillbeanoperationaldecision.

Instrategicdecision-making,thegoalitselfmayormaynotbeclear,andthesameistrueforthepathtoreachthegoal.Theconsequencesofthedecisionwouldbeapparentsometimelater.Thus,oneisconstantlyscanningfornewpossibilities and new paths to achieve the goals. BI can help with what-ifanalysisofmanypossiblescenarios.BIcanalsohelpcreatenewideasbasedonnewpatternsfoundfromdatamining.

Operational decisions can bemademore efficient using an analysis of pastdata.A classification system can be created andmodeled using the data ofpast instances todevelopagoodmodelof thedomain.Thismodelcanhelpimproveoperationaldecisionsinthefuture.BIcanhelpautomateoperationslevel decision-making and improve efficiency by making millions ofmicroleveloperationaldecisionsinamodel-drivenway.Forexample,abankmight want to make decisions about making financial loans in a morescientific way using data-basedmodels. A decision-tree-basedmodel couldprovideaconsistentlyaccurateloandecisions.Developingsuchdecisiontreemodelsisoneofthemainapplicationsofdataminingtechniques.

Effective BI has an evolutionary component, as business models evolve.Whenpeople andorganizations act, new facts (data) are generated.Currentbusinessmodels can be tested against the new data, and it is possible thatthosemodelswillnotholdupwell. In thatcase,decisionmodelsshouldberevised and new insights should be incorporated. An unending process ofgeneratingfreshnewinsightsinrealtimecanhelpmakebetterdecisions,andthuscanbeasignificantcompetitiveadvantage.

37

BIToolsBI includes a variety of software tools and techniques to provide themanagers with the information and insights needed to run the business.Information can be provided about the current state of affairs with thecapabilitytodrilldownintodetails,andalsoinsightsaboutemergingpatternswhichleadtoprojectionsintothefuture.BItoolsincludedatawarehousing,online analytical processing, social media analytics, reporting, dashboards,querying,anddatamining.

BItoolscanrangefromverysimpletoolsthatcouldbeconsideredend-usertools, toverysophisticated tools thatofferaverybroadandcomplexsetoffunctionality.Thus,EvenexecutivescanbetheirownBIexperts,ortheycanrely on BI specialists to set up the BI mechanisms for them. Thus, largeorganizationsinvestinexpensivesophisticatedBIsolutionsthatprovidegoodinformationinrealtime.

Aspreadsheettool,suchasMicrosoftExcel,canactasaneasybuteffectiveBItoolbyitself.Datacanbedownloadedandstoredinthespreadsheet,thenanalyzedtoproduceinsights,thenpresentedintheformofgraphsandtables.This systemoffers limitedautomationusingmacrosandother features.Theanalytical features include basic statistical and financial functions. Pivottableshelpdosophisticatedwhat-ifanalysis.Add-onmodulescanbeinstalledtoenablemoderatelysophisticatedstatisticalanalysis.

A dashboarding system, such as IBM Cognos or Tableau, can offer asophisticatedsetoftoolsforgathering,analyzing,andpresentingdata.Attheuserend,modulardashboardscanbedesignedand redesignedeasilywithagraphical user interface. The back-end data analytical capabilities includemany statistical functions.Thedashboards are linked to datawarehouses atthebackend toensure that the tablesandgraphsandotherelementsof thedashboardareupdatedinrealtime(Figure2.2).

38

Figure2.2:SampleExecutiveDashboard

Data mining systems, such as IBM SPSS Modeler, are industrial strengthsystems thatprovidecapabilities toapplyawiderangeofanalyticalmodelsonlargedatasets.Opensourcesystems,suchasWeka,arepopularplatformsdesignedtohelpminelargeamountsofdatatodiscoverpatterns.

39

BISkillsAsdatagrowsandexceedsourcapacitytomakesenseofit,thetoolsneedtoevolve, and so should the imagination of theBI specialist. “DataScientist”hasbeencalledasthehottestjobofthisdecade.

AskilledandexperiencedBIspecialistshouldbeopenenoughtogooutsidethe box, open the aperture and see a wider perspective that includes moredimensionsandvariables,inordertofindimportantpatternsandinsights.Theproblem needs to be looked at from a wider perspective to considermanymore angles thatmaynotbe immediatelyobvious.An imaginative solutionshouldbeproposedfortheproblemsothatinterestingandusefulresultscanemerge.

A good data mining project begins with an interesting problem to solve.Selecting the right datamining problem is an important skill. The problemshould be valuable enough that solving it would be worth the time andexpense. It takes a lot of time and energy to gather, organize, cleanse, andpreparethedataforminingandotheranalysis.Thedataminerneedstopersistwith the exploration of patterns in the data. The skill level has to be deepenoughtoengagewiththedataandmakeityieldnewusefulinsights.

40

BIApplicationsBItoolsarerequiredinalmostallindustriesandfunctions.Thenatureoftheinformationand the speedof actionmaybedifferent acrossbusinesses, butevery manager today needs access to BI tools to have up-to-date metricsaboutbusinessperformance.Businessesneedtoembednewinsightsintotheiroperating processes to ensure that their activities continue to evolve withmoreefficientpractices.ThefollowingaresomeareasofapplicationsofBIanddatamining.

CustomerRelationshipManagementAbusiness exists to serve a customer.A happy customer becomes a repeatcustomer. A business should understand the needs and sentiments of thecustomer,sellmoreofitsofferingstotheexistingcustomers,andalso,expandthepoolofcustomers it serves.BIapplicationscan impactmanyaspectsofmarketing.

1. Maximize the return on marketing campaigns: Understanding thecustomer’s pain points from data-based analysis can ensure that themarketingmessagesarefine-tunedtobetterresonatewithcustomers.

2. Improve customer retention (churn analysis): It is more difficult andexpensive towinnewcustomers than it is to retainexistingcustomers.Scoringeachcustomerontheirlikelihoodtoquit,canhelpthebusinessdesign effective interventions, such as discounts or free services, toretainprofitablecustomersinacost-effectivemanner.

3. Maximize customer value (cross-, up-selling): Every contact with thecustomershouldbeseenasanopportunitytogaugetheircurrentneeds.Offeringacustomernewproductsandsolutionsbasedonthoseimputedneeds can help increase revenue per customer. Even a customercomplaintcanbeseenasanopportunitytowowthecustomer.Usingtheknowledgeofthecustomer’shistoryandvalue,thebusinesscanchoosetosellapremiumservicetothecustomer.

4. Identify and delight highly-valued customers. By segmenting thecustomers,thebestcustomerscanbeidentified.Theycanbeproactivelycontacted, and delighted, with greater attention and better service.Loyaltyprogramscanbemanagedmoreeffectively.

41

5. Managebrandimage.Abusinesscancreatealisteningposttolistentosocialmediachatteraboutitself.Itcanthendosentimentanalysisofthetexttounderstandthenatureofcomments,andrespondappropriatelytotheprospectsandcustomers.

HealthcareandWellnessHealth care isoneof thebiggest sectors in advancedeconomies.Evidence-basedmedicineisthenewesttrendindata-basedhealthcaremanagement.BIapplicationscanhelpapplythemosteffectivediagnosesandprescriptionsforvariousailments.Theycanalsohelpmanagepublichealthissues,andreducewasteandfraud.

1. Diagnose disease in patients: Diagnosing the cause of a medicalcondition is the critical first step in amedical engagement.Accuratelydiagnosingcasesofcancerordiabetescanbeamatteroflifeanddeathfor thepatient. Inaddition to thepatient’sowncurrent situation,manyother factors can be considered, including the patient’s health history,medication history, family’s history, and other environmental factors.Thismakesdiagnosisasmuchofanart formas it isscience.Systems,suchasIBMWatson,absorball themedicalresearchtodateandmakeprobabilisticdiagnoses in the formofadecision tree,alongwitha fullexplanation for their recommendations.These systems take awaymostoftheguessworkdonebydoctorsindiagnosingailments.

2. Treatmenteffectiveness:Theprescriptionofmedicationandtreatmentisalso a difficult choice out of somanypossibilities. For example, therearemore than 100medications for hypertension (high blood pressure)alone.Therearealsointeractionsintermsofwhichdrugsworkwellwithothers and which drugs do not. Decision trees can help doctors learnaboutandprescribemoreeffective treatments.Thus, thepatientscouldrecovertheirhealthfasterwithalowerriskofcomplicationsandcost.

3. Wellness management: This includes keeping track of patient healthrecords,analyzingcustomerhealthtrendsandproactivelyadvisingthemtotakeanyneededprecautions.

4. Managefraudandabuse:Somemedicalpractitionershaveunfortunately

42

beenfoundtoconductunnecessarytests,and/oroverbillthegovernmentand health insurance companies. Exception reporting systems canidentifysuchprovidersandactioncanbetakenagainstthem.

5. Publichealthmanagement:Themanagementofpublichealth isoneofthe important responsibilities of any government. By using effectiveforecasting tools and techniques, governments can better predict theonset of disease in certain areas in real time. They can thus be betterprepared to fight the diseases. Google has been known to predict themovement of certain diseases by tracking the search terms (like flu,vaccine)usedindifferentpartsoftheworld.

EducationAshighereducationbecomesmoreexpensiveandcompetitive,itbecomesagreat user of data-based decision-making. There is a strong need forefficiency, increasing revenue, and improving the quality of studentexperienceatalllevelsofeducation.

1. Student Enrollment (Recruitment and Retention): Marketing to newpotentialstudentsrequiresschoolstodevelopprofilesofthestudentsthataremostlikelytoattend.Schoolscandevelopmodelsofwhatkindsofstudentsareattractedtotheschool,andthenreachouttothosestudents.The students at risk of not returning can be flagged, and correctivemeasurescanbetakenintime.

2. Courseofferings: Schools can use the class enrolment data to developmodels of which new courses are likely to be more popular withstudents. This can help increase class size, reduce costs, and improvestudentsatisfaction.

3. Fund-raising from Alumni and other donors: Schools can developpredictivemodels ofwhich alumni aremost likely to pledge financialsupporttotheschool.Schoolscancreateaprofileforalumnimorelikelytopledgedonations to theschool.Thiscould lead toareduction in thecostofmailingsandotherformsofoutreachtoalumni.

RetailRetailorganizationsgrowbymeetingcustomerneedswithqualityproducts,in a convenient, timely, and cost-effectivemanner.Understanding emerging

43

customer shopping patterns can help retailers organize their products,inventory,storelayout,andwebpresenceinordertodelighttheircustomers,whichinturnwouldhelpincreaserevenueandprofits.Retailersgeneratealotof transaction and logistics data that can be used to diagnose and solveproblems.

1. Optimizeinventorylevelsatdifferentlocations:Retailersneedtomanagetheir inventories carefully. Carrying too much inventory imposescarrying costs,while carrying too little inventory can cause stock-outsandlostsalesopportunities.Predictingsalestrendsdynamicallycanhelpretailers move inventory to where it is most in demand. Retailorganizations can provide their suppliers with real time informationaboutsalesoftheiritems,sothesupplierscandelivertheirproducttotherightlocationsandminimizestock-outs.

2. Improvestorelayoutandsalespromotions:Amarketbasketanalysiscandevelop predictive models of which products sell together often. Thisknowledge of affinities between products can help retailers co-locatethose products. Alternatively, those affinity products could be locatedfarther apart tomake the customerwalk the length and breadth of thestore, and thus be exposed to other products. Promotional discountedproductbundlescanbecreatedtopushanonsellingitemalongwithasetofproductsthatsellwelltogether.

3. Optimize logistics for seasonal effects: Seasonal products offertremendously profitable short-term sales opportunities, yet they alsooffer the risk of unsold inventories at the end of the season.Understandingwhich products are in season inwhichmarket can helpretailers dynamically manage prices to ensure their inventory is soldduringtheseason.Ifitisraininginacertainarea,thentheinventoryofumbrellaandponchoscouldberapidlymovedtherefromnonrainyareastohelpincreasesales.

4. Minimize losses due to limited shelf life: Perishable goods offerchallenges in termsofdisposingoff the inventory in time.By trackingsalestrends,theperishableproductsatriskofnotsellingbeforethesell-bydate,canbesuitablydiscountedandpromoted.

Banking

44

Banksmake loansandoffercredit cards tomillionsofcustomers.Theyaremost interested in improving the quality of loans and reducing bad debts.They also want to retain more good customers, and sell more services tothem.

1. Automate the loan application process: Decision models can begenerated from past data that predict the likelihood of a loan provingsuccessful.Thesecanbe inserted inbusinessprocesses toautomate thefinancialloanapprovalprocess.

2. Detectfraudulenttransactions:Billionsoffinancialtransactionshappenaround the world every day. Exception-seeking models can identifypatterns of fraudulent transactions. For example, if money is beingtransferred to an unrelated account for the first time, it could be afraudulenttransaction.

3. Maximizecustomervalue(cross-,up-selling).Sellingmoreproductsandservices to existing customers is often the easiest way to increaserevenue.Acheckingaccountcustomeringoodstandingcouldbeofferedhome, auto, or educational loans on more favorable terms than othercustomers, and thus, the value generated from that customer could beincreased.

4. Optimizecashreserveswithforecasting.Bankshavetomaintaincertainliquidity to meet the needs of depositors who may like to withdrawmoney.Usingpastdataandtrendanalysis,bankscanforecasthowmuchtokeepandinvesttheresttoearninterest.

FinancialServicesStockbrokeragesareanintensiveuserofBIsystems.Fortunescanbemadeorlostbasedonaccesstoaccurateandtimelyinformation.

1. Predictchangesinbondandstockprices:Forecastingthepriceofstocksandbondsisafavoritepastimeoffinancialexpertsaswellaslaypeople.Stocktransactiondatafromthepast,alongwithothervariables,canbeusedtopredictfuturepricepatterns.Thiscanhelptradersdeveloplong-termtradingstrategies.

45

2. Assesstheeffectofeventsonmarketmovements.Decisionmodelsusingdecisiontreescanbecreatedtoassesstheimpactofeventsonchangesinmarket volume and prices. Monetary policy changes (such as FederalReserve interest ratechange)orgeopolitical changes (suchaswar inapartoftheworld)canbefactoredintothepredictivemodeltohelptakeactionwithgreaterconfidenceandlessrisk.

3. Identify and prevent fraudulent activities in trading: There haveunfortunately been many cases of insider trading, leading to manyprominent financial industry stalwarts going to jail. Fraud detectionmodels seek out-of-the-ordinary activities, and help identify and flagfraudulentactivitypatterns.

InsuranceThis industry is a prolific user of prediction models in pricing insuranceproposalsandmanaginglossesfromclaimsagainstinsuredassets.

1. Forecast claim costs for better business planning: When naturaldisasters, such as hurricanes and earthquakes strike, loss of life andpropertyoccurs.Byusingthebestavailabledatatomodelthelikelihood(or risk) of such events happening, the insurer can plan for losses andmanageresourcesandprofitseffectively.

2. Determine optimal rate plans: Pricing an insurance rate plan requirescovering the potential losses andmaking a profit. Insurers use actuarytables toproject lifespansanddisease tables toprojectmortality rates,andthuspricethemselvescompetitivelyyetprofitably.

3. Optimize marketing to specific customers: By micro-segmentingpotential customers, a data-savvy insurer can cherry pick the bestcustomers and leave the less profitable customers to its competitors.ProgressiveInsuranceisaUS-basedcompanythatisknowntoactivelyusedataminingtocherrypickcustomersandincreaseitsprofitability.

4. Identify and prevent fraudulent claim activities. Patterns can beidentifiedastowhereandwhatkindsoffraudaremorelikelytooccur.Decision-tree-basedmodels canbeused to identify and flag fraudulent

46

claims.

ManufacturingManufacturing operations are complex systems with inter-related sub-systems.Frommachinesworkingright,toworkershavingtherightskills,tothe right components arriving with the right quality at the right time, tomoney to source the components, many things have to go right. Toyota’sfamousleanmanufacturingcompanyworksonjust-in-timeinventorysystemsto optimize investments in inventory and to improve flexibility in theirproduct-mix.

1. Discovernovelpatternstoimproveproductquality:Qualityofaproductcan also be tracked, and this data can be used to create a predictivemodel of product quality deteriorating. Many companies, such asautomobilecompanies,have to recall theirproducts if theyhave founddefectsthathaveapublicsafetyimplication.Dataminingcanhelpwithrootcauseanalysisthatcanbeusedtoidentifysourcesoferrorsandhelpimproveproductqualityinthefuture.

2. Predict/preventmachinery failures:Statistically, all equipment is likelytobreakdownatsomepointintime.Predictingwhichmachineislikelyto shut down is a complex process. Decision models to forecastmachinery failures could be constructed using past data. Preventivemaintenance can be planned, and manufacturing capacity can beadjusted,toaccountforsuchmaintenanceactivities.

TelecomBIintelecomcanhelpwiththecustomersideaswellasnetworksideoftheoperations. Key BI applications include churn management,marketing/customerprofiling,networkfailure,andfrauddetection.

1. Churn management: Telecom customers have shown a tendency toswitchtheirprovidersinsearchforbetterdeals.Telecomcompaniestendtorespondwithmanyincentivesanddiscountstoholdontocustomers.However, theyneed todeterminewhichcustomersareat a real riskofswitching and which others are just negotiating for a better deal. Thelevelof risk should tobe factored into thekindofdeals anddiscountsthat should be given. Millions of such customer calls happen everymonth. The telecom companies need to provide a consistent and data-basedwaytopredicttheriskofthecustomerswitching,andthenmake

47

an operational decision in real time while the customer call is takingplace.Adecision-tree-oraneuralnetwork-basedsystemcanbeusedtoguidethecustomer-servicecalloperatortomaketherightdecisionsforthecompany,inaconsistentmanner.

2. Marketingandproductcreation.Inadditiontocustomerdata,telecomcompaniesalsostorecalldetailrecords(CDRs),whichcanbeanalyzedtopreciselydescribethecallingbehaviorofeachcustomer.Thisuniquedatacanbeusedtoprofilecustomersandthencanbeusedforcreatingnew products/services bundles for marketing purposes. An Americantelecomcompany,MCI,createdaprogramcalledFriends&Familythatallowed free calls with one’s friends and family on that network, andthus,effectivelylockedmanypeopleintotheirnetwork.

3. Networkfailuremanagement:Failureoftelecomnetworksfortechnicalfailures or malicious attacks can have devastating impacts on people,businesses,andsociety.Intelecominfrastructure,someequipmentwilllikelyfailwithcertainmeantimebetweenfailures.Modelingthefailurepatternofvariouscomponentsof thenetworkcanhelpwithpreventivemaintenanceandcapacityplanning.

4. Fraud Management: There are many kinds of fraud in consumertransactions. Subscription fraud occurs when a customer opens anaccount with the intention of never paying for the services.Superimposition fraud involves illegitimate activity by a person otherthan the legitimate account holder.Decision rules can be developed toanalyze each CDR in real time to identify chances of fraud and takeeffectiveaction.

PublicSectorGovernment gathers a large amount of data by virtue of their regulatoryfunction. That data could be analyzed for developing models of effectivefunctioning.Thereareinnumerableapplicationsthatcanbenefitfromminingthatdata.Acoupleofsampleapplicationsareshownhere.

1. Lawenforcement:Socialbehaviorisalotmorepatternedandpredictablethanonewould imagine.Forexample,LosAngelesPoliceDepartment(LAPD)minedthedatafromits13millioncrimerecordsover80yearsanddevelopedmodelsofwhatkindofcrimegoingtohappenwhenand

48

where.Byincreasingpatrollinginthoseparticularareas,LAPDwasabletoreducepropertycrimeby27percent.Internetchattercanbeanalyzedtolearnofandpreventanyevildesigns.

2. Scientificresearch:Anylargecollectionofresearchdataisamenabletobeing mined for patterns and insights. Protein folding (microbiology),nuclear reaction analysis (sub-atomic physics), disease control (publichealth) are some exampleswhere datamining can yield powerful newinsights.

49

ConclusionBusiness Intelligence isacomprehensivesetof IT tools to supportdecisionmaking with imaginative solutions for a variety of problems. BI can helpimprovetheperformanceinnearlyallindustriesandapplications.

50

ReviewQuestions1. Whyshouldorganizationsinvestinbusinessintelligencesolutions?Are

thesemoreimportantthanITsecuritysolutions?Whyorwhynot?2. List3businessintelligenceapplicationsinthehospitalityindustry.3. Describe2BItoolsusedinyourorganization.4. Businesses need a ‘two-second advantage’ to succeed.What does that

meantoyou?

51

LibertyStoresCaseExercise:Step1LibertyStoresIncisaspecializedglobalretailchainthatsells organic food, organic clothing, wellness products,andeducationproductstoenlightenedLOHAS(Lifestylesof theHealthy and Sustainable) citizensworldwide. Thecompany is20yearsold,and isgrowingrapidly. Itnowoperatesin5continents,50countries,150cities,andhas500 stores. It sells 20000 products and has 10000employees.Thecompanyhasrevenuesofover$5billionand has a profit of about 5% of revenue. The companypays special attention to the conditionsunderwhich theproductsaregrownandproduced. Itdonatesaboutone-fifth (20%) of its pre-tax profits from global localcharitablecauses.

1:CreateacomprehensivedashboardfortheCEOofthecompany.

2:Createanotherdashboardforacountryhead.

52

Chapter3:DataWarehousing

A data warehouse (DW) is an organized collection of integrated, subject-oriented databases designed to support decision support functions. DW isorganized at the right level of granularity to provide clean enterprise-widedata in a standardized format for reports, queries, and analysis. DW isphysically and functionally separate from an operational and transactionaldatabase. Creating a DW for analysis and queries represents significantinvestmentintimeandeffort.Ithastobeconstantlykeptup-to-dateforittobeuseful.DWoffersmanybusinessandtechnicalbenefits.

DW supports business reporting and datamining activities. It can facilitatedistributed access to up-to-date business knowledge for departments andfunctions,thusimprovingbusinessefficiencyandcustomerservice.DWcanpresentacompetitiveadvantagebyfacilitatingdecisionmakingandhelpingreformbusinessprocesses.

DWenablesaconsolidatedviewofcorporatedata,allcleanedandorganized.Thus, the entire organization can see an integrated view of itself.DW thusprovides better and timely information. It simplifies data access and allowsendusers toperformextensiveanalysis. Itenhancesoverall ITperformanceby not burdening the operational databases used by Enterprise ResourcePlanning(ERP)andothersystems.

53

Caselet:UniversityHealthSystem–BIinHealthcareIndiana University Health(IUH), a large academic healthcaresystem,decided tobuildanenterprise data warehouse(EDW) to foster a genuinelydata-drivenmanagementculture.IUH hired a data warehousingvendor to develop an EDWwhich also integrates with theirElectronic Health Records(EHR) system. They loaded 14billion rows of data into theEDW—fully 10 years of clinicaldatafromacrossIUH’snetwork.Clinical events, patientencounters, lab and radiology,and other patient data wereincluded, as were IUH’sperformance management,revenue cycle, and patientsatisfaction data. They soon putin a new interactive dashboardusing the EDW that providedIUH’s leadership with the dailyoperationalinsightstheyneedtosolvethequality/costequation.Itoffers visibility into keyoperationalmetricsandtrendstoeasily track the performancemeasures critical to controllingcosts and maintaining quality.The EDW can easily be usedacross IUH’s departments toanalyze, track and measureclinical, financial, and patientexperience outcomes. (Source:healthcatalyst.com)

Q1: What are the benefits of asingle large comprehensive

54

EDW?

Q2:Whatkindsofdatawouldbeneeded for an EDW for anairlinecompany?

55

DesignConsiderationsforDWThe objective ofDW is to provide business knowledge to support decisionmaking. For DW to serve its objective, it should be aligned around thosedecisions. It shouldbe comprehensive, easy to access, andup-to-date.HerearesomerequirementsforagoodDW:

1. Subjectoriented: To be effective, aDW should be designed around asubjectdomain,i.e.tohelpsolveacertaincategoryofproblems.

2. Integrated:TheDWshould includedata frommany functions that canshedlightonaparticularsubjectarea.Thustheorganizationcanbenefitfromacomprehensiveviewofthesubjectarea.

3. Time-variant(timeseries):ThedatainDWshouldgrowatdailyorotherchosenintervals.Thatallowslatestcomparisonsovertime.

4. Nonvolatile:DWshouldbepersistent,thatis,itshouldnotbecreatedontheflyfromtheoperationsdatabases.Thus,DWisconsistentlyavailableforanalysis,acrosstheorganizationandovertime.

5. Summarized: DWcontains rolled-updataat the right level forqueriesandanalysis.Theprocessof rollingup thedatahelpscreateconsistentgranularityforeffectivecomparisons.Italsohelpsreducesthenumberofvariablesordimensionsof thedata tomake themmoremeaningful forthedecisionmakers.

6. Not normalized: DW often uses a star schema, which is a rectangularcentraltable,surroundedbysomelook-uptables.Thesingletableviewsignificantlyenhancesspeedofqueries.

7. Metadata: Many of the variables in the database are computed fromothervariablesintheoperationaldatabase.Forexample,totaldailysalesmaybeacomputedfield.Themethodofitscalculationforeachvariableshouldbeeffectivelydocumented.Everyelement in theDWshouldbesufficientlywell-defined.

8. Near Real-time and/or right-time (active): DWs should be updated innear real-time in many high transaction volume industries, such asairlines.ThecostofimplementingandupdatingDWinreal timecouldbe discouraging though. Another downside of real-time DW is thepossibilitiesofinconsistenciesinreportsdrawnjustafewminutesapart.

56

57

DWDevelopmentApproachesThere are two fundamentally different approaches to developing DW: topdown and bottom up. The top-down approach is tomake a comprehensiveDW that covers all the reporting needs of the enterprise. The bottom-upapproachis toproducesmalldatamarts,for thereportingneedsofdifferentdepartmentsor functions, asneeded.The smallerdatamartswill eventuallyalign to deliver comprehensive EDW capabilities. The top-down approachprovides consistency but takes more time and resources. The bottom-upapproachleadstohealthylocalownershipandmaintainabilityofdata(Table3.1).

FunctionalDataMart

EnterpriseDataWarehouse

Scope

Onesubjectorfunctionalarea

Completeenterprisedataneeds

Value

Functionalareareportingandinsights

Deeperinsightsconnectingmultiplefunctionalareas

Targetorganization

Decentralizedmanagement

Centralizedmanagement

Time

Lowtomedium

High

Cost

Low

High

Size

Smalltomedium

Mediumtolarge

Approach

Bottomup

Topdown

Complexity

Low(fewerdatatransformations)

High(datastandardization)

Technology

Smallerscaleserversanddatabases

Industrialstrength

Table3.1:ComparingDataMartandDataWarehouse

58

DWArchitectureDWhasfourkeyelements(Figure3.1).Thefirstelementisthedatasourcesthatprovidetherawdata.Thesecondelementistheprocessoftransformingthat data to meet the decision needs. The third element is the methods ofregularly and accurately loading of that data into EDW or datamarts. Thefourth element is the data access and analysis part, where devices andapplications use the data fromDW to deliver insights and other benefits tousers.

Figure3.1:DataWarehousingArchitecture

59

DataSourcesDataWarehousesarecreatedfromstructureddatasources.UnstructureddatasuchastextdatawouldneedtobestructuredbeforeinsertedintotheDW.

1. Operations data: This includes data from all business applications,including from ERPs systems that form the backbone of anorganization’sITsystems.Thedatatobeextractedwilldependuponthesubjectmatterofthedatawarehouse.Forexample,forasales/marketingdatamart,only thedataaboutcustomers,orders,customerservice,andsoonwouldbeextracted.

2. Specializedapplications:ThisincludesapplicationssuchasPointofSale(POS) terminals, and e-commerce applications, that also providecustomer-facing data. Supplier data could come from Supply ChainManagementsystems.Planningandbudgetdatashouldalsobeaddedasneededformakingcomparisonsagainsttargets.

3. Externalsyndicateddata:This includespubliclyavailabledata suchasweatheroreconomicactivitydata.ItcouldalsobeaddedtotheDW,asneeded,toprovidegoodcontextualinformationtodecisionmakers.

60

DataLoadingProcessesThe heart of a useful DW is the processes to populate the DWwith goodqualitydata.ThisiscalledtheExtract-Transform-Load(ETL)cycle.

1. Data should be extracted from the operational (transactional) databasesources,aswellasfromotherapplications,onaregularbasis.

2. The extracted data should be aligned together by key fields andintegrated into a single data set. It should be cleansed of anyirregularities or missing values. It should be rolled-up together to thesame level of granularity. Desired fields, such as daily sales totals,shouldbecomputed.TheentiredatashouldthenbebroughttothesameformatasthecentraltableofDW.

3. ThistransformeddatashouldthenbeuploadedintotheDW.

ThisETLprocessshouldberunataregularfrequency.Dailytransactiondatacanbeextracted fromERPs, transformed, anduploaded to thedatabase thesamenight.Thus,theDWisuptodateeverymorning.IfaDWisneededfornear-real-time informationaccess, then theETLprocesseswouldneed tobeexecutedmore frequently.ETLwork isusuallydoneusingautomatedusingprogramming scripts that are written, tested, and then deployed forperiodicallyupdatingtheDW.

61

DataWarehouseDesignStar schema is the preferred data architecture for most DWs. There is acentralfacttablethatprovidesmostoftheinformationofinterest.Therearelookuptablesthatprovidedetailedvaluesforcodesusedinthecentraltable.Forexample,thecentraltablemayusedigitstorepresentasalesperson.Thelookuptablewillhelpprovidethenameforthatsalespersoncode.Hereisanexampleof a star schema for a datamart formonitoring sales performance(Figure3.2).

Figure3.2:StarSchemaArchitectureforDW

Otherschemasincludethesnowflakearchitecture.Thedifferencebetweenastarandsnowflakeisthatinthelatter,thelook-uptablescanhavetheirownfurtherlookuptables.

There are many technology choices for developing DW. This includesselecting the right database management system and the right set of datamanagementtools.ThereareafewbigandreliableprovidersofDWsystems.The provider of the operational DBMS may be chosen for DW also.Alternatively, a best-of-breed DW vendor could be used. There are also avarietyof toolsout there fordatamigration,dataupload,data retrieval,anddataanalysis.

62

DWAccessData from the DW could be accessed for many purposes, by many users,throughmanydevices.

1. AprimaryuseofDWistoproduceroutinemanagementandmonitoringreports. For example, a sales performance reportwould show sales bymanydimensions,andcomparedwithplan.Adashboardingsystemwillusedatafromthewarehouseandpresentanalysistousers.ThedatafromDW can be used to populate customized performance dashboards forexecutives. The dashboard could include drill-down capabilities toanalyzetheperformancedataforrootcauseanalysis.

2. Thedata fromtheDWcouldbeused forad-hocqueriesandanyotherapplicationsthatmakeuseoftheinternaldata.

3. DatafromDWisusedtoprovidedataforminingpurposes.Partsofthedatawouldbeextracted,andthencombinedwithotherrelevantdata,fordatamining.

63

DWBestPracticesAdatawarehousingprojectreflectsasignificantinvestmentintoinformationtechnology (IT). All of the best practices in implementing any IT projectshouldbefollowed.

1. The DW project should align with the corporate strategy. Topmanagement should be consulted for setting objectives. Financialviability (ROI)shouldbeestablished.TheprojectmustbemanagedbybothITandbusinessprofessionals.TheDWdesignshouldbecarefullytested before beginning development work. It is often much moreexpensivetoredesignafterdevelopmentworkhasbegun.

2. Itisimportanttomanageuserexpectations.Thedatawarehouseshouldbe built incrementally.Users should be trained in using the system sotheycanabsorbthemanyfeaturesofthesystem.

3. Qualityandadaptabilityshouldbebuiltinfromthestart.Onlyrelevant,cleansed,andhigh-qualitydatashouldbeloaded.Thesystemshouldbeable to adapt to new tools for access. As business needs change, newdatamartsmayneedtobecreatedfornewneeds.

64

ConclusionDataWarehousesarespecialdatamanagementfacilitiesintendedforcreatingreports and analysis to support managerial decision making. They aredesignedtomakereportingandqueryingsimpleandefficient.Thesourcesofdataareoperationalsystems,andexternaldatasources.TheDWneedstobeupdatedwithnewdataregularlytokeepituseful.DatafromDWprovidesausefulinputfordataminingactivities.

65

ReviewQuestions1:Whatisthepurposeofadatawarehouse?

2:Whatarethekeyelementsofadatawarehouse?Describeeachone.

3:Whatarethesourcesandtypesofdataforadatawarehouse?

4:Howwilldatawarehousingevolveintheageofsocialmedia?

66

LibertyStoresCaseExercise:Step2The Liberty Stores company wants to be fully informed about its sales ofproductsandtakeadvantageofgrowthopportunitiesastheyarise.Itwantstoanalyzesalesofallitsproductsbyallstorelocations.ThenewlyhiredChiefKnowledgeOfficerhasdecidedtobuildaDataWarehouse.

1. Design a DW structure for the company to monitor its salesperformance.(Hint:Designthecentraltableandlook-uptables).

2. Design another DW for the company’s sustainability and charitableactivities.

67

Chapter4:DataMining

Datamining is the art and science of discovering knowledge, insights, andpatterns indata. It is theactofextractingusefulpatterns fromanorganizedcollection of data. Patterns must be valid, novel, potentially useful, andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.

Data mining is a multidisciplinary field that borrows techniques from avarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfrom the databases area. It drawsmodeling and analytical techniques fromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.

The field of data mining emerged in the context of pattern recognition indefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspired technologies, it has evolved to help gain a competitiveadvantageinbusiness.

Forexample,“customerswhobuycheeseandmilkalsobuybread90percentof the time”would be a useful pattern for a grocery store,which can thenstock the products appropriately. Similarly, “people with blood pressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.

Past data canbe of predictive value inmany complex situations, especiallywhere the pattern may not be so easily visible without the modelingtechnique.Here is adramatic caseof adata-drivendecision-making systemthatbeats thebestofhumanexperts.Usingpastdata,adecision treemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvote in a5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentof the time. Incontrast, the legal analysts couldatbestpredict correctly59percentofthetime.(Source:Martinetal.2004)

68

69

Caselet:TargetCorp–DataMininginRetailTargetisalargeretailchainthatcrunches data to developinsights that help targetmarketing and advertisingcampaigns. Target analystsmanagedtodevelopapregnancyprediction score based on acustomer'spurchasinghistoryof25 products. In a widelypublicizedstory,theyfiguredoutthatateenagegirlwaspregnantbefore her father did. Thetargetingcanbequitesuccessfuland dramatic as this examplepublishedintheNewYorkTimesillustrates.

AboutayearafterTargetcreatedtheir pregnancy-predictionmodel, a man walked into aTarget store and demanded tosee the manager. He wasclutchingcouponsthathadbeensent tohisdaughterandhewasangry,accordingtoanemployeewho participated in theconversation. “My daughter gotthisinthemail!”hesaid.“She’sstill in high school, and you’resending her coupons for babyclothesandcribs?Areyoutryingto encourage her to getpregnant?”

The manager didn’t have anyidea what the man was talkingabout. He looked at the mailer.Sureenough,itwasaddressedtothe man’s daughter andcontained advertisements for

70

maternity clothing, nurseryfurnitureandpicturesofsmilinginfants.Themanagerapologizedandthencalledafewdayslatertoapologizeagain.

Onthephone,though,thefatherwassomewhatsubdued.“Ihadatalkwithmydaughter,”hesaid.“It turns out there’s been someactivities in my house I haven’tbeencompletelyawareof.Ioweyou an apology.” (Source: NewYorkTimes).

1:DoTargetandother retailershave full rights to use theiracquired data as it sees fit, andto contact desired consumerswithalllegallyadmissiblemeansand messages? What are theissuesinvolvedhere?

2:FaceBookandGoogleprovidemanyservicesforfree.Inreturnthey mine our email and blogsandsendustargetedads.Isthatafairdeal?

71

GatheringandselectingdataThetotalamountofdataintheworldisdoublingevery18months.Thereisanever-growingavalancheofdatacomingwithhighervelocity,volume,andvariety. One has to quickly use it or lose it. Smart data mining requireschoosingwhere toplay.Onehas tomake judiciousdecisionsaboutwhat togatherandwhattoignore,basedonthepurposeofthedataminingexercises.Itislikedecidingwheretofish;asnotallstreamsofdatawillbeequallyrichinpotentialinsights.

Tolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized, and then efficiently mined. One requires the skills andtechnologies for consolidation and integration of data elements frommanysources. Most organizations develop an enterprise data model (EDM) toorganize their data. An EDM is a unified, high-levelmodel of all the datastored in an organization’s databases. The EDM is usually inclusive of thedatageneratedfromallinternalsystems.TheEDMprovidesthebasicmenuofdata tocreateadatawarehouseforaparticulardecision-makingpurpose.DWshelporganizeallthisdatainaneasyandusablemannersothatitcanbeselected and deployed for mining. The EDM can also help imagine whatrelevantexternaldatashouldbegatheredtoprovidecontextanddevelopgoodpredictive relationships with the internal data. In the United States, thevarious federal and localgovernments and their regulatoryagenciesmakeavastvarietyandquantityofdataavailableatdata.gov.

Gathering and curating data takes time and effort, particularly when it isunstructured or semistructured.Unstructured data can come inmany formslikedatabases,blogs, images,videos,audio,andchats.Therearestreamsofunstructured social media data from blogs, chats, and tweets. There arestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetof things,andsoon.Eventually thedatashouldberectangularized,that is, put in rectangular data shapeswith clear columns and rows, beforesubmittingittodatamining.

Knowledgeofthebusinessdomainhelpsselect therightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegathered from thedatawarehouse.Every industryand functionwillhave its own requirements and constraints. The health care industry willprovideadifferent typeofdatawithdifferentdatanames.TheHRfunction

72

would provide different kinds of data. There would be different issues ofqualityandprivacyforthesedata.

73

DatacleansingandpreparationThe quality of data is critical to the success and value of the data miningproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.

Dataalmostcertainlyneeds tobecleansedandtransformedbefore itcanbeused for data mining. There are many ways in what data may need to becleansed – filling missing values, reigning in the effects of outliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-70%ofthetimeneededforadataminingproject.

1. Duplicate data needs to be removed. The same data may be receivedfrommultiple sources.Whenmerging the data sets, data must be de-duped.

2. Missing values need to be filled in, or those rows should be removedfromanalysis.Missingvaluescanbefilledinwithaverageormodalordefaultvalues.

3. Data elements should be comparable. They may need to be (a)transformedfromoneunittoanother.Forexample,totalcostsofhealthcare and the total number of patients may need to be reduced tocost/patient to allow comparability of that value. Data elements mayneed to be adjusted tomake them (b) comparable over time also. Forexample, currency values may need to be adjusted for inflation; theywould need to be converted to the same base year for comparability.Theymayneedtobeconvertedtoacommoncurrency.Datashouldbe(c)storedatthesamegranularitytoensurecomparability.Forexample,salesdatamaybeavailabledaily,butthesalespersoncompensationdatamayonlybeavailablemonthly.Torelate thesevariables, thedatamustbebroughttothelowestcommondenominator,inthiscase,monthly.

4. Continuousvaluesmayneedtobebinnedintoafewbucketstohelpwithsome analyses. For instance,work experience could be binned as low,medium,andhigh.

5. Outlierdataelementsneedtoberemovedaftercarefulreview,toavoidthe skewing of results. For example, one big donor could skew theanalysisofalumnidonorsinaneducationalsetting.

74

6. Ensure that thedata is representativeof thephenomenaunderanalysisbycorrectingforanybiasesintheselectionofdata.Forexample,ifthedata includesmanymoremembersofonegender than is typicalof thepopulationofinterest,thenadjustmentsneedtobeappliedtothedata.

7. Datamayneedtobeselectedtoincreaseinformationdensity.Somedatamaynotshowmuchvariability,becauseitwasnotproperlyrecordedorforother reasons.Thisdatamaydull theeffectsofotherdifferences inthedata and shouldbe removed to improve the informationdensityofthedata.

75

OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflect theobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.

One popular form of data mining output is a decision tree. It is ahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakea model-based decision. The tree may have certain attributes, such asprobabilities assigned to each branch.A related format is a set of businessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemapped to business rules. If the objective function is prediction, then adecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.

The output can be in the form of a regression equation or mathematicalfunction that represents the best fitting curve to represent the data. Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.

Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-net worth professionals, married with two children, living in the coastalareas”. Or a population of “20-something, ivy-league-educated, techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmore than 20 years old, giving low mileage per gallon, which failedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.

Business rules are an appropriate representation of the output of a marketbasket analysis exercise. These rules are if-then statements with someprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).

The output can be in the form of a regression equation or mathematicalfunction that represents the best fitting curve to represent the data. Thisequationmayincludelinearandnon-linearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.

Population‘centroid’isastatisticalmeasurefordescribingcentraltendencies

76

ofacollectionofdatapoints.Thesemightbedefinedinamulti-dimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwith2children,livinginthecoastalareas”.Or a population of “20-something, ivy-league-educated, tech entrepreneursbasedinSiliconValley”.Oracollectionof“vehiclesmorethan20yearsold,giving low mileage per gallon, that failed the environmental inspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.

Business rules are an appropriate representation of the output of amarket-basket analysis exercise. These rules are if-then statements with someprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbread,willalsobuybutter(with80%probability).

77

EvaluatingDataMiningResultsThere are two primary kinds of datamining processes: supervised learningand unsupervised learning. In supervised learning, a decisionmodel can becreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswer for future data instances. Classification is the main category ofsupervised learning activity. There are many techniques for classification,decision trees being themost popular one.Eachof these techniques canbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.

PredictiveAccuracy=(CorrectPredictions)/TotalPredictions

Suppose a data mining project has been initiated to develop a predictivemodel for cancer patients using a decision tree. Using a relevant set ofvariables and data instances, a decision tree model has been created. Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapoint is positive, that is a correct prediction, called a true positive (TP).Similarly,whena truenegativedatapoint isclassifiedasnegative, that isatrue negative (TN). On the other hand, when a true-positive data point isclassified by themodel as negative, that is an incorrect prediction, called afalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive, that is classifiedas a falsepositive (FP).This is representedusingtheconfusionmatrix(Figure4.1).

ConfusionMatrix

TrueClass

Positive

Negative

PredictedClass

Predictedclass

Positive

TruePositive(TP)

FalsePositive(FP)

Negative

FalseNegative(FN)

TrueNegative(TN)Figure4.1:ConfusionMatrix

Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.

PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).

78

All classification techniques have a predictive accuracy associated with apredictive model. The highest value can be 100%. In practice, predictivemodelswithmore than70%accuracy canbe consideredusable inbusinessdomains,dependinguponthenatureofthebusiness.

TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.

79

DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused toexplore thedata to find interestingassociativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure4.2).

DataMiningTechniques

SupervisedLearning

(Predictiveabilitybasedonpastdata)

Classification–MachineLearning

DecisionTrees

NeuralNetworks

Classification-Statistics

Regression

UnsupervisedLearning

(Exploratoryanalysistodiscoverpatterns)

ClusteringAnalysis

AssociationRulesFigure4.2:ImportantDataMiningTechniques

The most important class of problems solved using data mining areclassification problems. Classification techniques are called supervisedlearning as there is away to supervisewhether themodel is providing therightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyof the decisionmaking process in the future. The data of past decisions isorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.

Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.

1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.

2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.

3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuch

80

datapreparationfromtheusers.4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.

Therearemanyalgorithmstoimplementdecisiontrees.SomeofthepopularonesareC5,CARTandCHAID.

Regression is amost popular statistical datamining technique. The goal ofregression is to derive a smooth well-defined curve to best the data.Regression analysis techniques, for example, can be used to model andpredict the energy consumption as a function of daily temperature. Simplyplotting the data may show a non-linear curve. Applying a non-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfutureday can be predicted using this equation. The accuracy of the regressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.

Artificial Neural Networks (ANN) is a sophisticated datamining techniquefrom the Artificial Intelligence stream in Computer Science. It mimics thebehavior of human neural structure:Neurons receive stimuli, process them,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuron outputs a decision. A decision task may be processed by just oneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemany layers of neurons involved in a decision task, dependingupon thecomplexity of the domain. The neural network can be trained bymaking adecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedback receivedon itspreviousdecisions.The intermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.

At some point, the neural network will have learned enough and begin tomatchthepredictiveaccuracyofahumanexpertoralternativeclassificationtechniques.ThepredictionsofsomeANNsthathavebeentrainedoveralongperiod of time with a large amount of data have become decisively moreaccurate than human experts. At that point, the ANNs can begin to beseriouslyconsideredfordeployment,inrealsituationsinrealtime.ANNsarepopularbecausetheyareeventuallyabletoreachahighpredictiveaccuracy.ANNsarealsorelativelysimpletoimplementanddonothaveanyissueswithdataquality.However,ANNsrequirealotofdatatotrainittodevelopgoodpredictiveability.

81

ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyinga set of similar groups in the data. It is a technique used for automaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata. The K-means technique is a popular technique and allows the userguidanceinselectingtherightnumber(K)ofclustersfromthedata.

Clustering is alsoknownas the segmentation technique. Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.The output is the centroids for each cluster and the allocation of datapoints to their cluster. The centroid definition is used to assign new datainstancescanbeassignedto theirclusterhomes.Clustering isalsoapartoftheartificialintelligencefamilyoftechniques.

Association rules are a popular dataminingmethod in business, especiallywheresellingis involved.Alsoknownasmarketbasketanalysis, ithelps inansweringquestionsaboutcross-sellingopportunities.Thisistheheartofthepersonalization engine used by ecommerce sites like Amazon.com andstreamingmoviesites likeNetflix.com.The techniquehelps find interestingrelationships (affinities) between variables (items or events). These arerepresentedasrulesoftheformX®Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andthereareno right or wrong answers. There are just stronger and weaker affinities.Thus,eachrulehasaconfidencelevelassignedtoit. Apartofthemachinelearningfamily, this techniqueachieved legendarystatuswhena fascinatingrelationshipwasfoundinthesalesofdiapersandbeers.

82

ToolsandPlatformsforDataMiningData Mining tools have existed for many decades. However, they haverecently becomemore important as the values of data have grown and thefieldofbigdataanalyticshascomeintoprominence.Thereareawiderangeofdataminingplatformsavailableinthemarkettoday.

1. Simple or sophisticated: There are simple end-user data mining toolssuchasMSExcel,and therearemoresophisticated toolssuchas IBMSPSSModeler.

2. Stand-aloneorEmbedded:Therearestandalonetoolsandtherearetoolsembedded inanexisting transactionprocessingordatawarehousingorERPsystem.

3. OpensourceorCommercial:ThereareopensourceandfreelyavailabletoolssuchasWeka,andtherearecommercialproducts.

4. User interface: There are text-based tools that require someprogramming skills, and there are GUI-based drag-and-drop formattools.

5. Dataformats:Therearetoolsthatworkonlyonproprietarydataformatsand there are those directly accept data from a host of popular datamanagementtoolsformats.

Here we compare three platforms that we have used extensively andeffectivelyformanydataminingprojects.

Table4.1:ComparisonofPopularDataMiningPlatforms

Feature

Excel

IBMSPSSModeler

Weka

Ownership

Commercial

Commercial,expensive

Open-source,free

DataMiningFeatures

Limited;extensiblewithadd-onmodules

Extensivefeatures,unlimiteddatasizes

Extensive,performanceissueswithlargedata

Stand-alone

Stand-alone

EmbeddedinBIsoftwaresuites

Stand-alone

Userskillsneeded

End-users

ForskilledBIanalysts

SkilledBIanalysts

83

Userinterface Textandclick,Easy Drag&Dropuse,colorful,beautifulGUI

GUI,mostlyb&wtextoutput

Dataformats

Industry-standard

Varietyofdatasourcesaccepted

Proprietary

MSExcel is a relatively simple and easy datamining tool. It can get quiteversatileonceAnalystPackandsomeotheradd-onproductsareinstalledonit.

IBM’sSPSSModelerisanindustry-leadingdataminingplatform.Ifoffersapowerful set of tools and algorithms for most popular data miningcapabilities.IthascolorfulGUIformatwithdrag-and-dropcapabilities.ItcanacceptdatainmultipleformatsincludingreadingExcelfilesdirectly.

Weka is an open-sourceGUI based tool that offers a large number of dataminingalgorithms.

ERP systems include some data analytic capabilities, too. SAP has itsBusiness Objects (BO) software. BO is considered one of the leading BIsuitesintheindustry,andisoftenusedbyorganizationsthatuseSAP.

84

DataMiningBestPracticesEffective and successful use of datamining activity requires both businessand technologyskills.Thebusinessaspectshelpunderstand thedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources, clean up the data, assemble it to meet the needs of the businessproblem,andthenrunthedataminingtechniquesontheplatform.

An important element is to go after the problem iteratively. It is better todivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbest practices learned from the use of data mining techniques over a longperiod of time. The Data Mining industry has proposed a Cross-IndustryStandard Process for Data Mining (CRISP-DM). It has six essential steps(Figure4.3):

Figure4.3:CRISP-DMDataMiningcycle

1. Business Understanding: The first and most important step in dataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeany

85

other project, in that it should show strong payoffs if the project issuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject, which means that the project aligns well with the businessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginative hypotheses for the solution. Thinking outside the box isimportant, both in terms of a proposedmodel aswell in the data setsavailableandrequired.

2. DataUnderstanding:Arelated importantstep is tounderstand thedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesesto solve a problem. Without relevant data, the hypotheses cannot betested.

3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70% of the time in a data mining project. It may be desirable tocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.

4. Modeling:This is theactual taskofrunningmanyalgorithmsusingtheavailable data to discover if the hypotheses are supported. Patience isrequired in continuously engaging with the data until the data yieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused. A tool could be tried with different options, such as runningdifferentdecisiontreealgorithms.

5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbetter to triangulate the analysis by applying multiple data miningtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracy with more test data. When the accuracy has reached somesatisfactorylevel,thenthemodelshouldbedeployed.

6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresented to the key stakeholders, and is deployed in the organization.Otherwise theprojectwillbeawasteof timeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization. The model should be eventually embedded in theorganization’sbusinessprocesses.

86

MythsaboutdataminingThere are many myths about this area, scaring away many businessexecutivesfromusingdatamining.DataMiningisamindsetthatpresupposesafaithintheabilitytorevealinsights.Byitself,dataminingisnottoohard,nor is it too easy. It does require a disciplined approach and some cross-disciplinaryskills.

Myth#1:DataMiningisaboutalgorithms.Dataminingisusedbybusinessto answer important and practical business questions. Formulating theproblemstatementcorrectlyandidentifyingimaginativesolutionsfortestingare far more important before the data mining algorithms gets called in.Understanding therelativestrengthsofvariousalgorithms ishelpfulbutnotmandatory.

Myth #2: Data Mining is about predictive accuracy. While important,predictiveaccuracyisafeatureofthealgorithm.Asinmyth#1,thequalityofoutputisastrongfunctionoftherightproblem,righthypothesis,andtherightdata.

Myth #3: DataMining requires a datawarehouse.While the presence of adatawarehouseassistsinthegatheringofinformation,sometimesthecreationofthedatawarehouseitselfcanbenefitfromsomeexploratorydatamining.Some datamining problemsmay benefit from clean data available directlyfromtheDW,butaDWisnotmandatory.

Myth#4:DataMiningrequireslargequantitiesofdata.Manyinterestingdataminingexercisesaredoneusingsmallormediumsizeddatasets,atlowcosts,usingend-usertools.

Myth #5: DataMining requires a technology expert.Many interesting dataminingexercisesaredonebyend-usersandexecutivesusingsimpleeverydaytoolslikespreadsheets.

87

DataMiningMistakesDataminingisanexerciseinextractingnon-trivialusefulpatternsinthedata.Itrequiresalotofpreparationandpatiencetopursuethemanyleadsthatdatamayprovide.Muchdomainknowledge,toolsandskillisrequiredtofindsuchpatterns.Herearesomeofthemorecommonmistakesindoingdatamining,andshouldbeavoided.

Mistake#1:Selecting thewrongproblemfordatamining:Without therightgoalsorhavingnogoals, datamining leads to awasteof time.Getting theright answer to an irrelevant question could be interesting, but itwould bepointlessfromabusinessperspective.AgoodgoalwouldbeonethatwoulddeliveragoodROItotheorganization.

Mistake #2: Buried under mountains of data without clear metadata: It ismore important to be engagedwith the data, than to have lots of data.Therelevantdatarequiredmaybemuchlessthaninitiallythought.Theremaybeinsufficientknowledgeaboutthedata,ormetadata.Examinethedatawithacriticaleyeanddonotnaivelybelieveeverythingyouaretoldaboutthedata.

Mistake #3: Disorganized data mining: Without clear goals, much time iswasted. Doing the same tests using the samemining algorithms repeatedlyandblindly,withoutthinkingaboutthenextstage,withoutaplan,wouldleadtowastedtimeandenergy.Thiscancomefrombeingsloppyaboutkeepingtrackofthedataminingprocedureandresults.Notleavingsufficienttimefordataacquisition,selectionandpreparationcanleadtodataqualityissues,andGIGO.Similarlynotprovidingenoughtimefortestingthemodel,trainingtheusersanddeployingthesystemcanmaketheprojectafailure.

Mistake#4:Insufficientbusinessknowledge:Withoutadeepunderstandingofthebusinessdomain, theresultswouldbegibberishandmeaningless.Don’tmake erroneous assumptions, courtesy of experts. Don’t rule out anythingwhenobservingdataanalysis results.Don’t ignoresuspicious(goodorbad)findings and quickly move on. Be open to surprises. Even when insightsemergeatonelevel,itisimportanttosliceanddicethedataatotherlevelstoseeifmorepowerfulinsightscanbeextracted.

Mistake#5: Incompatibilityofdatamining toolsanddatasets.All the toolsfrom data gathering, preparation, mining, and visualization, should worktogether.Usetoolsthatcanworkwithdatafrommultiplesourcesinmultipleindustrystandardformats.

Mistake #6: Looking only at aggregated results and not at individual

88

records/predictions. It ispossible that therightresultsat theaggregate levelprovideabsurdconclusionsatanindividualrecordlevel.Divingintothedataattherightanglecanyieldinsightsatmanylevelsofdata.

Mistake#7:Notmeasuringyourresultsdifferentlyfromthewayyoursponsormeasuresthem.Ifthedataminingteamlosesitssenseofbusinessobjectives,andbeginningtominedataforitsownsake,itwillloserespectandexecutivesupportveryquickly.TheBIDMcycle(Figure1.1)shouldberemembered.

89

ConclusionData Mining is like diving into the rough material to discover a valuablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportant toprovide imaginative solutions thatcan thenbe testedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.

90

ReviewQuestions1. What is data mining?What are supervised and unsupervised learning

techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportantto

followtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. Whatarethemajormistakestobeavoidedwhendoingdatamining?7. Whatarethekeyrequirementsforaskilleddataanalyst?

91

LibertyStoresCaseExercise:Step3Liberty is constantly evaluating opportunities forimprovingefficiencies inall itsoperations, including thecommercialoperationsaswellitscharitableactivities.

1. Whatdataminingtechniqueswouldyouusetoanalyzeandpredictsalespatterns?

2. Whatdataminingtechniquewouldyouusetocategorizeitscustomers

92

Chapter5:DataVisualization

DataVisualization is the art and scienceofmakingdata easy tounderstandandconsume,fortheenduser.Idealvisualizationshowstherightamountofdata, in the rightorder, in the rightvisual form, toconvey thehighpriorityinformation. The right visualization requires an understanding of theconsumer’s needs, nature of the data, and the many tools and techniquesavailable to present data. The right visualization arises from a completeunderstandingofthetotalityofthesituation.Oneshouldusevisualstotellatrue,completeandfast-pacedstory.

Datavisualizationisthelaststepinthedatalifecycle.Thisiswherethedatais processed for presentation in an easy-to-consume manner to the rightaudiencefortherightpurpose.Thedatashouldbeconvertedintoalanguageandformatthatisbestpreferredandunderstoodbytheconsumerofdata.Thepresentation should aim to highlight the insights from the data in anactionable manner. If the data is presented in too much detail, then theconsumerofthatdatamightloseinterestandtheinsight.

93

Caselet:DrHansGosling-VisualizingGlobalPublicHealthDr.HansRoslingisamasteratdatavisualization.Hehasperfected the art of showing data in novel ways tohighlightunexpectedtruths.Hehasbecomeanonlinestarbyusingdatavisualizationstomakeseriouspointsaboutglobalhealthpolicyanddevelopment.Usingnovelwaysto illustrate data obtained from UN agencies, he hashelpeddemonstratetheprogressthattheworldhasmadeinimprovingpublichealthonmanydimensions.ThebestwaytograspthepowerofhisworkistoclickheretoseethisTEDvideo,whereLifeExpectancy ismappedalongwith Fertility Rate for all countries from 1962 to 2003.Figure5.1showsaonegraphicfromthisvideo.

Figure5.1:VisualizingGlobalHealthData(source:ted.com)

“THEbiggestmythisthatifwesaveallthepoorkids,wewilldestroytheplanet,”saysHansRosling,adoctorandprofessor of international health at the KarolinskaInstitute in Sweden. “But you can't stop populationgrowth by letting poor children die.” He has thecomputerised graphs to prove it: colourful visuals withcirclesthatswarm,swellandshrinklikelivingcreatures.DrRosling'smesmerizinggraphicshavebeenimpressingaudiences on the international lecture circuit, from theTEDconferencestotheWorldEconomicForumatDavos.Instead of bar charts and histograms, Dr Rosling uses

94

https://www.youtube.com/watch?v=hVimVzgtD6w

Legobricks,IKEAboxesanddata-visualizationsoftwaredeveloped by his Gapminder Foundation to transformreams of economic and public-health data into grippingstories.Hisaimisambitious.“Iproducearoadmapforthemodernworld,”hesays.“Wherepeoplewanttodriveis up to them. But I have the idea that if they have aproperroadmapandknowwhattheglobalrealitiesare,they'llmakebetterdecisions.”(source:economist.com).

Q1:Whatarethebusinessandsocialimplicationsofthiskindofdatavisualization?Q2:Howcouldthesetechniquesbeappliedinyourorganizationandareaofwork?

95

ExcellenceinVisualizationDatacanbepresentedintheformofrectangulartables,oritcanbepresentedincolorfulgraphsofvarious types.“Small,non-comparative,highly-labeleddatasetsusuallybelongintables”–(EdTufte,2001,p33).However,astheamount of data grows, graphs are preferable. Graphics help give shape todata.Tufte,apioneeringexpertondatavisualization,presents thefollowingobjectivesforgraphicalexcellence:

1. Show,andevenreveal,thedata:Thedatashouldtellastory,especiallyastoryhiddeninlargemassesofdata.However,revealthedataincontext,sothestoryiscorrectlytold.

2. Inducetheviewertothinkofthesubstanceofthedata:Theformatofthegraph shouldbe sonatural to thedata, that it hides itself and lets datashine.

3. Avoiddistortingwhatthedatahavetosay:Statisticscanbeusedtolie.In the name of simplifying, some crucial context could be removedleadingtodistortedcommunication.

4. Make largedata sets coherent: By giving shape to data, visualizationscanhelpbringthedatatogethertotellacomprehensivestory.

5. Encourage the eyes to compare different pieces of data: Organize thechartinwaystheeyeswouldnaturallymovetoderiveinsightsfromthegraph.

6. Reveal the data at several levels of detail: Graphs leads to insights,which raise further curiosity, and thus presentations should help get totherootcause.

7. Serveareasonablyclearpurpose–informingordecision-making.8. Closely integrate with the statistical and verbal descriptions of thedataset:Thereshouldbenoseparationofchartsandtextinpresentation.Each mode should tell a complete story. Intersperse text with themap/graphictohighlightthemaininsights.

Context is important in interpreting graphics. Perception of the chart is asimportantastheactualcharts.Donotignoretheintelligenceorthebiasesofthe reader. Keep the template consistent, and only show variations in data.There can be many excuses for graphical distortion. E.g. “we are justapproximating.”Qualityofinformationtransmissioncomespriortoaestheticsofchart.Leavingoutthecontextualdatacanbemisleading.

A lot of graphics are published because they serve a particular cause or apoint of view. It is particularly importantwhen in a for-profit or politically

96

contestedenvironments.Manyrelateddimensionscanbefoldedintoagraph.Themorethedimensionsthatarerepresentedinagraph,thericherandmoreuseful the chart become. The data visualizer should understand the client’sobjects and present the data for accurate perception of the totality of thesituation.

97

TypesofChartsTherearemanykindsofdataasseeninthecaseletabove.Timeseriesdataisthemostpopular formofdata. It helps revealpatternsover time.However,datacouldbeorganizedaroundalphabeticallistofthings,suchascountriesorproductsorsalespeople.Figure5.2showssomeofthepopularcharttypesandtheirusage.

1. Line graph. This is a basic and most popular type of displayinginformation.Itshowsdataasaseriesofpointsconnectedbystraightlinesegments.Ifminingwithtime-seriesdata,timeisusuallyshownonthex-axis.Multiplevariablescanberepresentedonthesamescaleony-axistocompareofthelinegraphsofallthevariables.

2. Scatterplot:Thisisanotherverybasicandusefulgraphicform.Ithelpsreveal the relationship between two variables. In the above caselet, itshows twodimensions:LifeExpectancyandFertilityRate.Unlike inalinegraph,therearenolinesegmentsconnectingthepoints.

3. Bargraph:Abargraphshows thincolorful rectangularbarswith theirlengths being proportional to the values represented. The bars can beplottedverticallyorhorizontally.Thebargraphsuse a lot ofmore inkthanthelinegraphandshouldbeusedwhenlinegraphsareinadequate.

4. StackedBargraphs:Theseareaparticularmethodofdoingbargraphs.Valuesofmultiplevariablesarestackedoneontopoftheothertotellaninterestingstory.Barscanalsobenormalizedsuchasthetotalheightofeverybarisequal,soitcanshowtherelativecompositionofeachbar.

5. Histograms: These are like bar graphs, except that they are useful inshowing data frequencies or data values on classes (or ranges) of anumericalvariable.

98

http://en.wikipedia.org/wiki/Rectangle

http://en.wikipedia.org/wiki/Length

Figure5.1:Manytypesofgraphs

6. Piecharts:Theseareverypopulartoshowthedistributionofavariable,such as sales by region. The size of a slice is representative of therelativestrengthsofeachvalue.

7. Boxcharts:Thesearespecialformofchartstoshowthedistributionofvariables.Theboxshowsthemiddlehalfof thevalues,whilewhiskersonbothsidesextendtotheextremevaluesineitherdirection.

8. Bubble Graph: This is an interesting way of displaying multipledimensionsinonechart.It isavariantofascatterplotwithmanydatapointsmarkedontwodimensions.Nowimaginethateachdatapointonthegraphisabubble(oracircle)…thesizeofthecircleandthecolorfillinthecirclecouldrepresenttwoadditionaldimensions.

9. Dials:Thesearechartslikethespeeddialinthecar,thatshowswhetherthevariable value (such as sales number) is in the low range,mediumrange,orhighrange.Theserangescouldbecoloredred,yellowandgreetogiveaninstantviewofthedata.

10. GeographicalDatamapsareparticularlyusefulmapstodenotestatistics. Figure 5.3 shows a tweet density map of the US. It showswherethetweetsemergefromintheUS.

99

Figure5.3:UStweetmap(Source:Slate.com)

11. Pictographs:Onecanusepicturestorepresentdata.E.g.Figure5.2showsthenumberoflitersofwaterneededtoproduceonepoundofeachof the products, where images are used to show the product for easyreference.Eachdropletofwateralsorepresents50litersofwater.

Figure5.4:PictographofWaterfootprint(source:waterfootprint.org)

100

VisualizationExampleTodemonstratehoweachofthevisualizationtoolscouldbeused,imagineanexecutiveforacompanywhowants toanalyzethesalesperformanceofhisdivision.Figure5.1 show the important raw sales data for the current year,alphabeticallysortedbyProductnames.

Product

Revenue

Orders

SalesPers

AA

9731

131

23

BB

355

43

8

CC

992

32

6

DD

125

31

4

EE

933

30

7

FF

676

35

6

GG

1411

128

13

HH

5116

132

38

JJ

215

7

2

KK

3833

122

50

LL

1348

15

7

MM

1201

28

13Table5.1:RawPerformanceData

Torevealsomemeaningfulpattern,agoodfirststepwouldbetosortthetablebyProductrevenue,withhighestrevenuefirst.WecouldtotalupthevaluesofRevenue,Orders,andSalespersonsforallproducts.Wecanalsoaddsome

101

importantratiostotherightofthetable(Table5.2).

Product

Revenue

Orders

SalesPers

Rev/Order

Rev/SalesP

Orders/SalesP

AA

9731

131

23

74.3

423.1

5.7

HH

5116

132

38

38.8

134.6

3.5

KK

3833

122

50

31.4

76.7

2.4

GG

1411

128

13

11.0

108.5

9.8

LL

1348

15

7

89.9

192.6

2.1

MM

1201

28

13

42.9

92.4

2.2

CC

992

32

6

31.0

165.3

5.3

EE

933

30

7

31.1

133.3

4.3

FF

676

35

6

19.3

112.7

5.8

BB

355

43

8

8.3

44.4

5.4

JJ

215

7

2

30.7

107.5

3.5

DD

125

31

4

4.0

31.3

7.8

Total

25936

734

177

35.3

146.5

4.1

Table5.2:Sorteddata,withadditionalratios

Therearetoomanynumbersonthistabletovisualizeanytrendsinthem.Thenumbersareindifferentscalessoplottingthemonthesamechartwouldnotbe easy. E.g. the Revenue numbers are in thousands while the SalesPersnumbersandOrders/SalesPersareinthesingleordoubledigit.

One could start by visualizing the revenue as a pie-chart. The revenue

102

proportiondropssignificantlyfromthefirstproducttothenext.(Figure5.5).It is interesting to note that the top 3 products produce almost 75% of therevenue.

Figure5.5:RevenueSharebyProduct

Thenumberofordersforeachproductcanbeplottedasabargraph(Figure5.2).This shows thatwhile the revenue iswidely different for the top fourproducts,theyhaveapproximatelythesamenumberoforders.

Figure5.6:OrdersbyProducts

Therefore,theordersdatacouldbeinvestigatedfurthertoseeorderpatterns.Supposeadditionaldata ismadeavailable forOrdersby their size.Supposethe orders are chunked into 4 sizes: Tiny, Small, Medium, and Large.AdditionaldataisshowninTable5.3.

103

Product

TotalOrders

Tiny

Small

Medium

Large

AA

131

5

44

70

12

HH

132

38

60

30

4

KK

122

20

50

44

8

GG

128

52

70

6

0

LL

15

2

3

5

5

MM

28

8

12

6

2

CC

32

5

17

10

0

EE

30

6

14

10

0

FF

35

10

22

3

0

BB

43

18

25

0

0

JJ

7

4

2

1

0

DD

31

21

10

0

0

Total

734

189

329

185

31

Table5.3:Additionaldataonordersizes

Figure5.7isastackedbargraphthatshowsthepercentageofOrdersbysizeforeachproduct.Thischart(Figure5.7)bringsadifferentsetof insights. Itshows that the product HH has a larger proportion of tiny orders. Theproductsatthefarrighthavealargenumberoftinyordersandveryfewlargeorders.

104

Figure5.7:ProductOrdersbyOrderSize

105

VisualizationExamplephase-2The executive wants to understand the productivity of salespersons. Thisanalysiscouldbedonebothintermsofthenumberoforders,orrevenue,persalesperson.Therecouldbetwoseparategraphs,oneforthenumberofordersper salesperson,and theother for the revenueper salesperson.However,aninterestingway is to plot bothmeasures on the same graph to give amorecomplete picture. This can be done evenwhen the two data have differentscales.Thedataishereresortedbynumberoforderspersalesperson.

Figure 5.8 shows two line graphs superimposed upon each other. One lineshows the revenue per salesperson, while the other shows the number oforderspersalesperson.Itshowsthatthehighestproductivityof5.3orderspersales person, down to 2.1 orders per salesperson.The second line, the bluelineshowstherevenuepersalespersonforeachfortheproducts.Therevenuepersalespersonishighestat630,whileitislowestatjust30.

Andthusadditionallayersofdatavisualizationcangoonforthisdataset.

Figure5.8:Salespersonproductivitybyproduct

106

TipsforDataVisualizationTohelptheclientinunderstandingthesituation,thefollowingconsiderationsareimportant:

1. Fetch appropriate and correct data for analysis. This requires someunderstandingofthedomainoftheclientandwhatisimportantfortheclient.E.g. inabusinesssetting,onemayneedtounderstandthemanymeasureofprofitabilityandproductivity.

2. Sort the data in the most appropriate manner. It could be sorted bynumericalvariables,oralphabeticallybyname.

3. Choose appropriate method to present the data. The data could bepresentedasatable,oritcouldbepresentedasanyofthegraphtypes.

4. The data set could be pruned to include only the more significantelements.Moredata isnotnecessarilybetter, unless itmakes themostsignificantimpactonthesituation.

5. Thevisualizationcouldshowadditionaldimensionforreferencesuchastheexpectationsortargetswithwhichtocomparetheresults.

6. Thenumericaldatamayneedtobebinnedintoafewcategories.E.g.theorders per person were plotted as actual values, while the order sizeswerebinnedinto4categoricalchoices.

7. High-levelvisualizationcouldbebackedbymoredetailedanalysis.Forthemostsignificantresults,adrill-downmayberequired.

8. Theremaybeneed topresentadditional textual information to tell thewhole story. For example, one may require notes to explain someextraordinaryresults.

107

ConclusionDataVisualizationisthelastphaseofthedatalifecycle,andleadstotheconsumptionofdatabytheenduser.Itshouldtellanaccurate, completeandsimple storybackedbydate,whilekeeping it insightful andengaging.Thereare innumerable typesofvisual graphing techniques available for visualizing data. The choice of the right tools requires a good understanding of thebusiness domain, the data set and the client needs. There is ample room for creativity to design ever more compelling datavisualizationtomostefficientlyconveytheinsightsfromthedata.

108

ReviewQuestions1. Whatisdatavisualization?2. Howwouldyoujudgethequalityofdatavisualizations?3. Whatarethedatavisualizationtechniques?Whenwouldyouusetables

orgraphs?4. Describesomekeystepsindatavisualization.5. Whataresomekeyrequirementsforgoodvisualization.

109

LibertyStoresCaseExercise:Step4Liberty is constantly evaluating its performance forimprovingefficiencies inall itsoperations, including thecommercialoperationsaswellitscharitableactivities.

1. What data visualization techniques would you use to help understandsalespatterns?

2. What data visualization technique would you use to categorize itscustomers?

110

Section2

Thissectioncoversfiveimportantdataminingtechniques.

Thefirstthreetechniquesareexamplesofsupervisedlearning,consistingofclassificationtechniques.

Chapter6willcoverdecisiontrees,whicharethemostpopularformofdataminingtechniques.Therearemanyalgorithmstodevelopdecisiontrees.

Chapter7willdescriberegressionmodelingtechniques.Thesearestatisticaltechniques.

Chapter8willcoverartificialneuralnetworks,whichareamachinelearningtechnique.

Thenexttwotechniquesareexamplesofunsupervisedlearning,consistingofdataexplorationtechniques.

Chapter 9 will cover Cluster Analysis. This is also called MarketSegmentationanalysis.

Chapter 10 will cover the Association Rule Mining technique, also calledMarketBasketAnalysis.

111

Chapter6:DecisionTrees

Decision trees are a simple way to guide one’s path to a decision. Thedecisionmaybeasimplebinaryone,whethertoapprovealoanornot.Oritmaybeacomplexmulti-valueddecision,astowhatmaybethediagnosisforaparticularsickness.Decisiontreesarehierarchicallybranchedstructuresthathelponecometoadecisionbasedonaskingcertainquestionsinaparticularsequence. Decision trees are one of the most widely used techniques forclassification. A good decision tree should be short and ask only a fewmeaningfulquestions.Theyareveryefficienttouse,easytoexplain,andtheirclassificationaccuracyiscompetitivewithothermethods.Decisiontreescangenerate knowledge froma few test instances that can thenbe applied to abroadpopulation.Decisiontreesareusedmostlytoanswerrelativelysimplebinarydecisions.

112

Caselet:PredictingHeartAttacksusingDecisionTreesA study was done at UC SanDiego concerning heart diseasepatient data. The patients werediagnosed with a heart attackfrom chest pain, diagnosed byEKG,highenzymelevelsintheirheartmuscles,etc.Theobjectivewas to predict which of thesepatientswasatriskofdyingfromasecondheartattackwithin thenext 30 days. The predictionwould determine the treatmentplan,suchaswhethertokeepthepatient in intensive care or not.For eachpatientmore than100variables were collected,including demographics,medical history and lab data.Using that data, and the CARTalgorithm, a decision tree wasconstructed.

Thedecision tree showed that ifBloodPressurewaslow(<=90),the chance of another heartattack was very high (70%). Ifthepatient’sBPwasok,thenextquestiontoaskwasthepatient’sage. If theagewas low(<=62),then the patient’s survival wasalmost guaranteed (98%). If theage was higher, then the nextquestion to askwas about sinusproblems. If their sinus was ok,the chances of survival were89%. Otherwise, the chance ofsurvival dropped to 50%. Thisdecision tree predicts 86.5% ofthe cases correctly. (Source:SalfordSystems).

113

1:Isadecisiontreegoodenoughin terms of accuracy, design,readability,forthisdataetc.

2: Identify the benefits fromcreating such a decision tree.Canthesebequantified?

114

DecisionTreeproblemImagine a conversation between a doctor and a patient. The doctor asksquestionstodeterminethecauseoftheailment.Thedoctorwouldcontinuetoask questions, till she is able to arrive at a reasonable decision. If nothingseemsplausible,shemightrecommendsometeststogeneratemoredataandoptions.

This ishowexperts inany field solveproblems.Theyusedecision treesordecision rules. For every question they ask, the potential answers createseparatebranchesforfurtherquestioning.Foreachbranch,theexpertwouldknowhowtoproceedahead.Theprocesscontinuesuntiltheendofthetreeisreached,whichmeansaleafnodeisreached.

Human experts learn from past experiences or data points. Similarly, amachine canbe trained to learn from the past data points and extract someknowledgeorrulesfromit.Decisiontreesusemachinelearningalgorithmstoabstract knowledge from data. A decision tree would have a predictiveaccuracybasedonhowoftenitmakescorrectdecisions.

1. The more training data is provided, the more accurate its knowledgeextractionwillbe,andthus,itwillmakemoreaccuratedecisions.

2. Themorevariablesthetreecanchoosefrom,thegreateristhelikelyoftheaccuracyofthedecisiontree.

3. Inaddition,agooddecisiontreeshouldalsobefrugalsothatittakestheleastnumberofquestions,andthus,theleastamountofeffort,togettotherightdecision.

Hereisanexercisetocreateadecisiontreethathelpsmakedecisionsaboutapproving theplayofanoutdoorgame.Theobjective is topredict theplaydecisiongiventheatmosphericconditionsoutthere.Thedecisionis:Shouldthegamebeallowedornot?Hereisthedecisionproblem.

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

Normal

True

??

To answer that question, one should look at past experience, and seewhatdecisionwasmadeinasimilarinstance,ifsuchaninstanceexists.Onecould

115

lookupthedatabaseofpastdecisionstofindtheanswerandtrytocometoananswer. Here is a list of the decisions taken in 14 instances of past soccergamesituations.(Datasetcourtesy:Witten,Frank,andHall,2010).

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

No

IftherewerearowforSunny/Hot/Normal/Windyconditioninthedatatable,itwouldmatch the current problem; and the decision from that row could be

116

usedtoanswerthecurrentproblem.However,thereisnosuchpastinstanceinthiscase.Therearethreedisadvantagesoflookingupthedatatable:

1. Asmentionedearlier,howtodecideifthereisn’tarowthatcorrespondsto the exact situation today? If there is no exact matching instanceavailableinthedatabase,thepastexperiencecannotguidethedecision.

2. Searching through the entire past database may be time consuming,depending on the number of variables and the organization of thedatabase.

3. What if the data values are not available for all the variables? In thisinstance,ifthedataforhumidityvariablewasnotavailable,lookingupthepastdatawouldnothelp.

Abetterwayofsolvingtheproblemmaybetoabstracttheknowledgefromthepastdata intodecision treeor rules.These rulescanbe represented inadecisiontree,andthenthattreecanbeusedmakethedecisions.Thedecisiontreemaynotneedvaluesforallthevariables.

117

DecisionTreeConstructionAdecisiontreeisahierarchicallybranchedstructure.Whatshouldbethefirstquestion asked in creating the tree? One should ask the more importantquestion first, and the less important questions later. What is the mostimportant question that should be asked to solve the problem?How is theimportanceof thequestionsdetermined?Thus,howshould therootnodeofthetreebedetermined?

Determining root node of the tree: In this example, there are four choicesbasedonthefourvariables.Onecouldbeginbyaskingoneofthefollowingquestions:whatistheoutlook,whatisthetemperature,whatisthehumidity,andwhat is thewind speed? A criterion should be used to evaluate thesechoices.Thekeycriterionwouldbethat:whichoneofthesequestionsgivesthemostinsightaboutthesituation?Anotherwaytolookatitwouldbethecriterion of frugality. That is, which question will provide us the shortestultimatedecisiontree?Anotherwaytolookatthisisthatifoneisallowedtoaskoneandonlyonequestion,whichonewouldoneask? In this case, themostimportantquestionshouldbetheonethat,byitself,helpsmakethemostcorrect decisions with the fewest errors. The four questions can now besystematically compared, to seewhichvariableby itselfwill helpmake themostcorrectdecisions.Oneshouldsystematicallycalculatethecorrectnessofdecisionsbasedoneachquestion.Thenonecanselect thequestionwith themostcorrectpredictions,orthefewesterrors.

Start with the first variable, in this case outlook. It can take three values,sunny,overcast,andrainy.

Start with the sunny value of outlook. There are five instances where theoutlookissunny.In2ofthe5instancestheplaydecisionwasyes,andintheother three, the decision was No. Thus, if the decision rule was thatOutlook:sunny→No,then3outof5decisionswouldbecorrect,while2outof5suchdecisionswouldbeincorrect.Thereare2errorsoutof5.ThiscanberecordedinRow1.

Attribute

Rules

Error

TotalError

Outlook

Sunny→No

2/5

118

Similaranalysiswouldbedoneforothervaluesoftheoutlookvariable.Therearefour instanceswhere theoutlookisovercast. Inall4out4 instances thePlaydecisionwasyes.Thus,ifthedecisionrulewasthatOutlook:overcast→Yes, then 4 out of 4 decisions would be correct, while none of decisionswouldbe incorrect.Thereare0errorsoutof4.Thiscanberecorded in thenextrow.

Attribute

Rules

Error

TotalError

Outlook

Sunny→No

2/5

Overcast→yes

0/4

Therearefiveinstanceswheretheoutlookisrainy.In3ofthe5instancestheplaydecisionwasyes,andintheotherthree,thedecisionwasno.Thus,ifthedecisionrulewasthatOutlook:rainy→Yes,then3outof5decisionswouldbecorrect,while2outof5decisionswouldbe incorrect.Therewillbe2/5errors.Thiscanberecordedinnextrow.

Attribute

Rules

Error

TotalError

Outlook

Sunny→No

2/5

4/14

Overcast→yes

0/4

Rainy→yes

2/5

Adding up errors for all values of outlook, there are 4 errors out of 14. Inotherwords, Outlook gives 10 correct decisions out of 14, and 4 incorrectones.

Asimilaranalysiscanbedonefortheotherthreevariables.Attheendofthatanalyticalexercise,thefollowingErrortablewillbeconstructed.

119

Attribute

Rules

Error

TotalError

Outlook

Sunny→No

2/5

4/14

Overcast→yes

0/4

Rainy→yes

2/5

Temp

Hot→No

2/4

5/14

Mild→Yes

2/6

Cool→Yes

1/4

Humidity

High→No

3/7

4/14

Normal→Yes

1/7

Windy

False→Yes

2/8

5/14

True→No

3/6

The variable that leads to the least number of errors (and thus the mostnumberofcorrectdecisions)shouldbechosenasthefirstnode.Inthiscase,twovariableshavetheleastnumberoferrors.Thereisatiebetweenoutlookandhumidity,asbothhave4errorsoutof14instances.Thetiecanbebrokenusinganothercriterion,thepurityofresultingsub-trees.

Ifall theerrorswereconcentratedinafewof thesubtrees,andsomeof thebranches were completely free of error, that is preferred from a usabilityperspective.Outlookhasoneerror-freebranch,fortheovercastvalue,whilethereisnosuchpuresub-classforhumidityvariable.Thusthetieisbrokeninfavorofoutlook.Thedecision treewilluseoutlookas thefirstnode,or thefirst splitting variable. The first question that should be asked to solve thePlayproblem,is‘Whatisthevalueofoutlook’?

SplittingtheTree:Fromtherootnode,thedecisiontreewillbesplitintothree

120

branchesorsub-trees,oneforeachofthethreevaluesofoutlook.Datafortheroot node (the entire data)will be divided into the three segments, one foreachof thevalueofoutlook.Thesunnybranchwill inherit thedata for theinstances that had sunny as the value of outlook. These will be used forfurtherbuildingof thatsub-tree.Similarly, therainybranchwill inheritdatafortheinstancesthathadrainyasthevalueofoutlook.Thesewillbeusedforfurtherbuildingofthatsub-tree.Theovercastbranchwillinheritthedatafortheinstancesthathadovercastastheoutlook.However,therewillbenoneedtobuildfurtheronthatbranch.Thereisacleardecision,yes,forallinstanceswhenoutlookvalueisovercast.

Thedecisiontreewilllooklikethisafterthefirstlevelofsplitting.

Determining the next nodes of the tree: A similar recursive logic of treebuildingshouldbeappliedtoeachbranch.Forthesunnybranchontheleft,errorvalueswillbecalculatedforthethreeothervariables–temp,humidityandwindy.Finalcomparisonlookslikethis:

Attribute

Rules

Error

TotalError

Temp

Hot->No

0/2

1/5

Mild->No

1/2

Cool->yes

0/1

Humidity

High->No

0/3

0/5

Normal->Yes

0/2

121

Windy

False->No

1/3

2/5

True->Yes

1/2

Thevariableofhumidityshowstheleastamountoferror,i.e.zeroerror.Theothertwovariableshavenon-zeroerrors.ThustheOutlook:sunnybranchontheleftwillusehumidityasthenextsplittingvariable.

Similaranalysisshouldbedoneforthe‘rainy’valueofthetree.Theanalysiswouldlooklikethis.

Attribute

Rules

Error

TotalError

Temp

Mild->Yes

1/3

2/5

Cool->yes

1/2

Humidity

High->No

1/2

2/5

Normal->Yes

1/3

Windy

False->Yes

0/3

0/5

True-No

0/2

FortheRainybranch, itcansimilarlybeseenthat thevariableWindygivesall thecorrect answers,whilenoneof theother twovariablesmakesall thecorrectdecisions.

Thisishowthefinaldecisiontreelookslike.HereitisproducedusingWekaopen-source data mining platform (Figure 6.1). This is the model thatabstractstheknowledgeofthepastdataofdecision.

122

Figure6.1:DecisionTreefortheweatherproblem

This decision tree can be used to solve the current problem. Here is theproblemagain.

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

Normal

True

??

According to the tree, the first question to ask is about outlook. In thisproblemtheoutlookissunny.So, thedecisionproblemmovesto theSunnybranch of the tree. The node in that sub-tree is humidity. In the problem,HumidityisNormal.ThatbranchleadstoananswerYes.Thus,theanswertotheplayproblemisYes.

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

Normal

True

Yes

123

LessonsfromconstructingtreesHerearesomebenefitsofusingthisdecisiontreecomparedwithlookinguptheanswersfromthedatatable(Figure6.1)

DecisionTree

TableLookup

Accuracy

Variedlevelof

accuracy

100%accurate

Generality

General.Appliestoall

situations

Appliesonlywhenasimilarcasehad

occurredearlier

Frugality

Onlythreevariables

needed

Allfourvariablesareneeded

Simple

Onlyone,ormaxtwovariablevaluesare

needed

Allfourvariablevaluesareneeded

Easy

Logical,andeasyto

understand

Canbecumbersometolookup;nounderstandingofthelogicbehindthe

decisionFigure6.1:ComparingDecisionTreewithTableLook-up

Hereareafewobservationsabouthowthetreewasconstructed:

1. The finaldecision treehaszeroerrors inmapping to thepriordata. Inother words, the tree has a predictive accuracy of 100%. The treecompletely fits the data. In real life situations, such perfect predictiveaccuracy is not possible when making decision trees.When there arelarger,complicateddatasets,withmanymorevariables,aperfect fit isunachievable. This is especially true in business and social contexts,wherethingsarenotalwaysfullyclearandconsistent.

2. Thedecision treealgorithmselected theminimumnumber of variablesthat are needed to solve the problem. Thus, one can start with allavailable data variables, and let the decision-tree algorithm select theonesthatareuseful,anddiscardtherest.

124

3. Thistreeisalmostsymmetricwithallbranchesbeingofalmostsimilarlengths.However, in real life situations, some of the branchesmay bemuchlongerthantheothers,andthetreemayneedtobeprunedtomakeitmorebalancedandusable.

4. Itmaybepossibletoincreasepredictiveaccuracybymakingmoresub-treesandmakingthetreelonger.However,themarginalaccuracygainedfromeachsubsequentlevelinthetreewillbeless,andmaynotbeworththelossineaseandinterpretabilityofthetree.Ifthebranchesarelongand complicated, itwill be difficult to understand and use.The longerbranchesmayneedtobetrimmedtokeepthetreeeasytouse.

5. A perfectly fitting tree has the danger of over-fitting the data, thuscapturing all the random variations in the data. Itmay fit the trainingdatawell,butmaynotdowellinpredictingthefuturerealinstances.

6. Therewasasinglebesttreeforthisdata.Therecouldhoweverbetwoormore equally efficient decision trees of similar length with similarpredictive accuracy for the same data set. Decision trees are basedstrictlyonpatternswithin thedata, and donot rely on anyunderlyingtheory of the problem domain. When multiple candidate trees areavailable, one could choose whichever is easier to understand,communicateorimplement.

125

DecisionTreeAlgorithmsAswesaw,decisiontreesemploythedivideandconquermethod.Thedataisbranched at each node according to certain criteria until all the data isassignedtoleafnodes.Itrecursivelydividesatrainingsetuntileachdivisionconsistsofexamplesfromoneclass.

Thefollowingisapseudocodeformakingdecisiontrees:

1. Createarootnodeandassignallofthetrainingdatatoit.2. Selectthebestsplittingattributeaccordingtocertaincriteria.3. Addabranchtotherootnodeforeachvalueofthesplit.4. Split the data into mutually exclusive subsets along the lines of the

specificsplit.5. Repeat steps 2 and 3 for each and every leaf node until a stopping

criteriaisreached.

There are many algorithms for making decision trees. Decision treealgorithmsdifferonthreekeyelements:

1. Splittingcriteria1. Whichvariabletouseforthefirstsplit?Howshouldonedetermine

themost importantvariable for the firstbranch,andsubsequently,for each sub-tree? There are many measures like least errors,informationgain,gini’scoefficient,etc.

2. What values to use for the split? If the variables have continuousvaluessuchasforageorbloodpressure,whatvalue-rangesshouldbeusedtomakebins?

3. Howmanybranchesshouldbeallowedforeachnode?Therecouldbebinarytrees,withjusttwobranchesateachnode.Ortherecouldbemorebranchesallowed.

2. Stoppingcriteria:When tostopbuilding the tree?Thereare twomajorways to make that determination. The tree building could be stoppedwhen a certain depth of the branches has been reached and the treebecomesunreadableafterthat.Thetreecouldalsobestoppedwhentheerrorlevelatanynodeiswithinpredefinedtolerablelevels.

3. Pruning:Thetreecouldbetrimmedtomakeitmorebalancedandmoreeasilyusable.Thepruningisoftendoneafterthetreeisconstructed,tobalance out the tree and improve usability. The symptoms of an over-fitted treearea tree toodeep,with toomanybranches, someofwhichmayreflectanomaliesduetonoiseoroutliers.Thus,thetreeshouldbe

126

pruned.Therearetwoapproachestoavoidover-fitting.

- Pre-pruningmeansto halt the treeconstruction early,when certain criteriaaremet.Thedownsideis that it is difficult todecidewhat criteria touse for halting theconstruction, becausewe do not know whatmay happensubsequently, if wekeepgrowingthetree.- Post-pruning:Remove branches orsub-treesfroma“fullygrown” tree. Thismethod is commonlyused. C4.5 algorithmuses a statisticalmethodtoestimatetheerrorsateachnodeforpruning. A validationset may be used forpruningaswell.

ThemostpopulardecisiontreealgorithmsareC5,CARTandCHAID(Table6.2)

Figure6.2:ComparingpopularDecisionTreealgorithms

Decision-Tree

C4.5

CART

CHAID

FullName

IterativeDichotomiser(ID3)

ClassificationandRegressionTrees

Chi-squareAutomaticInteractionDetector

127

Basicalgorithm

Hunt’salgorithm

Hunt’salgorithm

adjustedsignificancetesting

Developer

RossQuinlan

Bremman

GordonKass

Whendeveloped

1986

1984

1980

Typesoftrees

Classification

Classification&Regressiontrees

Classification&regression

Serialimplementation

Tree-growth&Tree-pruning



Typeofdata

Discrete&Continuous;Incompletedata

DiscreteandContinuous

Non-normaldataalsoaccepted

Typesofsplits

Multi-waysplits

Binarysplitsonly;Cleversurrogatesplitstoreducetreedepth

Multi-waysplitsasdefault

Splittingcriteria

Informationgain

Gini’scoefficient,andothers

Chi-squaretest

PruningCriteria

Cleverbottom-uptechniqueavoidsoverfitting

Removeweakestlinksfirst

Treescanbecomeverylarge

Implementation

Publiclyavailable

Publiclyavailableinmostpackages

Popularinmarketresearch,forsegmentation

128

ConclusionDecision trees are themost popular, versatile, and easy to use dataminingtechnique with high predictive accuracy. They are also very useful ascommunication tools with executives. There are many successful decisiontreealgorithms. Allpubliclyavailabledataminingsoftwareplatformsoffermultipledecisiontreeimplementations.

129

ReviewQuestions1: What is a decision tree? Why are decision trees the most popularclassificationtechnique?

2:What isasplittingvariable?Describe threecriteria forchoosingsplittingvariable.

3:Whatispruning?Whatarepre-pruningandpost-pruning?Whychooseoneovertheother?

4:Whataregini’scoefficient,andinformationgain?(Hint:googleit).

Hands-on Exercise: Create a decision tree for the following data set. Theobjectiveistopredicttheclasscategory.(loanapprovedornot).

Age

Job

House

Credit

LoanApproved

Young

False

No

Fair

No

Young

False

No

Good

No

Young

True

No

Good

Yes

Young

True

Yes

Fair

Yes

Young

False

No

Fair

No

Middle

False

No

Fair

No

Middle

False

No

Good

No

Middle

True

Yes

Good

Yes

Middle

False

Yes

Excellent

Yes

Middle

False

Yes

Excellent

Yes

130

Old False Yes Excellent YesOld

False

Yes

Good

Yes

Old

True

No

Good

Yes

Old

True

No

Excellent

Yes

Old

False

No

Fair

No

Thensolvethefollowingproblemusingthemodel.

Age

Job

House

Credit

LoanApproved

Young

False

False

Good

??

131

LibertyStoresCaseExercise:Step5Libertyisconstantlyevaluatingrequestsforopeningnewstores.Theywouldlike to formalize the process for handling many requests, so that the bestcandidatesareselectedfordetailedevaluation.

Develop a decision tree for evaluating new stores options. Here is thetrainingdata:

City-size

AvgIncome

Localinvestors

LOHASawareness

Decision

Big

High

yes

High

yes

Med

Med

no

Med

no

Small

Low

yes

Low

no

Big

High

no

High

Yes

Small

med

yes

High

No

Med

high

yes

med

Yes

Med

med

yes

med

No

Big

med

no

med

No

Med

high

yes

low

No

Small

High

no

High

Yes

Small

med

no

High

No

Med

high

no

med

No

Usethedecisiontreetoanswerthefollowingquestion?

132

City-size

AvgIncome

Localinvestors

LOHASawareness

Decision

Med

med

no

med

??

133

Chapter7:Regression

Regression is a well-known statistical technique to model the predictiverelationshipbetweenseveralindependentvariables(DVs)andonedependentvariable. The objective is to find the best-fitting curve for a dependentvariableinamultidimensionalspace,witheachindependentvariablebeingadimension.Thecurvecouldbeastraightline,oritcouldbeanonlinearcurve.Thequalityoffitofthecurvetothedatacanbemeasuredbyacoefficientofcorrelation(r),whichis thesquarerootoftheamountofvarianceexplainedbythecurve.

Thekeystepsforregressionaresimple:

1. Listallthevariablesavailableformakingthemodel.2. EstablishaDependentVariable(DV)ofinterest.3. Examinevisual(ifpossible)relationshipsbetweenvariablesofinterest.4. FindawaytopredictDVusingtheothervariables.

134

Caselet:DatadrivenPredictionMarketsTraditionalpollstersstillseemtobeusingmethodologiesthatworkedwell a decade or two ago.Nate Silver is anew breed of data-based political forecasters who areseeped in big data and advanced analytics. In the 2012elections,hepredictedthatObamawouldwintheelectionwith 291 electoral votes, compared to 247 for MittRomney,givingthePresidenta62%leadandre-election.He stunned the political forecasting world by correctlypredicting the Presidential winner in all 50 states,including all nine swing states. He also, correctlypredictedthewinnerin31ofthe33USSenateraces.Nate Silver brings a different view to the world offorecasting political elections, viewing it as a scientificdiscipline. State the hypothesis scientifically, gather allavailable information, analyze the data and extractinsights using sophisticated models and algorithms andfinally,applyhumanjudgmenttointerpretthoseinsights.The results are likely to be much more grounded andsuccessful.(Source:TheSignalandtheNoise:WhyMostPredictionsFailbutSomeDon’t,byNateSilver,2012)Q1: What is the impact of this story on traditionalpollsters&commentators?

135

http://en.wikipedia.org/wiki/Big_data

http://en.wikipedia.org/wiki/Swing_state

http://www.amazon.com/The-Signal-Noise-Most-Predictions/dp/159420411X/ref=sr_1_1?ie=UTF8&qid=1354967399&sr=8-1&keywords=the+signal+and+the+noise

CorrelationsandRelationshipsStatistical relationshipsareaboutwhichelementsofdatahangtogether,andwhich ones hang separately. It is about categorizing variables that have arelationshipwithoneanother,andcategorizingvariablesthataredistinctandunrelated to other variables. It is about describing significant positiverelationshipsandsignificantnegativedifferences.

Thefirstandforemostmeasureofthestrengthofarelationshipisco-relation(orcorrelation).Thestrengthofacorrelationisaquantitativemeasurethatismeasured in anormalized rangebetween0 (zero) and1.Acorrelationof1indicatesaperfectrelationship,wherethetwovariablesareinperfectsync.Acorrelationof0indicatesthatthereisnorelationshipbetweenthevariables.

Therelationshipcanbepositive,oritcanbeaninverserelationship,thatis,the variables may move together in the same direction or in the oppositedirection. Therefore, a good measure of correlation is the correlationcoefficient,whichisthesquarerootofcorrelation.Thiscoefficient,calledr,canthusrangefrom−1to+1.Anrvalueof0signifiesnorelationship.Anrvalueof1showsperfectrelationshipinthesamedirection,andanrvalueof−1showsaperfectrelationshipbutmovinginoppositedirections.

Given two numeric variables x and y, the coefficient of correlation r ismathematically computed by the following equation.̄ x (called x-bar) is themeanofx,andȳ(y-bar)isthemeanofy.

136

VisuallookatrelationshipsA scatter plot (or scatter diagram) is a simple exercise for plotting all datapointsbetweentwovariablesonatwo-dimensionalgraph.Itprovidesavisuallayoutofwhereallthedatapointsareplacedinthattwo-dimensionalspace.The scatter plot can be useful for graphically intuiting the relationshipbetweentwovariables.

Here is a picture (Figure 7.1) that showsmany possible patterns in scatterdiagrams.

Figure7.1:Scatterplotsshowingtypesofrelationshipsamongtwovariables(Source:Groebneretal.2013)

Chart(a)showsaverystronglinearrelationshipbetweenthevariablesxandy.Thatmeans thevalueofy increasesproportionallywithx.Chart (b) alsoshowsastronglinearrelationshipbetweenthevariablesxandy.Hereitisaninverserelationship.Thatmeansthevalueofydecreasesproportionallywithx.

Chart(c)showsacurvilinearrelationship.Itisaninverserelationship,whichmeansthatthevalueofydecreasesproportionallywithx.However,itseemsarelatively well-defined relationship, like an arc of a circle, which can berepresented by a simple quadratic equation (quadratic means the power oftwo,thatis,usingtermslikex2andy2).Chart(d)showsapositivecurvilinearrelationship.However,itdoesnotseemtoresemblearegularshape,andthuswouldnotbe a strong relationship.Charts (e) and (f) showno relationship.Thatmeansvariablesxandyareindependentofeachother.

Charts(a)and(b)aregoodcandidatesthatmodelasimplelinearregressionmodel (the terms regression model and regression equation can be used

137

interchangeably).Chart(c)toocouldbemodeledwithalittlemorecomplex,quadratic regression equation.Chart (d)might require an evenhigher orderpolynomialregressionequationtorepresentthedata.Charts(e)and(f)havenorelationship,thus,theycannotbemodeledtogether,byregressionorusinganyothermodelingtool.

138

RegressionExerciseTheregressionmodel isdescribedasa linearequation that follows.y is thedependentvariable,thatis,thevariablebeingpredicted.xistheindependentvariable, or the predictor variable.There could bemanypredictor variables(suchasx1,x2,...)inaregressionequation.However,therecanbeonlyonedependentvariable(y)intheregressionequation.

y=β0+β1x+ε

Asimpleexampleofaregressionequationwouldbetopredictahousepricefromthesizeofthehouse.Hereisasamplehousepricesdata:

HousePrice

Size(sqft)

$229,500

1850

$273,300

2190

$247,000

2100

$195,100

1930

$261,000

2300

$179,700

1710

$168,500

1550

$234,400

1920

$168,800

1840

$180,400

1720

$156,200

1660

$288,350

2405

139

$186,750

1525

$202,100

2030

$256,800

2240

The two dimensions of (one predictor, one outcome variable) data can beplottedonascatterdiagram.Ascatterplotwithabest-fittinglinelookslikethegraphthatfollows(Figure7.2).

Figure 7.2: Scatter plot and regression equation between House price andhousesize.

Visually, one can see a positive correlation between House Price and Size(sqft).However, the relationship isnotperfect.Runninga regressionmodelbetweenthetwovariablesproducesthefollowingoutput(truncated).

RegressionStatistics

r

0.891

r2

0.794

Coefficients

140

Intercept

-54191

Size(sqft)

139.48

It shows the coefficient of correlation is 0.891. r2, the measure of totalvariance explained by the equation, is 0.794, or 79%. Thatmeans the twovariables are moderately and positively correlated. Regression coefficientshelpcreatethefollowingequationforpredictinghouseprices.

HousePrice($)=139.48*Size(sqft)–54191

This equation explains only 79% of the variance in house prices. Supposeotherpredictorvariablesaremadeavailable,suchasthenumberofroomsinthehouse.Itmighthelpimprovetheregressionmodel.

141

Thehousedatanowlookslikethis:

HousePrice

Size(sqft)

#Rooms

$229,500

1850

4

$273,300

2190

5

$247,000

2100

4

$195,100

1930

3

$261,000

2300

4

$179,700

1710

2

$168,500

1550

2

$234,400

1920

4

$168,800

1840

2

$180,400

1720

2

$156,200

1660

2

$288,350

2405

5

$186,750

1525

3

$202,100

2030

2

$256,800

2240

4

142

Whileitispossibletomakea3-dimensionalscatterplot,onecanalternativelyexaminethecorrelationmatrixamongthevariables.

HousePrice

Size(sqft)

#Rooms

HousePrice

1

Size(sqft)

0.891

1

Rooms

0.944

0.748

1

ItshowsthattheHousepricehasastrongcorrelationwithnumberofrooms(0.944) aswell.Thus, it is likely that adding this variable to the regressionmodelwilladdtothestrengthofthemodel.

Running a regression model between these three variables produces thefollowingoutput(truncated).

RegressionStatisticsr

0.984

r2

0.968

Coefficients

Intercept

12923

Size(sqft)

65.60

Rooms

23613

Itshowstheco-efficientofcorrelationof thisregressionmodel is0.984.R2,thetotalvarianceexplainedbytheequation,is0.968or97%.Thatmeansthevariablesarepositivelyandverystronglycorrelated.Addinganewrelevantvariablehashelpedimprovethestrengthoftheregressionmodel.

143

Using the regression coefficients helps create the following equation forpredictinghouseprices.

HousePrice($)=65.6*Size(sqft)+23613*Rooms+12924

Thisequationshowsa97%goodnessoffitwiththedata,whichisverygoodfor business and economic data. There is always some randomvariation innaturallyoccurringbusinessdata,anditisnotdesirabletooverfitthemodeltothedata.

This predictive equation should be used for future transactions. Given asituationasbelow, itwill bepossible topredict thepriceof thehousewith2000sqftand3rooms.

HousePrice

Size(sqft)

#Rooms

??

2000

3

HousePrice($)=65.6*2000(sqft)+23613*3+12924=$214,963

Thepredictedvaluesshouldbecomparedwith theactualvalues toseehowclosethemodelisabletopredicttheactualvalue.Asnewdatapointsbecomeavailable,thereareopportunitiestofine-tuneandimprovethemodel.

144

Non-linearregressionexerciseTherelationshipbetweenthevariablesmayalsobecurvilinear.Forexample,givenpastdatafromelectricityconsumption(KwH)andtemperature(temp),the objective is to predict the electrical consumption from the temperaturevalue.Hereareadozenpastobservations.

KWatts

Temp(F)

12530

46.8

10800

52.1

10180

55.1

9730

59.2

9750

61.9

10230

66.2

11160

69.9

13910

76.8

15690

79.3

15110

79.7

17020

80.2

17880

83.3

Intwodimensions(onepredictor,oneoutcomevariable)datacanbeplottedonascatterdiagram.Ascatterplotwithabest-fittinglinelookslikethegraphbelow(Figure7.3).

145

Figure6.2:Scatterplotsshowingregressionbetween(a)kwattsandtemp,and(b)kwattsandtempsquare

Itisvisuallyclearthatthefirstlinedoesnotfitthedatawell.Therelationshipbetween temperature andKwatts follows a curvilinearmodel,where it hitsbottomatacertainvalueoftemperature.TheregressionmodelconfirmstherelationshipsinceRisonly0.77andR-square isalsoonly60%.Thus,only60%ofthevarianceisexplained.

The regression model can then be enhanced using a Temp2 variable in theequation.Thesecondline is therelationshipbetweenKWHandTemp2.Thescatter plot shows that the Energy consumption shows a strong linearrelationshipwiththequadraticTemp2variable.Runningtheregressionmodelafteraddingthequadraticvariable,leadstothefollowingresults:

RegressionStatisticsr

0.992

r2

0.984

Coefficients

Intercept

67245

146

Temp(F) -1911Temp-sq

15.87

It shows that the co-efficient of correlation of the regressionmodel is now0.99.R2,thetotalvarianceexplainedbytheequationis0.985,or98.5%.Thatmeans the variables are very strongly and positively correlated. Theregressioncoefficientshelpcreatethefollowingequationfor

EnergyConsumption(Kwatts)=15.87*Temp2-1911*Temp+67245

This equation shows a 98.5% fit which is very good for business andeconomic contexts. Now one can predict the Kwatts value for when thetemperatureis72-degrees.

Energyconsumption=(15.87*72*72)-(1911*72)+67245=11923Kwatts

147

LogisticRegressionRegressionmodelstraditionallyworkwithcontinuousnumericvaluedatafordependent and independent variables. Logistic regression models can,however,workwithdependentvariableswithbinaryvalues,suchaswhetheraloanisapproved(yesorno).Logisticregressionmeasurestherelationshipbetween a categorical dependent variable and one or more independentvariables.Forexample,Logisticregressionmightbeusedtopredictwhetherapatienthasagivendisease(e.g.diabetes),basedonobservedcharacteristicsofthepatient(age,gender,bodymassindex,resultsofbloodtests,etc.).

Logisticalregressionmodelsuseprobabilityscoresasthepredictedvaluesofthedependentvariable.Logisticregressiontakesthenaturallogarithmoftheoddsofthedependentvariablebeingacase(referredtoasthelogit)tocreatea continuous criterion as a transformed version of the dependent variable.Thus the logit transformation isused in logistic regressionas thedependentvariable. The net effect is that although the dependent variable in logisticregressionisbinomial(orcategorical, i.e.hasonlytwopossiblevalues), thelogit is the continuous function uponwhich linear regression is conducted.Here is the general logistic function, with independent variable on thehorizontal axis and the logit dependentvariableon thevertical axis (Figure7.3).

Figure7.3:GeneralLogitfunction

All popular data mining platforms provide support for regular multipleregressionmodels,aswellasoptionsforLogisticRegression.

148

http://en.wikipedia.org/wiki/Diabetes_mellitus

http://en.wikipedia.org/wiki/Body_mass_index

http://en.wikipedia.org/wiki/Blood_test

http://en.wikipedia.org/wiki/Natural_logarithm

http://en.wikipedia.org/wiki/Logit

AdvantagesandDisadvantagesofRegressionModelsRegressionModelsareverypopularbecausetheyoffermanyadvantages.

1. Regressionmodels are easy to understand as they are built uponbasicstatisticalprinciplessuchascorrelationandleastsquareerror.

2. Regressionmodels provide simple algebraic equations that are easy tounderstandanduse.

3. Thestrength(orthegoodnessoffit)oftheregressionmodelismeasuredin terms of the correlation coefficients, and other related statisticalparametersthatarewellunderstood.

4. Regression models can match and beat the predictive power of othermodelingtechniques.

5. Regressionmodelscanincludeallthevariablesthatonewantstoincludeinthemodel.

6. Regressionmodeling tools are pervasive. They are found in statisticalpackages aswell asdataminingpackages.MSExcel spreadsheets canalsoprovidesimpleregressionmodelingcapabilities.

Regressionmodelscanhoweverproveinadequateundermanycircumstances.

1. Regressionmodelscannotcoverforpoordataqualityissues.Ifthedataisnotpreparedwelltoremovemissingvalues,orisnotwell-behavedintermsofanormaldistribution,thevalidityofthemodelsuffers.

2. Regression models suffer from collinearity problems (meaning stronglinear correlations among some independent variables). If theindependentvariableshave strongcorrelations among themselves, thenthey will eat into each other’s predictive power and the regressioncoefficients will lose their ruggedness. Regression models will notautomaticallychoosebetweenhighlycollinearvariables,althoughsomepackagesattempttodothat.

3. Regressionmodelscanbeunwieldyandunreliableifalargenumberofvariablesareincludedinthemodel.Allvariablesenteredintothemodelwill be reflected in the regression equation, irrespective of theircontributiontothepredictivepowerofthemodel.Thereisnoconceptofautomaticpruningoftheregressionmodel.

4. Regressionmodelsdonotautomatically takecareofnon-linearity.Theuserneedstoimaginethekindofadditionaltermsthatmightbeneededtobeaddedtotheregressionmodeltoimproveitsfit.

5. Regressionmodelsworkonlywithnumericdataandnotwithcategoricalvariables.There areways to dealwith categorical variables though by

149

creatingmultiplenewvariableswithayes/novalue.

150

ConclusionRegression models are simple, versatile, visual/graphical tools with highpredictive ability. They include non-linear as well as binary predictions.Regression models should be used in conjunction with other data miningtechniquestoconfirmthefindings.

***

151

ReviewExercises:Q1:Whatisaregressionmodel?

Q2:Whatisascatterplot?Howdoesithelp?

Q3:Compareandcontrastdecisiontreeswithregressionmodels?

Q4:Usingthedatabelow,createaregressionmodeltopredicttheTest2fromtheTest1score.Thenpredictthescoreforonewhogota46inTest1.

Test1

Test2

59

56

52

63

44

55

51

50

42

66

42

48

41

58

45

36

27

13

63

50

54

81

44

56

50

64

152

47

50

153

LibertyStoresCaseExercise:Step6Libertywantstoforecastitssalesfornextyear,forfinancialbudgeting.

Year

GlobalGDPindexpercapita

#custservcalls(‘000s)

#employees(‘000)

#Items(‘000)

Revenue

($M)

1

100

25

45

11

2000

2

112

27

53

11

2400

3

115

22

54

12

2700

4

123

27

58

14

2900

5

122

32

60

14

3200

6

132

33

65

15

3500

7

143

40

72

16

4000

8

126

30

65

16

4200

9

166

34

85

17

4500

10

157

47

97

18

4700

11

176

33

98

18

4900

12

180

45

100

20

5000

Checkthecorrelations.Whichvariablesarestronglycorrelated?

Createaregressionmodelthatbestpredictstherevenue.

154

155

Chapter8:ArtificialNeuralNetworks

ArtificialNeuralNetworks(ANN)areinspiredbytheinformationprocessingmodelofthemind/brain.Thehumanbrainconsistsofbillionsofneuronsthatlink with one another in an intricate pattern. Every neuron receivesinformation frommanyotherneurons,processes it, gets excitedornot, andpassesitsstateinformationtootherneurons.

Just like the brain is a multipurpose system, so also the ANNs are veryversatilesystems.Theycanbeusedformanykindsofpatternrecognitionandprediction. They are also used for classification, regression, clustering,association,andoptimizationactivities.Theyareusedinfinance,marketing,manufacturing,operations,informationsystemsapplications,andsoon.

ANNsarecomposedof a largenumberofhighly interconnectedprocessingelements(neurons)workinginamulti-layeredstructuresthatreceiveinputs,processtheinputs,andproduceanoutput.AnANNisdesignedforaspecificapplication, such as pattern recognition or data classification, and trainedthrough a learning process. Just like in biological systems, ANNs makeadjustmentstothesynapticconnectionswitheachlearninginstance.

ANNsarelikeablackboxtrainedintosolvingaparticulartypeofproblem,and they can develop high predictive powers. Their intermediate synapticparameter values evolve as the system obtains feedback on its predictions,andthusanANNlearnsfrommoretrainingdata(Figure8.1).

Figure8.1:GeneralANNmodel

156

Caselet:IBMWatson-AnalyticsinMedicineThe amount of medicalinformation available isdoubling every five years andmuch of this data isunstructured. Physicians simplydon't have time to read everyjournal that can help them keepup to date with the latestadvances.Mistakesindiagnosisare likely to happen and clientshave becomemore aware of theevidence. Analytics willtransform the field of medicineinto Evidence-based medicine.How can healthcare providersaddresstheseproblems?IBM’s Watson cognitivecomputing system can analyzelarge amounts of unstructuredtext and develop hypothesesbased on that analysis.Physicians can use Watson toassistindiagnosingandtreatingpatients. First, the physicianmight describe symptoms andother related factors to thesystem.Watsoncanthenidentifythekeypiecesofinformationandmine the patient’s data to findrelevant facts about familyhistory, currentmedicationsandother existing conditions. Itcombines this information withcurrent findings from tests, andthen forms and tests ahypotheses by examining avariety of data sources—treatment guidelines, electronicmedicalrecorddataanddoctors’and nurses’ notes, as well as

157

peer-reviewed research andclinical studies. From here,Watson can provide potentialtreatment options and itsconfidence rating for eachsuggestion.Watson has been deployed atmany leading healthcareinstitutions to improve thequality and efficiency ofhealthcare decisions; to helpclinicians uncover insights fromits patient information inelectronic medical records(EMR);amongotherbenefits.Q1: How would IBM Watsonchangemedical practices in thefuture?Q2: In what other industries &functions could this technologybeapplied?

158

BusinessApplicationsofANNNeuralnetworksareusedmostoftenwhentheobjectivefunctioniscomplex,andwherethereexistsplentyofdata,andthemodel isexpectedto improveoveraperiodoftime.Afewsampleapplications:

1. Theyareusedinstockpricepredictionwheretherulesofthegameareextremely complicated, and a lot of data needs to be processed veryquickly.

2. Theyareusedforcharacter recognition,as in recognizinghand-writtentext, or damagedormangled text.They areused in recognizing fingerprints. These are complicated patterns and are unique for each person.Layers of neurons can progressively clarify the pattern leading to aremarkablyaccurateresult.

3. Theyarealsousedintraditionalclassificationproblems,likeapprovingafinancialloanapplication.

159

DesignPrinciplesofanArtificialNeuralNetwork1. A neuron is the basic processing unit of the network. The neuron (or

processingelement)receivesinputsfromitsprecedingneurons(orPEs),doessomenonlinearweightedcomputationonthebasisofthoseinputs,transformstheresultintoitsoutputvalue,andthenpassesontheoutputtothenextneuroninthenetwork(Figure8.2).X’saretheinputs,w’saretheweightsforeachinput,andyistheoutput.

Figure8.2:Modelforasingleartificialneuron

2. ANeuralnetwork is amulti-layeredmodel.There is at leastone inputneuron,oneoutputneuron,andatleastoneprocessingneuron.AnANNwith just this basic structure would be a simple, single-stagecomputational unit. A simple task may be processed by just that oneneuronandtheresultmaybecommunicatedsoon.ANNshowever,mayhavemultiplelayersofprocessingelementsinsequence.Therecouldbemanyneuronsinvolvedinasequencedependinguponthecomplexityofthepredictiveaction.ThelayersofPEscouldworkinsequence,ortheycouldworkinparallel(Figure8.3).

160

Figure8.3:Modelforamulti-layerANN

3. Theprocessinglogicofeachneuronmayassigndifferentweightstothevariousincominginputstreams.Theprocessinglogicmayalsousenon-linear transformation, such as a sigmoid function, from the processedvalues to the output value. This processing logic and the intermediateweightandprocessingfunctionsarejustwhatworksforthesystemasawhole, in its objective of solving a problem collectively. Thus, neuralnetworksareconsideredtobeanopaqueandablack-boxsystem.

4. Theneuralnetworkcanbetrainedbymakingsimilardecisionsoverandover again with many training cases. It will continue to learn byadjustingitsinternalcomputationandcommunicationbasedonfeedbackaboutitspreviousdecisions.Thus,theneuralnetworksbecomebetteratmakingadecisionastheyhandlemoreandmoredecisions.

Depending upon the nature of the problem and the availability of goodtrainingdata,atsomepointtheneuralnetworkwilllearnenoughandbegintomatchthepredictiveaccuracyofahumanexpert.Inmanypracticalsituations,the predictions of ANN, trained over a long period of time with a largenumberoftrainingdata,havebeguntodecisivelybecomemoreaccuratethanhumanexperts.At thatpointANNcanbegin tobeseriouslyconsidered fordeploymentinrealsituationsinrealtime.

161

RepresentationofaNeuralNetworkAneuralnetworkisaseriesofneuronsthatreceiveinputsfromotherneurons.They do a weighted summation function of all the inputs, using differentweights(orimportance)foreachinput.Theweightedsumisthentransformedintoanoutputvalueusingatransferfunction.

LearninginANNoccurswhenthevariousprocessingelementsintheneuralnetwork adjust the underlying relationship (weights, transfer function, etc)betweeninputandoutputs,inresponsetothefeedbackontheirpredictions.Ifthepredictionmadewascorrect,thentheweightswouldremainthesame,butifthepredictionwasincorrect,thentheparametervalueswouldchange.

TheTransformation(Transfer)Functionisanyfunctionsuitableforthetaskathand. The transfer function for ANNs is usually a non-linear sigmoidfunction.Thus,ifthenormalizedcomputedvalueislessthansomevalue(say0.5)thentheoutputvaluewillbezero.Ifthecomputedvalueisatthecut-offthreshold,thentheoutputvaluewillbea1.Itcouldbeanonlinearhyperbolicfunctioninwhichtheoutputiseithera-1ora1.Manyotherfunctionscouldbedesignedforanyoralloftheprocessingelements.

Thus, in a neural network, every processing element can potentially have adifferentnumberof inputvalues,adifferentsetofweights for those inputs,andadifferenttransformationfunction.Thosevaluessupportandcompensatefor one another until the neural network as a whole learns to provide thecorrectoutput,asdesiredbytheuser.

162

ArchitectingaNeuralNetworkThere are many ways to architect the functioning of an ANN using fairlysimpleandopenruleswithatremendousamountofflexibilityateachstage.The most popular architecture is a Feed-forward, multi-layered perceptronwith back-propagation learning algorithm. That means there are multiplelayersofPEsinthesystemandtheoutputofneuronsarefedforwardtothePEsinthenextlayers;andthefeedbackonthepredictionisfedbackintotheneuralnetworkforlearningtooccur.Thisisessentiallywhatwasdescribedinthe earlier paragraphs. ANN architectures for different applications areshowninTable8.1.

Classification

Feedforwardnetworks(MLP),radialbasisfunction,andprobabilistic

Regression

Feedforwardnetworks(MLP),radialbasisfunction

Clustering

Adaptiveresonancetheory(ART),Self-organizingmaps(SOMs)

AssociationRuleMining

Hopfieldnetworks

Table8.1:ANNarchitecturesfordifferentapplications

163

DevelopinganANNIttakesresources,trainingdata,skillandtimetodevelopa neural network. Most data mining platforms offer atleast the Multi-Layer-Perceptron (MLP) algorithm toimplement a neural network. Other neural networkarchitectures include Probabilistic networks and Self-organizingfeaturemaps.

ThestepsrequiredtobuildanANNareasfollows:

1. Gather data. Divide into training data and test data. The training dataneedstobefurtherdividedintotrainingdataandvalidationdata.

2. Selectthenetworkarchitecture,suchasFeedforwardnetwork.3. Selectthealgorithm,suchasMulti-layerPerception.4. Setnetworkparameters.5. TraintheANNwithtrainingdata.6. Validatethemodelwithvalidationdata.7. Freezetheweightsandotherparameters.8. Testthetrainednetworkwithtestdata.9. DeploytheANNwhenitachievesgoodpredictiveaccuracy.

Training anANN requires that the training data be splitintothreeparts(Table8.2):

Trainingset

Thisdatasetisusedtoadjusttheweightsontheneuralnetwork(∼60%).

Validationset

Thisdatasetisusedtominimizeoverfittingandverifyingaccuracy(∼20%).

Testingset

Thisdatasetisusedonlyfortestingthefinalsolutioninordertoconfirmtheactualpredictivepowerofthenetwork(∼20%).

k-foldcross-validation

Thisapproachmeansthatthedataisdividedintokequalpieces,andthelearningprocessisrepeatedk-timeswitheachpiecesbecomingthetrainingset.Thisprocessleadstolessbiasandmoreaccuracy,butis

164

moretimeconsuming.Table8.2:ANNTrainingdatasets

165

AdvantagesandDisadvantagesofusingANNsTherearemanybenefitsofusingANN.

1. ANNs impose very little restrictions on their use. ANN can deal with(identify/model) highly nonlinear relationships on their own, withoutmuchworkfromtheuseroranalyst.Theyhelpfindpracticaldata-drivensolutions where algorithmic solutions are non-existent or toocomplicated.

2. There is no need to program neural networks, as they learn fromexamples.Theygetbetterwithuse,withoutmuchprogrammingeffort.

3. They can handle a variety of problem types, including classification,clustering,associations,etc.

4. ANNaretolerantofdataqualityissuesandtheydonotrestrictthedatatofollowstrictnormalityand/orindependenceassumptions.

5. Theycanhandlebothnumericalandcategoricalvariables.6. ANNscanbemuchfasterthanothertechniques.7. Most importantly, theyusuallyprovidebetter results (predictionand/or

clustering) compared to statistical counterparts, once they have beentrainedenough.

The key disadvantages arise from the fact that they arenoteasytointerpretorexplainorcompute.

1. Theyaredeemedtobeblack-boxsolutions,lackingexplainability.Thustheyaredifficult tocommunicateabout,except through thestrengthoftheirresults.

2. OptimaldesignofANNisstillanart:itrequiresexpertiseandextensiveexperimentation.

3. Itcanbedifficult tohandlea largenumberofvariables (especially therichnominalattributes).

4. IttakeslargedatasetstotrainanANN.

166

ConclusionArtificial neural networks are complex systems thatmirror the functioning of the human brain. They areversatile enough to solve many data mining tasks withhigh accuracy. However, they are like black boxes andtheyprovide littleguidanceon the intuitive logicbehindtheirpredictions.

167

ReviewExercises1:Whatisaneuralnetwork?Howdoesitwork?

2:Compareaneuralnetworkwithadecisiontree.

3: What makes a neural network versatile enough forsupervisedaswellasnon-supervisedlearningtasks?

4:Examine thesteps indevelopinganeuralnetworkforpredictingstockprices.WhatkindofobjectivefunctionandwhatkindofdatawouldberequiredforagoodstockpricepredictorsystemusingANN?

***

168

Chapter9:ClusterAnalysisCluster analysis is used for automatic identification of natural groupings ofthings.Itisalsoknownasthesegmentationtechnique.Inthistechnique,datainstances that are similar to (or near) each other are categorized into onecluster. Similarly, data instances that are very different (or far away) fromeachotheraremovedintodifferentclusters.

Clustering is an unsupervised learning technique as there is no output ordependentvariableforwhicharightorwronganswercanbecomputed.Thecorrect number of clusters or the definition of those clusters is not knownaheadoftime.Clusteringtechniquescanonlysuggesttotheuserhowmanyclusterswouldmakesensefromthecharacteristicsofthedata.Theusercanspecifyadifferent,largerorsmaller,numberofdesiredclustersbasedontheirmakingbusinesssense.Theclusteranalysistechniquewillthendefinemanydistinctclustersfromanalysisofthedata,withclusterdefinitionsforeachofthoseclusters.However,therearegoodclusterdefinitions,dependingonhowcloselytheclusterparametersfitthedata.

169

Caselet:ClusterAnalysisAnationalinsurancecompanydistributesitspersonalandsmallcommercialinsurance products through independent agents. They wanted to increasetheir salesbybetterunderstanding their customers.Theywere interested inincreasing their market share by doing some direct marketing campaigns,however without creating a channel conflict with the independent agents.Theywerealsointerestedinexaminingdifferentcustomersegmentsbasedontheirneeds,andtheprofitabilityofeachofthosesegments.

They gathered attitudinal, behavioral, and demographic data using a mailsurvey of 2000 U.S. households that own auto insurance. Additional geo-demographic and credit informationwas added to the survey data. Clusteranalysisofthedatarevealedfiveroughlyequalsegments:

Non-Traditionals: interested in using the Internet and/or buyinginsuranceatwork.DirectBuyers:interestedinbuyingviadirectmailortelephone.BudgetConscious: interested inminimalcoverageand finding thebestdeal.AgentLoyals:expressedstrongloyaltytotheiragentsandhighlevelsofpersonalservice.Hassle-Free: similar toAgentLoyalsbut less interested in face-to-faceservice.

(Source:greenbook.org)

Q1.Whichcustomersegmentswouldyouchoosefordirectmarketing?Willthesecreateachannelconflict?

Q2. Could this segmentation apply to otherservicebusinesses?Whichones?

170

ApplicationsofClusterAnalysisClusteranalysisisusedinalmosteveryfieldwherethereisalargevarietyoftransactions. It helps provide characterization, definition, and labels forpopulations. It can help identify natural groupings of customers, products,patients,andsoon.Itcanalsohelpidentifyoutliersinaspecificdomainandthus decrease the size and complexity of problems. A prominent businessapplicationofclusteranalysisisinmarketresearch.Customersaresegmentedinto clusters based on their characteristics—wants and needs, geography,pricesensitivity,andsoon.Herearesomeexamplesofclustering:

1. Market Segmentation: Categorizing customers according to theirsimilarities, for instance by their common wants and needs, andpropensitytopay,canhelpwithtargetedmarketing.

2. Product portfolio: People of similar sizes can be grouped together tomakesmall,mediumandlargesizesforclothingitems.

3. Text Mining: Clustering can help organize a given collection of textdocumentsaccordingtotheircontentsimilaritiesintoclustersofrelatedtopics.

171

DefinitionofaClusterAn operational definition of a cluster is that, given a representation of nobjects, find K groups based on a measure of similarity, such that objectswithin the same group are alike but the objects in different groups are notalike.

However, thenotionofsimilaritycanbeinterpretedinmanyways.Clusterscandifferintermsoftheirshape,size,anddensity.Clustersarepatterns,andtherecanbemanykindsofpatterns.Someclustersare thetraditional types,suchasdatapointshangingtogether.However,thereareotherclusters,suchas all points representing the circumference of a circle. There may beconcentric circles with points of different circles representing differentclusters.Thepresenceofnoiseinthedatamakesthedetectionoftheclustersevenmoredifficult.

Anidealclustercanbedefinedasasetofpointsthatiscompactandisolated.Inreality,aclusterisasubjectiveentitywhosesignificanceandinterpretationrequiresdomainknowledge.Inthesampledatabelow(Figure9.1),howmanyclusterscanonevisualize?

Figure9.1:Visualclusterexample

It seems like there are two clusters of approximately equal sizes.However,theycanbe seenas threeclusters,dependingonhowwedraw thedividinglines.Thereisnotatrulyoptimalwaytocalculateit.Heuristicsareoftenusedtodefinethenumberofclusters.

172

RepresentingclustersTheclusterscanberepresentedbyacentralormodalvalue.Aclustercanbedefinedasthecentroidofthecollectionofpointsbelongingtoit.Acentroidisameasure of central tendency. It is the point fromwhere the sum total ofsquared distance from all the points is theminimum.A real-life equivalentwouldbethecitycenterasthepointthatisconsideredthemosteasytousebyall constituents of the city. Thus all cities are defined by their centers ordowntownareas.

Acluster canalsobe representedby themost frequentlyoccurringvalue inthecluster, i.e. theclustercandefinedbyitsmodalvalue.Thus,aparticularclusterrepresentingasocialpointofviewcouldbecalledthe‘soccermoms’,even though not allmembers of that cluster need currently be amomwithsoccer-playingchildren.

173

ClusteringtechniquesClusteranalysis isamachine-learningtechnique.Thequalityofaclusteringresult dependson thealgorithm, thedistance function, and theapplication.First, consider the distance function. Most cluster analysis methods use adistancemeasuretocalculatetheclosenessbetweenpairsofitems.Therearetwomajormeasuresofdistances:Euclidiandistance (“as the crow flies”orstraightline)isthemostintuitivemeasure.TheotherpopularmeasureistheManhattan (rectilinear) distance, where one can go only in orthogonaldirections.TheEuclidiandistanceisthehypotenuseofarighttriangle,whiletheManhattandistanceisthesumofthetwolegsoftherighttriangle.

Ineithercase,thekeyobjectiveoftheclusteringalgorithmisthesame:

- Inter-clusters distanceÞmaximized;and- Intra-clusters distanceÞminimized

There are many algorithms to produce clusters. There are top-down,hierarchicalmethods that start with creating a given number of best-fittingclusters. There are also bottom-up methods that begin with identifyingnaturallyoccurringclusters.

ThemostpopularclusteringalgorithmistheK-meansalgorithm.Itisatop-down, statistical technique, based on the method of minimizing the leastsquareddistancefromthecenterpointsoftheclusters.Othertechniques,suchasneuralnetworks,arealsousedforclustering.Comparingclusteralgorithmsisadifficulttaskasthereisnosinglerightnumberofclusters.However,thespeed of the algorithm and its versatility in terms of different dataset areimportantcriteria.

Hereisthegenericpseudocodeforclustering

1. Pickanarbitrarynumberofgroups/segmentstobecreated

2. Startwithsomeinitialrandomly-chosencentervaluesforgroups

3. Classifyinstancestoclosestgroups

4. Computenewvaluesforthegroupcenters

5. Repeatstep3&4tillgroupsconverge

174

6. Ifclustersarenotsatisfactory,gotostep1andpickadifferentnumberofgroups/segments

Theclusteringexercisecanbecontinuedwithadifferentnumberofclustersand different location of those points. Clusters are considered good if theclusterdefinitionsstabilize,andthestabilizeddefinitionsproveusefulforthepurposeathand.Else,repeattheclusteringexercisewithadifferentnumberofclusters,anddifferentstartingpointsforgroupmeans.

175

ClusteringExerciseHereisasimpleexercisetovisuallyandintuitiveidentifyclustersfromdata.X andY are two dimensions of interest. The objective is to determine thenumberofclusters,andthecenterpointsofthoseclusters.

X

Y

2

4

2

6

5

6

4

7

8

3

6

6

5

2

5

7

6

3

4

4

A scatter plot of 10 items in 2 dimensions shows them distributed fairlyrandomly. As a bottom-up technique, the number of clusters and theircentroids can be intuited (Figure 9.2).

176

Figure9.2:Initialdatapointsandthecentroid(shownasthickdot)

Thepoints aredistributed randomlyenough that it couldbe consideredonecluster.Thesolidcirclewouldrepresent thecentralpoint(centroid)of thesepoints.

However,thereisabigdistancebetweenthepoints(2,6)and(8,3).So,thisdatacouldbebrokeninto2clusters.Thethreepointsatthebottomrightcouldformone cluster and the other seven could form the other cluster.The twoclusterswould look like this (Figure 9.3). The two circleswill be the newcentroids.

Figure9.3:Dividingintotwoclusters(centroidsshownasthickdots)

Thebiggerclusterseemstoofarapart.So,itseemslikethe4pointsonthetopwill form a separate cluster. The three clusters could look like this (Figure

177

9.4).

Figure9.4:Dividingintothreeclusters(centroidsshownasthickdots)

Thissolutionhasthreeclusters.Theclusterontherightisfarfromtheothertwoclusters.However,itscentroidisnottooclosetoallthedatapoints.Thecluster at the top looks very tight-fitting, with a nice centroid. The thirdcluster,attheleft,isspreadoutandmaynotbeofmuchusefulness.

Thiswasabottom-upexerciseinvisuallyproducingthreebest-fittingclusterdefinitionsfromthegivendata.Therightnumberofclusterswilldependonthedataandtheapplicationforwhichthedatawouldbeused.

178

K-MeansAlgorithmforclusteringK-meansisthemostpopularclusteringalgorithm.Ititerativelycomputestheclustersandtheircentroids.It isatopdownapproachtoclustering.Startingwith a given number of K clusters, say 3 clusters. Thus three randomcentroidswillbecreatedasstartingpointsofthecentersofthreeclusters.Thecirclesareinitialclustercentroids(Figure9.5).

Figure9.5:Randomlyassigningthreecentroidsforthreedataclusters

Step 1: For a data point, distance values will be from each of the threecentroids. The data point will be assigned to the cluster with the shortestdistance to the centroid. All data points will thus, be assigned to one datapointortheother(Figure9.6).Thearrowsfromeachdataelementshowsthecentroidthatthepointisassignedto.

179

Figure9.6:Assigningdatapointstoclosestcentroid

Step2:Thecentroidforeachclusterwillnowberecalculatedsuchthat it isclosesttoallthedatapointsallocatedtothatcluster.Thedashedarrowsshowthecentroidsbeingmovedfromtheirold(shaded)valuestotherevisednewvalues(Figure9.7).

Figure9.7:Recomputingcentroidsforeachcluster

Step3:Onceagain,datapointsareassignedtothethreecentroidsclosesttoit(Figure9.8).

Figure9.8:Assigningdatapointstorecomputedcentroids

180

Thenewcentroidswillbecomputedfromthedatapointsintheclusteruntilfinally, thecentroids stabilize in their locations.Theseare the threeclusterscomputedbythisalgorithm.

Figure9.9:Recomputingcentroidsforeachclustertillclustersstabilize

Thethreeclustersshownare:a3-datapointsclusterwithcentroid(6.5,4.5),a2- datapoint cluster with centroid (4.5,3) and a 5-datapoint cluster withcentroid(3.5,3)(Figure9.9).

Theseclusterdefinitionsaredifferentfromtheonesderivedvisually.Thisisafunction of the random starting centroid values. The centroid points usedearlier in the visual exercise were different from that chosen with the K-means clustering algorithm. The K-means clustering exercise shouldtherefore,berunagainwiththisdata,butwithnewrandomcentroidstartingvalues.Withmany runs, the cluster definitions are likely to stabilize. If thecluster definitions do not stabilize, that may be a sign that the number ofclusterschosenistoohighortoolow.ThealgorithmshouldalsoberunwithdifferentvaluesofK.

181

182

SelectingthenumberofclustersThe correct choice of the value of k is often ambiguous. It depends on theshapeand scale of the distribution points in a data set and the desiredclustering resolution of the user. Heuristics are needed to pick the rightnumber.Onecangraphthepercentageofvarianceexplainedby theclustersagainst the number of clusters (Fig 9.10). The first clusters will add moreinformation(explainalotofvariance),butatsomepointthemarginalgaininvariancewill fall, giving a sharp angle to thegraph, looking like an elbow.Beyondthatelbowpoint,addingmoreclusterswillnotaddmuchincrementalvalue. That would be the desired K.

Figure9.10:Elbowmethodfordeterminingnumberofclustersinadataset

Toengagewiththedataandtounderstandtheclustersbetter,itisoftenbettertostartwitha smallnumberofclusters suchas2or3,dependingupon thedata set and the application domain. The number can be increasedsubsequently, as needed from an application point of view. This helpsunderstandthedataandtheclustersprogressivelybetter.

183

AdvantagesandDisadvantagesofK-MeansalgorithmTherearemanyadvantagesofK-MeansAlgorithm

1. K-Meansalgorithmissimple,easytounderstandandeasytoimplement.2. Itisalsoefficient,inthatthetimetakentoclusterk-means,riseslinearly

withthenumberofdatapoints.3. NootherclusteringalgorithmperformsbetterthanK-Means,ingeneral.

Thereareafewdisadvantagestoo:

1. TheuserneedstospecifyaninitialvalueofK.2. Theprocessoffindingtheclustersmaynotconverge.3. It is not suitable for discovering clusters shapes that are not hyper-

ellipsoids(orhyper-spheres).

Neural networks can also be deployed for clustering, using the appropriateobjective function. The neural networkwill produce the appropriate clustercentroidsandclusterpopulationforeachcluster.

184

ConclusionCluster analysis is a useful, unsupervised learning technique that is used inmanybusiness situations to segment thedata intomeaningful smallgroups.K-Meansalgorithmisaneasystatisticaltechniquetoiterativelysegmentthedata.However,thereisonlyaheuristictechniquetoselecttherightnumberofclusters.

185

ReviewExercises1:Whatisunsupervisedlearning?Whenisitused?

2:Describethreebusinessapplicationsinyourindustrywhereclusteranalysiswillbeuseful.

3:Dataaboutheightandweightforafewvolunteersisavailable.Createasetofclustersforthefollowingdata,todecidehowmanysizesofT-shirtsshouldbeordered.

Height

Weight

71

165

68

165

72

180

67

113

72

178

62

101

70

150

69

172

72

185

63

149

69

132

61

115

186

187

LibertyStoresCaseExercise:Step7Liberty wants to find suitablenumber of segments for itscustomers, for targetedmarketing. Here is a list ofrepresentativecustomers.

Cust

#oftrans-actions

TotalPurchase($)

Income($K)

1

5

450

90

2

10

800

82

3

15

900

77

4

2

50

30

5

18

900

60

6

9

200

45

7

14

500

82

8

8

300

22

9

7

250

90

10

9

1000

80

11

1

30

60

12

6

700

80

1.WhatistherightnumberofclustersforLiberty?

188

2.Whataretheircentroidsfortheclusters?

189

Chapter10:AssociationRuleMining

Associateruleminingisapopular,unsupervisedlearningtechnique,usedinbusinesstohelpidentifyshoppingpatterns.Itisalsoknownasmarketbasketanalysis. It helps find interesting relationships (affinities) between variables(itemsorevents).Thus, it canhelpcross-sell related itemsand increase thesizeofasale.

Alldatausedinthistechniqueiscategorical.Thereisnodependentvariable.It uses machine learning algorithms. The fascinating “relationship betweensalesofdiapersandbeers’ ishowit isoftenexplained inpopular literature.This technique accepts as input the raw point-of-sale transaction data. Theoutput produced is the description of the most frequent affinities amongitems.Anexampleofanassociationrulewouldbe,“ACustomerwhoboughtaflightticketsandahotelreservationalsoboughtarentalcarplan60percentofthetime."

190

Caselet:Netflix:DataMininginEntertainmentNetflix suggestions andrecommendation engines arepoweredbyasutieofalgorithmsusing data about millions ofcustomer ratings aboutthousands of movies. Most ofthese algorithms are based onthepremise that similar viewingpatterns represent similar usertastes. This suite of algorithms,called CineMatch, instructsNetflix's servers to processinformationfromitsdatabasestodetermine which movies acustomer is likely to enjoy. Thealgorithm takes into accountmany factors about the filmsthemselves, the customers'ratings, and the combinedratings of all Netflix users. Thecompany estimates that awhopping 75 percent of vieweractivity is driven byrecommendations. According toNetflix, these predictions werevalid around 75 percent of thetime and half of Netflix userswho rented CineMatch-recommendedmovies gave themafive-starrating.

Tomakematches,acomputer

1. Searches theCineMatchdatabase forpeoplewhohave rated the samemovie-forexample,"TheReturnoftheJedi".

2. Determineswhichofthosepeoplehavealsoratedasecondmovie,suchas"TheMatrix".

3. Calculatesthestatisticallikelihoodthatpeoplewholiked"ReturnoftheJedi"willalsolike"TheMatrix".

191

4. Continues this process to establish a pattern of correlations betweensubscribers'ratingsofmanydifferentfilms.

Netflix launched a contest in2006 to find an algorithm thatcould beat CineMatch. Thecontest, called theNetflixPrize,promised $1 million to the firstperson or team to meet theaccuracy goals forrecommending movies based onusers' personal preferences.Each of these algorithmsubmissions was required todemonstrate a 10 percentimprovement over CineMatch.Threeyears later, the$1millionprizewas awarded to a teamofseven people. (source:http://electronics.howstuffworks.com

1: Are Netflix customers beingmanipulated into seeing whatNetflixwantsthemtosee?

2: Compare this story withAmazon’s personalizationengine.

192

http://electronics.howstuffworks.com

BusinessApplicationsofAssociationRulesIn business environments a pattern or knowledge can be used for manypurposes. In sales andmarketing, it is used for cross-marketing and cross-selling, catalog design, e-commerce site design, online advertisingoptimization, product pricing, and sales/promotion configurations. Thisanalysiscansuggestnottoputoneitemonsaleatatime,andinsteadtocreateabundleofproductspromotedasapackagetosellothernon-sellingitems.

In retail environments, it can be used for store design. Strongly associateditemscanbekeptclosetougherforcustomerconvenience.Ortheycouldbeplacedfarfromeachothersothatthecustomerhastowalktheaislesandbydoingsoispotentiallyexposedtootheritems.

Inmedicine,thistechniquecanbeusedforrelationshipsbetweensymptomsandillnesses;diagnosisandpatientcharacteristics/treatments;genesandtheirfunctions;etc.

193

RepresentingAssociationRulesAgenericAssociationRuleisrepresentedbetweenasetXandY:XÞY[S%,C%]

X,Y:productsand/orservices

X:Left-hand-side(LHS)

Y:Right-hand-side(RHS)

S:Support:howoftenXandYgotogetherinthedataset–i.e.P(XUY)

C:Confidence:howoftenYisfound,givenX–i.e.P(YǀX)

Example:{Hotelbooking,Flightbooking}Þ{RentalCar}[30%,60%]

[Note: P (X) is the mathematical representation of the the probability orchanceofXoccurringinthedataset.}

Computationexample:

Supposethereare1000transactionsinadataset.Thereare300occurrencesofX,and150occurrencesof(X,Y)inthedataset.

SupportSforXÞYwillbeP(XUY)=150/1000=15%.

ConfidenceforXÞYwillbeP(YǀX);orP(XUY)/P(X)=150/300=50%

194

AlgorithmsforAssociationRuleNotallassociationrulesareinterestinganduseful,onlythosethatarestrongrulesandalsothosethatoccurfrequently.Inassociationrulemining,thegoalis to find all rules that satisfy the user-specified minimum support andminimumconfidence.Theresultingsetsofrulesareallthesameirrespectiveof the algorithm used, that is, given a transaction data set T, a minimumsupportandaminimumconfidence,thesetofassociationrulesexistinginTisuniquelydetermined.

Fortunately, there is a large number of algorithms that are available forgeneratingassociationrules.ThemostpopularalgorithmsareApriori,Eclat,FP-Growth, alongwithvariousderivativesandhybridsof the three.All thealgorithmshelp identify the frequent itemsets,whichare thenconverted toassociationrules.

195

AprioriAlgorithmThis is the most popular algorithm used for association rule mining. Theobjectiveistofindsubsetsthatarecommontoatleastaminimumnumberoftheitemsets.Afrequentitemsetisanitemsetwhosesupportisgreaterthanorequal to minimum support threshold. The Apriori property is a downwardclosureproperty,whichmeansthatanysubsetsofafrequentitemsetarealsofrequent itemsets.Thus, if (A,B,C,D) is a frequent itemset, then any subsetsuchas(A,B,C)or(B,D)arealsofrequentitemsets.

It uses a bottom-up approach; and the size of frequent subsets is graduallyincreased,fromone-itemsubsetstotwo-itemsubsets,thenthree-itemsubsets,andsoon.Groupsofcandidatesateachlevelare testedagainst thedataforminimumsupport.

196

AssociationrulesexerciseHereareadozensalestransactions.Therearesixproductsbeingsold:Milk,Bread,Butter,Eggs,Cookies, andKetchup.Transaction#1 soldMilk,Eggs,BreadandButter.Transaction#2soldMilk,Butter,Egg&Ketchup.Andsoon. The objective is to use this transaction data to find affinities betweenproducts,i.e.whichproductsselltogetheroften.

Thesupportlevelwillbesetat33percent;theconfidencelevelwillbesetat50 percent. That means that we have decided to consider rules from onlythose itemsets that occur at least 33 percent of the time in the total set oftransactions.Confidence levelmeans thatwithin those itemsets, the rulesoftheformX→Yshouldbesuchthatthereisatleast50percentchanceofYoccurringbasedonXoccurring.

TransactionsList

1

Milk

Egg

Bread

Butter

2

Milk

Butter

Egg

Ketchup

3

Bread

Butter

Ketchup

4

Milk

Bread

Butter

5

Bread

Butter

Cookies

6

Milk

Bread

Butter

Cookies

7

Milk

Cookies

8

Milk

Bread

Butter

9

Bread

Butter

Egg

Cookies

10

Milk

Butter

Bread

11

Milk

Bread

Butter

197

12

Milk

Bread

Cookies

Ketchup

First step is to compute 1-item Itemsets. i.e. How often does any productindividuallysell.

1-itemSets

Freq

Milk

9

Bread

10

Butter

10

Egg

3

Ketchup

3

Cookies

5

Thus, Milk sells in 9 out of 12 transactions. Bread sells in 10 out of 12transactions.Andsoon.

Ateverypoint,thereisanopportunitytoselectitemsetsofinterest,andthusfurtheranalysis.Otheritemsetsthatoccurveryinfrequentlymayberemoved.Ifitemsetsthatoccur4ormoretimesoutof12areselected,thatcorrespondstomeetingaminimumsupportlevelof33percent(4outof12).Only4itemsmakethecut.Thefrequentitemsthatmeetthesupportlevelof33percentare:

Frequent1-itemSets

Freq

Milk

9

Bread

10

198

Butter

10

Cookies

5

The next step is to go for the next level of itemsets using items selectedearlier:2-itemitemsets.

2-itemSets

Freq

Milk,Bread

7

Milk,Butter

7

Milk,Cookies

3

Bread,Butter

9

Butter,Cookies

3

Bread,Cookies

4

Thus(Milk,Bread)sell7timesoutof12.(Milk,Butter)selltogether7times,(Bread,Buttersell)together9times,and(Bread,Cookies)sell4times.

Howeveronlyfourof thesetransactionsmeet theminimumsupport levelof33%.

2-itemSets

Freq

Milk,Bread

7

Milk,Butter

7

Bread,Butter

9

199

Bread,Cookies 4

Thenextstepistolistthenexthigherlevelofitemsets:3-itemitemsets.

3-itemSets

Freq

Milk,Bread,Butter

6

Milk,Bread,Cookies

1

Bread,Butter,Cookies

3

Thus(Milk,Bread,Butter)sell6timesoutof12.(Bread,Butter,Cookies)sell3 times out of 12. One one 3-item itemset meets the minimum supportrequirements.

3-itemSets

Freq

Milk,Bread,Butter

6

Thereisnoroomtocreatea4-itemitemsetforthissupportlevel.

200

CreatingAssociationRulesThemostinterestingandcomplexrulesathighersizeitemsetsstarttop-downwiththemostfrequentitemsetsofhighersize-numbers.Associationrulesarecreatedthatmeetthesupportlevel(>33%)andconfidencelevels(>50%).

Thehighestlevelitemsetthatmeetsthesupportrequirementsisthethree-itemitemset.Thefollowingitemsethasasupportlevelof50%(6outof12).

Milk,Bread,Butter

6

ThisitemsetcouldleadtomultiplecandidateAssociationrules.

Startwiththefollowingrule:(Bread,Butter)Milk.

Thereareatotaloftotal12transactions.

X(inthiscaseBread,Butter)occurs9times;

X,Y(inthiscaseBread,Butter,Milk)occurs6times.

The support level for this rule is6/12=50%.Theconfidence level for thisrule is 6/9 = 67%. This rulemeets our thresholds for support (>33%) andconfidence(>50%).

Thus,thefirstvalidAssociationrulefromthisdatais:(Bread,Butter)Milk{S=50%,C=67%}.

Inexactlythesameway,otherrulescanbeconsideredfortheirvalidity.

Considertherule:(Milk,Bread)Butter.Outoftotal12transactions,(Milk,Bread)occur7times;and(Milk,Bread,Butter)occurs6times.


Thus,thesecondvalidAssociationrulefromthisdatais(Milk,Bread)Butter{S=50%,C=67%}.

Consider therule(Milk,Butter)Bread. Outof total12transactions(Milk,

201

Butter)occurs7timeswhile(Milk,Butter,Bread)occur6times.


Thus, the next valid Association rule is: Milk,Butter Bread {S=50%,C=84%}.

Thus,therewereonlythreepossiblerulesatthe3-itemitemsetlevel,andallwerefoundtobevalid.

Onecanget to thenext lower level andgenerate association rules at the2-itemitemsetlevel.

Consider the ruleMilkBread. Outof total12 transactionsMilkoccurs9timeswhile(Milk,Bread)occur7times.


Thus,thenextvalidAssociationruleis:

Milk->Bread{58%,77%}.

Manysuchrulescouldbederivedifneeded.

Notallsuchassociationrulesareinteresting.Theclientmaybeinterestedinonlythetopfewrulesthattheywanttoimplement.Thenumberofassociationrulesdependsuponbusinessneed. Implementingeveryrule inbusinesswillrequire somecostandeffort,with somepotentialofgains.Thestrongestofrules,withthehighersupportandconfidencerates,shouldbeusedfirst,andtheothersshouldbeprogressivelyimplementedlater.

202

ConclusionAssociationRuleshelpdiscoveraffinitiesbetweenproductsintransactions.Ithelpsmakecross-sellingrecommendationsmuchmoretargetedandeffective.Aprioritechniqueisthemostpopulartechnique,anditisamachinelearningtechnique.

203

ReviewExercisesQ1:Whatareassociationrules?Howdotheyhelp?

Q2:Howmanyassociationrulesshouldbeused?

204

LibertyStoresCaseExercise:Step8Here isa listofTransactions fromLiberty’sstores.Createassociationrulesforthefollowingdata.With33%supportleveland66%confidence.

1

A

B

C

2

B

E

F

3

A

C

E

4

B

C

F

5

A

C

E

6

C

F

G

7

A

D

F

8

D

E

F

9

A

B

D

10

A

B

C

11

B

D

E

12

A

C

D

205

Section3

Thissectioncoverssomeadditionaltopics.

Chapter11willcoverTextMining,theartandscienceofgeneratinginsightsfromtext.Itisveryimportantintheageofsocialmedia.

Chapter12willcoverWebMining,theartandscienceofgeneratinginsightsfrom theworld-wideweb, its content andusage. It is very important in thedigitalagewherealotofadvertisingandsellingismovingtotheweb.

Chapter13willcoverBigData.Thisisanewmonikercreatedtodescribethephenomenon of large amounts of data being generated from many datasources,andwhichcannotbehandledwith the traditionaldatamanagementtools.

Chapter14willcoveraprimeronDataModeling.Thisisusefulasaramp-upto data mining, especially for those who have not had much exposure totraditionaldatamanagementormayneedarefresher.

206

Chapter11:TextMining

Text mining is the art and science of discovering knowledge, insights andpatternsfromanorganizedcollectionoftextualdatabases.Textualminingcanhelp with frequency analysis of important terms, and their semanticrelationships.

Text is an important part of the growing data in the world. Social mediatechnologieshaveenableduserstobecomeproducersoftextandimagesandotherkindsofinformation.Textminingcanbeappliedtolarge-scalesocialmediadataforgatheringpreferences,andmeasuringemotionalsentiments.Itcanalsobeappliedtosocietal,organizationalandindividualscales.

207

Caselet:WhatsAppandPrivateSecurityDo you think that what you post on social mediaremainsprivate?Thinkagain.Anewdashboardshowshowmuchpersonal informationisout there,andhowcompaniesareabletoconstructwaystomakeuseofitfor commercial benefits. Here is a dashboard ofconversationsbetweentwopeopleJenniferandNicoleover45days.

There is a variety of categories that Nicole andJennifer speak about such as computers, politics,laundry, desserts. The polarity of Jennifer’s personalthoughts and tone is overwhelmingly positive, andJenniferrespondstoNicolemuchmorethanviceversa,identifying Nicole as the influencer in theirrelationship.

The data visualization reveals the waking hours ofJennifer, showing that she is most active around8:00pmandheadstobedaroundmidnight.53%ofherconversationisaboutfood–and15%aboutdesserts.Maybe she’s a strategic person to push restaurant orweightlossads.

The most intimate detail exposed during thisconversation is thatNicole and Jennifer discuss rightwing populism, radical parties, and conservativepolitics. It exemplifies that the amount of privateinformation obtained from your WhatsAppconversationsislimitlessandpotentiallydangerous.

WhatsAppistheworld’slargestmessagingservicethathasover450millionusers.FaceBookrecentlyboughtthis three year old company for a whopping $19billion. People share a lot of sensitive personalinformationonWhatsAppthattheymaynotevensharewiththeirfamilymembers.

(Sources:WhatFacebookKnowsAboutYouFromOneWhatsAppConv,byAdiAzaria,onLinkedIn,April10,2014).1:Whatarethebusinessandsocialimplicationsofthiskindofanalysis?2:Areyouworried?Shouldyoubeworried?

Textminingworks on texts from practically any kind of sources from anybusinessornon-businessdomains,inanyformatsincludingWorddocuments,PDF files, XML files, text messages, etc. Here are some representativeexamples:

1. In the legal profession, text sources would include law, courtdeliberations,courtorders,etc.

2. In academic research, it would include texts of interviews, publishedresearcharticles,etc.

3. Theworldoffinancewillincludestatutoryreports,internalreports,CFO

208

statements,andmore.4. In medicine, it would include medical journals, patient histories,

dischargesummaries,etc.5. Inmarketing,itwouldincludeadvertisements,customercomments,etc.6. In the world of technology and search, it would include patent

applications,thewholeofinformationontheworld-wideweb,andmore.

209

TextMiningApplicationsTextminingisausefultoolinthehandsofchiefknowledgeofficerstoextractknowledge relevant to an organization. Text mining can be used acrossindustrysectorsandapplicationareas, includingdecisionsupport, sentimentanalysis,frauddetection,surveyanalysis,andmanymore.

1. Marketing:Thevoiceofthecustomercanbecapturedinitsnativeandrawformatandthenanalyzedforcustomerpreferencesandcomplaints.1. Social personas are a clustering technique to develop customer

segments of interest. Consumer input from social media sources,such as reviews, blogs, and tweets, contain numerous leadingindicators that can be used towards anticipating and predictingconsumerbehavior.

2. A‘listeningplatform’isatextminingapplication,thatinrealtime,gatherssocialmedia,blogs,andother textual feedback,andfiltersoutthechattertoextracttrueconsumersentiment.Theinsightscanlead to more effective product marketing and better customerservice.

3. Thecustomercallcenterconversationsandrecordscanbeanalyzedfor patterns of customer complaints. Decision trees can organizethis data to create decision choices that could help with productmanagement activities and to become proactive in avoiding thosecomplaints.

2. Business operations: Many aspects of business functioning can beaccuratelygaugedfromanalyzingtext./1. Socialnetworkanalysis and textminingcanbeapplied toemails,

blogs,socialmediaandotherdata tomeasure theemotionalstatesand the mood of employee populations. Sentiment analysis canrevealearlysignsofemployeedissatisfactionwhichcanthencanbeproactivelymanaged.

2. Studying people as emotional investors and using text analysis ofthe social Internet to measure mass psychology can help inobtainingsuperiorinvestmentreturns.

3. Legal: In legal applications, lawyers and paralegals can more easilysearchcasehistoriesandlawsforrelevantdocumentsinaparticularcasetoimprovetheirchancesofwinning.

210

1. Textminingisalsoembeddedine-discoveryplatformsthathelpinminimizing risk in the process of sharing legally mandateddocuments.

2. Case histories, testimonies, and client meeting notes can revealadditionalinformation,suchasmorbiditiesinahealthcaresituationthatcanhelpbetterpredicthigh-costinjuriesandpreventcosts.

4. GovernanceandPolitics: Governmentscanbeoverturnedbasedonatweetoriginatingfromaself-immolatingfruit-vendorinTunisia.1. Socialnetworkanalysisandtextminingoflarge-scalesocialmedia

datacanbeusedformeasuring theemotionalstatesandthemoodof constituent populations. Micro-targeting constituents withspecificmessagesgleanedfromsocialmediaanalysiscanbeamoreefficientuseofresourceswhenfightingdemocraticelections.

2. In geopolitical security, internet chatter can be processed for real-timeinformationandtoconnectthedotsonanyemergingthreats.

3. In academic, research streams could be meta-analyzed forunderlyingresearchtrends.

211

TextMiningProcessTextMining isa rapidlyevolvingareaof research.As theamountofsocialmedia and other text data grows, there is need for efficient abstraction andcategorizationofmeaningfulinformationfromthetext.

Thefirstlevelofanalysisisidentifyingfrequentwords.Thiscreatesabagofimportant words. Texts – documents or smaller messages – can then berankedonhow theymatch toaparticularbag-of-words.However, therearechallengeswiththisapproach.Forexample,thewordsmaybespelledalittledifferently.Ortheremaybedifferentwordswithsimilarmeanings.

Thenext level isat the levelof identifyingmeaningfulphrasesfromwords.Thus ‘ice’ and ‘cream’ will be two different key words that often cometogether.However,thereisamoremeaningfulphrasebycombiningthetwowords into ‘ice cream’. There might be similarly meaningful phrases like‘ApplePie’.

Thenexthigher level is thatofTopics.Multiplephrasescouldbecombinedinto Topic area. Thus the two phrases above could be put into a commonbasket,andthisbucketcouldbecalled‘Desserts’.

Text mining is a semi-automated process. Text data needs to be gathered,structured,andthenmined,ina3-stepprocess(Figure11.1)

Figure11.1:TextMiningArchitecture

1. Thetextanddocumentsarefirstgatheredintoacorpus,andorganized.2. Thecorpusisthenanalyzedforstructure.Theresultisamatrixmapping

importanttermstosourcedocuments.3. Thestructureddataisthenanalyzedforwordstructures,sequences,and

frequency.

212

213

TermDocumentMatrixThis is the heart of the structuring process. Free flowing text can betransformed into numeric data in a TDM, which can then be mined usingregulardataminingtechniques.

1. There are several efficient techniques for identifyingkey terms fromatext.Therearelessefficienttechniquesavailableforcreatingtopicsoutof them.For the purpose of this discussion, one could call keywords,phrases or topics as a term of interest. This approach measures thefrequenciesofselect important termsoccurringineachdocument.ThiscreatesatxdTerm–by–DocumentMatrix(TDM)wheretisthenumberoftermsanddisthenumberofdocuments(Table11.1).

2. CreatingaTDMrequiresmakingchoicesofwhichtermstoinclude.Theterms chosen should reflect the stated purpose of the text miningexercise.Thelistoftermsshouldbeasextensiveasneeded,butshouldnot includeunnecessarystuff thatwillserve toconfuse theanalysis,orslowthecomputation.

TermDocumentMatrix

Document/Terms

investment

Profit

happy

Success

…

Doc1

10

4

3

4

Doc2

7

2

2

Doc3

2

6

Doc4

1

5

3

Doc5

6

2

Doc6

4

2

…

Table11.1:Term-DocumentMatrix

214

HerearesomeconsiderationsincreatingaTDM.

1. A large collection of documentsmapped to a large bag of words willlikely lead to a very sparse matrix if they have few common words.Reducingdimensionalityofdatawillhelpimprovethespeedofanalysisand meaningfulness of the results. Synonyms, or terms will similarmeaning, should be combined and should be counted together, as acommonterm.Thiswouldhelpreducethenumberofdistincttermsofwordsor‘tokens’.

2. Data should be cleaned for spelling errors. Common spelling errorsshould be ignored and the terms should be combined. Uppercase-lowercasetermsshouldalsobecombined.

3. Whenmanyvariantsofthesametermareused,justthestemofthewordwouldbeused to reduce thenumberof terms.For instance, terms likecustomerorder, ordering, order data, shouldbe combined into a singletokenword,called‘Order’.

4. Ontheotherside,homonyms(termswiththesamespellingbutdifferentmeanings)shouldbecountedseparately.Thiswouldenhancethequalityofanalysis.Forexample,thetermordercanmeanacustomerorder,orthe ranking of certain choices.These two should be treated separately.“Thebossorderedthatthecustomerordersdataanalysisbepresentedinchronologicalorder’.Thisstatementshowsthreedifferentmeaningsfortheword‘order’.Thus,therewillbeaneedforamanualreviewoftheTDmatrix.

5. Terms with very few occurrences in very few documents should beeliminatedfromthematrix.Thiswouldhelpincreasethedensityofthematrixandthequalityofanalysis.

6. The measures in each cell of the matrix could be one of severalpossibilities.Itcouldbeasimplecountofthenumberofoccurrencesofeachterminadocument.Itcouldalsobethelogofthatnumber.Itcouldbethefractionnumbercomputedbydividingthefrequencycountbythetotalnumberofwordsinthedocument.Ortheremaybebinaryvaluesinthematrixtorepresentwhetheratermismentionedornot.Thechoiceofvalueinthecellswilldependuponthepurposeofthetextanalysis.

At theendof thisanalysisandcleansing,awell-formed,denselypopulated,rectangular,TDMwillbereadyforanalysis.TheTDMcouldbeminedusingalltheavailabledataminingtechniques.

215

216

MiningtheTDMThe TDM can be mined to extract patterns/knowledge. A variety oftechniquescouldbeappliedtotheTDMtoextractnewknowledge.

Predictors of desirable terms could be discovered through predictivetechniques,suchasregressionanalysis.Supposethewordprofitisadesirableword in a document. The number of occurrences of the word profit in adocument could be regressed against many other terms in the TDM. Therelative strengths of the coefficients of various predictor variables wouldshowtherelativeimpactofthosetermsoncreatingaprofitdiscussion.

Predictingthechancesofadocumentbeinglikedisanotherformofanalysis.Forexample, importantspeechesmadebytheCEOor theCFOtoinvestorscouldbeevaluatedforquality.Iftheclassificationofthosedocuments(suchas good or poor speeches)was available, then the terms of TDM could beused to predict the speech class. A decision tree could be constructed thatmakesasimpletreewithafewdecisionpointsthatpredictsthesuccessofaspeech80percentof the time.This treecouldbe trainedwithmoredata tobecomebetterovertime.

Clusteringtechniquescanhelpcategorizedocumentsbycommonprofile.Forexample,documents containing thewords investment andprofitmoreoftencould be bundled together. Similarly, documents containing the words,customerordersandmarketing,moreoftencouldbebundledtogether.Thus,afew strongly demarcated bundles could capture the essence of the entireTDM.Thesebundlescouldthushelpwithfurtherprocessing,suchashandingoverselectdocumentstoothersforlegaldiscovery.

Associationruleanalysiscouldshowrelationshipsofcoexistence.Thus,onecouldsaythatthewords,tastyandsweet,occurtogetheroften(say5percentofthetime);andfurther,whenthesetwowordsarepresent,70percentofthetime,thewordhappy,isalsopresentinthedocument.

217

ComparingTextMiningandDataMiningTextMining is a form of data mining. There are many common elementsbetween Text and Data Mining. However, there are some key differences(Table11.2).Thekeydifferenceisthattextminingrequiresconversionoftextdataintofrequencydata,beforedataminingtechniquescanbeapplied.

Dimension

TextMining

DataMining

Natureofdata

Unstructureddata:Words,phrases,sentences

Numbers;alphabeticalandlogicalvalues

Languageused

Manylanguagesanddialectsusedintheworld;manylanguagesareextinct,newdocumentsarediscovered

Similarnumericalsystemsacrosstheworld

Clarityandprecision

Sentencescanbeambiguous;sentimentmaycontradictthewords

Numbersareprecise.

Consistency

Differentpartsofthetextcancontradicteachother

Differentpartsofdatacanbeinconsistent,thus,requiringstatisticalsignificanceanalysis

Sentiment

Textmaypresentaclearandconsistentormixedsentiment,acrossacontinuum.Spokenwordsaddsfurthersentiment

Notapplicable

Quality

Spellingerrors.Differingvaluesofpropernounssuchasnames.Varyingqualityoflanguagetranslation

Issueswithmissingvalues,outliers,etc

Natureof

Keywordbasedsearch;co-existenceofthemes;Sentiment

Afullwiderangeofstatisticalandmachinelearninganalysisfor

218

mining; relationshipsanddifferences

Table11.2:ComparingTextMiningandDataMining

219

TextMiningBestPracticesManyofthebestpracticesthatapplytotheuseofdataminingtechniqueswillalsoapplytotextmining.

1. Thefirstandmostimportantpracticeistoasktherightquestion.Agoodquestion isonewhichgivesananswerandwould lead to largepayoffsfortheorganization.ThepurposeandthekeyquestionwilldefinehowandatwhatlevelsofgranularitytheTDMwouldbemade.Forexample,TDMdefined for simpler searcheswould be different from those usedforcomplexsemanticanalysisornetworkanalysis.

2. A second important practice is to be creative and open in proposingimaginative hypotheses for the solution. Thinking outside the box isimportant, both in the quality of the proposed solution as well as infinding the high quality data sets required to test the hypothesizedsolution. For example, a TDM of consumer sentiment data should becombinedwithcustomerorderdatainordertodevelopacomprehensiveviewofcustomerbehavior.It’simportanttoassembleateamthathasahealthymixoftechnicalandbusinessskills.

3. Another important element is to pursue the problem iteratively. Toomuchdatacanoverwhelmtheinfrastructureandalsobefuddlethemind.ItisbettertodivideandconquertheproblemwithasimplerTDM,withfewer termsandfewerdocumentsanddatasources.Expandasneeded,in an iterative sequence of steps. In the future, add new terms to helpimprovepredictiveaccuracy.

4. Avarietyofdataminingtoolsshouldbeusedtotesttherelationshipsinthe TDM. Different decision tree algorithms could be run alongsidecluster analysis and other techniques. Triangulating the findings withmultipletechniques,andmanywhat-ifscenarios,helpsbuildconfidencein the solution. Test the solution in many ways before committing todeployit.

220

ConclusionTextMiningisdivingintotheunstructuredtexttodiscovervaluableinsightsabout the business. The text is gathered and then structured into a term-documentmatrix based on the frequency of a bag ofwords in a corpus ofdocuments. The TDM can then be mined for useful, novel patterns, andinsights.While the technique is important, the business objective shouldbewellunderstoodandshouldalwaysbekeptinmind.

***

221

ReviewQuestions1:Whyistextminingusefulintheageofsocialmedia?

2:Whatkindsofproblemscanbeaddressedusingtextmining?

3:Whatkindsofsentimentscanbefoundinthetext?

DoaTextmininganalysisofsalesspeechesbythreesalesmen.

1. DidyouknowyourteamcanbuildPowerpointmuscles?Yes,IhelpbuildPowerPoint muscles. I teach people how to use PowerPoint moreeffectively in business. Now, for instance, I’m working with a globalconsulting firm to trainall their seniorconsultants togivebetter salespresentationssotheycanclosemorebusiness.

2. I train people how to make sure their PowerPoint slides aren’t acompletedisaster.Thosewhoattendmyworkshopcancreateslidesthatare50%moreclearand50%moreconvincingbytheendofthetraining,basedonscoresstudentsgiveeachotherbeforeandaftertheworkshop.I’m not sure if my training could work at your company. But I’d behappytotalktoyouaboutit.

3. You know how most business people use PowerPoint but most use itpretty poorly? Well, bad PowerPoint has all kinds of consequences –salesthatdon’tclose,goodideasthatgetignored,timewastedbuildingslidesthatcouldhavebeenuseddevelopingorexecutingstrategies.Mycompany shows businesses how to use PowerPoint to capture thosesales,bringattentiontothosegreatideasandusethosewastedhoursonmoreimportantprojects.

Thepurposeistoselectthebestspeech.

1:Howwouldyouselecttherightbagofwords?

2: If speech#1was thebest speech,use theTDMtocreatea rule forgoodspeeches.


HereareafewcommentsfromcustomerservicecallsreceivedbyLiberty.

1. Ilovedthedesignoftheshirt.Thesizefittedmeverywell.However,thefabricseemedflimsy.Iamcallingtoseeifyou

222

canreplacetheshirtwithadifferentone.Orpleaserefundmymoney.

2. Iwasrunninglatefromwork,andIstoppedbytopickupsomegroceries.IdidnotlikethewaythemanagerclosedthestorewhileIwasstillshopping.

3. Istoppedbytopickupflowers.Thecheckoutlinewasverylong.Themanagerwaspolitebutdidnotopennewcashiers.Igotlateformyappointment.

4. Themanagerpromisedthattheproductwillbethere,butwhenIwenttheretheproductwasnotthere.Thevisitwasawaste.Themanagershouldhavecompensatedmeformytrouble.

5. Whentherewasaproblemwithmycateringorder,thestoremanagerpromptlycontactedmeandquicklygotthekinksouttosendmereplacementfoodimmediately.Thereareverycourteous.

CreateaTDMwithnotmorethan6keyterms.[Hint:Treateachcommentasadocument]

223

Chapter12:WebMining

Webmining is theartandscienceofdiscoveringpatternsand insights fromtheWorld-wideweb so as to improve it.Theworld-wideweb isat theheartof thedigitalrevolution.Moredataispostedonthewebeverydaythanwasthereonthewholewebjust20yearsago.Billionsofusersareusingiteverydayforavariety of purposes. The web is used for electronic commerce, businesscommunication,andmanyotherapplications.Webmininganalyzesdatafromthe web and helps find insights that could optimize the web content andimprove the user experience. Data for web mining is collected via Webcrawlers,weblogs,andothermeans.

Herearesomecharacteristicsofoptimizedwebsites:

1. Appearance:Aestheticdesign.Well-formattedcontent,easytoscanandnavigate.Goodcolorcontrasts.

2. Content: Well planned information architecture with useful content.Freshcontent.Search-engineoptimized.Linkstoothergoodsites.

3. Functionality: Accessible to all authorized users. Fast loading times.Usableforms.Mobileenabled.

Thistypeofcontentanditsstructureisofinteresttoensurethewebiseasytouse.The analysis ofweb usage provides feedback on theweb content, andalso the consumer’s browsing habits. This data can be of immense use forcommercialadvertising,andevenforsocialengineering.

Theweb could be analyzed for its structure as well as content. The usagepatternofwebpagescouldalsobeanalyzed.Dependinguponobjectives,webmining can be divided into three different types: Web usage mining, WebcontentminingandWebstructuremining(Figure12.1).

224

http://en.wikipedia.org/wiki/World_Wide_Web

Figure:12.1WebMiningstructure

225

WebcontentminingAwebsite is designed in the form of pageswith a distinctURL (universalresource locator). A largewebsitemay contain thousands of pages. ThesepagesandtheircontentismanagedusingspecializedsoftwaresystemscalledContent Management Systems. Every page can have text, graphics, audio,video, forms, applications, and more kinds of content including usergeneratedcontent.

The websites keep a record of all requests received for its page/URLs,includingtherequesterinformationusing‘cookies’.Thelogoftheserequestscould be analyzed to gauge the popularity of those pages among differentsegments of the population. The text and application content on the pagescould be analyzed for its usage by visit counts. The pages on a websitethemselvescouldbeanalyzedforqualityofcontent thatattractsmostusers.Thustheunwantedorunpopularpagescouldbeweededout,or theycanbetransformedwithdifferentcontentandstyle.Similarly,moreresourcescouldbeassignedtokeepthemorepopularpagesmorefreshandinviting.

226

WebstructureminingTheWebworksthroughasystemofhyperlinksusingthehypertextprotocol(http).Anypagecancreateahyperlinktoanyotherpage,itcanbelinkedtobyanotherpage.Theintertwinedorself-referralnatureofweblendsitselftosomeuniquenetworkanalyticalalgorithms.ThestructureofWebpagescouldalsobe analyzed to examine thepatternofhyperlinks amongpages. Therearetwobasicstrategicmodelsforsuccessfulwebsites:HubsandAuthorities.

1. Hubs: These are pages with a large number of interesting links. Theyserve as a hub, or a gathering point, where people visit to access avarietyofinformation.MediasiteslikeYahoo.com,orgovernmentsiteswouldservethatpurpose.MorefocusedsiteslikeTraveladvisor.comandyelp.comcouldaspiretobecominghubsfornewemergingareas.

2. Authorities: Ultimately, people would gravitate towards pages thatprovidethemostcompleteandauthoritativeinformationonaparticularsubject. This could be factual information, news, advice, user reviewsetc.Thesewebsiteswouldhavethemostnumberofinboundlinksfromother websites. Thus Mayoclinic.com would serve as an authoritativepage for expert medical opinion. NYtimes.com would serve as anauthoritativepagefordailynews.

227

WebusageminingAsauserclicksanywhereonawebpageorapplication,theactionisrecordedbymany entities inmany locations.Thebrowser at the clientmachinewillrecordtheclick,andthewebserverprovidingthecontentwouldalsomakearecordofthepagesservedandtheuseractivityonthosepages.Theentitiesbetween the client and the server, such as the router, proxy server, or adserver,toowouldrecordthatclick.

Thegoal ofwebusagemining is to extract useful information andpatternsfrom data generated throughWeb page visits and transactions. The activitydatacomes fromdata stored in serveraccess logs, referrer logs, agent logs,and client-side cookies. The user characteristics and usage profiles are alsogathered directly, or indirectly, through syndicated data. Further, metadata,suchaspageattributes,contentattributes,andusagedataarealsogathered.

Thewebcontentcouldbeanalyzedatmultiplelevels(Figure12.2).

1. Theserversideanalysiswouldshowtherelativepopularityof thewebpagesaccessed.Thosewebsitescouldbehubsandauthorities.

2. Theclient sideanalysis could focus on the usage pattern or the actualcontentconsumedandcreatedbyusers.1. Usage pattern could be analyzed using ‘clickstream’ analysis, i.e.

analyzingweb activity for patterns of sequence of clicks, and thelocationanddurationofvisitsonwebsites.Clickstreamanalysiscanbe useful for web activity analysis, software testing, marketresearch,andanalyzingemployeeproductivity.

2. Textualinformationaccessedonthepagesretrievedbyuserscouldbe analyzed using text mining techniques. The text would begatheredandstructuredusingthebag-of-wordstechniquetobuildaTerm-document matrix. This matrix could then be mined usingcluster analysis and association rules for patterns such as populartopics,usersegmentation,andsentimentanalysis.

228

http://en.wikipedia.org/wiki/Router_%28computing%29

http://en.wikipedia.org/wiki/Proxy_server

http://en.wikipedia.org/wiki/Ad_server

Figure:12.2WebUsageMiningarchitecture

Web usagemining hasmany business applications. It can help predict userbehaviorbasedonpreviously learned rulesandusers'profiles,andcanhelpdetermine lifetime value of clients. It can also help design cross-marketingstrategiesacrossproducts,byobservingassociationrulesamongthepagesonthewebsite.Webusagecanhelpevaluatepromotionalcampaignsandseeifthe users were attracted to the website and used the pages relevant to thecampaign.Webusageminingcouldbeused topresentdynamic informationtousersbasedontheirinterestsandprofiles.Thisincludestargetedonlineadsandcouponsatusergroupsbasedonuseraccesspatterns.

229

WebMiningAlgorithmsHyperlink-InducedTopicSearch(HITS)isalinkanalysisalgorithmthatrateswebpagesasbeinghubsor authorities.ManyotherHITS-basedalgorithmshavealsobeenpublished.Themostfamousandpowerfulofthesealgorithmsis thePageRankalgorithm. InventedbyGoogleco-founderLarryPage, thisalgorithmisusedbyGoogletoorganizetheresultsofitssearchfunction.Thisalgorithmhelpsdeterminetherelativeimportanceofanyparticularwebpageby counting the number and quality of links to a page. The websites withmorenumberoflinks,and/ormorelinksfromhigher-qualitywebsites,willberankedhigher.Itworksinasimilarwayasdeterminingthestatusofapersoninasocietyofpeople.Thosewithrelationstomorepeopleand/orrelationstopeopleofhigherstatuswillbeaccordedahigherstatus.

PageRankisthealgorithmthathelpsdeterminetheorderofpageslisteduponaGoogleSearchquery.TheoriginalPageRankalgorithmformuationhasbeenupdated in many ways and the latest algorithm is kept a secret so otherwebsitescannottakeadvantageofthealgorithmandmanipulatetheirwebsiteaccording to it. However, there are many standard elements that remainunchanged. These elements lead to the principles for a goodwebsite. ThisprocessisalsocalledSearchEngineOptimization(SEO).

230

ConclusionThewebhasgrowingresources,withmorecontenteverydayandmoreusersvisitingitformanypurposes.Agoodwebsiteshouldbeuseful,easytouse,and flexible for evolution. From the insights gleaned using web mining,websitesshouldbeconstantlyoptimized.

Web usage mining can help discover what content users really like andconsume, and help prioritize that for improvement.Web structure can helpimprovetraffictothosesites,bybuildingauthorityforthesites.

231

ReviewQuestions1:Whatarethethreetypesofwebmining?

2:Whatisclickstreamanalysis?

3:Whatarethetwomajorwaysthatawebsitecanbecomepopular?

4:Whataretheprivacyissuesinwebmining?

5:Auserspends60minutesontheweb,visiting10webpagesinall.Giventheclickstreamdata,whatkindofananalysiswouldyoudo?

232

Chapter13:BigDataBigdataisanumbrellatermforacollectionofdatasetssolargeandcomplexthat it becomesdifficult to process themusing traditional datamanagementtools. There has been increasing democratization of the process of contentcreation and sharing over the Internet, using socialmedia applications.Thecombination of cloud-based storage, social media applications, and mobileaccess devices is helping crystallize the big data phenomenon. The leadingmanagement consulting firm, McKinsey & Co. created a flutter when itpublished a report in 2011 showing a huge impact of such big data onbusiness and other organizations. They also reported that there will bemillionsofnewjobsinthenextdecade,relatedtotheuseofbigdatainmanyindustries.

Bigdatacanbeused todiscovernew insights froma360-degreeviewofasituation that can allow for a complete new perspective on situations, newmodels of reality, and potentially new types of solutions. It can help spotbusinesstrendsandopportunities.Forexample,Googleisabletopredictthespread of a disease by tracking the use of search terms related to thesymptoms of the disease over the globe in real time. Big Data can helpdetermine the quality of research, prevent diseases, link legal citations,combatcrime,anddeterminereal-timeroadwaytrafficconditions.BigDataisenablingevidence-basedmedicine,andmanyotherinnovations.

Data has become the new natural resource.Organizations have a choice inhowtoengagewiththisexponentiallygrowingvolume,varietyandvelocityofdata.Theycanchoosetobeburiedundertheavalanche,ortheycanchoosetouseitforcompetitiveadvantage.Challengesinbigdataincludetheentirerangeofoperationsfromcapture,curation,storage,search,sharing,analysis,andvisualization.Bigdataismorevaluablewhenanalyzedasawhole.Moreandmoreinformationisderivablefromanalysisofasinglelargesetofrelateddata,ascomparedtoseparatesmallersets.However,special toolsandskillsareneededtomanagesuchextremelylargedatasets.

233

Caselet: PersonalizedPromotionsatSearsA couple of years ago, SearsHoldingscametotheconclusionthat it needed to generategreater value from the hugeamounts of customer, product,and promotion data it collectedfrom its many brands. Searsrequired about eight weeks togenerate personalizedpromotions,atwhichpointmanyof themwere no longer optimalfor thecompany. It tookso longmainlybecausethedatarequiredfor these large-scale analyseswerebothvoluminousandhighlyfragmented—housed in manydatabases and “datawarehouses” maintained by thevarious brands. Sears turned tothetechnologiesandpracticesofbigdata.Asoneofitsfirststeps,itsetupaHadoopcluster,usinga group of inexpensivecommodityservers.

Sears started using the Hadoopcluster to store incoming datafrom all its brands and fromexistingdatawarehouses.Itthenconducted analyses on thecluster directly, avoiding thetime-consuming complexities ofpulling data from varioussources and combining them sothat they can be analyzed.Sears’s Hadoop cluster storesand processes several petabytesof data at a fraction of the costof a comparable standard data

234

warehouse. The time needed togenerate a comprehensive set ofpromotions dropped from eightweeks to one. And thesepromotionsareofhigherquality,because they’re more timely,more granular, and morepersonalized. (Source: McAfee&BrynjolfssonHBSOct2012)

1:WhatareotherwaysinwhichSears can benefit from BigData?

2: What are the challenges inmakinguseofBigData?

235

DefiningBigDataIn2000,therewere800,000Petabytesofdataintheworld.Itisexpectedtogrowto35zettabytesbytheyear2020.Aboutamillionbooksworthofdatais being created daily on social media alone. Big Data is big, fast,unstructured,andofmanytypes.Thereareseveraluniquefeatures:

1. Variety:Therearemanytypesofdata,includingstructuredandunstructured data. Structured data consists of numeric and text fields.Unstructureddataincludesimages,video,audio,andmanyothertypes.There are also many sources of data. The traditional sources ofstructured data include data from ERPs systems and other operationalsystems.Sourcesforunstructureddataincludesocialmedia,Web,RFID,machinedata,andothers.Unstructureddatacomesinavarietyofsizes,resolutions,andaresubject todifferentkindsofanalysis.Forexample,videofilescanbetaggedwithlabels,andtheycanbeplayed,butvideodata is typically not computed, which is the same with audio data.Graphicdatacanbeanalyzedfornetworkdistances.Facebooktextsandtweetscanbeanalyzedforsentiments,butcannotbedirectlycompared.

2. Velocity:TheInternetgreatly increases thespeedofmovementofdata,from e-mails to social media to video files, data can move quickly.Cloud-basedstoragemakessharinginstantaneous,andeasilyaccessiblefrom anywhere. Socialmedia applications enable people to share theirdatawitheachother instantly.Mobile access to theseapplicationsalsospeedsupthegenerationandaccesstodata(Figure13.1).

Figure13.1SourcesofBigData(Source:Hortonworks.com)

236

3. Volume:Websiteshavebecomegreatsourcedandrepositoriesformanykindsofdata.Userclickstreamsarerecordedandstoredforfutureuse.SocialmediaapplicationssuchasFacebook,Twitter,Pinterest,andotherapplicationshaveenableduserstobecomeprosumersofdata(producersandconsumers).Thereisanincreaseinthenumberofdatashares,andalso the sizeof eachdata element.High-definitionvideos can increasethetotalshareddata.Thereareautonomousdatastreamsofvideo,audio,text, data, and so on coming from social media sites, websites, RFIDapplications,andsoon.

4. SourcesofData:Thereareseveralsourcesofdata,includingsomenewones.Data fromoutside the organizationmay be incomplete, and of adifferentqualityandaccuracy.1. Social Media: All activities on the web and social media are

considered stores and are accessible. Email was the first majorsource of new data. Google searches, Facebook posts, Tweets,Youtube videos, and blogs enable people to generate data for oneanother.

2. Organizations:Businessorganizationsandgovernmentareamajorsourceofdata.ERPsystems,e-Commercesystems,user-generatedcontent,web-accesslogs,andmanyothersourcesofdatageneratevaluabledatafororganizations.

3. Machines: TheInternetof things isevolving.Manymachinesareconnected to the web and autonomously generate data that isuntouched by humans. RFID tags and telematics are two majorapplications that generate enormous amounts of data. Connecteddevices such asphones and refrigeratorsgeneratedata about theirlocationandstatus.

4. Metadata:There is enormousdata aboutdata itself.Webcrawlersand web-bots scan the web to capture new webpages, their htmlstructure, and their metadata. This data is used by manyapplications,includingwebsearchengines.

Thedataalsoincludesvariedqualityofdata.Itdependsuponthepurposeofcollectingthedata,andhowcarefullyithasbeencollectedandcurated.Datafrom within the organization is likely to be of a higher quality. Publiclyavailable data would include some trustworthy data such as from thegovernment.

237

238

BigDataLandscapeBigdatacanbeunderstoodatmanylevels(Figure13.2).Atthehighestlevelare business applications to suit particular industries or to suit businessintelligence for executives.A unique concept of “data as a service” is alsopossible for particular industries. At the next level, there are infrastructureelements for broad cross-industry applications, such as analytics andstructured databases. This also includes offering this infrastructure as aservicewithsomeoperationalmanagementservicesbuiltin.Atthecore,bigdataisabouttechnologiesandstandardstostoreandmanipulatethelargefaststreams of data, and make them available for rapid data-based decision-making.

Figure13.2TheBigDataLandscape(source:bigdatalandscape.com)

239

BusinessImplicationsofBigData“Big data will disrupt yourbusiness. Your actions willdetermine whether thesedisruptions are positive ornegative.”(Gartner,2012).

Any industry that produces information-based products ismost likely to bedisrupted. Thus, the newspaper industry has taken a hit from digitaldistribution channels, as well as from published-on-web-only blogs.Entertainment has also been impacted by digital distribution and piracy, aswellasbyuser-generated-and-uploadedcontentontheinternet.Theeducationindustryisbeingdisruptedbymassivelyon-lineopencourses(MOOCs)anduser-uploadedcontent.Healthcaredelivery is impactedbyelectronichealthrecordsanddigitalmedicine.Theretailindustryhasbeenhighlydisruptedbyecommerce companies. Fashion companies are impacted by quick feedbackontheirdesignsonsocialmedia.Thebankingindustryhasbeenimpactedbythecost-effectiveonlineself-servebankingapplicationsandthiswill impactemploymentlevelsintheindustry.

There is rapid change inbusinessmodels enabledbybigdata technologies.SteveJobs,theex-CEOofApple,concededthathiscompany’sproductsandbusinessmodelswould be disrupted.He preferred his older products to becannibalizedbyhisownnewproductsratherthanbythoseofthecompetition.

Everyotherbusinesstoowilllikelybedisrupted.Thekeyissueforbusinessishowtoharnessbigdataforbusinesstogenerategrowthopportunitiesandtoleapfrog competition. Organizations need to learn how to organize theirbusinesses so that they do not get buried in high volume, velocity, and thevarietyofdata,butinsteaduseitsmartlyandproactivelytoobtainaquickbutdecisive advantage over their competition.Organizations need to figure outhowtousebigdataasastrategicassetinrealtime,toidentifyopportunities,thwart threats, build new capabilities, and enhance operational efficiencies.Organizationscannoweffectivelyfusestrategyanddigitalbusiness,andthenstrive to design innovative “digital business strategy” around digital assetsandcapabilities.

240

241

TechnologyImplicationsofBigData"Big data" forces organizationsto address the variety ofinformation assets and how fastthese new asset types arechanging informationmanagementdemands. (Gartner,2012).

Thegrowthofdataismadepossible in part by the advancement of storage technology. The attachedgraphshowsthegrowthofdisk-driveaveragecapacities.Thecostofstorageis falling, the size of storage is getting smaller, and the speed of access isgoing up (Figure 13.3). Flash drives are become cheaper. Random accessmemorystorageused tobeexpensive,butnowisso inexpensive thatentiredatabasescanbeloadedandprocessedquickly,insteadofswappingsectionsofitintoandoutofhigh-speedmemory.

New data management and processing technologies have emerged. ITprofessionals integrate “big data” structured assets with content and mustincrease their business requirement identification skills. Big data is goingdemocratic.Businessfunctionswillbeprotectiveoftheirdataandwillbegininitiativesaroundexploitingit.ITsupportteamsneedtofindwaystosupportend-user-deployedbigdatasolutions.Enterprisedatawarehouseswillneedtoincludebigdata insomeform.TheITplatformneeds tobestrengthened tohelp provide the enablement of a “digital business strategy” around digitalassetsandcapabilities.

242

243

BigDataTechnologiesNew tools and techniqueshave arisen in the last 10-20years tohandle thislargeandstillgrowingdata.Therearetechnologiesforstoringandaccessingthisdata.

1. Non-relationaldatastructures: Bigdata isstoredusingnon-traditionaldata structures. Large non-relational databases like Hadoop haveemerged as a leading data management platform for big data. InHadoop’s Distributed File System (HDFS), data is stored as ‘key anddata-value’ combinations. Google BigFile is another prominenttechnology. NoSQL is emerging as a popular language to access andmanagenon-relationaldatabases.ThereisamatchingDataWarehousingsystem called Hive along with its own PigSQL language. The open-source stack of programming languages (such as Pig) and other toolshelpmakeHadoopapowerfulandpopulartool.

2. Massivelyparallelcomputing:Giventhesizeofdata,itisusefulto divide and conquer the problem quickly using multiple processorssimultaneously. Parallel processing allows for the data to be processedby multiple machines so that results can be achieved sooner. Map-Reduce algorithm, originally generated at Google for doing searchesfaster, has emerged as a popular parallel processing mechanism. Theoriginal problem is divided into smaller problems, which are thenmapped tomultipleprocessors thatcanoperate inparallel.Theoutputsof these processors are passed to an output processor that reduces theoutputtoasinglestream,whichisthensenttotheenduser.Figure13.4showsanexampleofaMap-Reducealgorithm.

Figure13.4AMapReduceAlgorithmexample(source:

244

www.cs.uml.edu)

3. UnstructuredInformationManagementArchitecture (UIMA).This is one of elements in the “secret sauce” behind IBM’ Watson’ssystemthatreadsmassiveamountsofdata,andorganizesforjust-in-timeprocessing.WatsonbeattheJeopardy(quizprogram)championin2011andisnowusedformanybusinessapplications,likediagnosis,inhealthcare situations. Natural language processing is another capability thathelpsextendthepowerofbigdatatechnologies.

245

ManagementofBigDataMany organizations have started initiatives around the use of Big Data.However,most organizationsdonot necessarily have a grip on it.Here aresomeemerginginsightsintomakingbetteruseofbigdata.

1. Acrossall industries, thebusinesscase forbigdata is strongly focusedonaddressingcustomer-centricobjectives.Thefirst focusondeployingbigdatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.

2. Solve a real pain-point. Big data should be deployed for specificbusiness objectives in order to avoid being overwhelmed by the sheersizeofitall.

3. Organizations are beginning their pilot implementations by usingexisting and newly accessible internal sources of data. It is better tobegin with data under one’s control and where one has a superiorunderstandingofthedata.

4. Puthumansanddatatogether togetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.

5. Advanced analytical capabilities are required, yet lacking, fororganizations to get themost value from big data. There is a growingawarenessofbuildingorhiringthoseskillsandcapabilities.

6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.

7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.

8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyou initiallyanticipate.Datacanaddperspective tootherdatalaterinamultiplicativemanner.

9. Maintain one copy of your data, not multiple. This would help avoidconfusionandincreaseefficiency.

246

10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponential rates. Storage costs continue to fall, data generationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.

11. Ascalableandextensible informationmanagement foundation is aprerequisite for big data advancement. Big data builds upon resilient,secure, efficient, flexible, and real-time information processingenvironment.

12. Bigdataistransformingbusiness,justlikeITdid.Bigdataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.

247

ConclusionBig Data is a new natural force and natural resource. The exponentiallygrowing volume, variety and velocity of data is constantly disruptingbusinesses across all industries, atmultiple levels from product to businessmodels. Organizations need to begin initiatives around big data; acquireskills, tools and technologies; and show the vision to disrupt their industryandcomeoutahead.

248

ReviewQuestions1:Whatarethe3VsofBigData?

2:HowdoesBigDataimpactthebusinessmodels?

3:WhatisHadoop?

4:HowdoesMap-Reducealgorithmwork?

5:WhatarethekeyissuesinmanagingBigData?

249

Chapter14:DataModelingPrimer

Data needs to be efficiently structured and stored so that it includes all theinformation needed for decision making, without duplication and loss ofintegrity.Herearetoptenqualitiesofgooddata.

Datashouldbe:

1.Accurate:Datashouldretainconsistentvaluesacrossdatastores,usersandapplications.Thisisthemostimportantaspect of data.Anyuse of inaccurate or corrupted data to do anyanalysisisknownasthegarbage-in-garbage-out(GIGO)condition.

2.Persistent:Datashouldbeavailableforalltimes, now and later. It should thus be nonvolatile, stored andmanagedforlateraccess.

3.Available:Datashouldbemadeavailabletoauthorized users, when, where, and how they want to access it,withinpolicyconstraints.

4.Accessible:Notonlyshoulddatabeavailabletouser, it should also be easy to use. Thus, data should be madeavailableindesiredformats,witheasytools.MSExcelisapopularmediumtoaccessnumericdata,andthentransfertootherformats.

5.Comprehensive:Datashouldbegatheredfromall relevantsources toprovideacompleteandholisticviewof thesituation. New dimensions should be added to data as and whentheybecomeavailable.

6. Analyzable:Datashouldbeavailable foranalysis, for historical and predictive purposes. Thus, data shouldbe organized such that it can be used by analytical tools, such asOLAP,datacube,ordatamining.

7.Flexible:Dataisgrowinginvarietyoftypes.Thus, data stores should be able to store a variety of data types:

250

small/large,text/video,andsoon

8.Scalable:Dataisgrowinginvolume.Datastorageshouldbeorganizedtomeetemergentdemands.

9.Secure:Datashouldbedoublyandtriplybackedup, and protected against loss and damage.There is no bigger ITnightmarethancorrupteddata.Inconsistentdatahastobemanuallysortedoutwhichleadstolossofface, lossofbusiness,downtime,andsometimesthebusinessneverrecovers.

10.Cost-effective:Thecostofcollectingdataandstoring it is coming down rapidly.However, still the total cost ofgathering, organizing, and storing a type of data should beproportionaltotheestimatedvaluefromitsuse.

251

EvolutionofdatamanagementsystemsData management has evolved from manual filing systems to the mostadvancedonlinesystemscapableofhandlingmillionsofdataprocessingandaccessrequestseverysecond.

Thefirstdatamanagementsystemswerecalledfilesystems.Thesemimickedpaperfilesandfolders.Everythingwasstoredchronologically.Accesstothisdatawassequential.

Thenextstepindatamodelingwastofindwaystoaccessanyrandomrecordquickly. Thus hierarchical database systems appeared. They were able toconnectallitemsforanorder,givenanordernumber.

The next step was to traverse the linkages both ways, from top of thehierarchytothebottom,andfromthebottomtothetop.Givenanitemsold,oneshouldbeabletofinditsordernumber,andlistalltheotheritemssoldinthatorder.Thustherewerenetworksoflinksestablishedinthedatatotrackthoserelationships.

The major leap came when the relationship between data elements itselfbecamethecenterofattention.Therelationshipbetweendatavalueswasthekey element of storage. Relationships were established through matchingvaluesof commonattributes, rather thanby locationof the record in a file.Thisledtodatamodelingusingrelationalalgebra.Relationscouldbejoinedandsubtracted,withsetoperationslikeunionandintersection.Searchingthedatabecameaneasiertaskbydeclaringthevaluesofavariableofinterest.

Therelationalmodelwasenhancedtoincludevariableswithnon-comparablevalues like binary objects (such as pictures) which had to be processeddifferently.Thusemergedtheideaofencapsulatingtheproceduresalongwiththe data elements they worked on. The data and its methods wereencapsulated intoanobject.Those objects could be further specialized. Forexample,avehicle isanobjectwithcertainattributes.Acaranda truckaremorespecializedversionsofavehicle.Theyinheritedthedatastructureofthevehicle, but had their own additional attributes. Similarly the specializedobject inherited all the procedures and programs associated with the moregeneralentity.Thisbecametheobject-orientedmodel.

252

RelationalDataModelThe first mathematical-theory-driven model for data management wasdesignedbyEdCoddofIBMin1970.

1.Arelationaldatabaseiscomposedofasetofrelations(datatables),whichcanbejoinedusingsharedattributes.A“data table” isacollectionof instances (or records),withakeyattributetouniquelyidentifyeachinstance.

2.DatatablescanbeJOINedusingtheshared“key” attributes to create larger temporary tables, which can bequeriedtofetchinformationacrosstables.Joinscanbesimpleonesasbetweentwotables.JoinscanalsobecomplexwithAND,OR,UNIONorINTERSECTION,andmoreoperations.

3. High-levelcommandsinStructuredQueryLanguage (SQL) can be used to perform joins, selection, andorganizingofrecords.

Relational data models flow from conceptual models, to logical models tophysical implementations.Datacanbeconceivedofasbeingaboutentities,and relationships among entities. A relationship between entities may behierarchybetweenentities,or transactions involvingmultipleentities.Thesecanbegraphicallyrepresentedasanentity–relationshipdiagram(ERD).

In Figure 14.1, the rectangle reflects the entities students and courses. Therelationship is enrolment. In the example below the rectangle reflects theentities Students and Courses. The diamond shows the Enrolmentrelationship.

Figure:14.1Simplerelationshipbetweentwoentities

HerearesomefundamentalconceptsonERD:

1. Anentityisanyobjectoreventaboutwhichsomeonechoosestocollectdata, which may be a person, place, or thing (e.g., sales person, city,product,vehicle,employee).

2. Entitieshaveattributes.Attributesaredataitemsthathavesomethingin

253

common with the entity. For example, student id, student name, andstudent address represent details for a student entity.Attributes can besingle-valued(e.g.,studentname)ormulti-valued(listofpastaddressesfor the student). Attribute can be simple (e.g., student name) orcomposite(e.g.,studentaddress,composedofstreet,city,andstate).

3. Everyentitymusthaveakeyattribute(s)thatcanbeusedtoidentifyaninstance. E.g. Student ID can identify a student. A primary key is aunique attribute value for the instance (e.g. Student ID).Any attributethatcanserveasaprimarykey(e.g.StudentAddress)isacandidatekey.Asecondarykey—akeywhichmaynotbeunique,maybeusedtoselecta groupof records (Student city). Some entitieswill have a compositekey—a combination of two or more attributes that together uniquelyrepresentthekey(e.g.FlightnumberandFlightdate).Aforeignkey isuseful in representing a one-to-many relationship. The primary key ofthefileattheoneendoftherelationshipshouldbecontainedasaforeignkeyonthefileatthemanyendoftherelationship.

4. Relationships have many characteristics: degree, cardinality, andparticipation.

5. Degree of relationship depends upon the number of entitiesparticipating in a relationship. Relationships can be unary (e.g.,employeeandmanager-as-employee),binary (e.g., studentandcourse),andternary(e.g.,vendor,part,warehouse)

6. Cardinality represents the extent of participation of each entity in arelationship.1. One-to-one(e.g.,employeeandparkingspace)2. One-to-many(e.g.,customerandorders)3. Many-to-many(e.g.,studentandcourse)

7. Participationindicatestheoptionalormandatorynatureofrelationship.1. Customerandorder(mandatory)2. Employeeandcourse(optional)

8. Therearealsoweakentitiesthataredependentonanotherentityforitsexistence (e.g., employees and dependents). If an employee data isremoved,thenthedependentdatamustalsoberemoved.

9. There are associative entities used to represent many-to-manyrelationship relationships (e.g., student-course enrolment). There aretwowaystoimplementamany-manyrelationship.Itcouldbeconvertedinto two one-to-many relationships with an associative entity in themiddle. Alternatively, the combination of primary keys of the entitiesparticipating in the relationship will form the primary key for theassociativeentity.

10. Therearealsosupersubtypeentities.Thesehelp representadditionalattributes,onasubsetoftherecords.Forexample,vehicleisa

254

supertypeandpassengercarisitssubtype.

255

ImplementingtheRelationalDataModelOncethelogicaldatamodelhasbeencreated, it iseasytotranslateit intoaphysical datamodel, which can then be implemented it using any publiclyavailableDBMS.Everyentityshouldbeimplementedbycreatingadatabasetable. Every table will be a specific data field (key) that would uniquelyidentify each relation (or row) in that table. Eachmaster table or databaserelationshouldhaveprogramstocreate,read,update,anddeletetherecords.

Thedatabasesshouldfollow3IntegrityConstraints.

1. Entityintegrityensuresthattheentityoratableishealthy.Theprimarykeycannothaveanullvalue.Everyrowmusthaveauniquevalue.Orelse that rowshouldbedeleted.Asacorollary, if theprimarykey is acompositekey,noneof thefieldsparticipatingin thekeycancontainanullvalue.Everykeymustbeunique.

2. Domainintegrityisenforcedbyusingrulestovalidatethedataasbeingoftheappropriatesrangeandtype.

3. Referential integrity governs the nature of records in a one-to-manyrelationship.Thisensures that thevalueofaforeignkeyshouldhaveamatching value in primary keys of the table referred to by the foreignkey.

256

Databasemanagementsystems(DBMS)Thesearemanydatabasemanagementsoftwaresystemsthathelpmanagetheactivities related to storing the data model, the data itself, and doing theoperations on the data and relations. The data in the DBMS grows, and itservesmanyusers of the data concurrently.TheDBMS typically runs on acomputercalledadatabaseserver–inann-tierapplicationarchitecture.Thusinanairlinereservationsystem,millionsoftransactionsmightsimultaneouslytry toaccess thesamesetofdata.Thedatabaseisconstantlymonitoredandmanagedtoprovidedataaccesstoallauthorizedusers,securelyandspeedily,while keeping the database consistent and useful. Content managementsystemsarespecialpurposeDBMS,or just featureswithinstandardDBMS,that help people manage their own data on a web-site. There are object-orientedandothermorecomplexwaysofmanagingdata.

257

StructuredQueryLanguageSQL is a very easy and powerful language to access relational databases.There are two essential components of SQL: theDataDefinitionLanguage(DDL)andDataManipulationLanguage.

DDLprovides instructions to createnewdatabase, and to createnew tableswithinadatabase.Furtheritprovidesinstructionstodeleteadatabase,orjustafewtableswithinadatabase.Thereareotherancilliarycommandstodefineindexesetcforefficientaccesstothedatabase.

DML is theheart ofSQL. It provides instructions to add, read,modify anddeletedata fromthedatabaseandanyof its tables.Thedatacanselectivelyaccessed,andthenformatted, toansweraspecificquestion.Forexample, tofindthesalesofmoviesbyquarter,theSQLquerywouldbe:

SELECT Product-Name,SUM(Amount)FROMMovies-TransactionsGROUPBYProduct-Name

258

ConclusionDatashouldbemodeledtoachievethebusinessobjectives.Gooddatashouldbe accurate and accessible, so that it can be used for business operations.Relationaldatamodelisthetwomostpopularwayofmanagingdatatoday.

259

ReviewQuestions1:Whoinventedrelationalmodelandwhen?

2: How does relational model mark a clear break from previous databasemodels?

3:WhatisanEntity-Relationshipdiagram?

4:Whatkindsofattributescananentityhave?

5:Whatarethedifferentkindsofrelationships?

260

Appendix1:DataMiningTutorialwithWeka

DataMiningTutorialwithWeka

Developedforacademicuseonly

byDr.AnilMaheshwari&Dr.EdiShivaji

261

ThistutorialfortheWEKAsoftwareplatformisdesignedforusebyastudentofacourseinDataMiningapplications.Thistutorialwillprovideexamplesof solving certain data mining problems using Weka tool and the sampledatasetsprovidedwithit.

Step1:DownloadthefreeWekasoftware

http://www.cs.waikato.ac.nz/ml/weka/downloading.html

Step2:DownloadthefreeWekadatasets

http://www.cs.waikato.ac.nz/ml/weka/datasets.html

Step3:Accesstheassociatedtextbooktolearnaboutdatamining

http://www.cs.waikato.ac.nz/~ml/weka/book.html

This tutorial used data from the freeWeka datasets. The sample problemsaddressedinthistutorialare:

1. Classificationmodels:Thesearethemost importantapplicationofdatamining.WewilluseDecisiontreesandRegressionmethods

2. Clustering:UsingtheK-meansalgorithm3. AssociationRuleMining:UsingApriorialgorithm.

Exercise1:ClassificationusingDECISIONTREES

262


http://www.cs.waikato.ac.nz/ml/weka/datasets.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

Problemstatement:Whatisthebestwaytopredictthatagamewillbeonoroffbasedonweatherindicators?Adatasetofpastdecisionhasbeenprovided.

Datasetused:Weather–nominal.Itdescribes14instancesofweatherconditionsandwhetheranoutdoorgamewaspossibleornot(Play)underthoseweatherconditions.Hereistherawdata.

Loadthedataset.Itisnominal.However,thereisnoneedfornominalityofdataforClassification.

Analysisused:J48decisiontreealgorithm(ItisanimplementationofC4.5algorithm).Itisatop-downapproach.

Results:

Instances:14

Attributes:5outlooktemperaturehumiditywindyplay

263

Testmode:evaluateontrainingdata

===Classifiermodel(fulltrainingset)===

J48prunedtree------------------outlook=sunny|humidity=high:no(3.0)|humidity=normal:yes(2.0)outlook=overcast:yes(4.0)outlook=rainy|windy=TRUE:no(2.0)|windy=FALSE:yes(3.0)NumberofLeaves:5Sizeofthetree:8===Summary===

CorrectlyClassifiedInstances14100%IncorrectlyClassifiedInstances00%Kappastatistic1Meanabsoluteerror0Rootmeansquarederror0Relativeabsoluteerror0%Rootrelativesquarederror0%TotalNumberofInstances14===DetailedAccuracyByClass===TPRateFPRatePrecisionRecallF-MeasureROCAreaClass101111yes101111noWtdAvg.101111===ConfusionMatrix===ab<--classifiedas90|a=yes05|b=no

Note: The model explains 100% of the instances correctly. The pruned treeshowstherulesformakingthedecisioninatextform.

Interpretingthetree:Thefirstsplitvariableis“Outlook”.Ifoutlookisovercast,thencheck forhumidity. If outlook is sunny, the answer isyes. If theoutlook israiny,thencheckforwindy.

Visualizingtheoutput:Wekacancreateavisualversionofthetree.

264

InterpretingtheVisualTree:Thevisualdecisiontreeissimpleandself-explanatory.

Exercise:

1. TrydifferentdecisiontreealgorithmsinWekaforthissimpledataset.2. Comparetimetaken,accuracy,andinterpretabilityoftheoutput.

Exercise2:ClassificationusingDECISIONTREES

Problemstatement:Whatisthebestmodeltodiagnosewhetherabreastlumpisbenignormalignant?

Datasetused:breast-w.Thisismuchlargerdataset.Itshowsmanymorevariablesandinstances.Itdescribes699instancesofbiopsyanalysesofbreastcancersuspects.Thereare15variables:someofwhicharenominalwhileothersarenumeric.Theclassvariableshowsiftheinstancewasjudgedtobebenignofmalignant?

Load the data set. There is no need for nominality of data for Decision trees. For simplicity of analysis however, only thenominalvariableswerekept,whileotherswereremovedfromthedatasetbeforeanalysis.

Analysisused:J48decisiontreealgorithm.

Results:

Scheme:weka.classifiers.trees.J48-C0.25-M2Relation:wisconsin-breast-cancerInstances:699Attributes:10Clump_ThicknessCell_Size_UniformityCell_Shape_UniformityMarginal_AdhesionSingle_Epi_Cell_SizeBare_NucleiBland_ChromatinNormal_NucleoliMitoses

265

Class–(Benign/Malignant)Testmode:evaluateontrainingdata===Classifiermodel(fulltrainingset)===J48prunedtree------------------Cell_Size_Uniformity<=2|Bare_Nuclei<=3:benign(405.39/2.0)|Bare_Nuclei>3||Clump_Thickness<=3:benign(11.55)||Clump_Thickness>3|||Bland_Chromatin<=2||||Marginal_Adhesion<=3:malignant(2.0)||||Marginal_Adhesion>3:benign(2.0)|||Bland_Chromatin>2:malignant(8.06/0.06)Cell_Size_Uniformity>2|Cell_Shape_Uniformity<=2||Clump_Thickness<=5:benign(19.0/1.0)||Clump_Thickness>5:malignant(4.0)|Cell_Shape_Uniformity>2||Cell_Size_Uniformity<=4|||Bare_Nuclei<=2||||Marginal_Adhesion<=3:benign(11.41/1.21)||||Marginal_Adhesion>3:malignant(3.0)|||Bare_Nuclei>2||||Clump_Thickness<=6|||||Cell_Size_Uniformity<=3:malignant(13.0/2.0)|||||Cell_Size_Uniformity>3||||||Marginal_Adhesion<=5:benign(5.79/1.0)||||||Marginal_Adhesion>5:malignant(5.0)||||Clump_Thickness>6:malignant(31.79/1.0)||Cell_Size_Uniformity>4:malignant(177.0/5.0)NumberofLeaves:14Sizeofthetree:27Timetakentobuildmodel:0.07seconds===Evaluationontrainingset======Summary===CorrectlyClassifiedInstances68698.1402%(i.e98%casesareclassifiedcorrectly)IncorrectlyClassifiedInstances131.8598%Kappastatistic0.959Meanabsoluteerror0.0355Rootmeansquarederror0.1324Relativeabsoluteerror7.8614%Rootrelativesquarederror27.8462%TotalNumberofInstances699===DetailedAccuracyByClass===TPRateFPRatePrecisionRecallF-MeasureROCAreaClass0.9830.0210.9890.9830.9860.989benign0.9790.0170.9670.9790.9730.989malignantWeightedAvg.0.9810.020.9810.9810.9810.989===ConfusionMatrix===ab<--classifiedas4508|a=benign(450benigncasesarecorrectlyclassifiedasbenign,8arefalsepositives)5236|b=malignant(236malignantcasesarecorrectlyclassifiedasmalignant,5arefalsenegatives)Visualizingtheoutput:Theprunedtreelooksverycomplexandunreadable,andisthereforeremovedfromthisdocument.Thevisualdecisiontreemakesitmoreeasytograsp.

266

Interpretingthedecisiontreeoutput:

1. Thenumbersonthe leafnodesshowthecorrectlyandincorrectlyclassifiedinstancesfor thatnode.Thedecisionrule/nodeontherightincorrectlyclassifies5instances,evenwhileitaccuratelyclassifies177oftheinstancescorrectly.

2. Notallnodesareequallyimportant.Somenodesexplainmanymoreinstancesthanothernodes.1. E.g.asinglenodeontheleftofthetreerepresentsaverysimplerule(cell_size_uniformity<2andbare_nuclei

<3)explainseasily90%(405outof450)ofthebenigncases,andmorethan55%ofthetotalcases(405outof

699).

2. Similarly,thenodeontherightexplainsover73%ofthemalignantcases(177outof241),andthusprovidesaclearruleorheuristic.

3. Thetreeshowsaclearpathfordiagonozingeachcase.Andsoonandon.

Exercise3:ClusterAnalysisusingK-Meansalgorithm

Natureofproblem/opportunity:Understandtheunderlyingclustersinstancesofbreastcancerevaluations.

Datasetused:breast-w.Itdescribes699instancesofbiopsyanalysesofbreastcancersuspects.Thereare15variables:someofwhicharenominalwhileothersarenumeric.Theclassvariableshowsiftheinstancewasjudgedtobebenignofmalignant.

Datapreparation:Loadthedataset.

Analysisused:K-meansalgorithm.Choicesincludenumberofclusterstobeginwith.

Outputoftheanalysis.

Instances:699Attributes:10===Modelandevaluationontrainingset===

267

kMeans======Numberofiterations:5Withinclustersumofsquarederrors:259.92291180466714Missingvaluesgloballyreplacedwithmean/modeClusterCentroids:Cluster#AttributeFullData01(699)(246)(453)=====================================================

Clump_Thickness4.41777.17482.9205Cell_Size_Uniformity3.13456.59761.2539Cell_Shape_Uniformity3.20746.57321.3797Marginal_Adhesion2.80695.53251.3267Single_Epi_Cell_Size3.2165.30892.0795Bare_Nuclei3.54477.55761.3654Bland_Chromatin3.43785.96342.0662Normal_Nucleoli2.8675.89431.223Mitoses1.58942.5611.0618Classbenignmalignantbenign===Modelandevaluationontrainingset===ClusteredInstances0246(35%)1453(65%)Interpretation:Thisisaveryclearresult.Thereareclearlytwoclasses…malignantandbenign.Sensitivity analysis ofClustering: The two classes above could be unduly influenced by the bipolar variable class variable(benign,malignant).So,removethatvariableandrunthesameanalysisagain.

kMeans======Numberofiterations:6Withinclustersumofsquarederrors:243.1478671867869Missingvaluesgloballyreplacedwithmean/modeClustercentroids:Cluster#AttributeFullData01(699)(233)(466)========================================================Clump_Thickness4.41777.15883.0472Cell_Size_Uniformity3.13456.79831.3026Cell_Shape_Uniformity3.20746.72961.4464Marginal_Adhesion2.80695.73391.3433Single_Epi_Cell_Size3.2165.47212.088Bare_Nuclei3.54477.8741.38Bland_Chromatin3.43786.1032.1052Normal_Nucleoli2.8676.07731.2618Mitoses1.58942.54941.1094===Modelandevaluationontrainingset===ClusteredInstances0233(33%)1466(67%)

Interpretation:

1. Theclusterstructurehasnotchanged.2. However,thestrengthofinstancesineachclusterisslightlychanged…from35-65%to33-67%.So,thereismoreerror

268

ofType-2;i.e.morecasesaremarkedinthe‘benign’category,thanisactuallythecase.

SensitivityanalysisofClustering#2:Maybetherearecasesthatarenotfullymalignant,arebutarenottrulybenign.So,changethenumberofclustersto3,insteadof2.Runtheanalysisagain.

Results:Withinclustersumofsquarederrors:227.7071391007967Missingvaluesgloballyreplacedwithmean/modeClustercentroids:Cluster#AttributeFullData012(699)(222)(178)(299)=================================================Clump_Thickness4.41777.19825.03371.9866Cell_Size_Uniformity3.13456.9641.75841.1104Cell_Shape_Uniformity3.20746.88292.01691.1873Marginal_Adhesion2.80695.91441.71911.1472Single_Epi_Cell_Size3.2165.50452.44941.9732Bare_Nuclei3.54477.95781.83811.284Bland_Chromatin3.43786.20272.52251.9298Normal_Nucleoli2.8676.19371.73031.0736Mitoses1.58942.61261.1911.0669

Interpretationofresults:

1. Theclusterstructurehasobviouslychangedsince thenumberofclustershaschanged. It isclear that the realsplithasbeeninthebenigngroup

2. Asignificantnumberofbenigninstancesseemstohavefallenintoanintermediate/borderlinecategory.Somemarginallymalignantcaseshavealsofallenintothissamecategory.Thesecasesmayneedtobeputunderextrascrutiny.

Exercise4:AssociationRulesusingApriorialgorithm

ASSOCIATIONRULES

Natureofproblem/opportunity:UnderstandtheunderlyingassociationsamongcommercialaspectsoflifeofforeignworkersinGermany.

DataSetused:Credit-g.arff.Thisshowsdataaboutdemographics,jobtype,assets,andcreditclassofworkersinGermany.Itshows17variablesfor1000germanworkers.

Datapreparation: Load the data set. Ensure all non-nominal variables are removed from analysis.Because association rulesworkonlyonnominaldata.

Analysisused:Apriorialgoritm.Choicesincludechangingtheminimumlevelofconfidenceinarule(say90%),andminimalsupportlevel(10%).

Outputoftheanalysis.

Instances:1000

Attributes:11

269

checking_statuscredit_historypurposesavings_statusemploymentpersonal_statusproperty_magnitudehousingjobown_telephoneclass===Associatormodel(fulltrainingset)===Apriori=======Minimumsupport:0.1(100instances)

Minimummetric<confidence>:0.9

Numberofcyclesperformed:18

……

TenBestrulesfound:

1.housing=forfree108==>property_magnitude=noknownproperty104conf:(0.96)

2.checking_status=nocheckingcredit_history=critical/otherexistingcredithousing=own126==>class=good120conf:(0.95)

3.checking_status=nocheckingpurpose=radio/tv127==>class=good120conf:(0.94)

4.checking_status=nocheckingpurpose=radio/tvhousing=own108==>class=good102conf:(0.94)

5.personal_status=malesingleproperty_magnitude=carjob=skilled124==>housing=own117conf:(0.94)

6.checking_status=nocheckingpersonal_status=malesinglehousing=ownjob=skilled121==>class=good114conf:(0.94)

7.checking_status=nocheckingcredit_history=critical/otherexistingcredit153==>class=good143conf:(0.93)

8.checking_status=nocheckingemployment=>=7115==>class=good107conf:(0.93)

9.personal_status=malesingleproperty_magnitude=carclass=good129==>housing=own120conf:(0.93)

10.checking_status=nocheckingjob=skilledown_telephone=yes117==>class=good108conf:(0.92)

InterpretingtheOutput

1. Rule1impliesthat96%ofthosewholiveinfreehousing,donotownanyproperty.2. Rule5impliesthatsinglemalesthatholdskilledjobsandownacar,arealsolikelytoownahouse(94%chance).3. Rule9impliessinglemalesthathavegoodcredithistoryandownacar,arealsolikelytoownahouse(93%chance).4. Rules5and9arehighlyoverlapping.Thesearetwocandidatesforpotentiallycombining.5. Andsoonandon.

---***---

270

271

Appendix1:DataMiningTutorialwithR

DataMiningTutorialwithR

Developedforacademicuseonly

byDr.AnilMaheshwari&Mr.TonmayBhattacharjee

272

BasicRtutorialfordataminingLearnthebasic:

1. Google“codeR”andgototheRcodeschoolwebsite.Youcandirectlygotohttp://tryr.codeschool.comtoo.2. Signup/registerprovidingthesimpleinformation.3. Followthesimpleinstructionandpracticeonthegivencodewindow.4. Finishthestepandunlockthenextsteps.5. Finishallsevenstepsandyou’llseeacongratulationpagelikebellow.

InstallR:

1. ClickontheofficialRprogrammingsiteordirectlyvisithttp://www.r-project.org/

Youshouldseesomethinglikethefollowing

273

http://tryr.codeschool.com

2. ClickdownloadRtogetthepropermirror.Thisshouldtakeyoutoapagesomethinglikebellow.

3. ChoosethelinkforIowaStateUniversityoranyothermirroryoulike.

274

4. Chooseyouroperatingsystem.Formycaseitwaswindows.

5. ClickinstallRforthefirsttime.

6. Clickondownloadtodownloadtheexefile.(forwindows)

7. ClickonSavefiletosavetheexetoyourcomputer.

275

8. Doubleclickthe.exefileforinstallation.

9. ClickRuntostarttheinstallation.Followthesteps.Clicknext,acceptagreement,selectyourinstallationfolderandfinishtheinstallation.

276

CodingwithR:

SelecttheRapplicationfromyourstartmenu.AllcodingstyleshouldbesamewhatyoupracticedonRcodeschool.

Decisiontree:

1. LoadlibraryMASStosupportfunctionsanddatasetsforVenablesandRipley'susinglibrary(“MASS”)

2. Convertyour.xlsor.xlsxfileto.csvfileandputonDocumentsfolder.3. Load thedata toavariableusingread.csv(“filename.csv”). Inmycase I’ve loaded thedata to thevariablenameddata

usingdata<-read.csv(“height.csv”)

4. Loadthelibraryrpartforthedecisiontreeusinglibrary(“rpart”)5. Draw the tree and assign to a variable like tree<-rpart(gend~Height+age+wt, data=data, method=class). Here gend,

Height,ageandwtarecolumnnamesandI’mdrawingdecisiontreetofindoutgendbasedonHeight,ageandwt.datais

thevariablenameofyourcsvfileloadedtoit.Andmethod=classstandsforclassification.

6. Youcanplotthetreeusingplot(tree)7. Toputthelabelsontreeyoucanusetext(tree).Asimpledecisiontreeshouldbedrawn.8. Tomakethetreelittlebitfancyyoucaninstallrpart.plotusinginstall.packages(‘rpart.plot’)9. Selectyourmirrorfortheinstallation.10. In the same way install RColorBrewer using install.packages(‘RColorBrewer’). It has library rattle which is a free

graphicalinterfacefordataminingtocodewithR.

11. Loadtherattlelibraryusinglibrary(‘rattle’)12. Loadthelibraryrpart.plotusinglibrary(‘rpart.plot’)13. LoadthelibraryRColorBrewerusinglibrary(‘RColorBrewer’)14. NowdrawthetreeusingfancyRpartPlot(tree)

Thefollowingexamplecodeandtreeisgivenbellow

277

Correlationandregression:

1. Inthesamewaydescribedindecisiontreeyoucaninstallthenecessarylibraryandloadthedata.2. Usingcov(data)youcanseerelation

278

3. Usingpairs(data)youcanseetheregression.

Thefollowingexampleillustratesthesteps:

279

Hereisanotherexample:

280

281

Foranyhelpvisit:

http://www.rdatamining.com/docs/introduction-to-data-mining-with-r

282

http://www.rdatamining.com/docs/introduction-to-data-mining-with-r

AdditionalResources

Teradatanetwork.com:JoinTeradataUniversityNetwork toaccess toolsandmaterialsforBusinessIntelligence.Itiscompletelyfreeforstudents.

Here are some other books and papers for a deeper dive into the topicscoveredinthisbook.

1. Ayres,I.(2007)SuperCrunchers:WhyThinking-by-NumbersIstheNewWaytobeSmart.RandomHousePublishing.

2. Davenport,T.&J.Harris (2007).CompetingonAnalytics:TheNewScienceofWinning.HBSPress.

3. Gartner(2012).BusinessImplicationsofBigData.4. Gartner(2012).TechnologyImplicationsofBigData.5. GordonLinoff&MichaelBerry (2011).DataMiningTechniques. 3rd

edition.Wiley.6. Groebner, David F,P.W. Shannon, P.C. Fry. (2013). Business Statistics

(9thedition).Pearson.7. Jain,AnilK.(2008).“DataClustering:50yearsbeyondK-Means.”19th

InternationalConferenceonPatternRecognition.8. Lewis, Michael (2004).Moneyball: The Art ofWinning an Unfair

Game.Norton&Co.9. AndrewDMartinetal.“CompetingApproachestoPredictingSupremeCourtDecisionmaking”,PerspectiveinPolitics,2004).

10. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink .HoughtonMifflinHarcourt.

11. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com

12. Sathi,Arvind(2011).CustomerExperienceAnalytics:TheKeyto Real-Time, Adaptive Customer Relationships. IndependentPublishersGroup.

13. Sharda, R., D. Dusen, and E. Turban. (2014). BusinessIntelligenceandDataAnalytics.10thedition.Pearson.

14. Shmueli,G,N.Patel,&P.Bruce (2010).DataMining forBusinessIntelligence.Wiley.

15. Siegel,Eric,(2013).PredictiveAnalytics.Wiley.

283

16. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.

17. Statsoft.www.statsoft/textbook18. Taylor, James (2011).DecisionManagement Systems:A

Practical Guide to Using Business Rules and Predictive Analytics(IBMPress).PearsonEducation.

19. Weka system. http://www.cs.waikato.ac.nz/ml/weka/downloading.html

20. Witten,I.,E.Frank,M.Hall(2009).DataMining.3rdedition.MorganKauffman.

284

http://www.statsoft/textbook


AdvancePraiseforthisbook:

“This book is a splendid and valuable addition to this subject. The wholebookiswellwrittenandIhavenohesitationtorecommendthat thiscanbeadaptedasatextbookforgraduatecoursesinBusinessIntelligenceandDataMining.”Dr.EdiShivaji,DesMoines,Iowa,USA.

“Reallywellwritten and timely as theWorld gets in theBigDatamode! Ithink thiscanbeagoodbridgeandprimer for theuninitiatedmanagerwhoknowsBigData is thefuturebutdoesn'tknowwhere tobegin!”–Dr.AlokMishra,Singapore.

“Thisbookhasdoneagreatjoboftakingacomplex,highlyimportantsubjectareaandmakingitaccessibletoeveryone.Itbeginsbysimplyconnectingtowhatyouknow,and thenbang -you've suddenly foundout aboutDecisionTrees, Regression Models and Artificial Neural Networks, not to mentioncluster analysis,webmining andBigData.” –Ms.CharmaineOak,UnitedKingdom.

“AsacompletenovicetothisareajuststartingoutonaMBAcourseIfoundthe book incredibly useful and very easy to follow and understand. Theconcepts are clearly explained and make it an easy task to gain anunderstandingofthesubjectmatter.”–Mr.CraigDomoney,SouthAfrica.

AbouttheAuthor

Dr.AnilMaheshwari isaProfessorofManagement InformationSystemsatMaharishi University ofManagement, and the Director of their Center forDataAnalytics.He teaches courses in data analytics, and helps researcherswith extracting deep insights from their data. He worked in a variety ofleadership roles at IBM inAustin TX, and has alsoworked atmany othercompaniesincludingstartups.HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineering degree from Indian Institute ofTechnology inDelhi,anMBA from Indian Institute ofManagement inAhmedabad, and a Ph.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.Heblogsinterestingstuffatanilmah.wordpress.com

285

286

TableofContents

PrefaceChapter1:WholenessofDataAnalytics

BusinessIntelligenceCaselet:MoneyBall-DataMininginSportsPatternRecognitionDataProcessingChain

DataDatabaseDataWarehouseDataMiningDataVisualization

OrganizationofthebookReviewQuestions

Section1Chapter2:BusinessIntelligenceConceptsandApplications

Caselet:KhanAcademy–BIinEducationBIforbetterdecisionsDecisiontypesBIToolsBISkillsBIApplications

CustomerRelationshipManagementHealthcareandWellnessEducationRetailBankingFinancialServicesInsuranceManufacturingTelecomPublicSector

ConclusionReviewQuestionsLibertyStoresCaseExercise:Step1

Chapter3:DataWarehousingCaselet:UniversityHealthSystem–BIinHealthcareDesignConsiderationsforDWDWDevelopmentApproachesDWArchitecture

287

DataSourcesDataLoadingProcessesDataWarehouseDesignDWAccessDWBestPracticesConclusionReviewQuestionsLibertyStoresCaseExercise:Step2

Chapter4:DataMiningCaselet:TargetCorp–DataMininginRetailGatheringandselectingdataDatacleansingandpreparationOutputsofDataMiningEvaluatingDataMiningResultsDataMiningTechniquesToolsandPlatformsforDataMiningDataMiningBestPracticesMythsaboutdataminingDataMiningMistakesConclusionReviewQuestionsLibertyStoresCaseExercise:Step3

Chapter5:DataVisualizationCaselet:DrHansGosling-VisualizingGlobalPublicHealthExcellenceinVisualizationTypesofChartsVisualizationExampleVisualizationExamplephase-2TipsforDataVisualizationConclusionReviewQuestionsLibertyStoresCaseExercise:Step4

Section2Chapter6:DecisionTrees

Caselet:PredictingHeartAttacksusingDecisionTreesDecisionTreeproblemDecisionTreeConstructionLessonsfromconstructingtreesDecisionTreeAlgorithmsConclusionReviewQuestionsLibertyStoresCaseExercise:Step5

288

Chapter7:RegressionCaselet:DatadrivenPredictionMarketsCorrelationsandRelationshipsVisuallookatrelationshipsRegressionExerciseNon-linearregressionexerciseLogisticRegressionAdvantagesandDisadvantagesofRegressionModelsConclusionReviewExercises:LibertyStoresCaseExercise:Step6

Chapter8:ArtificialNeuralNetworksCaselet:IBMWatson-AnalyticsinMedicineBusinessApplicationsofANNDesignPrinciplesofanArtificialNeuralNetworkRepresentationofaNeuralNetworkArchitectingaNeuralNetworkDevelopinganANNAdvantagesandDisadvantagesofusingANNsConclusionReviewExercises

Chapter9:ClusterAnalysisCaselet:ClusterAnalysisApplicationsofClusterAnalysisDefinitionofaClusterRepresentingclustersClusteringtechniquesClusteringExerciseK-MeansAlgorithmforclusteringSelectingthenumberofclustersAdvantagesandDisadvantagesofK-MeansalgorithmConclusionReviewExercisesLibertyStoresCaseExercise:Step7

Chapter10:AssociationRuleMiningCaselet:Netflix:DataMininginEntertainmentBusinessApplicationsofAssociationRulesRepresentingAssociationRulesAlgorithmsforAssociationRuleAprioriAlgorithmAssociationrulesexerciseCreatingAssociationRules

289

ConclusionReviewExercisesLibertyStoresCaseExercise:Step8

Section3Chapter11:TextMining

Caselet:WhatsAppandPrivateSecurityTextMiningApplicationsTextMiningProcessTermDocumentMatrixMiningtheTDMComparingTextMiningandDataMiningTextMiningBestPracticesConclusionReviewQuestions

Chapter12:WebMiningWebcontentminingWebstructureminingWebusageminingWebMiningAlgorithmsConclusionReviewQuestions

Chapter13:BigDataCaselet:PersonalizedPromotionsatSearsDefiningBigDataBigDataLandscapeBusinessImplicationsofBigDataTechnologyImplicationsofBigDataBigDataTechnologiesManagementofBigDataConclusionReviewQuestions

Chapter14:DataModelingPrimerEvolutionofdatamanagementsystemsRelationalDataModelImplementingtheRelationalDataModelDatabasemanagementsystems(DBMS)StructuredQueryLanguageConclusionReviewQuestions

Appendix1:DataMiningTutorialwithWekaAppendix1:DataMiningTutorialwithRAdditionalResources

290

291

Indice

Preface 4Chapter1:WholenessofDataAnalytics 11BusinessIntelligence 12Caselet:MoneyBall-DataMininginSports 13PatternRecognition 15DataProcessingChain 18

Data 18Database 20DataWarehouse 22DataMining 24DataVisualization 27

Organizationofthebook 29ReviewQuestions 30

Section1 31Chapter2:BusinessIntelligenceConceptsandApplications 32Caselet:KhanAcademy–BIinEducation 34BIforbetterdecisions 36Decisiontypes 37BITools 38BISkills 40BIApplications 41

CustomerRelationshipManagement 41HealthcareandWellness 42Education 43Retail 43Banking 44FinancialServices 45Insurance 46Manufacturing 47Telecom 47PublicSector 48

Conclusion 50ReviewQuestions 51

292

LibertyStoresCaseExercise:Step1 52Chapter3:DataWarehousing 53Caselet:UniversityHealthSystem–BIinHealthcare 54DesignConsiderationsforDW 56DWDevelopmentApproaches 58DWArchitecture 59DataSources 60DataLoadingProcesses 61DataWarehouseDesign 62DWAccess 63DWBestPractices 64Conclusion 65ReviewQuestions 66LibertyStoresCaseExercise:Step2 67

Chapter4:DataMining 68Caselet:TargetCorp–DataMininginRetail 70Gatheringandselectingdata 72Datacleansingandpreparation 74OutputsofDataMining 76EvaluatingDataMiningResults 78DataMiningTechniques 80ToolsandPlatformsforDataMining 83DataMiningBestPractices 85Mythsaboutdatamining 87DataMiningMistakes 88Conclusion 90ReviewQuestions 91LibertyStoresCaseExercise:Step3 92

Chapter5:DataVisualization 93Caselet:DrHansGosling-VisualizingGlobalPublicHealth 94ExcellenceinVisualization 96TypesofCharts 98VisualizationExample 101VisualizationExamplephase-2 106TipsforDataVisualization 107

293

Conclusion 108ReviewQuestions 109LibertyStoresCaseExercise:Step4 110

Section2 111Chapter6:DecisionTrees 112Caselet:PredictingHeartAttacksusingDecisionTrees 113DecisionTreeproblem 115DecisionTreeConstruction 118Lessonsfromconstructingtrees 124DecisionTreeAlgorithms 126Conclusion 129ReviewQuestions 130LibertyStoresCaseExercise:Step5 132

Chapter7:Regression 134Caselet:DatadrivenPredictionMarkets 135CorrelationsandRelationships 136Visuallookatrelationships 137RegressionExercise 139Non-linearregressionexercise 145LogisticRegression 148AdvantagesandDisadvantagesofRegressionModels 149Conclusion 151ReviewExercises: 152LibertyStoresCaseExercise:Step6 154

Chapter8:ArtificialNeuralNetworks 156Caselet:IBMWatson-AnalyticsinMedicine 157BusinessApplicationsofANN 159DesignPrinciplesofanArtificialNeuralNetwork 160RepresentationofaNeuralNetwork 162ArchitectingaNeuralNetwork 163DevelopinganANN 164AdvantagesandDisadvantagesofusingANNs 166Conclusion 167ReviewExercises 168

Chapter9:ClusterAnalysis 169

294

Caselet:ClusterAnalysis 170ApplicationsofClusterAnalysis 171DefinitionofaCluster 172Representingclusters 173Clusteringtechniques 174ClusteringExercise 176K-MeansAlgorithmforclustering 179Selectingthenumberofclusters 183AdvantagesandDisadvantagesofK-Meansalgorithm 184Conclusion 185ReviewExercises 186LibertyStoresCaseExercise:Step7 188

Chapter10:AssociationRuleMining 190Caselet:Netflix:DataMininginEntertainment 191BusinessApplicationsofAssociationRules 193RepresentingAssociationRules 194AlgorithmsforAssociationRule 195AprioriAlgorithm 196Associationrulesexercise 197CreatingAssociationRules 201Conclusion 203ReviewExercises 204LibertyStoresCaseExercise:Step8 205

Section3 206Chapter11:TextMining 207Caselet:WhatsAppandPrivateSecurity 208TextMiningApplications 210TextMiningProcess 212TermDocumentMatrix 214MiningtheTDM 217ComparingTextMiningandDataMining 218TextMiningBestPractices 220Conclusion 221ReviewQuestions 222

Chapter12:WebMining 224

295

Webcontentmining 226Webstructuremining 227Webusagemining 228WebMiningAlgorithms 230Conclusion 231ReviewQuestions 232

Chapter13:BigData 233Caselet:PersonalizedPromotionsatSears 234DefiningBigData 236BigDataLandscape 239BusinessImplicationsofBigData 240TechnologyImplicationsofBigData 242BigDataTechnologies 244ManagementofBigData 246Conclusion 248ReviewQuestions 249

Chapter14:DataModelingPrimer 250Evolutionofdatamanagementsystems 252RelationalDataModel 253ImplementingtheRelationalDataModel 256Databasemanagementsystems(DBMS) 257StructuredQueryLanguage 258Conclusion 259ReviewQuestions 260

Appendix1:DataMiningTutorialwithWeka 261Appendix1:DataMiningTutorialwithR 272AdditionalResources 283

296

Date post:	16-Sep-2019
Category:	Documents
Upload:	others
View:	8 times
Download:	1 times