+ All Categories
Home > Documents > Getting Started With Business Analytics

Getting Started With Business Analytics

Date post: 07-Sep-2015
Category:
Upload: swapnil022
View: 9 times
Download: 4 times
Share this document with a friend
Description:
Insightful decision making of your data.
41
“K14271”— 2013/2/15 9:12 3 Data Mining in a Nutshell The purpose of this chapter is to introduce the reader to the main concepts and data mining tools used in business analyt- ics. Methods are described at a non-technical level, focusing on the idea behind the method, how it is used, advantages and limitations, and when the method is likely to be of value to business objectives. The goal is to transform data into information, and information into insight. —- Carly Fiorina (1954–) President of Hewlett Packard, 19992005 3.1 What Is Data Mining? Data mining is a field that combines methods from artificial intelligence, machine learning 1 , statistics, and database sys- 1 Machine learning is a sci- entific discipline concerned with the design and develop- ment of algorithms that allow computers to evolve behav- iors based on empirical data, such as from sensor data or databases. tems. Machine learning and statistical tools are used for the purpose of learning through experience, in order to improve future performance. In the context of business analytics, data mining is sometimes referred to as “advanced analytics” 2 . 2 We avoid referring to analyt- ics as “simple” or “advanced” as these terms more appropri- ately describe the usage level of analytics. We want machines that are able to learn for several rea- sons. From large amounts of data, hidden relationships and correlations can be extracted. Scenarios such as changing en- vironments highlight the need for machines that can learn how to cope with modifying surroundings. Computer learn- ing algorithms that are not produced by detailed human de- sign but by automatic evolution can accommodate a constant stream of new data and information related to a task. Data mining focuses on automatically recognizing com- plex patterns from data, to project likely outcomes. Learning is defined as the acquisition of knowledge or skill through experience. In data mining, we train computational methods
Transcript
  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    3 Data Mining in a Nutshell

    The purpose of this chapter is to introduce the reader to themain concepts and data mining tools used in business analyt-ics. Methods are described at a non-technical level, focusingon the idea behind the method, how it is used, advantagesand limitations, and when the method is likely to be of valueto business objectives.

    The goal is to transform data into information,and information into insight.

    - Carly Fiorina (1954 )President of Hewlett Packard, 19992005

    3.1 What Is Data Mining?

    Data mining is a field that combines methods from artificialintelligence, machine learning1, statistics, and database sys- 1 Machine learning is a sci-

    entific discipline concernedwith the design and develop-ment of algorithms that allowcomputers to evolve behav-iors based on empirical data,such as from sensor data ordatabases.

    tems. Machine learning and statistical tools are used for thepurpose of learning through experience, in order to improvefuture performance. In the context of business analytics, datamining is sometimes referred to as advanced analytics2.

    2 We avoid referring to analyt-ics as simple or advancedas these terms more appropri-ately describe the usage levelof analytics.

    We want machines that are able to learn for several rea-sons. From large amounts of data, hidden relationships andcorrelations can be extracted. Scenarios such as changing en-vironments highlight the need for machines that can learnhow to cope with modifying surroundings. Computer learn-ing algorithms that are not produced by detailed human de-sign but by automatic evolution can accommodate a constantstream of new data and information related to a task.

    Data mining focuses on automatically recognizing com-plex patterns from data, to project likely outcomes. Learningis defined as the acquisition of knowledge or skill throughexperience. In data mining, we train computational methods

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    42 getting started with business analytics

    to learn from data for the purpose of applying the knowledgeto new cases.

    The main challenge of data mining techniques is the abil-ity to learn from a finite set of samples (data) and be able togeneralize and produce useful output on new cases (scoring).

    Within data mining, algorithms are divided into two majortypes of learning approaches:

    Supervised learning: we know the labels or outcomes for asample of records, and we wish to predict the outcomes ofnew or future records. In this case, the algorithm is trainedto detect patterns that relate inputs and the outcome. Thisrelationship is then used to predict future or new records.An example is predicting the next move in a chess game.

    The fundamental approach in supervised learning isbased on training the model and then evaluating its per-formance. Data are therefore typically segmented intothree portions:

    Training data: the data used for training the data min-ing algorithm or model

    Validation data: used to tweak models and to compareperformance across models. The rule of thumb is 8020for training and validation data.

    Test data (or hold-out data): used to evaluate the finalmodels performance, based on its ability to perform onnew previously unseen data.

    Unsupervised learning: we have a sample of records, each con-taining a set of measurements but without any particularoutcome of interest. Here the goal is to detect patterns orassociations that help find groups of records or relation-ships between measurements, to create insights about rela-tionships between records or between measurements. Anexample is Amazons recommendation system that recom-mends a set of products based on browsing and purchaseinformation.

    3.2 Predictive Analytics

    This set of tools includes a wide variety of methods and al-gorithms from statistics and machine learning. We cover afew of the most popular predictive analytics tools. Interested

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 43

    Figure 3.1: The difference be-tween supervised and unsu-pervised problems. In the su-pervised learning task we tryto classify the object as chair(+1) or an apple (1). In theunsupervised learning case,we try to measure similarity ofan item to other items.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    44 getting started with business analytics

    reader can obtain information about further methods or fur-ther technical details from more specialized books.

    Computers are useless. They can only give you an-swers.

    Pablo Picasso (18811973)

    Supervised Learning

    In supervised learning, for each record we have a set of inputmeasurements as well as a known target or outcome mea-surement. For example, in a customer database of mobilephone users, where we are interested in modeling customerchurn, a record is a customer. For each customer, input mea-surements can include demographic information as well ascall and billing history. A possible outcome measurement iswhether the customer stays with the company for at least ayear.

    The purpose of supervised learning methods is to find arelationship between the input measurements and the out-come measurement. In the mobile customer churn example,we are looking for a relationship between customer attributesand behavior and their attrition.

    Another classic example of a supervised learning task isthe prediction of spam (unsolicited email) for the purpose ofspam filtering. Each record is an email message, for whichwe have multiple input measurements such as the senderaddress, the title, and text length. The outcome of interest isa label of spam or non-spam.

    In the above examples of customer churn and spam, theoutcome measurement is categorical: whether a customerstays or not, or whether an email is spam or not. This type ofoutcome is called a class. Predicting the outcome is thereforecalled classification.

    Supervised learning includes scenarios where the outcomemeasurement is either categorical or numerical. Some exam-ples of a numerical outcome are predicting the duration ofservice calls at a call center based on input measurementsthat are available before the call is taken, or predicting theamount of cash withdrawn in each ATM transaction beforethe actual amount is keyed in by the customer. When theoutcome is numerical, the supervised learning task is calledprediction3.

    3 In machine learning, theterm used for predicting a nu-merical outcome is regression.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 45

    The following supervised learning techniques are used forclassification and/or prediction. The various methods, eachwith strengths and weaknesses, approach the task of detect-ing potential relationships between the input and outcomemeasurements differently.k-Nearest Neighbors (k-NN)

    k-nearest neighbors (k-NN) algorithms are useful for bothclassification and prediction. They can be used to predict cat-egorical and numerical outcomes. The algorithm identifies krecords in the training set that are most similar to the recordto be predicted, in terms of input measurements. These kneighbors are then used to generate a prediction of the out-come for the record of interest. If the outcome is categorical,we let the neighbors vote to determine the predicted class ofthe record of interest. If the outcome is numerical, we simplytake an average of the neighbors outcome measurement toobtain the prediction.

    The nearest neighbors approach is what real estate agentstend to instinctively use when pricing a new property. Theyseek similar properties in terms of size, location and otherfeatures and then use these reference properties to price thenew property.

    Consider the mobile customer churn example for predict-ing how likely a new customer is to stay with the com-pany for at least one year. The k-nearest-neighbors algorithmsearches the customer database for a set of k customers simi-lar to the to-be-predicted customer in terms of demographic,calling and billing profiles. The algorithm then considersthe churn behavior of the k neighbors and uses the mostpopular class (churn/no churn) to predict the class of thenew customer. If we are interested in a probability of churn,the algorithm can compute the percentage of neighbors whochurned.

    In the call-center call duration example, we want to pre-dict the duration of an incoming call before it begins. The k-NN algorithm searches the historic database for k calls withsimilar features (information available on the caller, call time,etc.). The average call duration of these k similar calls is thenthe predicted duration for the new call.

    To illustrate the k-NN algorithm graphically, consider theexample of predicting whether an online auction will be com-petitive or not. A competitive auction is one that receivesmore than a single bid. Using a set of over 1,000 eBay auc-tions, we examine two input measurements in each auction:

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    46 getting started with business analytics

    the seller rating (where higher ratings indicate more experi-ence) and the opening price set by the seller.

    The relationship between the auction competitiveness out-come and these two inputs is shown in Figure 3.2. Supposethat we want to predict the outcome for a new auction, giventhe seller rating and opening price. This new record is de-noted by a question mark in the chart. The k-NN algorithmsearches for the k nearest auctions. In this case k was chosento be 7. Among the seven neighbors, five were competitiveauctions; the predicted probability of this auction to be com-petitive is therefore 5/7. If we use a majority rule to generatea classification, then the five competitive auctions are the ma-jority of the seven neighboring auctions, and k-NN classifiesthe new auction as being competitive.

    Figure 3.2: Competitiveauctions (black circles) andnon-competitive auctions(gray squares) as a functionof seller rating and openingprice in eBay auctions. k-nearest neighbors classifies anew auctions competitivenessbased on k auctions withsimilar seller ratings andopening prices.

    A k-nearest neighbors algorithm requires determining twofactors: the number of neighbors to use (k) and the defini-tion of similarity between records. The number of neighborsshould depend on the nature of the relationship between theinput and outcome measurements in terms of its global ver-sus local nature. In a global pattern, the same relationshipholds across all input values, whereas in local patterns differ-ent relationships exist for different values of the input values.

    In the mobile churn example, if churn decreases in ageregardless of other demographics or billing features, then wecan say that there is a global relationship between churn andage. However, if churn decreases in age only for heavy callers

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 47

    but increases for low-volume callers, then the relationshipbetween churn and age is local. A small number of neighborsis better for capturing local relationships only a small setof very close neighbors would be similar to the record ofinterest whereas in global relationships a large number ofneighbors leads to more precise predictions.

    The choice of k is typically done automatically. The algo-rithm is run multiple times, each time varying the value of k(starting with k = 2) and evaluating the predictive accuracyon a validation set. The number of neighbors that producesthe most accurate predictions on the validation set is chosen.

    Similarity between records can be determined in manyways. Records are compared in terms of their input measure-ments. The similarity metric most commonly used in k-NNalgorithms is Euclidean distance. To measure the distance be-tween two records, we look at each input measurement sep-arately and measure the squared difference between the tworecords. We then take a sum of all the squared differencesacross the various input measurements. This is the Euclideandistance between two records.

    For example, the Euclidean distance between two auctionsis computed by summing up the squared difference betweenthe pair of seller ratings and the squared difference betweenthe pair of opening prices. You may have noticed that com-puting a Euclidean distance in this way will produce a sim-ilarity measure that gives much more weight to input mea-surements with large scales (such as seller ratings, comparedto opening prices). For this reason, it is essential to first nor-malize the input measurements before computing Euclideandistances. Normalizing can be done in different ways. Twocommon normalizing approaches are converting all scalesto a [0,1] scale or subtracting the mean and dividing by thestandard deviation. While similarity between records can bemeasured in different ways, Euclidean distance is appealingbecause of its computational efficiency.

    In k-NN, computational efficiency is especially importantbecause the algorithm computes the similarity between theto-be-predicted record with each and every record in thetraining data. Moreover, if we want to predict many newrecords (such as for a large set of potential customers), thecomputational task can be heavy.

    The Verdict: Among supervised learning methods, k-NNis simple to explain and easy to automate. It can be used forboth prediction and classification and is highly data-driven,

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    48 getting started with business analytics

    i.e., there are no assumptions about the nature of the relation-ship between the outcome and inputs. While k-NN is simpleto explain, it produces black-box predictions because it isnot clear which inputs contribute to the prediction and towhat degree. When transparency is needed, k-NN is not anappealing candidate.

    One key requirement of k-NN algorithms is sufficienttraining data. k-NN must be able to find a sufficient numberof close neighbors to produce accurate predictions. Unfor-tunately, the number of required records increases exponen-tially in the number of input measurements, a problem calledthe curse of dimensionality. Another challenge that KNNfaces is computational: the time to find the nearest neighborsin a large training dataset can be prohibitive. While thereare various tricks to try to address the curse of dimensional-ity and the computational burden, these two issues must beconsidered as inherent challenges within k-NN.Classification and Regression Trees (CART)

    Classification and regression trees are supervised learningalgorithms that can be used for both classification (classifi-cation trees) and prediction (regression trees). Like k-NN,the idea is to define neighborhoods of similar records, andto use those neighborhoods to produce predictions or classi-fications for new records. However, the way that trees deter-mine neighborhoods is very different from k-NN. In particu-lar, trees create rules that split data into different zones basedon input measurements, so that each zone is dominated byrecords with a similar outcome. In the eBay auctions exam-ple, we might have a rule that says IF the opening price isbelow $1 AND the seller rating is above 100, THEN the auc-tion is competitive.

    To create rules, tree algorithms sequentially split inputpredictors into two ranges: above and below some value. Thealgorithm starts by looking for the best split, one that pro-duces two sets of records that are each as homogeneous aspossible in terms of the outcome. Finding the best split re-quires trying different values across all the different inputmeasurements. Once the first split is determined, the algo-rithm searches for another split, which will further breakdown the two data groups into three sets. The next step isfinding a third split, and so forth.

    The splitting process is illustrated graphically in Figure3.3 for two input measurements, using the eBay auctions ex-ample. The first split uses the opening price, and creates two

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 49

    zones: above and below $1.23. We can see that the lower zoneis dominated by competitive auctions, while the upper zoneis more balanced, but contains more non-competitive auc-tions. The second split further separates the high openingprice zone (above $1.23) by seller rating, with a high sellerrating zone (above 300) and a low seller zone (300 or be-low). This second split breaks down the upper zone into twostrips, one with mostly competitive auctions and the otherwith mostly non-competitive auctions. The third split sepa-rates the high opening price, high seller rating zone fur-ther, by seller rating (above/below $5).

    Displaying the results in a scatter plot with splits as in Fig-ure 3.3 is no longer possible once additional input measure-ments are introduced. However, there is an alternative pow-erful chart that can be used to display the resulting splits, inthe form of a tree. The tree for this same example is shown inFigure 3.4. Starting from top to bottom, each layer of the treerepresents a split, in the order it occurred. The rectangles arecalled leaf nodes or terminal nodes, and they representthe outcome of the records that fall in that zone. For exam-ple, 89% of the auctions with an opening price below $1.23are competitive (as can be seen in Figure 3.4).

    To convert the leaf node probabilities to competitive/non-competitive classifications, we specify a majority threshold.The default threshold of 50% means that a leaf node withfewer than 50% competitive auctions will lead to a non-competitive classification (such as the right-most leaf nodein Figure 3.4), whereas a leaf node with 50% or more com-petitive auctions will lead to a competitive classification. Fora numerical outcome, such as call duration, the leaf node la-bel is typically the average outcome of records in that leafnode.

    A tree can easily be translated into a set of logical rulesthat relate the outcome of interest to the input measurements.In our example, using a 50% majority threshold, we have fourrules:

    1. IF the opening price is below $1.23, THEN the auction iscompetitive.

    2. IF the opening price is above $1.23, AND the seller ratingis below 300, THEN the auction is competitive.

    3. IF the opening price is above $1.23, AND the seller ratingis above 300, AND the opening price is below $5, THEN

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    50 getting started with business analytics

    Figure 3.3: Competitiveauctions (black circles) andnon-competitive auctions(grey squares) as a functionof seller rating and openingprice in eBay auctions. Theclassification tree splits theauctions first by openingprice (above/below$1.23),then the high opening pricezone is split by seller ratingabove/below 300, and thenthe high seller rating zone isfurther split by opening price(above/below $5). Each ofthe four zones has a majorityof auctions with similaroutcomes (competitive/non-competitive).

    Figure 3.4: Displaying the al-gorithm splitting results in theform of a tree.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 51

    the auction is competitive.

    4. IF the opening price is above $1.23, AND the seller ratingis above 300, AND the opening price is above $5, THENthe auction is non-competitive.

    Obviously, these rules can be further compressed into asmaller set of rules (IF the opening price is above $5 ANDthe seller rating is above 300, THEN the auction is non-competitive; otherwise it is competitive).

    Using a tree to generate predictions or classifications isstraightforward: simply drop a new record at the top of thetree and find out in which leaf node it lands. The leaf nodethen determines the class or prediction for that record.

    The two factors that must be determined in a tree algo-rithm are the function measuring homogeneity of a set ofrecords in terms of the outcome, and the size of the treein terms of splits. Typical homogeneity measures are the fa-mous Gini index for classification tasks, and the standarddeviation for prediction tasks. Most software packages willhave defaults and some allow users to choose between dif-ferent homogeneity measures.

    Determining tree size is an important step. Too manysplits result in over-fitting the training data. An extreme ex-ample is when each zone contains a single class. This willtypically require many splits and the final zone will containvery few records. Most tree algorithms use an automated ap-proach, where validation data are used to determine the treesize, thereby avoiding over-fitting. A tree that produces pre-dictions on the training data that are much more accuratethan predictions on the validation data is clearly over-fitting.Algorithms are designed to avoid this scenario.

    The Verdict: CART has proven useful in a wide range ofbusiness applications. Its biggest strength among data-drivenalgorithms is transparency and interpretability. The ability tounderstand the relationship between a prediction and the in-put measurements is crucial in applications such as insur-ance underwriting, as well as for increasing the comfort levelof users who are not analytics-savvy. Trees are often usedin applications with a large number of potential input mea-surements, for the purpose of ranking the importance of theinput measurements.

    Trees are highly automated. They do not require any userinput, and do not make assumptions about the type of rela-tionship between the outcome and input measurements. Be-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    52 getting started with business analytics

    cause of their data-driven nature, they require a large num-ber of records for training, and a separate set for validation.Trees are robust to extreme records, and can be applied todata with missing values. However, a small change in thedata can result in a different tree. In terms of computation,trees can be computationally expensive, increasingly so asthe number of input measurements increases.

    Finally, extensions of trees such as random forests andboosted trees improve prediction accuracy and stability. Theseextensions are based on creating multiple trees from the dataand combining them to produce better predictions.Regression Models

    CART and k-NN are both data-driven algorithms, whereusers need not specify the form of the relationship betweenthe outcome and the input measurements. These methodstherefore require large amounts of training data, and are sub-ject to over-fitting. A different approach is to model the rela-tionship between the outcome and input measurements viaa statistical model of the form:

    Outcome = 0 + 1 Input1 + 2 Input2 + . . .We call this type of model a regression model.

    Consider the example of a car manufacturer interested inpredicting the cost of the three-year warranty it provides tocustomers. The manufacturer has access to a large datasetof historical warranty claims, with detailed information oneach claim. The challenge is to determine which input mea-surements are predictive of warranty cost and how to use thisinformation to predict the warranty cost for new cars. Figure3.5 illustrates a regression model for modeling the relation-ship between the warranty cost (outcome) and the vehicleodometer reading (single input measurement)4. The straight 4 The data for this illustration

    is a sample from the DontGet Kicked competition onwww.kaggle.com.

    line denotes the regression equation, and we see that the spe-cific regression formula estimated from the training data is:

    Warranty Cost = 174.45 + 0.01Odometer Reading,where cost is in USD and odometer reading is in miles. Theregression tells us that the higher the odometer reading, thehigher the warranty cost. Moreover, each additional mile onthe odometer increases the predicted cost by one cent ($0.01).While it is difficult to display a model of warranty cost onmore than a single input measurement using a chart, regres-sion models do scale up to cases of multiple input measure-ments.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 53

    Figure 3.5: A regression modelfor predicting warranty cost ofa vehicle by its odometer read-ing. Each point corresponds toa vehicle.

    Although a linear relationship is a rough approximation ofa more complicated relationship, it may be sufficient for pro-ducing sufficiently accurate predictions. Moreover, variationsof the linear model above can yield input-outcome functionsthat are non-linear, such as exponential or polynomial. Fora categorical outcome with two or more classes, regressionformulas are available that link class probability with the in-puts. A popular regression model for a categorical outcomeis the logistic regression model.

    By specifying a mathematical function that relates out-come to input measurements, we significantly reduce thecomputational burden and use the data more efficiently.Rather than using the data to detect the relationship struc-ture, we specify the structure and use the data only to esti-mate the function parameters (0, 1, 2, . . .).

    The Verdict: Regression models are popular predictivetools. They can be used for classification and predictionand can capture linear, non-linear or any pre-specified re-lationship functions between the outcome and input mea-surements. In addition to providing predictions, regressionmodels also provide information about the importance ofdifferent input measurements through quantifiable coeffi-cients ( values). Like CART, regression models are consid-ered transparent and interpretable. They do not require largeamounts of training data. By pre-specifying an approximateformula (such as a linear function) smaller amounts of dataare needed. The main weakness of regression models is the

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    54 getting started with business analytics

    need for users to specify the formula linking the outcome tothe inputs.Ensembles

    Given the variety of predictive algorithms, each with itsstrengths and weaknesses, one approach is to compare theresults and choose the algorithm that yields the best results.The algorithm with the most precise predictions is not nec-essarily the best. Other considerations such as operationalcosts, future software availability or run time may lead to adifferent choice of algorithm.

    An alternative approach is to combine results from severalalgorithms. Combining results can be done in different ways.A simple approach is to average the results. For example, forpredicting warranty cost of a new vehicle, we can take theaverage of the cost predictions generated by different algo-rithms. For a classification task such as churn/no-churn, wecan classify a new customer by a majority vote of class pre-dictions by different algorithms.

    The averaging or voting operations are easily automated,thereby producing a hybrid algorithm, typically called anensemble. Ensembles rely on the same principle as diver-sified financial portfolios: they reduce risk. In the contextof prediction, this means that ensembles produce more pre-cise predictions. This phenomenon has been demonstrated inmany real world cases. Ensembles played a major role in themillion-dollar Netflix Prize contest5, where teams competed 5 www.netflixprize.comin creating the most accurate predictions of movie prefer-ences by users of the Netflix DVD rental service. Differentteams ended up joining forces to create ensemble predictions,which proved more accurate than the individual predictions.The winning BellKors Pragmatic Chaos team combinedresults from the BellKor and Big Chaos teams and ad-ditional members. In a 2010 article in Chance magazine, theNetflix Prize winners described the power of their ensembleapproach [2]:

    An early lesson of the competition was the value of combin-ing sets of predictions from multiple models or algorithms.If two prediction sets achieved similar RMSEs, it was quickerand more effective to simply average the two sets than to tryto develop a new model that incorporated the best of eachmethod. Even if the RMSE for one set was much worse thanthe other, there was almost certainly a linear combination thatimproved on the better set.

    Another way to create ensembles is to split the data into

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 55

    multiple sub-samples, to run a particular algorithm sepa-rately on each sample, and then to combine the results. Thereare also various ways to combine results. For example, in-stead of an average we can take a weighted average that givesmore weight to more precise predictions. Lastly, some datamining algorithms are themselves ensembles. For example,random forests are an ensemble of classification or regressiontrees.

    The Verdict: Ensembles, or combining algorithm results,is a useful way to generate more precise and more robustpredictions. The only caveat with the ensemble approach isthat it requires more resources than a single algorithm. It re-quires running each of the algorithms at the production stageand whenever new predictions are needed. Constraints suchas run time or future software availability should thereforebe carefully considered.

    Unsupervised Learning

    In unsupervised learning, we have a set of measurementsfor each record. As in supervised learning, it is importantto define what is a record, because different definitions willlead to different data structures. For example, from a banksdatabase we can extract a dataset of customers, where aset of measurements is available for each customer: demo-graphic information, banking history (number of monthlyATM transactions, teller transactions, loans, etc.). Alterna-tively, we can extract a dataset of single transactions, wheremeasurements are given on each transaction (time of day,type of transaction, amount, etc.). A third option is to de-fine a record as a teller encounter, where measurements arethe types of products or services rendered in that encounteras well as the customer data.

    The purpose of unsupervised learning methods is tofind relationships either between measurements or betweenrecords. Note that in contrast to supervised learning, we donot differentiate between input and outcome measurements.In the bank customer dataset, for instance, we might be in-terested in discovering different types of customers in termsof their banking history. This popular task is called customersegmentation. In the teller encounter example, we might wantto learn which products/services tend to go together, forpurposes such as providing product recommendations, de-termining marketing campaigns, etc. Several algorithms are

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    56 getting started with business analytics

    popular for these various tasks, ranging from collaborative fil-tering methods to market basket analysis.

    A third relationship of interest is figuring out the amountof information overlap in a large set of measurements, usu-ally in an effort to reduce the number of measurements to asmaller, more manageable one. This process is called dimen-sion reduction. In the following sections we describe severalpopular unsupervised data mining techniques.

    Cluster Analysis (Segmentation)Cluster analysis is an exploratory tool for discovering clus-

    ters of similar records. Humans can easily detect similaritiesand differences between objects in some contexts. For exam-ple, we easily distinguish between a cat and a dog, despitethe fact that they share many common features (four legs,fur, etc.). Yet, when it comes to many rows of records withmultiple measurements, our ability to find similarities anddifferences between records is challenged.

    Clustering algorithms are an automated way of detectingclusters of similar records. Similarity requires defining a met-ric that compares all the measurements of interest. For exam-ple, the similarity between two bank accounts can be mea-sured by comparing the different features of the two accountssuch as account type, recent activity, current balance, etc. Var-ious distance metrics, each with advantages and shortcom-ings, are used to measure the difference between a pair ofrecords.

    A popular metric is Euclidean distance, which aggregatesthe squared difference for each feature (we have discussedthis metric earlier in the context of k-NN). For example, bankaccount #1 has a current balance of $2,500 and the mostrecent activity is 24 hours ago. Account #2 has a currentbalance of $3,000 and the most recent activity is 14 hoursago. The Euclidean distance between accounts #1 and #2 is(2, 500 3, 000)2 + (24 14)2 = 250, 100. Note that using Eu-clidean distance requires re-scaling (normalizing) all featuresso that they have the same scale. Otherwise, features thathave units with a wide range, such as current balance, willdominate the distance.

    Figure 3.6 graphically illustrates the notion of clustering.The chart compares 821 vehicle records, each represented bya horizontal line (row). For each vehicle, we have four fea-tures (from left to right: odometer reading, acquisition cost,

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 57

    whether the vehicle was purchased online, and warrantyprice). The chart uses color to denote the value on each fea-ture, so that darker color corresponds to higher values. Wesorted the vehicles by odometer reading, from high to low.On this chart, we can visually detect a few types of vehi-cles. At the very top are high-mileage-inexpensive-online-purchased-low-warranty-cost vehicles. At the bottom of thechart we can spot a cluster of vehicles with very low odome-ter readings, acquisition costs, and warranty costs. Most ve-hicles were purchased offline (one cluster) and a few online(another cluster).

    Note how difficult it is to clearly separate the 821 vehi-cles into clear-cut clusters based on all four features. For thisreason, data mining algorithms are needed. The result of ap-plying a clustering algorithm is shown in Figure 3.7. The al-gorithm detected four clusters (labeled 10, 6, 8, 9). Clusters 10and 6 comprise the vehicles that were purchased offline. Themain difference between the clusters is the higher warrantycost and odometer readings in cluster 10.

    The Verdict: We can think of cluster analysis as a way forcompressing rows of a dataset. Cluster analysis is a usefulapproach for exploring groupings of records, based on a setof measurements. It is especially advantageous in Big Data,where the number of records and the number and variety ofmeasurements is large.

    Some clustering algorithms are designed for very largedatasets and can be deployed for high-performance imple-mentation. Note that cluster analysis is an exploratory ap-proach. There are no correct or incorrect results. The keyquestion is always whether the resulting clusters offer in-sights or lead to insightful decisions. Segmenting records intoclusters is useful for various reasons and applications, in-cluding differentiated customer targeting (customary in mar-keting campaigns), deploying different policies to differentsegments (as in credit card scoring), and more.

    Dimension ReductionOne feature of Big Data is the large number of measure-

    ments for each record. This richness is the result of the pro-liferation of digitized data collection mechanisms, includingscanners, RFID readers, GPS devices and automated record-ing of online and mobile footprints. While more measure-ments may increase the richness of data, they also add a lotof noise due to errors and other factors. Moreover, measure-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    58 getting started with business analytics

    Figure 3.6: Heatmap compar-ing 821 vehicle records (rows)across four features. Darkercolor denotes higher values.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 59

    Figure 3.7: Results from a clus-tering algorithm: The 821 ve-hicle records are grouped intofour clusters.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    60 getting started with business analytics

    ments often contain an overlap of information. For example,age and years since opening bank account are likely to be highlycorrelated. The challenge is to determine what part of thedata contains the information and to separate it out from thenoise.

    Dimension reduction techniques are aimed at reducing thenumber of measurements (the dimension of the data) to asmaller set that contains a high ratio of information-to-noise.One approach is to choose a particular set of information-richmeasurements and to discard all the other measurements. Adifferent approach is to derive a small set of new measure-ments, based on the large set of original measurements. Thisis the basis for popular data mining algorithms such as Prin-cipal Components Analysis (PCA), Singular Value Decompo-sition (SVD), and Factor Analysis.

    In PCA, each new measurement, called a principal com-ponent, is a weighted average of the original measurements.What is special about these principal components is that theycontain no information overlap. In other words, the correla-tion between each pair of principal components is zero. Asmall number of principal components (typically less than10) is sufficient to capture most of the variability in thedataset.

    For example, in the vehicle data that we described ear-lier, running PCA (after rescaling the four measurements)indicates that we can capture approximately 70% of the in-formation that is contained in four measurements (odometerreading, acquisition cost, whether the vehicle was purchasedonline, and warranty price) by considering only two princi-pal components: the average of these four measurements andwhether the vehicle was purchased online6. 6 The first principal compo-

    nent is approximately the sim-ple average:

    0.63VehOdo + 0.42VehBcost

    +0.30IsOnlineSale + 0.58

    WarrantyCost.

    The second principal compo-nent is dominated by IsOnline-Sale, as can be seen from thecomputation

    0.02VehOdo + 0.47VehBcost

    0.88IsOnlineSale+0.08WarrantyCost.

    The Verdict: We can think of dimension reduction as com-pressing the columns of a dataset. Dimension reduction isa crucial step in data mining, because datasets typically in-clude tens, hundreds or even thousands of measurements.Domain knowledge can be used to eliminate measurementsthat are completely irrelevant. However, by eliminating mea-surements altogether, some useful information might be lost.Methods such as PCA and SVD approach dimension reduc-tion in a different fashion. They help reduce informationoverlaps between measurements by creating a smaller set ofvariables that do not have information overlap. The result-ing set of variables is smaller than the original set of mea-surements, and therefore easier to handle and faster to run

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 61

    in other analyses. However, computing these new variablesusually requires access to the entire set of original measure-ments.

    Association Rules (Market-Basket Analysis)Scanners and online shopping create records of customers

    baskets. Each transaction includes a combination of prod-ucts that are linked to the same record in a single transac-tion. Services such as movie rentals and telecom plans simi-larly have baskets for different customers: a basket of moviesrented within the same transaction; a basket of mobile tele-com services (such as a monthly 300-minute talk time, unlim-ited SMS and unlimited data). Similar data structures arise inhealthcare: a doctor visit produces a combination of symp-tom data; a hospitalization produces a combination of medi-cal procedures and tests. Online education produces data ongroups of courses taken by a student.

    Unlike the data structure that we described earlier, wherewe have a set of measurements (columns) on a set of records(rows), here different records have different measurements.If we think of baskets as records, then different baskets con-tain different items. This data structure is illustrated in Table3.2, which shows an example for a fictitious pizza deliveryservice.

    We see that different orders contain a different numberof items. We could potentially restructure the data so thateach item on the menu is a column and then we would listthe item count for each order. However, when there are manydifferent possible items (many SKUs in a grocery store, manypossible symptoms, etc.) the latter data structure will containmostly zeros. Association rules are an algorithm that is de-signed especially for such circumstances, i.e., a large set ofpotential items.

    Order Item 1 Item 2 Item 3 Item 4No.1 Large Margherita 2-Liter Pepsi Small

    Pizza salad2 Small Veg Pizza Small Hawaiian

    Pizza3 Large Margherita Large Margherita 2-Liter 2-Liter

    Pizza Pizza Pepsi Pepsi4 Large salad Small Veg Pizza

    The purpose of association rules algorithms is to discoverrelationships between items that co-appear in baskets, in

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    62 getting started with business analytics

    very large transactional datasets. The algorithm does this bysearching for large sets of baskets that contain the same itemcombinations. The result is a set of statements about whichitem goes with which other items. More accurately, asso-ciation rules generate IF/THEN rules such as IF Tylenol,THEN Kleenex.

    In the pizza example, discovering a rule such as IF LargeMargherita THEN 2-Liter Pepsi can be used for coupon of-fers, discount bundles, stocking, and designing the menu. Inother words, the rules are useful for generating common, im-personal decisions.

    The Verdict: Association rules are useful for discover-ing relationships between measurements or items in largedatasets. Large refers to the number of records as well as toa large number of potential items. The resulting associationrules are easy to understand. However, two notes of cautionare in place:

    With a large dataset, association rules will produce a largenumber of rules. One danger is discovering rules thatdo not generalize and instead reflect combinations that aresimply due to chance.

    Discovered rules can be uninteresting or trivial. For exam-ple, discovering that bananas are purchased with manyother products in grocery stores in the United States isunsurprizing, because bananas are a typical loss item thatis offered to lure customers.

    While association rules originated from grocery store trans-actional databases, they can be used in many other contextsand applications, including healthcare, finance, education,telecom, and more. Using association rules requires a largenumber of baskets and potential items, and stakeholder com-mitment toward generating common rules.

    Collaborative FilteringRecommender systems are common in almost every do-

    main. They are used for book recommendations on Ama-zon.com, for music recommendations on personalized radiostations such as LastFM.com and on video sharing sites suchas YouTube, for product recommendations in online shop-ping, and more.

    Collaborative filtering is the algorithm behind many on-line recommender systems. Like association rules, collabora-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 63

    tive filtering uses large datasets on baskets that are combina-tions of products or services that are consumed or purchasedtogether. However, unlike association rules that generate gen-eral, impersonal rules, collaborative filtering generates per-sonalized recommendations. The idea is to generate user-specific associations based on information from many otherusers. The fundamental assumption is that if users rate itemssimilarly, or have similar behaviors (e.g., buying, watching,listening), they will rate or behave similarly on other items[20].

    Another difference is that association rules look for popu-lar combinations in the sense that the combinations appearmany times in the dataset. In contrast, recommender sys-tems look for combinations that are frequent among basketsthat have at least one of the items, regardless of how manysuch baskets there are. This means that association rules ig-nore non-popular items while recommender systems do not.These non-popular items are what is known as the longtail.

    Collaborative filtering algorithms search for items thathave a large percentage of baskets in common. In the pizzaexample in Table 3.2, a 2-liter Pepsi and a large pizzaMargherita share 100% of the baskets that contain either a2-liter Pepsi or a large pizza Margherita. Even if these are theonly two orders in thousands of orders, this coupling willbe captured by the collaborative filtering algorithm. When anew order is placed and one of these two items is mentioned,a recommendation of the second item will automatically begenerated.

    Another difference between association rules and collab-orative filtering is in terms of implementation. Associationrule algorithms are run retrospectively on a large transac-tional dataset, and generate rules. The rules are then usedin the decision making process for purposes of generatingbusiness rules. In contrast, collaborative filtering is run ona live and incrementally growing dataset, and generatesreal-time recommendations to the end users.

    The data used in recommendation systems can be user rat-ings of different items, which require user input. For exam-ple, Netflix solicits movie ratings from its users in order toprovide them with recommendations. User preference datamay also be implicit, as in purchase or consumption of itemcombinations. An example is book orders on Amazon.com.When ratings are used, user-based collaborative filtering al-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    64 getting started with business analytics

    gorithms search for users who rate a set of items similarlyand use this group as neighbors to generate recommen-dations for a new similar user. This is the same idea as thek-NN algorithm (Section 3.2), where the predictors are rat-ings of items and the outcome is an item not yet rated by theuser but rated by neighbors.

    Another type of collaborative filtering is item-based. Inthis case similarity refers to items, so that items with sim-ilar user ratings or those purchased together are consideredsimilar. This approach is the basis for Amazon.coms recom-mender system that produces recommendations of the typeCustomers who bought this item also bought. . . .

    A major challenge with collaborative filtering is that mostusers (rows) only have values or ratings on a small numberof items. The resulting users-by-items dataset therefore typ-ically has many missing values. This sparsity issue posesvarious challenges. Because similarity is based on commonitems (which may be relatively rare), results can be unreli-able. One approach to dealing with the many missing valuesis applying dimension reduction methods.

    The Verdict: The personalized nature of collaborative fil-tering recommendations, and the possibility of implementingthem in real-time, makes them desirable techniques for on-line or other real-time recommendation systems. We notedthe sparsity issues, for which various solutions are available.Two conceptual limitations to consider are the lack of rec-ommendations for users who are dissimilar from all otherusers and the reliance of similarity on combinations of behav-iors but not on combinations of non-behaviors (such as theabsence of certain symptom combinations). In spite of theabove, collaborative filtering is considered one of the mostsuccessful recommender system techniques.

    3.3 Forecasting

    The term forecasting commonly refers to prediction of futurevalues. While it is similar to supervised learning as describedin Section 3.2, forecasting has a few distinguishing features.The first characteristic is the type of data. In forecasting, theoutcome measurement is usually a series of values over time,called a time series. Examples include data on quarterly sales,average daily temperature, hourly website traffic, and annualrainfall. Second, the typical goal is to generate a forecast of

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 65

    future values of the time series. For example, forecasting de-mand in the next four quarters or forecasting weekly stock-outs.

    The main assumption underlying forecasting techniquesis that behaviors and patterns observed in the past willcontinue in the future. For example, an increasing trendis expected to increase further in the future. A seasonalpattern will recur in the future. The equivalent of this as-sumption in the ordinary supervised learning context is thatthe existing data are similar to the new data that will bescored/predicted.

    While the supervised learning methods can be used forforecasting, the temporal nature of both data and goal re-quires modifications. Let us consider a few key modifica-tions.

    Outcome and Input Measurements

    Two forecasting approaches are extrapolation and causal re-gression. Extrapolation techniques use early measurementsof the series as input measurements for later measurements(the outcome). Supervised learning methods such as regres-sion can then be applied to the input and output columns.When the series contains a trend and/or seasonal cycles, spe-cial input measurements are created to capture such patterns.For example, to capture a linear trend of an annual demandseries, we can model the demand values (the outcome) ina linear regression with the column of year numbers as aninput variable.

    In addition to the input measurements based on the se-ries past, additional external information can be included asinput variables, similar to ordinary supervised learning

    Asur and Huberman 7 created a forecasting model for 7 S. Asur and B. A. Huberman.Predicting the future withsocial media. IEEE/WIC/ACMInternational Conference onWeb Intelligence and Intelli-gent Agent Technolgy, pages492499, 2010.

    box-office revenue generated by a movie in its opening week-end. Their input variables are based on tweets (posts on Twit-ter.com) that referred to the movie prior to its release. To fore-cast box-office revenue for a particular movie in its openingweekend, they used the daily tweet ratio from each of thelast seven days as seven input variables.

    Training and Holdout Data

    In Section 3.2 we introduced the practice of partitioning thedata into training, validation and test portions in order to

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    66 getting started with business analytics

    evaluate the predictive accuracy of a solution on new data.In the case of time series, we cannot use random segmenta-tion, because it will create holes in the series. The valuesin a time series are ordered chronologically and this orderis of importance. Therefore, the concept of data partitioningis carried out by splitting the series into an earlier trainingperiod and a later validation period (the most recent portionof the series) that is considered the holdout period.

    Data mining algorithms are trained on the training periodand their performance is evaluated on the holdout period.The holdout period is the future, relative to the training pe-riod. Determining where to split the series depends on dif-ferent aspects of the forecasting goal and the nature of thedata, such as the forecast horizon and the types of patternsthat the data exhibit. Time series partitioning is illustrated inFigure 3.8.

    Figure 3.8: Illustration of parti-tioning a time series into train-ing and holdout periods (fromG. Shmueli. Practical Time Se-ries Forecasting: A Hands-OnGuide. CreateSpace, 2nd edi-tion, 2011, with permission).

    Further Forecasting Methods

    The temporal nature of a time series usually implies corre-lation between neighboring values (autocorrelation). This alsoimplies that the most recent values in a series carry the fresh-est information for forecasting future values. There are var-ious techniques that approach the forecasting task by di-rectly capturing this fact. The simplest are smoothing meth-ods, such as the moving average and exponential smoothing. Amoving average produces forecasts by taking the average ofthe most recent values of the series. Exponential smoothingforecasts are a weighted average of all past values with de-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 67

    caying weights into the past. More sophisticated smoothingmethods exist for forecasting series with trends and seasonalpatterns.

    A different approach that is somewhat similar to regres-sion models is based on fitting a formula that directly mod-els the relationship between a value and past values. Meth-ods such as Autoregressive Integrated Moving Average (ARIMA)models and state space models are based on this approach.The formula can be quite complicated, and automated proce-dures are used to find the formulas parameters that generatethe best forecasts.

    Forecasting Many Series

    While forecasting methods have been used for decades, anew aspect of forecasting is the need to forecast a very largenumber of series, sometimes in real-time. Examples includeretail stores and chains that require forecasts for each StockKeeping Unit (SKU). Websites such as Google Insights forSearch produce immediate forecasts for keyword popular-ity according to the keywords typed by users. Farecast.com(now Bing Travel, by Microsoft www.bing.com/travel/about/howAirPredictions.do) produces forecasts for many airtravel routes, to determine the best time to purchase air tick-ets. Forecasting algorithms predict whether the airfare on acertain route will increase/decrease with a certain confidencelevel.

    In this new realm, the role of automation is critical. Al-gorithms that can be automated and that have low memoryand storage requirements are often preferred over alterna-tives that might produce slightly more accurate forecasts.

    The Verdict: Forecasting is an important component of su-pervised learning, although the temporal nature of the dataand the goal require some modifications to non-temporal ap-proaches. In addition, specialized forecasting algorithms ex-ist that take advantage of the correlation between neighbor-ing values in the series. Todays context often requires gener-ating forecasts for a large number of series, often in real-time.Automated forecasting algorithms are therefore at the core ofmany such systems.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    68 getting started with business analytics

    3.4 Optimization

    Optimization is an important tool with origins in operationsresearch. Operations research is the discipline of applying an-alytical techniques to arrive at better decisions. The primaryfocus of this field is to arrive at optimal or near-optimal solu-tions for complex decision-making problems. Optimizationand simulation (see Section 3.5) are the main two tools. Op-timization is aimed at finding a combination of parametersthat optimizes a function of interest. For example, in onlineadvertising we might ask what ad location, size, and colorlead to maximum click-through rate.

    We illustrate the process of optimization through a busi-ness example. Consider a manufacturing company that pro-duces different types of furniture. The company would liketo determine which combination of items produced in oneday results in maximum profit.

    As a first step, it is important to understand the landscapewe wish to optimize. We have three main components: thefunction to optimize, the parameters, and the constraints.

    Parameters (Variables)

    We are interested in determining the combination of furni-ture item quantities to produce in a day. In this example,we consider four types of items, and plan to produce quan-tities x1, x2, x3 and x4 of each item, respectively. Regardingthe amount produced of each item, there may be minimalrequirements, such as at least one item from each type. Forthe sake of simplicity, let us assume that there is no requiredminimal quota of furniture to be produced.

    The output of the optimization procedure will include thevalues for x1, x2, x3 and x4 that optimize the daily profit.

    Costs and Constraints

    In our example, we want to capture the different manufac-tured items as well as their respective costs and profits. Costsinclude labor (man hours worked) and materials. We list thelabor and material requirements in Table 3.1.

    We also need to capture the cost and availability of thelabor and required materials. These are given in Table 3.2.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 69

    Labor (hrs) Metal (lbs) Wood (ft3) Price ($)Chairs 1 1 3 79Bookcases 3 1 4 125Bed-frames 2 1 4 109Desks 2 1 3 94

    Table 3.1: The furniture com-pany is able to produce fourdifferent items that each havedifferent labor and materialcosts. The last column lists theselling price of each item.

    Labor (hrs) Metal (lbs) Wood (ft3)Cost ($) 14 20 11Availability 225 117 420

    Table 3.2: The cost and avail-ability of labor, metal andwood for the production ofchairs, bookcases, bed-framesand desks.

    Objective Function

    The goal stated by the company is maximizing daily profit.Assuming that all of the produced items can be sold, thebusiness question to be addressed is: How many desks,chairs, bookcases and bed-frames should the company pro-duce per day in order to achieve maximum profit? Torephrase, how can the company maximize profit given theabove constraints?

    Computing profit requires computing costs and revenues.For a given set of furniture item quantities x1, x2, x3 and x4,we can compute costs, revenues and resulting profits. Thespreadsheet in Figure 3.9 illustrates this computation. Ordi-nary cells include numbers, while highlighted cells includeformulas (for example, the formula for cell F11 is shown inthe function bar).

    Optimization Model

    The optimization model can be represented with the opti-mization objective

    Maximize { Profit = Total Revenue Total Cost }with the parameters on which the decision will be made on

    #Desks (x1), #Chairs (x2), #Bookcases (x3), and#Bed-frames (x4)

    and the constraints (limitations) to achieve the objective

    Availability of labor, metal, and wood.

    This optimization problem can be solved using ExcelsSolver add-in8. Given the spreadsheet in Figure 3.9, we can

    8 For more examples of usingExcels Solver for optimiza-tion problems, see G. Shmueli.Practical Time Series Forecast-ing: A Hands-On Guide. Cre-ateSpace, 2nd edition, 2011.

    specify the objective function, parameters and constraints asshown in Figure 3.10, which yields the following solution

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    70 getting started with business analytics

    Figure 3.9: Spreadsheet withfurniture example. Ordinarycells include numbers; high-lighted cells include formulas.All formulas are based on thefour values in the Quantity col-umn.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 71

    (shown in the spreadsheet): the maximum achievable dailyprofit is $1827 by manufacturing 0 desks, 48 chairs, 39 book-cases and 30 bed-frames. Examining the pre-determined con-straints, we find that we maximally and exactly utilize thefull availability of labor at 225 hours, metal at 117 and woodat 420 with zero daily surplus.

    Figure 3.10: Using ExcelsSolver add-in to solve the op-timization problem. The opti-mal solution (parameters andprofit) is shown in Figure 3.9in bold.

    The optimization model above, together with the infor-mation in Tables 3.1 and 3.2, can also be translated into thefollowing optimization code9: 9 The code is written and ex-

    ecuted in SAS\OR. Optmodelindicates that the type of opti-mization solver would be au-tomatically selected.

    proc optmodel;

    /* declare variables */

    var desks >= 0, chairs >= 0, bookcases >= 0,

    bedframes >= 0;

    /* declare constraints */

    con Labor: 2*desks + 1*chairs + 3*bookcases

    + 2*bedframes

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    72 getting started with business analytics

    con Wood: 3*desks + 3*chairs + 4*bookcases

    + 4*bedframes

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 73

    The solution given in Figures 3.9 and 3.11 is optimal; itis not possible to do better given the current information.However, the solution might not be operationable. Assumethat our manufacturer is unable to produce zero desks dueto a minimum requirement of desks, or that employees areunable to work non-stop and require fixed rest periods, etc.An optimization model will provide the best possible solu-tion (if one is available) based on the information provided.Subsequent information that needs to be taken into account(such as labor rest periods or shift changes) would need tobe included in the overall formulation.

    Perhaps most importantly, the optimization framework al-lows for examining multiple scenarios, real and potential. Wecan use optimization not only to identify the feasibility ofcurrent business objectives, but also to identify the outcomeif changes are incorporated. For example, we can explore themaximum possible profit and quantity of manufactured fur-niture items if we increase/decrease the selling price, laborhours and/or the material costs. Optimization gives us theability to identify the best possible case under varying con-ditions.

    Optimization Methods and Algorithms

    As illustrated above, the simplest case of optimization is themaximization or minimization of an objective function bysystematically choosing values from within an allowed setand computing the value of the objective function. Optimiza-tion problems fall into various categories, depending on theform of the objective function and constraints.

    The simplest case is a linear objective function and lin-ear equality or inequality constraints. Our furniture examplehad exactly this structure: revenue was a linear function ofthe four item quantities and the constraints were all linearinequalities. The common optimization technique for solvingsuch problems is Linear Programming (LP).

    Optimization problems are generally divided into twotypes of problems, convex and non-convex. Convex problemsare those with a single optimal solution. The objective func-tion has a single minimum or maximum point.

    Geometrically, a function is convex if every point on astraight line joining two points on the function is also withinthe region. This is illustrated in Figure 3.12, where anystraight line joining two points in the convex curve is within

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    74 getting started with business analytics

    the region, whereas for the non-convex function we can cre-ate a line external to the region by connecting two points.

    Figure 3.12: A plot of a non-convex (top line) and convex(bottom line) function. The topcurve is non-convex becausethe black straight line segmentconnecting two points on thecurve is below the curve (out-side the region).

    In general, the difficulty of solving a problem is affected byits convexity. A variety of optimization methods exists, eachdesigned for particular characteristics of the problem. For ex-ample, convex programming is used when the objective func-tion is convex or concave (a minimization or maximizationgoal) and the constraint set is convex. LP is an example ofconvex programming. Nonlinear programming is needed whenthe objective function and/or the constraints are nonlinear.

    A possible constraint is that parameter units are integers.In our furniture example, we might require the item quan-tities to be integers. Integer programming is used when someor all of the parameters are constrained to take on integervalues. This is a non-convex case, and in general much moredifficult than regular linear programming.

    A comprehensive list of different optimization problemsand methods is listed in Wikipedias Mathematical Optimiza-tion article10. 10 en.wikipedia.org/wiki/

    Mathematical_optimization.Many computational algorithms exist for solving opti-mization problems. Examples of popular algorithms includeBranch and Bound, the Expectation-Maximization (EM) Algo-rithm, the Genetic Algorithm, and the Simplex Algorithm.11 We

    11 For a comprehensive listingsee en.wikipedia.org/wiki/Category:Optimization_

    algorithms_and_methods

    note that most commercial software packages are able to au-tomatically select the relevant technique for the defined op-

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 75

    timization problem. For a theoretical understanding of opti-mization methodologies see 12. 12 Stephen Boyed and Lieven

    Vandenberghe. Convex Opti-mization. Cambridge Univer-sity Press, 2004; R. T. Rock-afellar. Lagrange multipliersand optimality. SIAM Review,21(183), 1993; and T. Ui. Anote on discrete convexity andlocal optimality. Japan J. In-dust. Appl. Math., 23(21), 2006.

    3.5 Simulation

    Simulation is a tool for scenario building and assessment,which originates in operations research and is used in a widerange of fields and applications. The purpose of simulationis to assess the impact of various factors on outcomes of in-terest. In particular, simulation is the process of imitating anoperation of a process over time. We can then change variousprocess parameters and conditions and see how they affectthe process outcomes of interest.

    Simulation and optimization are sometimes confused. Themain difference is that in optimization we search for combina-tions of inputs that lead to some optimal outcome, whereassimulation imitates a prescribed process flow over time itdoes not optimize the process.

    Let us continue our example of the furniture company. Weused optimization to find the optimal combination of furni-ture items produced daily in order to maximize profit, givenvarious constraints. We can use simulation to assess how theoptimal solution (daily production process) will be affectedby factors not considered in the optimization. For example,what if we over-estimated the number of work hours? Whatif a large order comes in? Moreover, although we stated thedifferent costs and requirements as figures (such as 1 hour forproducing a chair), there is always some random fluctuationin these figures. For instance, a chair might take anywherebetween 45 minutes and 1.5 hours to produce. We thereforemay wish to simulate the furniture assembly line by captur-ing the steps required to assemble the different types of fur-niture. Such a simulation can help us evaluate the outcomeunder a broader context, taking into account random fluc-tuation. The simulation can help identify bottlenecks in theprocess as well as whether the process is sustainable overtime.

    Monte Carlo Simulation

    One approach to incorporating random variability into thesimulation is using Monte Carlo experiments. The idea be-hind this approach is to compute the relationship of interestmultiple times, each time varying parameters according to

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    76 getting started with business analytics

    some random process. By examining the different runs, weget an idea of possible fluctuations in the outcomes of inter-est and their sensitivity to changes in the parameters. In ourfurniture example, costs of labor and material were stated asfixed figures. Suppose that the number of hours to produceeach of the furniture items can vary within half an hour ofthe values in Table 3.1 due to causes not under our control(differences in skill level, effects of temperature on workers,etc.). How does this extra variability affect the daily profit?How much variation does it introduce into our setup?

    To take into account the above-mentioned random vari-ation in labor time, we replace the values in cells C7:C10in the spreadsheet shown in Figure 3.9 with a formula thatadheres to the type of randomness that we envisage. If weassume that building a chair can take anywhere betweenhalf an hour and 1.5 hours, and that any value within thisrange is equally likely, then we can use the Excel formula=0.5+RAND(), where the function RAND() generates a ran-dom number between 0 and 1. Figure 3.13 shows the resultof a single run. We see that the RAND() functions that wereinserted into the labor values lead to a slightly different dailyprofit. Compared to the original profit of $1,827, the profit inthis screenshot is $1,918. However, this is only a single run,representing one possible outcome. In Excel, we can regener-ate random numbers by pressing the F9 key. The next step isto aggregate the results from multiple runs, and to examinethe variation in profit across the different runs. This is typi-cally done by examining a chart of profit across the differentruns. An example is shown in Figure 3.14, where we see thelikelihood of obtaining different profit levels.

    In short, simulation helps us assess the uncertainty as-sociated with the relationship between the process param-eters and the objectives. Monte Carlo simulation does this byadding random variation into the process parameters. In thisexample, we chose a uniform distribution (function RND())to represent the randomness structure. However, many otherprobability distributions can be used instead.

    Let us expand on our previous example and assume thatthe furniture company also provides repair and maintenanceservices for the items it manufactures. We would now wantto know whether the current inventory holding level suffi-ciently adheres to the companys service agreements, whichtarget a certain customer service level. Suppose that the com-pany requires a 95% service level, where 95% of all spare

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 77

    Figure 3.13: Furniture exam-ple with randomness insertedinto the labor time (cellsC7:C10). The effect can beseen on the revenue, cost, andprofit values.

    parts are available at all times, while the remaining 5% re-quire special order or are on queue for repair. We simulatethe demand for repair parts, assuming that each part is com-pletely repairable.

    Let us consider two types of replacement parts: Part A(wood frame) and Part B (metal frame). Part A can be re-placed with either another Part A or with a Part B. However,Part B can be replaced only by another Part B.

    To evaluate the service level for our simulation, we aimto capture a number of operational considerations (as illus-trated in Figure 3.15, left to right):

    Simulate the daily number of furniture items coming infor repair over time, and the respective required part (PartA or Part B).

    Assign the requested part, otherwise place the item intothe queue and calculate service level.

    If requesting part A, check availability of part A or B andsend the replaced part A for repair.

    If requesting part B, check availability of part B and sendthe replaced part B for repair.

    Update inventory holding to reflect outgoing parts.

    Such a simulation will ascertain the ongoing service levelwhen given the number of parts held in inventory (at some

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    78 getting started with business analytics

    Figure 3.14: Results from 1,000Monte Carlo runs of furnitureexample with randomness in-serted into the labor time. They-axis shows the percent ofruns obtained for a particu-lar profit range. The peak in-dicates that a profit in therange of $1,800 is most likely,but values can range between$1,1002,500.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 79

    starting point, as the service level will change as parts areconsumed and furniture is repaired), the time taken to repaireach item, the service time, etc.

    Figure 3.15 presents a schematic view of a recommendedholistic operational solution flow. It combines forecasting,optimization and simulation in a manner that is commonin practice. Given forecasts, the outcome of interest is opti-mized while simulating the ongoing operational variability.In this scenario, we try to capture the knowns and to quan-tify the unknowns. In the context of our furniture example,this means:

    1. Forecast the future demand for furniture items.

    2. Optimize the production mix, given constraints and fore-casted demand.

    3. Simulate the system given the forecasted demand, optimalholding level and domain knowledge.

    Such a process will take into consideration future require-ments, find the best possible configuration and simulate thesolution over time.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    80 getting started with business analytics

    Figure 3.15: Visualization ofa simulation flow of repair-ing furniture with two inter-changeable parts.

  • ii

    K14271 2013/2/15 9:12 ii

    ii

    ii

    data mining in a nutshell 81

    Figure 3.16: Visualisation ofa holistic approach towards acomplete operational systemthat incorporates forecasting,optimization and simulation.


Recommended