+ All Categories
Home > Documents > Introduction to Data Mining

Introduction to Data Mining

Date post: 22-Feb-2016
Category:
Upload: beata
View: 149 times
Download: 2 times
Share this document with a friend
Description:
Introduction to Data Mining. WHAT IS DATA MINING?. WHAT IS DATA MINING?. Data mining is the process of discovering: Meaningful new correlations Patterns Trends - PowerPoint PPT Presentation
Popular Tags:
38
Transcript
Page 1: Introduction to Data Mining
Page 2: Introduction to Data Mining

WHAT IS DATA MINING?

Page 3: Introduction to Data Mining

WHAT IS DATA MINING?Data mining is the process of discovering:

Meaningful new correlationsPatternsTrends

by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Page 4: Introduction to Data Mining

WHAT IS DATA MINING?Data mining is the analysis of (often large)

observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

Page 5: Introduction to Data Mining

WHAT IS DATA MINING?Data mining is interdisciplinary field

bringing together techniques from machine learning, pattern recognition, statistics, databases and visualization to address the issue of information extraction from large data bases

Page 6: Introduction to Data Mining

IntroductionData mining skills are in high demand as

organizations increasingly put data repositories online.

Effectively analyzing information from customers, partners, and suppliers has become important to more companies.

Many companies have implemented a data warehouse strategy and are now starting to look at what they can do with all that data

Page 7: Introduction to Data Mining

WHY DATA MINING?The ongoing remarkable growth in the field of

data mining and knowledge discovery has been fueled by a fortunate confluence of a variety of factors:The storing of the data in data warehouses, so that the

entire enterprise has access to a reliable current database

The availability of increased access to data from Web navigation and intranets

The competitive pressure to increase market share in a globalized economy

The development of off-the-shelf commercial data mining software suites

The tremendous growth in computing power and storage capacity

Page 8: Introduction to Data Mining

FALLACIES OF DATA MININGFallacy 1. There are data mining tools that we

can turn loose on our data repositories and use to find answers to our problems.Reality. There are no automatic data mining tools that

will solve your problems mechanically “while you wait.” Data mining is a process.

Fallacy 2. The data mining process is autonomous, requiring little or no human oversight.Reality. As we saw above, the data mining process

requires significant human interactivity at each stage. Even after the model is deployed, the introduction of new data often requires an updating of the model. Continuous quality monitoring and other evaluative measures must be assessed by human analysts.

Page 9: Introduction to Data Mining

FALLACIES OF DATA MININGFallacy 3. Data mining pays for itself quite

quickly.Reality. The return rates vary, depending on the

startup costs, analysis personnel costs, data warehousing preparation costs, and so on.

Fallacy 4. Data mining software packages are intuitive and easy to use.Reality. Again, ease of use varies. However, data

analysts must combine subject matter knowledge with an analytical mind and a familiarity with the overall business or research model.

Page 10: Introduction to Data Mining

FALLACIES OF DATA MININGFallacy 5. Data mining will identify the

causes of our business or research problems.Reality. The knowledge discovery process will help

you to uncover patterns of behavior. Again, it is up to humans to identify the causes.

Fallacy 6. Data mining will clean up a messy database automatically.Reality. Well, not automatically. As a preliminary

phase in the data mining process, data preparation often deals with data that has not been examined or used in years. Therefore, organizations beginning a newdata mining operation will often be confronted with the problem of data that has been lying around for years, is stale, and needs considerable updating.

Page 11: Introduction to Data Mining

Knowledge Discovery Process

Page 12: Introduction to Data Mining

Knowledge Discovery ProcessThe Knowledge Discovery in Databases process

comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:Data cleaning: also known as data cleansing, it is a phase

in which noise data and irrelevant data are removed from the collection.

Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source.

Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection.

Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure.

Page 13: Introduction to Data Mining

Knowledge Discovery ProcessData mining: it is the crucial step in which

clever techniques are applied to extract patterns potentially useful.

Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures.

Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

Page 14: Introduction to Data Mining

WHAT TASKS CAN DATA MINING ACCOMPLISH?

The following list shows the most common data mining tasks.EstimationPredictionClassificationClusteringAssociation

Page 15: Introduction to Data Mining

ClassificationThere is a target categorical variable Example:

income bracket: partitioned into 3 classes or categories: high income, middle income, and low income

Page 16: Introduction to Data Mining

Classification

Classify the income brackets of persons not currently in the database, based on other characteristics associated with that person, such as age, gender, and occupation

Classification Task

Page 17: Introduction to Data Mining

Classification - StepsFirst : examine the data set containing both the

predictor variables and the (already classified) target variable i.e income bracket.

The algorithm (software) “learns about” which combinations of variables are associated with which income brackets. For example, older females may be associated with the

high-income bracket. This data set is called the training set.

Second: The algorithm would look at new records, for which no information about income bracket is available.

Based on the classifications in the training set, the algorithm would assign classifications to the new records.For example, a 63-year-old female professor might be

classified in the high-income bracket.

Page 18: Introduction to Data Mining

ClassificationFor example, in the medical field, suppose that

we are interested in classifyingthe type of drug a patient should be prescribed,

based on certain patient characteristics,such as the age of the patient and the patient’s

sodium/potassium ratio. Figure 1.4 isa scatter plot of patients’ sodium/potassium ratio

against patients’ ages for a sampleof 200 patients. The particular drug prescribed is

symbolized by the shade of thepoints. Light gray points indicate drug Y; medium

gray points indicate drug A or X; dark gray points indicate drug B or C.

Page 19: Introduction to Data Mining

Which drug should be prescribed for a young patient with a high sodium/potassium ratio?

Which drug should be prescribed for older patients with low sodium/potassium ratios?

Page 20: Introduction to Data Mining

Other Classification examplesDetermining whether a particular credit

card transaction is fraudulentAssessing whether a mortgage application

is a good or bad credit riskDiagnosing whether a particular disease

is presentDetermining whether a will was written

by the actual deceased, or fraudulently by someone else

Identifying whether or not certain financial or personal behavior indicates a possible terrorist threat

Page 21: Introduction to Data Mining

EstimationEstimation is similar to classification

except that the target variable is numerical rather than categorical.

Example: estimating the systolic blood pressure reading of a hospital patient, based on the patient’s age, gender, body-mass index, and blood sodium levels

The relationship between systolic blood pressure and the predictor variables in the training set would provide us with an estimation model.

We can then apply that model to new cases.

Page 22: Introduction to Data Mining

Examples of EstimationEstimating the amount of money a

randomly chosen family of four will spend for back-to-school shopping this new school season.

Estimating the grade-point average (GPA) of a masters student, based on that student’s undergraduate GPA.

Estimating the number of goals per game that Ronaldinho will score during next UEFA, based on his last goals.

Page 23: Introduction to Data Mining

PredictionPrediction is similar to classification and

estimation, except that for prediction, the results lie in the future – regardless of numerical or categorical

Any of the methods and techniques used for classification and estimation may also be used, for prediction.

Traditional statistical methods - point estimation and confidence interval estimations, simple linear regression and correlation, multiple regression

Data mining and knowledge discovery methods such as neural network

Page 24: Introduction to Data Mining

Examples of PredictionPredicting the price of a stock three months

into the future Predicting the percentage increase in traffic

deaths next year if the speed limit is increased

Predicting the winner of this fall’s baseball World Series, based on a comparison of team statistics

Predicting whether a particular molecule in drug discovery will lead to a profitable new drug for a pharmaceutical company

Page 25: Introduction to Data Mining

Examples of Prediction

Predicting the price of a stock three months into the future

Page 26: Introduction to Data Mining

ClusteringClustering refers to the grouping of records,

observations, or cases into classes of similar objects.

A cluster is a collection of records that are similar to one another, and dissimilar to records in other clusters.

Clustering differs from classification in that there is no target variable for clustering.

The clustering task does not try to classify, estimate, or predict the value of a target variable.

Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters.

Page 27: Introduction to Data Mining
Page 28: Introduction to Data Mining

demographic profile of each of the geographic areas in the country, as defined by zip code. One of the clustering mechanisms they use is the PRIZM segmentation system, which describes every U.S. zip code area in terms of distinct lifestyle types

Page 29: Introduction to Data Mining

Clustering Exampleclusters for zip code 90210, Beverly Hills,

California.Beverly Hills, CA 90210's most common

PRIZM NE Segments are: Number Name

16 Bohemian Mix 07 Money & Brains 03 Movers & Shakers 01 Upper Crust 04 Young Digerati

Page 30: Introduction to Data Mining

Examples of ClusteringGroup of residents according to zip codes.For accounting auditing purposes, to

segment financial behavior into benign and suspicious categories

As a dimension-reduction tool when the data set has hundreds of attributes

For gene expression clustering, where very large quantities of genes may exhibit similar behavior

Page 31: Introduction to Data Mining

AssociationThe association task for data mining is the

job of finding which attributes “go together.”

Most prevalent in the business world, where it is known as affinity analysis or market basket analysis

The task of association seeks to uncover rules for quantifying the relationship between two or more attributes.

Association rules are of the form “If antecedent, then consequent,” together with a measure of the support and confidence associated with the rule.

Page 32: Introduction to Data Mining

AssociationAnother example, a particular supermarket

may find that of the 1000 customers shopping on a Thursday night, 200 bought diapers, and of those 200 who bought diapers, 50 bought Coke.

Thus, the association rule would be “If buy diapers, then buy coke” with a support of 200/1000 = 20% and a confidence of 50/200 = 25%.

Page 33: Introduction to Data Mining

Examples of AssociationInvestigating the proportion of subscribers to a

company’s cell phone plan that respond positively to an offer of a service upgrade

Examining the proportion of children whose parents read to them who are themselves good readers

Predicting degradation in telecommunications networks

Finding out which items in a supermarket are purchased together and which items are never purchased together

Determining the proportion of cases in which a new drug will exhibit dangerous side effects

Page 34: Introduction to Data Mining

Bohemian MixMidscale, Middle Age Mix

A collection of mobile urbanites, Bohemian Mix represents the nation's most liberal lifestyles. Its residents are an ethnically diverse, progressive mix of young singles, couples, and families ranging from students to professionals. In their funky row houses and apartments, Bohemian Mixers are the early adopters who are quick to check out the latest movie, nightclub, laptop, and microbrew.

Page 35: Introduction to Data Mining

Money & BrainsUpscale, Older Mix

The residents of Money & Brains seem to have it all: high incomes, advanced degrees, and sophisticated tastes to match their credentials. Many of these city dwellers are married couples with few children who live in fashionable homes on small, manicured lots.

Page 36: Introduction to Data Mining

Movers & ShakersWealthy, Middle Age w/o Kids

Movers & Shakers is home to America's up-and-coming business class: a wealthy suburban world of dual-income couples who are highly educated, typically between the ages of 35 and 54. Given its high percentage of executives and white-collar professionals, there's a decided business bent to this segment: members of Movers & Shakers rank number one for owning a small business and having a home office.

Page 37: Introduction to Data Mining

Upper Crust WealthyOlder w/o Kids

The nation's most exclusive address, Upper Crust is the wealthiest lifestyle in America--a haven for empty-nesting couples between the ages of 45 and 64. No segment has a higher concentration of residents earning over $100,000 a year or possessing a postgraduate degree. And none has a more opulent standard of living.

Page 38: Introduction to Data Mining

Young DigeratiUpscale, Younger Mix

Young Digerati are tech-savvy and live in fashionable neighborhoods on the urban fringe. Affluent, highly educated, and ethnically mixed, Young Digerati communities are typically filled with trendy apartments and condos, fitness clubs and clothing boutiques, casual restaurants and all types of bars--from juice to coffee to microbrew.


Recommended