Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | silvester-carroll |
View: | 218 times |
Download: | 0 times |
1
UNESCO courses:Module on Knowledge Discovery
and Data MiningProf. Ho Tu Bao
Prof. Bach Hung Khang
Institute of Information Technology
Japan Advanced Institute of Science and Technology
2
Outline of the presentation
Objectives,
Prerequisite
and Content
Brief
Introduction
to Lectures
Discussion
and
Conclusion
Objectives,
Prerequisite
and Content
This presentation summarizes the content and organizationof lectures in module “Knowledge Discovery and Data Mining”
3
Objectives
This course provides:
•fundamental techniques of knowledge discovery and data mining (KDD)
•issues in KDD practical use and tools
•case-studies of KDD application
4
Nothing special but the followings are expected:
Prerequisite for the course
• experience of computer use
• basis of databases and statistics
• programming skill for advanced levels
5
Content of the course
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
6
Outline of the presentation
Objectives,
Prerequisite
and Content
Brief
Introduction
to Lectures
Discussion
and
Conclusion
This presentation summarizes the content and organizationof lectures in module “Knowledge Discovery and Data Mining”
7
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
8
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
4. Data Mining Methods
3. KDD Applications
5. Challenges for KDD
9
KDD: A Definition
106-1012 bytes:never see the wholedata set or put it in thememory of computers
What knowledge?How to represent and use it?
Data mining algorithms?
KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.
10
We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily.
Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.
Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”.
Data, Information, Knowledge
Knowledge can be considered data at a high level of abstraction and generalization.
11
From Data to KnowledgeFrom Data to Knowledge From Data to KnowledgeFrom Data to Knowledge
...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS...
Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes
Numerical attribute categorical attribute missing values class labels
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]
[confidence, predictive accuracy]
12
People gathered and stored so much data because they think some valuable assetsare implicitly coded within it.
Raw data is rarely of direct benefit.Its true value depends on the ability to extract information useful for decision support.
Impractical Manual Data Analysis
knowledge base
inference engine
How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem.
?
Tradition: via knowledge engineers
New trend: via automatic programs
Data Rich Knowledge Poor
13
Volume
Value
EDP
MIS
DSS
Benefits of Knowledge Discovery
Generate
Rapid Response
Disseminate
EDP: Electronic Data ProcessingMIS: Management Information Systems
DSS: Decision Support Systems
14
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
4. Data Mining Methods
3. KDD Applications
5. Challenges for KDD
15
The KDD processThe non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
non-trivial process
Multiple process
valid Justified patterns/models
novel Previously unknown
useful Can be used
understandableby human and machine
16
The Knowledge Discovery ProcessThe Knowledge Discovery Process The Knowledge Discovery ProcessThe Knowledge Discovery Process
KDD is inherentlyinteractive and iterative
a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations
1
2
3
4
5
Understand the domain and Define problems
Collect and Preprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluate discovered knowledge
Putting the results in practical use
17
The KDD ProcessData organized by function
Create/selecttarget database
Select samplingtechnique and
sample data
Supply missing values
Normalizevalues
Select DM task (s)
Transform todifferent
representation
Eliminatenoisy data
Transformvalues
Select DM method (s)
Create derivedattributes
Extract knowledge
Find importantattributes &value ranges
Test knowledge
Refine knowledge
Query & report generationAggregation & sequencesAdvanced methods
Data warehousing 1
2
3 4
5
18
Main Contributing Areas of KDDMain Contributing Areas of KDD Main Contributing Areas of KDDMain Contributing Areas of KDD
DatabasesStore, access, search, update data (deduction)
StatisticsInfer info from data (deduction & induction, mainly numeric data)
Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)
KDD
[data warehouses:integrated data]
[OLAP: On-Line Analytical Processing]
19
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
4. Data Mining Methods
3. KDD Applications
5. Challenges for KDD
20
Potential ApplicationsPotential Applications Potential ApplicationsPotential ApplicationsBusiness information
- Marketing and sales data analysis- Investment analysis- Loan approval- Fraud detection- etc.
Manufacturing information
- Controlling and scheduling- Network management- Experiment result analysis- etc.
Scientific information- Sky survey cataloging- Biosequence Databases- Geosciences: Quakefinder- etc.
Personal information
21
KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges
Data RichKnowledge Poor(the resource)
Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)
Competitive Pressure
Data Mining TechnologyMature
KDD
22
KDD workshops: 1989, 1991,1993, 1994.Inter. Conferences: KDD’95, 96, 97, 98, 99 (USA)PAKDD’97, 98, 99 (Asia) , PKDD’97, 98, 99 (Europe)PAKDD’00 (Kyoto, 2000.4.18-20, deadline 99.10.10)
Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …80% of the Fortune 500 companies are currently involved in data mining pilot projects or using data mining systems.JAPAN: FGCS Project (logic programming and reasoning, recently more attention on knowledge acquisition and machine learning). Interests in KDD: Special Issue on KDD of JSAI, July 1997.
“Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.
KDD: A New and Fast Growing Area
23
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
4. Data Mining Methods
3. KDD Applications
5. Challenges for KDD
24
Primary Tasks of Data MiningPrimary Tasks of Data Mining Primary Tasks of Data MiningPrimary Tasks of Data Mining
Classification
Deviation andchange detection
?
Summarization
Clustering
Dependency Modeling
Regression
finding the descriptionof several predefined classes and classify a data item into one of them.
maps a data item to a real-valued prediction variable.
identifying a finite set of categories or clusters to describe
the data.
finding a compact description
for a subset of data
finding a model which describes
significant dependencies between variables.
discovering the most significant changes in the data
25
Data General patterns
Examples
Cancerous Cell Data
Classification“What factors determine cancerous cells?”
Classification Algorithm
MiningAlgorithm
- Rule Induction- Decision tree- Neural Network
26
If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)
If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)
Classification: Rule Induction“What factors determine a cell is cancerous?”
27
Color = dark Color = light
healthy
Classification: Decision Trees
#nuclei=1
#nuclei=2 #nuclei=1
#nuclei=2
#tails=1 #tails=2
cancerous
cancerous healthy
healthy
#tails=1 #tails=2
cancerous
28
Healthy
Cancerous
“What factors determine a cell is cancerous?”
Classification: Neural Networks
Color = dark
# nuclei = 1
…
# tails = 2