+ All Categories
Home > Documents > COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric...

COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric...

Date post: 28-Mar-2015
Category:
Upload: joseph-connolly
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)
Transcript
Page 1: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

COMP3740 CR32:Technologies for Knowledge

Management

Introduction to Knowledge Discovery

By Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,

School of Computing, University of Leeds)

Page 2: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

What has Machine Learning got to do with Computing / Information Systems?

“Most international organizations produce more information in a week than many people could read in a lifetime”

Adriaans and Zantinge

Page 3: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Objectives of knowledge discovery or data mining

• Data mining is about discovering patterns in data.

• For this we need:– KD/DM techniques, algorithms, tools, eg BootCat,

WEKA– A methodological framework to guide us, in collecting

data and applying the best algorithms: CRISP-DM

Page 4: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Data Mining, Knowledge Discovery, Text Mining

• Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields

• Knowledge Discovery is “exotic term” for DM???

• Increasingly, data is unstructured text (WWW), so

• Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data

Page 5: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

define: data mining• Data mining, also known as knowledge-discovery in

databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from artificial intelligence, statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining

Page 6: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

define: text mining • Text mining, also known as intelligent text analysis,

text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. ...en.wikipedia.org/wiki/Text_mining

Page 7: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

define: knowledge discovery• Knowledge discovery is the process of finding

novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test.www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html

• Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from AI, statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery

Page 8: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Data Mining: Overview

Concepts,Instances or examples,Attributes

Data Mining

Concept Descriptions

Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.

Page 9: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Instances

• Input to a data mining algorithm is in the form of a set of examples, or instances.

• Each instance is represented as a set of features or attributes.

• Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record.

• In text-mining, instance may be word/term in context (surrounding words/document)

• The concepts to be learned are formed from patterns discovered within the set of instances.

Page 10: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

concepts

The types of concepts we try to ‘learn’ include:• Key “indicators” – features or terms specific to our domain• Clusters or ‘Natural’ partitions;

– Eg we might cluster customers according to their shopping habits.

– Eg is this web-page British or American English?

• Rules for classifying examples into pre-defined classes.– Eg “Mature students studying information systems with high grade

for General Studies A level are likely to get a 1st class degree”

• General Associations – Eg “People who buy nappies are in general likely also to buy beer”

Page 11: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

More conceptsThe types of concepts we try to ‘learn’ include:• Unexpected (suspicious?) associations or coincidences

– Eg known suspects A, B, C all phoned D last week

• Numerical prediction– Eg look for rules to predict what salary a graduate will get,

given A level results, age, gender, programme of study and degree result – this may give us an equation:

Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree

(but are Gender, Programme really numbers???)

Page 12: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

DB Example: weather to play?

Page 13: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.
Page 14: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

/usr/local/weka-3-4-13/data/weather.nominal.arff@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}

@attribute temperature {hot, mild, cool}

@attribute humidity {high, normal}

@attribute windy {TRUE, FALSE}

@attribute play {yes, no}

@data

sunny,hot,high,FALSE,no

sunny,hot,high,TRUE,no

overcast,hot,high,FALSE,yes

rainy,mild,high,FALSE,yes

rainy,cool,normal,FALSE,yes

rainy,cool,normal,TRUE,no

overcast,cool,normal,TRUE,yes

sunny,mild,high,FALSE,no

sunny,cool,normal,FALSE,yes

rainy,mild,normal,FALSE,yes

sunny,mild,normal,TRUE,yes

overcast,mild,high,TRUE,yes

overcast,hot,normal,FALSE,yes

rainy,mild,high,TRUE,no

Page 15: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

/usr/local/weka-3-4-13/data/weather.arff@relation weather@attribute outlook {sunny,overcast,rainy}@attribute temperature real@attribute humidity real@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yesrainy,70,96,FALSE,yesrainy,68,80,FALSE,yesrainy,65,70,TRUE,noovercast,64,65,TRUE,yessunny,72,95,FALSE,nosunny,69,70,FALSE,yesrainy,75,80,FALSE,yes

Page 16: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Text mining example: Which English dominates the WWW, UK or US?

• “First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat was easier to use …

• First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html

• (see cw1 specifications and lecture notes …)

Page 17: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Example 2: Data Mining for an ontology• Ontology: the “concepts” in a discipline, and meaning-

relationships between these concepts (01.ppt)• “concepts” roughly equates to “terminology” – specialist

words and phrases in a discipline • WordNet is freely-available for general English• What about other languages? – EuroWordnet, BalkaNet, …

(but not ALL languages!)• What about specific domains? Domain-specific

ONTOLOGIES have to be devised (by experts)• What about my own specific domain/language?• Automatic extraction of key words / concepts from example

documents (machine learning / knowledge discovery)

Page 18: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Automatic terminology extractionTerminology extraction = thesaurus construction

based on documents (either retrieved set or the whole collection) as Corpus – training text set

define a ‘measure’ of how close one index term is to another – in meaning-space, ?or literal distance?

for each term, form a ‘neighbourhood’ comprising the nearest ‘n’ terms

treat these neighbourhoods like ‘related’ thesaurus classes

terms with similar neighbourhoods are treated as synonyms.

Page 19: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Finding “coordinate terms”One attempt to define how close a term is to another:

• If two terms are both used to index the same document many times in the collection, then they are deemed to be close.

• From document-term matrix, compute term-correlation matrix

• The term correlation matrix can be normalised so that terms that index a lot of documents don’t have an unfair chance – reduce weight of common words

Page 20: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Other ways to find specialist termsOther ways to find domain-specific terms and relations:

• Collect a domain corpus, find terms “different” from a generic “gold standard” corpus: British National Corpus

• Collocation-groups: For each term, collect its collocations in the Corpus: other words it appears next to (or near to). If two terms have similar collocation-sets, then they are deemed to be close.

• Association matrix based on proximity: compute average distance between pairs of terms (no. of words between them, literally), use this as closeness metric

Page 21: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Why build a thesaurus?• a thesaurus or ontology can be used to normalise a

vocabulary and queries (?or documents?)

• it can be used (with some human intervention) to increase recall and precision

• generic thesaurus/ontology may not be effective in specialized collections and/or queries

• Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results, e.g. Semantic Web

Page 22: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

Knowledge Discovery: Key points

• Knowledge Discovery (Data Mining) tools semi-automate the process of discovering patterns in data.

• Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)…

• … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses)

• Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered

Page 23: COMP3740 CR32: Technologies for Knowledge Management Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds (including.

A Data Mining consultant…

• You should be able to:

– Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives.

– Decide which is the most appropriate form of input (which attributes/features will be “useful” for learning) and output (what does your client want to see?)


Recommended