COMP3410 DB32: Technologies for Knowledge Management

COMP3410 DB32:Technologies for Knowledge

Management

10: Introduction to Knowledge Discovery

By Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,

School of Computing, University of Leeds)

What has Machine Learning got to do with Computing / Information Systems?

“Most international organizations produce more information in a week than many people could read in a lifetime”

Adriaans and Zantinge

Objectives of knowledge discovery or data mining

• Data mining is about discovering patterns in data.

• For this we need:– KD/DM techniques, algorithms, tools, eg BootCat,

WEKA– A methodological framework to guide us, in collecting

data and applying the best algorithms: CRISP-DM

Data Mining, Knowledge Discovery, Text Mining

• Data Mining was originally about “learning” patterns from DataBases, data structured as Records, Fields

• Knowledge Discovery is “exotic term” for DM???

• Increasingly, data is unstructured text (WWW), so

• Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data

define: data mining• Data mining, also known as knowledge-discovery in

databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining

define: text mining • Text mining, also known as intelligent text analysis,

text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. ...en.wikipedia.org/wiki/Text_mining

define: knowledge discovery• Knowledge discovery is the process of finding

novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test.www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html

• Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery

Data Mining: Overview

Concepts,Instances or examples,Attributes

Data Mining

Concept Descriptions

Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.

Instances

• Input to a data mining algorithm is in the form of a set of examples, or instances.

• Each instance is represented as a set of features or attributes.

• Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record.

• In text-mining, instance is word/term in a corpus.• The concepts to be learned are formed from

patterns discovered within the set of instances.

concepts

The types of concepts we try to ‘learn’ include:• Key “differences” – terms specific to our domain corpus• Clusters or ‘Natural’ partitions;

– Eg we might cluster customers according to their shopping habits.

• Rules for classifying examples into pre-defined classes.– Eg “Mature students studying information systems with high grade

for General Studies A level are likely to get a 1st class degree”

• General Associations – Eg “People who buy nappies are in general likely also to buy beer”

More concepts

The types of concepts we try to ‘learn’ include:• Numerical prediction

– Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation:

Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree

(but are Gender, Programme really numbers???)

DB Example: weather to play?

/usr/local/weka-3-4-5/data/weather.arff@relation weather@attribute outlook {sunny,overcast,rainy}@attribute temperature real@attribute humidity real@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,85,85,FALSE,nosunny,80,90,TRUE,noovercast,83,86,FALSE,yesrainy,70,96,FALSE,yesrainy,68,80,FALSE,yesrainy,65,70,TRUE,noovercast,64,65,TRUE,yessunny,72,95,FALSE,nosunny,69,70,FALSE,yesrainy,75,80,FALSE,yes

Text mining example: discovering terms in a domain, using WWW-BootCat

• “First catch your rabbit” (Mrs Beaton’s cookbook): Other tools are possible, but WWW-BootCat *should* be easier to use …

• First: sign up for Domain, SketchEngine account, Google key; download seeds-en from http://corpus.leeds.ac.uk/internet.html

• (see coursework spec for URLs)

First collect your corpus• Advanced Search option with parameter settings:

– using SergeSharoff's seed-en http://corpus.leeds.ac.uk/internet/seeds-en list of typical medium-frequency English words as seed-words,

– Google key set to the Key which I set up beforehand at https://www.google.com/accounts/NewAccount

– Language set to English– Select URLs ticked, so I can cut-and-paste the list of urls to a textfile

(TO HAND IN WITH CW) – Corpus name set to EnglishUK (in my case), or English?? (change ??

To your Domain)– email address set to [email protected]– Query Extension set to site:.uk (in my case), or site:.?? (change ?? To

your Domain)– other Advanced Options left at default values...???– ... then click on Build a corpus!, follow instructions as they appear, and

(after some wait) download the corpus in raw and vertical formats (either direct from URL or wait for email to tell you URL…)

Problems?• WWW-Bootcat: log in, Advanced options: upload

seed-en, check URLs, site:.??; Build Corpus

• If it crashes, ?bad HTML in website?, try again

• Download your corpus, because…

• 500,000-word quota – room for 2 corpuses (only), so you can only compare 2 at a time in WWWBootCat

• Or compare on your linux account…

• /home/www/db32/cw/EnglishUS , EnglishUK

Comparing text corpora• Aim: to find terms in C1 not in C2?

• and terms in C2 not in C1?

• Sort C1, C2 in Vertical format (1 word per line) to give C1termlist, C2termlist:– sort C1 > C1termlist; sort C2> C2termlist – diff C1termlist C2termlist

• BUT this shows LOTS of differences

• many “not significant”: 1 example (hapax legomena)

Comparing “significant” terms• Better: to find “significant” terms in C1 not in C2

• sort C1 | uniq -c | sort -n -r > C1termlist

• Terms with frequencies – most common first

• Can be compared “OLAP-style” – you can spot high-freq words in one list but not the other

• ? No need for further processing?

Comparing word-frequencies• BootCat (and others, eg Paul Rayson) offer tools to

compare frequencies of words – to find words used MUCH MORE in one corpus than another

• Several different metrics available, eg “mutual information”, “normalised frequency difference”,…

• Not necessary for DB32 coursework (probably)

• … BUT I will be impressed if you do use these advanced metrics!

Knowledge Discovery: Key points

• Knowledge Discovery (Data Mining) tools semi-automate the process of discovering patterns in data.

• Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)…

• … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses)

• Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered

Self-test

• You should be able to:

– Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives.

– Decide which is the most appropriate form of output.

Date post:	06-Jan-2016
Category:	Documents
Upload:	wood
View:	31 times
Download:	0 times

COMP3410 DB32: Technologies for Knowledge Management

Documents