+ All Categories
Home > Documents > Lecture 5: Text Mining and Knowledge...

Lecture 5: Text Mining and Knowledge...

Date post: 28-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
99
HG8003 Technologically Speaking: The intersection of language and technology. Text Mining and Knowledge Acquisition Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 5 Location: LT8 HG8003 (2014)
Transcript
  • HG8003 Technologically Speaking:The intersection of language and technology.

    Text Mining and Knowledge Acquisition

    Francis BondDivision of Linguistics and Multilingual Studieshttp://www3.ntu.edu.sg/home/fcbond/

    [email protected]

    Lecture 5Location: LT8

    HG8003 (2014)

  • Schedule

    Lec. Date Topic1 01-16 Introduction, Organization: Overview of NLP; Main Issues2 01-23 Representing Language3 02-06 Representing Meaning4 02-13 Words, Lexicons and Ontologies5 02-20 Text Mining and Knowledge Acquisition Quiz6 02-27 Structured Text and the Semantic Web

    Recess7 03-13 Citation, Reputation and PageRank8 03-20 Introduction to MT, Empirical NLP9 03-27 Analysis, Tagging, Parsing and Generation Quiz

    10 Video Statistical and Example-based MT11 04-03 Transfer and Word Sense Disambiguation12 04-10 Review and Conclusions

    Exam 05-06 17:00

    ➣ Video week 10

    Text Mining and Knowledge Acquisition 1

  • Introduction

    ➣ Review

    ➣ Text Mining and Knowledge Acquisition

    ➣ Homework

    Text Mining and Knowledge Acquisition 2

  • Review of Lexicons andOntologies

    Text Mining and Knowledge Acquisition 3

  • Review

    ➣ Storing information on machines allows us to manipulate it in many ways

    ➣ Information for humans can be made easier to search and validate

    ➢ Machine Readable Dictionaries

    ➣ Information for machines must be made explicit

    ➢ Dictionaries for various processors➢ Ontologies

    ➣ We can reuse knowledge to make new resources

    Text Mining and Knowledge Acquisition 4

  • Machine Readable Lexicon

    definition (n) a concise explanation of the meaning of a word or phraseor symbol

    ➣ Headword: definition

    ➣ Part of Speech: n (noun)

    ➣ Definition:

    ➢ genus: explanation➢ differentia: concise; of the meaning of a word or phrase or symbol

    ? Implied: countable (a), regular plural

    Text Mining and Knowledge Acquisition 5

  • Erin McKean’s TED Talk

    ➣ Redefining the dictionary (by Erin McKean; TED Talk 2007)(http://blog.ted.com/2007/08/30/redefining_the/)

    ➣ Dictionaries still don’t cover all wordsmany, many new words are undefinedas many as one per book?

    ➣ We need to define these words in context

    ➣ On-line dictionaries allow us to do this without space limitations

    ➢ Dictionaries can describe usage with real examples

    Text Mining and Knowledge Acquisition 6

  • Ontology Example (WordNet)

    Synset 06744396-n: definition

    Def: ’a concise explanation of the meaning of a word orphrase or symbol. ’

    Hype: accountHypo: redefinition, explicit definition, recursive definition,

    stipulative definition, contextual definition,ostensive definition, dictionary definition

    SUMO: = equivalentContentInstance

    Has-Part: genusHas-Part: differentia

    Text Mining and Knowledge Acquisition 7

  • What is an Ontology?

    ➣ A set of statements in a formal languagethat describes/conceptualizes knowledge in a given domain

    ➢ What kinds of entities exist (in that domain)➢ What kinds of relationships hold among them

    ➣ Ontologies usually assume a particular level of granularity

    ➢ doesn’t capture all details

    Text Mining and Knowledge Acquisition 8

  • Text Mining

    Text Mining and Knowledge Acquisition 9

  • Overview of Text Mining

    ➣ Text Mining

    ➣ Template Filling

    ➢ Named Entity Recognition

    ➣ Relation Detection

    ➣ Learning Lexical Knowledge

    Text Mining and Knowledge Acquisition 10

  • Why Text Mining?

    ➣ Too much (textual) information

    ➣ We now have electronic books, documents, web pages, emails, blogs,news, chats, memos, research papers, . . .

    . . . much of it immediately accessible, thanks to databases and InformationRetrieval (IR)

    ➣ An estimated 80–85% of all data stored in databases is natural language

    ➣ But humans did not scale so well. . .

    ➣ This results in the common perception of Information Overload

    Text Mining and Knowledge Acquisition 11

  • Example: The BioTech Industry

    ➣ Access to information is a serious problem

    ➢ 80% of biological knowledge is only in reasearch papers➢ finding the information you need is prohibitively expensive

    ➣ Humans do not scale well

    ➢ if you scan 60 research papers/week➢ and read 10% of those which are interesting➢ a scientist manages 6/week, or 300/year

    ➣ This is not good enough

    ➢ MedLine adds more than 10,000 abstracts each month➢ Chemical Abstracts Registry (CAS) registers 4000 entities each day

    Text Mining and Knowledge Acquisition 12

  • The growth in PubMed articles

    Text Mining and Knowledge Acquisition 13

  • What is Text Mining?

    ➣ The discovery by computer

    ➢ of new, previously unknown information,➢ by automatically extracting information➢ from a usually large amount➢ of different unstructured textual resources.

    Text Mining and Knowledge Acquisition 14

  • ➣ What does previously unknown mean?

    ➢ Implies discovering genuinely new information.➢ Marti Hearst’s analogy:

    Discovering new knowledge vs. merely finding patterns is like thedifference between a detective following clues to find the criminal vs.analysts looking at crime statistics to assess overall trends in car theft.

    ➣ What about unstructured?

    ➢ Naturally occurring text.➢ As opposed to HTML, XML, databases, . . .

    Text Mining and Knowledge Acquisition 15

  • Text Mining Process

    1. Document Collection

    2. Preprocessing

    3. Mining

    ➣ Template Filling➣ Relation Extraction

    4. Presentation/Visualization

    Text Mining and Knowledge Acquisition 16

  • Document Collection

    ➣ This is basically information retrieval

    ➣ Normally want to restrict the text domain in some way

    ➢ Existing Document collections∗ Research Papers∗ Newspapers∗ Phone conversations

    ➢ Induced document collections∗ Find all documents similar to a seed

    ➣ What you do depends on the goal

    Text Mining and Knowledge Acquisition 17

  • Issues with documents

    ➣ Large document collections contain erroneous data

    ➢ Mistaken analyses➢ Deliberately erroneous data➢ Out-of-date data➢ Fictional data

    ➣ Text is typically noisy

    ➢ Spelling errors➢ Conversion errors (hyphens, headers, footers)

    Text Mining and Knowledge Acquisition 18

  • Study: 58 Percent Of U.S. Exercise Televised

    WASHINGTON, DC—According to a new Department of Health andHuman Services study, 58 percent of all exercise performed in the U.S. isbroadcast on television. “Of the 3.5 billion push-ups performed in 2003,2.03 billion took place on exercise shows on the Lifetime Network andESPN3 or fitness segments on Good Morning America,”” the study read.“The abundance of TV exercise would create the impression that Americais a healthy society, if everyone didn’t already know that we’re a bunch ofdisgusting, near-immobile spectators.” The DHHS study also indicated that99.3 percent of the nation’s Soloflex workouts are televised.

    The ONION America’s Finest News Source

    http://www.theonion.com/articles/study-58-percent-of-us-exercise-televised,4623/ 19

  • MSNBC News 2004-03-12NORVILLE: Finally tonight, if you were watching the show earlier this

    week, you heard Health and Human Services Tommy Thompson encourageAmericans to work out and watch what they eat.

    Good advice, because it turns out most Americans are watching theirworkouts. Yes, according to a new study by Thompson‘s department, 58percent of all the exercise done in America is broadcast on television. Forinstance, of the 3.5 billion sit-ups done during 2003, two million, 30,000 ofthem were on exercise shows on Lifetime or one of the ESPN channels. Putit another way, according to the study, 99 percent of the time that someone isusing one of those Soloflex machines, it‘s when it‘s being broadcast on oneof those late-night commercials. [. . . ]

    We want to hear from you, so send us your e-mails and ideas to us [email protected].

    Thanks for watching. I‘m Deborah Norville.

    http://www.msnbc.msn.com/id/4533441/ 20

  • Text Preprocessing

    ➣ Text cleanup

    ➢ remove ads from web pages➢ normalize text converted from binary formats➢ deal with tables, figures and formulas

    ➣ Tokenization

    ➢ Splitting up a string of characters into a set of tokens.➢ Need to deal with issues like:

    ∗ Apostrophes, e.g., “John’s sick”, is it 2 or 3 tokens?∗ Hyphens, e.g., database vs. data-base vs. data base.∗ How should we deal with ‘C++’, ‘A/C’, ‘:-)’, ‘. . . ’.?∗ Is the amount of white space significant?

    21

  • Text Processing

    ➣ Sentence Splitting (split into sentences)

    ➣ Part of Speech Tagging (annotate POS)

    ➣ Chunking (find constituents, typically noun phrases)

    ➣ Lemmatization

    ➢ try to find the root form (mice → mouse)

    ➣ Stemming

    ➢ try to find the stem (computing, computer → comput)

    ➣ Parsing

    Text Mining and Knowledge Acquisition 22

  • Take out your clickers!

    Text Mining and Knowledge Acquisition 23

  • What to Mine?

    ➣ Email, Instant Messages, Blogs, Twitter, . . .

    ➢ Entities (Persons, Companies, Organizations, . . . )➢ Events (Inventions, Offers, Attacks, . . . )

    Biggest existing system: ECHELON (UK/USA)

    ➣ News: Newspaper articles, Newswires, . . .

    ➢ Collections of articles (e.g., from different agencies, describing the sameevent)

    ➢ Contrastive summaries (e.g., event described by U.S. newspaper vs.Arabic newspaper)

    Text Mining and Knowledge Acquisition 24

  • ➣ (Scientific) Books, Papers, . . .

    ➢ detect new trends in research➢ automatic curation of research results in Bioinformatics

    need to deal with highly specific language

    ➣ Software Requirement Specifications, Documentation, . . .

    ➢ extract requirements from software specification➢ detect conflicts between source code and its documentation

    ➣ Web Mining

    ➢ extract and analyse information from web sites➢ mine companies’ web pages (detect new products & trends)➢ mine Intranets (gather knowledge, find ‘illegal‘ content, . . . )

    Text Mining and Knowledge Acquisition 25

  • Typical Text Mining Tasks

    ➣ Classification and Clustering

    ➢ Email Spam-Detection, Classification (Orders, Offers, . . . )➢ Clustering of large document sets (vivisimo.com)➢ Creation of topic maps (www.leximancer.com)

    ➣ Web Mining

    ➢ Trend Mining, Opinion Mining, Novelty Detection➢ Ontology Creation, Entity Tracking, Information Extraction

    ➣ Summarization

    Text Mining and Knowledge Acquisition 26

  • Template Filling

    ➣ Look for a fixed template

    ➢ Seminar Announcement

    Title The Artificial Boundary ofHumanities and Science:Arguments from Linguistics and Literature

    Speaker A/P Wee Lian HeeInstitution Hong Kong Baptist UniversityDate 24 August 2009, ThursdayPlace HSS SEM RM 3 (HSS-B1-10)

    ➣ This can be used to fill in a calendar

    Text Mining and Knowledge Acquisition 27

  • Extract from email

    You are cordially invited to the CLASS Seminar on ”The ArtificialBoundary of Humanities and Science: Arguments from Linguistics andLiterature ” by A/P Wee Lian Hee of Hong Kong Baptist University on24 August 2009, Thursday at HSS SEM RM 3 (HSS-B1-10). Pleasedisseminate this email to your colleagues and students who may beinterested to attend.

    Note: 24 August 2009 was a Monday — easier to check if the date isextracted

    Text Mining and Knowledge Acquisition 28

  • Template Extraction

    ➣ Identify entities

    ➢ Named Entity Recognition

    ➣ Look for patterns that match slots

    ➢ Relation Extraction

    Text Mining and Knowledge Acquisition 29

  • Named Entity Recognition

    ➣ Identify interesting things, typically

    PER PeopleORG OrganizationLOC LocationGPE Geo-Political EntityFAC FacilityTIM TimeMON Money

    ➣ Task Dependent

    TIT Talk Title

    Text Mining and Knowledge Acquisition 30

  • Named Entity Recognition

    You are cordially invited to the CLASS Seminar on[TIT ”The Artificial Boundary of Humanities and Science: Argumentsfrom Linguistics and Literature”] by[PER A/P Wee Lian Hee] of[ORG Hong Kong Baptist University] on[TIM 24 August 2009, Thursday] at[LOC HSS SEM RM 3 (HSS-B1-10)]. Please disseminate this email toyour colleagues and students who may be interested to attend.

    Text Mining and Knowledge Acquisition 31

  • Named Entity Recognition (NER) as Sequence Labeling

    Text Mining and Knowledge Acquisition 32

  • NER as Sequence Labeling

    ➣ Typically learn a series of classifiers

    ➢ one for each NE type➢ choose the one with the highest score

    ➣ IOB encodingB BeginningI InsideO Outside

    ➢ byO

    WeeBPER

    LianIPER

    HeeIPER

    ofO

    HongBORG

    KongIORG

    BaptistIORG

    UniversityIORG

    onO

    Text Mining and Knowledge Acquisition 33

  • Typical Features for NER

    ➣ Words

    ➣ Stemmed words (or lemmatized)

    ➣ Shape (the orthographic form)

    ➣ Part of Speech

    ➣ Chunks (constituents: e.g., noun phrases)

    ➣ Gazetteer (Name List)

    ➣ n-gram bag-of-words

    Text Mining and Knowledge Acquisition 34

  • Typical Shape Features

    Feature Example CommentLower Case cummingsCapitalized Nanyang NameAll Caps NTUMixed Case eBayCapital letter and Period F. Person NameEnds in digit A9Hyphenated H-PFour numbers 1967 YearEight numbers 64561967 Phone Number

    Text Mining and Knowledge Acquisition 35

  • Gazetteer

    ➣ Geographical dictionary or directory (Original Meaning)

    ➣ Names from US Census

    ➣ Companies from stock market lists

    ➣ International Standard Organization (ISO) listse.g., ISO3166-2 Region Names

    ISO3166-2 regions for AUAU-NS : New South WalesAU-QL : QueenslandAU-SA : South Australia

    ➣ Facebook names, Student Lists, Phonebooks, . . .

    Text Mining and Knowledge Acquisition 36

  • Training a Classifier

    ➣ Training Input

    ➢ Labelled examples (IOB)➢ Features (extracted)➢ Gazetteers

    ➣ Classifier

    ➢ Takes text labelled with features➢ Labels Named Entities➢ Many machine learning methods (HMM, SVM, EM, kNN, . . . )

    ➣ Typically high precision (80-90%), low recall (30-40%)

    ➢ Better on restricted text

    Text Mining and Knowledge Acquisition 37

  • Using a Classifier

    Text Mining and Knowledge Acquisition 38

  • Different Ways of tagging chunksTokens IO BIO BMEWO BMEWO+Yesterday O O O BOS Oafternoon O O O O, O O O O PERJohn I PER B PER B PER B PERJ I PER I PER M PER M PER. I PER I PER M PER M PERSmith I PER I PER E PER E PERtraveled O O O PER Oto O O O O LOCWashington I LOC B LOC W LOC W LOC. O O O O EOS

    From Bob Carpenter’s lingpipe bloglingpipe-blog.com/2009/10/14/

    coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/

    Text Mining and Knowledge Acquisition 39

  • IO, BIO Encoding

    ➣ IO encoding

    ➢ TagsI X token is in named entity XO token is outside a named entity

    ⊗ Can’t represent two entities next to each other

    ➣ BIO encoding

    ➢ Tags: O and:B X token is beginning of named entity XI X token is a continuation of named entity X

    ➢ Industry standard

    Text Mining and Knowledge Acquisition 40

  • BMEWO Encoding

    ➣ BMEWO encoding

    ➢ Tags: BO and:M X token is in the middle of named entity X (sometimes I X)E X token is at the end of named entity XW X single-token named entity X (sometimes S X)

    ➢ Useful with more powerful machine learning (e.g., max entropy)

    Text Mining and Knowledge Acquisition 41

  • BMEWO+ Encoding

    ➣ BMEWO+ encoding (Bob Carpenter

    ➢ Tags: BMWEO and:O X token is before named entity XX O token is after named entity XBOS O Beginning of SentenceEOS O End of Sentence

    ➢ Adding finer-grained information to the tags themselves implicitlyencodes a kind of longer-distance information about preceding orfollowing words: John said , in Boston.

    ➢ Begin and end of sentence tags helps to reduce the confusion betweenEnglish sentence capitalization and proper name capitalization.

    Text Mining and Knowledge Acquisition 42

  • Try it!

    Tag the following:

    ➣ Eric Raymond was a GNU contributor in the mid- 1980s .

    ➣ Of the 3.5 billion push-ups performed in 2003, 2.03 billion took place onexercise shows on the Lifetime Network and ESPN3 or fitness segmentson Good Morning America.

    ➣ Use BIO with the following tags:

    PER PeopleORG OrganizationLOC LocationTIM TimeMON MoneyOTH Other

    Text Mining and Knowledge Acquisition 43

  • Evaluation Metrics

    Precision Ratio of correctly labeled/Labeled (P: Accuracy)

    Recall Ratio of correctly labeled/Should have been labeled (R)

    F-measure A measure of overall goodness 2PRP+R (F)

    More generally F-measure is (1+β2)PR

    β2P+R.

    Most often we set β = 1. If Precision is more important, increase β.

    Text Mining and Knowledge Acquisition 44

  • Confidence in Evaluation

    ➣ We train on one part of the data (training set)

    ➣ We tune the algorithm on another part (development set)

    ➣ We test on a third part (test set)

    ➢ Ideally this is unseen by the developers

    ➣ However, if you split the data in different ways, you may get different results

    ➢ Some parts may be more difficult, or less similar to the test set

    Text Mining and Knowledge Acquisition 45

  • n-fold cross validation

    ➣ We can average over differences in data using n-fold cross validation

    ➢ divide the data into n parts (folds)➢ train on sets n1 . . . n9 and test on n10➢ do this for each ni and use the average➢ we can also check the variation

    Text Mining and Knowledge Acquisition 46

  • Factors in Classifier Performance

    ➣ More training data improves performance

    ➣ Similar training data improves performance

    ➢ Out of domain performance is often a problem

    ➣ Orthogonal knowledge sources improve performance

    ➢ Different sources allow cross-checking➢ e.g., Use gazetteers and shape-based features

    ➣ Better machine learning algorithms improve performance

    ➢ Typically by allowing more or better features

    Text Mining and Knowledge Acquisition 47

  • Relation Detection

    Text Mining and Knowledge Acquisition 48

  • Relation Detection (Learning Lexical Knowledge)

    ➣ Try to find relations

    ➢ Hypernomy➢ Synonymy➢ Speaker➢ . . .

    ➣ Can also find lexical knowledge

    ➢ Syntactic structure➢ Countability

    will give two examples∗ Bootstrapping from ontology∗ Learning from text

    Text Mining and Knowledge Acquisition 49

  • The basic approach

    ➣ Similar relations behave the same➢ Look for patterns➢ Look for contexts

    ➣ Overcome noise by looking at multiple examples

    Text Mining and Knowledge Acquisition 50

  • Acquisition: Patterns

    ➣ Why disease carrying animals such as rat and cockroach didn’t get diseasefrom the bacteria they carried?

    ➣ Certain species of birds, such as the Phainopepla, a slim, glossy, blackbird with a slender crest, breed during the relatively cool spring, then leavethe desert for cooler areas at higher elevations or along the Pacific coast.

    ➣ A few desert animals, such as the Round-tailed Ground Squirrel, a diurnalmammal, enter a state of estivation when the days become too hot and thevegetation too dry.

    ➣ Skeptics paint a picture of Noah going to countries remote from the MiddleEast to gather animals such as kangaroos and koalas from Australia, andkiwis from New Zealand.

    Text Mining and Knowledge Acquisition 51

  • Acquisition: Patterns

    ➣ Why disease carrying animals such as rat and cockroach didn’t get diseasefrom the bacteria they carried?

    ➣ Certain species of birds, such as the Phainopepla, a slim, glossy, blackbird with a slender crest, breed during the relatively cool spring, then leavethe desert for cooler areas at higher elevations or along the Pacific coast.

    ➣ A few desert animals, such as the Round-tailed Ground Squirrel, a diurnalmammal, enter a state of estivation when the days become too hot and thevegetation too dry.

    ➣ Skeptics paint a picture of Noah going to countries remote from the MiddleEast to gather animals such as kangaroos and koalas from Australia, andkiwis from New Zealand.

    Text Mining and Knowledge Acquisition 52

  • Acquisition: Patterns

    ➣ Hypernyms

    ➢ S (such as|like|e.g.) A, B and C (S ⊃ A, B, C)➢ A, B, C and other S➢ S (including|especially) A, B, C➢ the A, an S,

    ➣ Synonyms

    ➢ both A and B (A ≈ B)➢ either A or B➢ neither A nor B➢ A (B)

    Text Mining and Knowledge Acquisition 53

  • ➣ Templates

    ➢ (seminar |talk) on TIT➢ PER of ORG ⇒ Institution➢ (seminar |talk) . . . by PER ⇒ Speaker

    Seminar on “The Artificial Boundary of Humanities and Science:Arguments from Linguistics and Literature” by A/P Wee Lian Hee ofHong Kong Baptist University

    Seminar on “The Artificial Boundary of Humanities and Science:Arguments from Linguistics and Literature” by A/P Wee Lian Hee ofHong Kong Baptist University

    Text Mining and Knowledge Acquisition 54

  • Example: Hypernym

    As the world looks around anxiously for an alternative to oil, energysources such as biofuels, solar, and nuclear seem like they could be themagic ticket

    ➣ Extract

    ➢ energy source ⊃ biofuel➢ energy source ⊃ solar➢ energy source ⊃ nuclear

    ➣ Note: use of lemmatization and chunking.

    ➣ Need to find multiple examples (can be different patterns)

    Text Mining and Knowledge Acquisition 55

  • Acquisition: Learning Patterns!

    ➣ If you know some relations, look for patterns they occur in

    ➢ n-grams : dog w1 w2 w3 animal➢ dependencies : SUBJ(dog, w1), OBJ(animal , w1)

    ➣ Then use the learned patterns to find more relational pairs

    ➢ e.g., train on positive and negative wordnet pairs〈dog, animal〉, 〈food, pizza〉, . . .〈dog, truck〉, 〈food, trust〉, . . .

    ➣ Used to add 30,000 entries to WordNet (Snow et al., 2006)

    Text Mining and Knowledge Acquisition 56

  • Example of a discovered pattern

    ➣ S called A (S ⊃ A)

    ➣ Learned from cases (in WordNet) such as:

    ➢ 〈sarcoma, cancer〉 . . . an uncommon bone cancer called osteogenicsarcoma and to . . .

    ➢ 〈deuterium, atom〉 . . . heavy water rich in the doubly heavy hydrogenatom called deuterium

    ➣ Finds cases (not in WordNet):

    ➢ 〈efflorescence, condition〉 . . . and a condition called efflorescenceare other reasons for . . .

    ➢ 〈hat creek outfit, ranch〉 . . . run a small ranch called the HatCreek Outfit.

    ➢ 〈tardive dyskinesia, problem〉 . . . irreversible problem calledtardive dyskinesia . . .

    Text Mining and Knowledge Acquisition 57

  • Pattern and Bootstrapping-based Relation Extraction

    Text Mining and Knowledge Acquisition 58

  • Bootstrapping

    ➣ “help oneself, often through improvised means”

    ➣ In machine learning: “any method that takes a few seed examples andlearns patterns from them”. (There are other more technical meanings)

    Text Mining and Knowledge Acquisition 59

  • Obscene patterns

    (Cartoon from http://xkcd.com/798/) 60

  • Case Studies

    Text Mining and Knowledge Acquisition 61

  • Two Examples of Knowledge Acquisition

    ➣ Attempt to find the countability of English nouns

    ➢ Countability and Semantics (knowledge based)(Bond and Vatikiotis-Bateson, 2002)

    ➢ Countability and Distribution (text based)(Baldwin and Bond, 2003)

    62

  • Why should we care?

    ➣ In generation need to decide between:

    ➢ a cake, cakes, a piece of cake➢ Especially important in machine translation

    ➣ In analysis, helps to resolve ambiguity:

    ➢ I like dogs (in general)➢ I like a dog (a specific dog)➢ I like dog (dog meat)

    ➣ Useful in teaching English (yet not marked in dictionaries)

    Text Mining and Knowledge Acquisition 63

  • Countability from an Ontology

    ➣ Many grammatical phenomena are both:

    ➢ semantically motivated➢ arbitrarily marked in different languages

    ➣ For example information is

    ➢ Countable in French; Uncountable in English

    ➣ How much of syntax is predictable from meaning?

    Text Mining and Knowledge Acquisition 64

  • Outline

    ➣ How far is English countability predictable from meaning?

    Short Answer: 78%Long Answer: It depends➢ Definition of countability

    Five Classes for Noun Countability Preferences➢ Definition of meaning

    Hierarchical ontology of 2,710 semantic classes

    Text Mining and Knowledge Acquisition 65

  • Noun Phrase Countability

    ➣ Semantically motivated:

    ➢ bounded, indivisible individuals (things)prototypically COUNTABLE: a dog, two dogs

    ➢ unbounded, divisible substances (stuff)prototypically UNCOUNTABLE: gold

    Text Mining and Knowledge Acquisition 66

  • Noun Phrase Countability

    ➣ Knowing the referent is not enough: (Wierzbicka 1996)e.g. scales

    1. Thought of as being made of two arms: (British)a pair of scales

    2. Thought of as a set of numbers: (Australian)a set of scales

    3. Thought of as discrete whole objects: (American)one scale/two scales

    Text Mining and Knowledge Acquisition 67

  • ➣ Also varies from language to language

    ➢ [a flash of] lightning (English)➢ ein Blitz (German)➢ un éclair (French)

    ➣ A well known problem for non-native speakers

    ➣ How often can we predict it from the referent’s meaning?

    ➢ There must be some connection

    Text Mining and Knowledge Acquisition 68

  • Noun Countability Preferences

    Noun Countability Code Example Default Default # %Preference Number Classifierfully CO knife sg — 47,255 65.8countablestrongly BC cake sg — 3,110 4.3countableweakly BU beer sg — 3,377 4.7countableuncountable UC furniture sg piece 15,435 21.5plural PO scissors pl pair 2,107 2.9only

    Text Mining and Knowledge Acquisition 69

  • Lexicon

    ➣ ALT-J/E ’s semantic transfer lexicon

    Index usagi

    sense 1

    English Translation rabbit

    Part of Speech noun

    Noun Countability Pref. strongly countable

    Default Number singular

    Semantic Classes[

    common noun animal, meat]

    ➣ 71,833 linked Japanese-English noun pairs

    ➣ 41,285 are multiword expressions (57.4%)

    Text Mining and Knowledge Acquisition 70

  • The Goi-Taikei Ontology

    ➣ A rich ontology with wide coverage of Japanese

    ➣ Used in many NLP applications such as MT

    ➣ Several hierarchies of concepts:

    ➢ 2,710 semantic classes (12-level tree structure) for common nouns➢ 200 classes (9-level tree structure) for proper nouns:➢ Not designed with countability in mind

    Text Mining and Knowledge Acquisition 71

  • Experiment

    ➣ How well do the semantic classes predict the countability preferences?

    ➣ Treat every combination of semantic classes as a different semantic class.

    ➣ Most frequent NCP is assigned to all members of a class.

    ➢ Ties are resolved as follows: fully countable beats stronglycountable beats weakly countable beats uncountable beatsplural only.

    ➣ Baseline (all fully countable = 65.8%)

    Text Mining and Knowledge Acquisition 72

  • Example

    ➣ Semantic Class 910:tableware

    ➢ crockery ↔ toukirui (UC)➢ dinner set ↔ youshokki (CO)➢ tableware ↔ shokki (UC)➢ Western-style tableware ↔ youshokki (UC)

    ➣ The most common NCP is UCAssociate uncountable with 910:tableware.

    ➣ This predicts the NCP correctly 75% of the time.

    Text Mining and Knowledge Acquisition 73

  • Results

    Conditions Entries % Range BaselineTraining=Test all 77.9 76.8–78.6 65.8Tenfold Cross Validation all 71.2 69.8–72.1 65.8

    ➣ Tested using stratified ten-fold validation

    ➣ 11.6% given default value (fully countable)i.e. we couldn’t decide

    Text Mining and Knowledge Acquisition 74

  • Discussion

    ➣ Problems of granularity: Where should cutlery go?

    ➢ WordNettableware

    cutlery chopsticks crockery dishware dinnerware tea set ...

    table knife fork spoon

    ➢ ALT-J/Etableware

    crockery cutlery/chopsticks cookware other tableware

    cutlery is uncountable, but knives, forks, spoons are countable.

    Text Mining and Knowledge Acquisition 75

  • Other Discussion

    ➣ pair plural only almost all wrong!binoculars, trousers, headphonesNeed some spatial representation!

    ➣ 7% or so errors in the ontology

    ➢ ソフトカラー sofuto karāsoft colour clothing BCsoft collar hue CO

    ➣ It is hard for Japanese speakers to judge countability

    Text Mining and Knowledge Acquisition 76

  • Applications

    ➣ Adding a checker to the dictionary

    ➢ Warn if semantic class does not predict the assigned countability➢ Check both semantic class and countability

    ➣ Predict countability for unknown words

    ➢ If we know their semantics

    Text Mining and Knowledge Acquisition 77

  • Examples

    ➣ totoro is a monster (∈ 222:monster)

    ⇒ totoro is fully countable

    ➣ gavagai is an edible animal (∈ 537:animal,810:meat)

    ⇒ gavagai is strongly countable

    ➣ ununquadium is an element (Uuq114)) (∈ 710:element)

    ⇒ ununquadium is uncountable

    Text Mining and Knowledge Acquisition 78

  • Conclusion

    ➣ With a limited ontology and noisy lexiconsemantics predicts countability around 78% of the timetherefore countability is semantically motivated

    ➣ If we can find the semantic class of a wordwe can predict something about its syntactic properties

    Text Mining and Knowledge Acquisition 79

  • Countability from Corpus Data

    ➣ Acquire lexical knowledge from corpora

    ➢ English noun countability preferences➢ Precision of 94.6%. (for words freq. > 10)

    ➣ Extract features in three ways

    ➢ POS tagging➢ Full text chunking➢ Robust parsing

    ➣ Combine in a memory-based learner (TiMBL)

    Text Mining and Knowledge Acquisition 80

  • Background

    ➣ Countability is a syntactic property of Englishnot marked morphologically

    ➣ In generation used to decide between:

    ➢ a cake, cakes, a piece of cake➢ Especially important in machine translation (J-E, . . . )

    ➣ In analysis, helps to resolve ambiguity:

    ➢ I need a paper by this evening (academic/newspaper)➢ I need some paper by this evening (material)➢ I need the paper by this evening (ambiguous)

    Text Mining and Knowledge Acquisition 81

  • Noun Countability Classes

    Countable: one dog, two dogs, many dogs, a dog kennel#one piece of dog, #much dog, ∗a dogs kennel

    Uncountable: much butter, a bit of butter, a butter knife #butters, #onebutter, #two butters

    Plural Only: some goods, a goods train∗good, ∗one good, ?two goods

    Bipartite: a pair of scissors, a scissor kick , ?some scissors∗a scissor, ∗one scissors ∗two scissors

    Text Mining and Knowledge Acquisition 82

  • Resources

    ➣ Gold standard data created by comparing two lexicons

    ➢ ALT-J/E ’s Japanese-English Lexicon56,000 noun-countability combinations

    ➢ COMLEX 3.014,000 noun-countability combinations

    ➣ Inter-resource agreement of 93.8%.

    ➢ Few actual errors➢ Almost half of the disagreements came from words with two

    countabilities in ALT-J/E but only one in COMLEX.

    Text Mining and Knowledge Acquisition 83

  • Learning Countability

    ➣ Identify lexical and/or constructional featuresassociated with each countability class

    ➣ Determine the relative corpus occurrence of the features for each noun

    ➣ Use the noun feature vectors to classify the noun as a member of each ofthe countability classes

    ➢ paper +countable, +uncountable➢ uranium +uncountable➢ tanuki +countable

    Text Mining and Knowledge Acquisition 84

  • Feature space

    Head noun number: 1D target noun number as head of NP (e.g. a shaggydog = SINGULAR)

    Modifier noun number: 1D target noun number as modifier in NP (e.g. dogfood = SINGULAR)

    Subject–verb agreement: 2D target noun number as subject vs. verbnumber agreement (e.g. the dog barks = 〈SINGULAR,SINGULAR〉)

    Coordinate noun number: 2D target noun number vs. the number of thehead nouns of conjuncts (e.g. dogs and mud = 〈PLURAL,SINGULAR〉)

    N1 of N2 constructions: 2D number of N2 vs. type of N1 (e.g. the type of dog= 〈TYPE,SINGULAR〉). We have identified a total of 11 N1 types for use in thisfeature cluster (e.g. COLLECTIVE, LACK, TEMPORAL).

    Text Mining and Knowledge Acquisition 85

  • Occurrence in PPs: 2D the presence or absence of a determiner (±DET) insingular head complement of PP (e.g. per dog = 〈per ,−DET〉).

    Pronoun co-occurrence: 2D what pronouns occur in the same sentence assingular and plural instances (e.g. The dog ate its dinner = 〈its,SINGULAR〉).Approximation of pronoun co-indexation.

    Singular determiners: 1D singular-selecting dependents (e.g. a dog = a).Two types: countable (e.g. another, each), uncountable (e.g. much, little).

    Plural determiners: 1D plural-selecting dependents (e.g. few dogs = few).

    Non-bounded determiners: 2D which non-bounded dependents vs. targetnoun number (e.g. more dogs = 〈more,PLURAL〉).

    Text Mining and Knowledge Acquisition 86

  • Feature extraction

    ➣ Features extracted from the written-portion of the BNC (redid tagging)

    ➢ British National Corpus (Balanced Corpus of English)

    ➣ Data considered: nouns with ≥ 10 instances for all 3 methods

    ➢ 20,530 common nouns

    Text Mining and Knowledge Acquisition 87

  • Classifier architecture

    ➣ Four parallel supervised classifiers (all use TiMBL k = 9)

    ➣ A noun may be in multiple classestrain on nouns in the BNC with

    ➢ positive examples in both ALT-J/E and COMLEX➢ negative examples in either ALT-J/E or COMLEX

    Class Positive data Negative data BaselineCountable 4,342 1,476 .746Uncountable 1,519 5,471 .783Bipartite 35 5,639 .994Plural only 84 5,639 .985

    Text Mining and Knowledge Acquisition 88

  • Cross-validated results (1)

    Class Accuracy (e.r.) F-score

    Countable .939 (.759) .960Uncountable .952 (.779) .892Bipartite .996 (.403) .722Plural Only .990 (.323) .582

    ➣ Performs well for countable and uncountable

    ➣ Much harder for small classeseasier to always say NO

    Text Mining and Knowledge Acquisition 89

  • Open data results

    ➣ 11,499 unseen feature-mapped common nouns

    ➣ Classified 10,355 (90.0%):

    Countable 7,974 77.0% alchemistUncountable 2,588 25.0% ingenuityBipartite 9 0.1% headphonesPlural only 80 0.8% damages

    ➣ 139 nouns assigned to multiple countability classes

    ➣ Combined lexicon contained 4,982 of the nouns:precision for these nouns is 94.6% (baseline 89.7%)

    Text Mining and Knowledge Acquisition 90

  • Hand evaluation

    ➣ 100 nouns from test data

    ➢ Precision of 92.4% (37.7% e.r.)➢ Baseline (87.8%: all countable)➢ Agreement with lexicons 92.4%

    ➣ 100 nouns from training data

    ➢ Baseline (80.5%: all countable)➢ Agreement with lexicons 86.8%

    ➣ Classifiers agree with corpus better than lexicons

    Text Mining and Knowledge Acquisition 91

  • Corpus Data

    ➣ Able to classify nouns with a precision of 94.6%

    ➣ Need to multiply classify more often

    ➣ Can classify more finely

    ➢ ideally a continuum from countable to uncountable➢ at least the noun countability preferences (ALT-J/E )

    Fully, Strongly, Weakly, Un-Countable,Plural Only, Bipartite

    ➣ Final precision comparable with existing lexicons

    ⇒ We can automatically acquireEnglish noun countability information from text

    Text Mining and Knowledge Acquisition 92

  • Meta-Conclusion

    ➣ Distributional approach works well (better than ontology)

    “You shall know a word by the company it keeps.”

    Firth, J.R. Modes of Meaning. Papers in Linguistics, 1934-1951. London: OxfordUniversity Press, 1957, p11.

    ➣ However, it only works for words with > 10 examples

    ➣ It does not find countability per sense

    ➣ Ideally we should combine syntactic distribution with semantic distribution

    ➣ Should use a more fine grained ontology

    Text Mining and Knowledge Acquisition 93

  • Conclusions

    ➣ There is a lot of information out there

    ➣ Much of it is unstructured text

    ➣ Using NLP techniques we can extract this information

    ➢ But we can’t trust it all

    Text Mining and Knowledge Acquisition 94

  • Readings and Acknowledgments

    ➣ Jurafsky and Martin (2008) Chapter 22, esp 22.2, 22.4

    ➣ Some of the text mining slides are based on www.rene-witte.net/system/files/IntroductionToTextMining.pdf

    ➣ Some figures are from Jurafsky and Martin (2008)

    ➣ Great Survey on NER: Nadeau, David and Satoshi Sekine (2007) A surveyof named entity recognition and classification. Linguisticae Investigationes30(1):3–26. nlp.cs.nyu.edu/sekine/papers/li07.pdf

    Text Mining and Knowledge Acquisition 95

  • Bibliography

    Text Mining and Knowledge Acquisition 96

  • *References

    Timothy Baldwin and Francis Bond. 2003. Learning the countability of English nouns fromcorpus data. In 41st Annual Meeting of the Association for Computational Linguistics:ACL-2003, pages 463–470. Sapporo, Japan.

    Francis Bond and Caitlin Vatikiotis-Bateson. 2002. Using an ontology to determine Englishcountability. In 19th International Conference on Computational Linguistics: COLING-2002, volume 1, pages 99–105. Taipei.

    Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics and SpeechRecognition. Prentice Hall, second edition.

    Text Mining and Knowledge Acquisition 97

  • Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy inductionfrom heterogenous evidence. In Proceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meeting of the Association for ComputationalLinguistics, pages 801–808. Association for Computational Linguistics, Sydney, Australia.URL http://www.aclweb.org/anthology/P/P06/P06-1101.

    Text Mining and Knowledge Acquisition 98


Recommended