+ All Categories
Home > Documents > Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Date post: 16-Jan-2016
Category:
Upload: benedict-hopkins
View: 226 times
Download: 0 times
Share this document with a friend
Popular Tags:
51
Language Processing BIF-30806 January 2010 Judith Risse
Transcript
Page 1: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Introduction to Text Mining and Natural Language Processing

BIF-30806January 2010

Judith Risse

Page 2: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

2

Outline

Literature and Databases Natural Language Processing

Information Retrieval Question Answering Information Extraction

Indexing Document Classification Exercises

Page 3: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

3

Definitions

Natural Language Processing (NLP) the study of automated generation and

understanding of natural human languages (Wikipedia)

Text Mining extract high quality (previously unknown)

information from large amounts of unstructured text

Page 4: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

4

Biomedical Literature

communication of scientific discoveries peer-reviewed and community reviewed provides additional information of

experimental results base for annotation of biological

databases

Page 5: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

5

Literature Databases NCBI Bookshelf PubMed Central PubMed

currently 19476540 citations (Jan 27, 2010) 5414 journals in Medline unique identifier PMID entries contain author, journal and title info more than 50% also abstracts links to full-text articles Medical Subject Headings (MeSH)

Page 6: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

6

PubMed

PubMed growth

0123456789

101112131415161718192021

19

50

19

53

19

56

19

59

19

62

19

65

19

68

19

71

19

74

19

77

19

80

19

83

19

86

19

89

19

92

19

95

19

98

20

01

20

04

20

07

No

of

pu

bli

cati

on

s in

mil

lio

ns

entries per yeartotal No of entries

Page 7: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

7

Pubmed (3)

© NLM 2008

Page 8: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

8

A scientific article

journal specific format sections print style

type of article review letter

document format html pdf

Page 9: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

9

Article content

Full-text title authors abstract body

Tables Figures References

Page 10: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

10

Biomedical Language domain specific terminology

cytosolic, erythroid precursor polysemic words

e.g. Drosophila gene names: coitus interruptus, lost in space

acronyms APC (activated protein C), mdh (malate

dehydrogenase) low frequency words anaphora (references)

Overexpression of FumRs and Frds1 resulted in the best citrate-producing strain in the presence of trace manganese concentrations. This strain gave a maximum yield of ….

Page 11: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

11

Biomedical Language (2)

synonyms/creating new terms typographical variants

malic dehydrogenase L-malate dehydrogenase NAD-L-malate dehydrogenase malic acid dehydrogenase NAD-dependent malic dehydrogenase NAD-malate dehydrogenase NAD-malic dehydrogenase malate (NAD) dehydrogenase MDH L-malate-NAD+ oxidoreductase

Page 12: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

12

Natural Language Processing

create computational models of language

multi-disciplinary information technology, linguistics, artificial

intelligence, statistics …. statistical properties of language

machine learning, rule-based, regular expressions

grammatical, morphological, syntactic and semantic features

Page 13: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

13

Grammatical Features

Grammar rules governing a language syntax and morphology

Part of speech (POS) noun, verb, adjective, adverb, preposition depends on context in sentence

Brill tagger (Eric Brill, PhD thesis,1993) http://www.cst.dk/online/pos_tagger/uk/

index.html http://en.wikipedia.org/wiki/Brill_Tagger

Page 14: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

14

Morphological Features

structure of words inflection

enzyme and enzymes (plural form) catalyse, catalyses, catalysing (verb inflection)

word-formation earth, earthworm (compounding) dependent, independent (derivation)

stemming and lemmatisation reduction of words to common base form

am, are, is be catalyse, catalyses, catalysing catalys

Porter Stemmer (tartarus.org/martin/PorterStemmer)

Page 15: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

15

Syntactic Features

relationships between words in a sentence noun-phrase, verb-phrase subject – object relationships

Page 16: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

16

POS Tagged Sentence

(NNP Pain) (VBD vanished) (IN for) (IN at) (JJS least) (CD three) (NNS months) (IN in) (NNS rats) (WP who) (VBD were) (VBN injected) (IN in) (DT the) (NN spine) (IN with) (DT a) (NN gene) (IN that) (NNS triggers) (VBZ endorphins) (. .)

Pain - Proper singular nounvanished - Verb, past tense for - Prepositionat - Prepositionleast - Superlative adjectivethree - Cardinal numbermonths - Plural nounin - Prepositionrats - Plural nounwho - wh-pronounwere - Verb, past tense

injected - Verb, past participlein - Prepositionthe - Determinerspine - Singular nounwith - Prepositiona - Determinergene - Singular nounthat - Prepositiontriggers - Plural nounendorphins - Verb, 3rd ps. sing.present. - Final punctuation

Page 17: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

17

Semantic Features

meaning of words given the context dictionaries, thesauri

Gene Ontology

Page 18: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

18

Contextual Analysis

Guilt by association Co-occurrence analysis

Word frequency bag of words statistical analysis of word frequency

Page 19: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

19

Exercise 1 take a gene/protein name of your

interest query pubMed and retrieve 1 abstract

Take a look at what the Porter stemmer does using the abstract

Describe what problems might occur from stemming

Porter Stemmer http://maya.cs.depaul.edu/~classes/ds575/

porter.html

Page 20: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Coffee Break

Page 21: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

21

Tasks of NLP

Information Extraction (IE) Question Answering (QA) Information Retrieval (IR)

machine translation text proofing speech recognition optical character recognition (OCR)

Page 22: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

22

Information Retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Introduction to IR (CambUnivPr, 2008)

Indexing Tokenization Case Folding (TNFalpha, Tnfalpha tnfalpha Stemming Stop-word removal (e.g. at, be, from, this …)

Boolean Queries Vector Space Model queries

Page 23: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

23

Zipf’s Law

• A small number of words occur very often• Those high frequency words are often function words (e.g. prepositions)• Most words with low frequency

Page 24: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

24

Boolean Queries

Combination of query terms with boolean operators AND OR NOT

Google, PubMed high recall, low precision unranked result

Page 25: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

25

The vector space model

term weight term

frequency (TF) inverse

document frequency (IDF)

corpus size (N)

(1+logTF)log(N/DF)

the vector points in ‘word space’ each dimension corresponds to a word or

phrase© Nat Rev Gen(2002):3 pp 601-610

Page 26: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

26

IR Evaluation

A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. Introduction to IR (CambUnivPr, 2008)

document collection test cases of information need, as queries measure of relevance

Page 27: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

27

Evaluation (2)

Precision What fraction of the returned results are

relevant to the information need? Recall

What fraction of the relevant documents in the collection were returned by the system?

F-score harmonic mean of precision and recall (2×p×r)/(p+r)

Page 28: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

28

Exercise 2

Compare the retrieval of abstracts between PubMed and Phasar (www.bioinformatics.nl/biometa/applet.html or twoquid.cs.ru.nl/applet.html) given the question:

What does prostaglandin inhibit? How many results do you get? Give examples of answers to the question. Give 5 pmids of papers you would read given

the results in each search. Which of the systems was more helpful and

why?

Page 29: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Coffee Break

Page 30: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

30

Question Answering

question posed in human language answer extracted from unstructured

text more developed in generic domain difficult in biomedical domain

Page 31: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

31

Information Extraction & Text Mining

extract structured information from unstructured text

Named Entity Recognition identify relationships

e.g. protein-protein interactions

Page 32: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

32

Information Extraction

extract meaning from a text

combines: pos-tagging ontologies regular expressions

© Nat Rev Gen(2002):3 pp 601-610

Page 33: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

33

Named Entity Recognition

tagging of biological entities high precision in generic NLP (0.9 F-

score) difficult in biology

complex terms, synonyms, disambiguation gene symbols

typographical variations no use of official symbols gene/protein names

Page 34: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

34

Challenges of NLP

Abbreviation punctuation can be confused with end of

sentence Wash. (Washington) with wash.

Decimal points apostrophes: To split or not to split?

Page 35: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

35

Challenges (2)

hyphens single or multiple words? data-base vs. data base vs. database carry-over?

simple stemming operate operating operates operation

operative operatives operational oper case folding

brown car vs Mr. Brown

Page 36: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

36

Anaphora

co-references one expression refering to another

The monkey took the banana and ate it. strictly only local antecendent

statements Sortal anaphora

this gene, the virus resolution required for increased recall

Page 37: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

37

Exercise 3

compare NER programmes retrieve one pubMed abstract http://biocreative.sourceforge.net/

bionlp_tools_links.html NLProt TerMine Whatizit

(http://www.ebi.ac.uk/webservices/whatizit/info.jsf)

What are the differences in recognized entities?

Do they miss any obvious entities?

Page 38: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

38

Indexing

Inverted Index (Inverted File) for each word in the collection (dictionary) list occurrence and frequency

size of index is proportional to size of corpus

remove stopwords, use stemming for more efficient index

classic version is a boolean index can also contain positional information

sparse matrix

Page 39: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

39

Example deterministic 20 73 89 90 106 173 194 233 243

251 252 255 257 258 267 276 281 304 312 315 32627 36822 44643 45285 53003 53061 86740 86743 97082 116618 121984 125750 125952 125968 126039 127633 128882 128978 129048 133781 133789 138493 140946 140947 152011 156191 157881 163490

deterrence 1 604 30309 30345 30444 30452

detonation 2 263 2644 131781 131956 131995 132303

number of docs containing the term

document ids

total # of occurrences

term position in counted words

Page 40: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

40

Suffix Array

A suffix array is an array that contains all the pointers to the text suffixes listed in lexicographical order.

Text is seen as one long string A text suffix is a substring from given

position till end of string position refers to beginning of word return all occurrences of string W in large

text A

Page 41: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

41

Example:

Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring

the word: abracadabra1. create all suffixes

2. sort suffixes on alphabet

3. resulting suffix array

Page 42: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

42

Document Classification

assign a document to a class given its content manual (ad hoc) rule-based decision tree machine learning approaches

Page 43: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

43

Statistical Text Classification

training documents for each class supervised learning test data or new data training data and test data have to be

similar

Page 44: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

44

Naïve Bayes

Naïve: all words in text are considered independent

Bayes: uses Bayes theorem

)(

)()|()|(

BP

APABPBAP

prior probability

posterior probability

Page 45: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

45

Basic Probability Theory

Given A represents an eventthe probability of A occuring is 0 ≤ P(A) ≤ 1

Joint probability P(A,B) = P(A∩B) Conditional probability P(A | B) Chain rule P(A,B) = P(A | B)P(B) = P(B |

A)P(A)

Page 46: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

46

Application to Document Classification

wikipedia.org

probability of a word belonging to category C

probability of a document belonging to category C given its words

Page 47: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

Coffee Break

Page 48: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

48

Exercise 4

Try to apply naïve Bayes to a selection of sentences using http://search.cpan.org/~kwilliams/

Algorithm-NaiveBayes/ rugby.txt and tennis.txt as training and test

data. If you have it implemented try using this in

combination with the Porter Stemmer (http://bionlp.stanford.edu/bionlp.pl)

Page 49: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

49

Added Challenge From sequence to abstract to NER

MSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQR DEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGM DLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL

retrieve UniprotID via BLAST (take best hit) retrieve gene name using getz (GeneName field) retrieve relevant abstracts from pubMed in Medline

format using eSearch and eFetch with the gene name

extract all protein/gene names from these abstracts http://bionlp.stanford.edu/webservices.html

how do they relate to the original protein? compare to the output of ebiMed using the gene

name (http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp)

Page 50: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

50

Helpful resources

http://www-nlp.stanford.edu/links/statnlp.html

http://nlp.stanford.edu/IR-book/html/htmledition/mybook.html

www.biocreative.org Drosophila gene names:

http://www.curioustaxonomy.net/gene/fly.html

Page 51: Introduction to Text Mining and Natural Language Processing BIF-30806 January 2010 Judith Risse.

51

Further Reading

Introduction to Information Retrieval Cambridge University Press ISBN 987-0-521-86571-5

The Text Mining Handbook Cambridge University Press ISBN-13 978-0-521-83657-9


Recommended