+ All Categories
Home > Science > Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Date post: 28-May-2015
Category:
Upload: vladimir-kulyukin
View: 279 times
Download: 3 times
Share this document with a friend
Popular Tags:
17
Speech & NLP www.vkedco.blogspot.com Part-of-Speech Tagging, Sentence Splitting & Parsing Vladimir Kulyukin Department of Computer Science Utah State University
Transcript
Page 1: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Speech & NLP

www.vkedco.blogspot.com

Part-of-Speech Tagging, Sentence Splitting & Parsing

Vladimir Kulyukin

Department of Computer Science

Utah State University

Page 2: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Outline

● Parts of Speech

● Approaches to POS Tagging

● Splitting Text into Sentences with Open NL and Parsing

them with Stanford Parser

Page 3: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Closed & Open Classes of Words

● Parts of speech are divided into two broad classes: closed

class types & open class types

● Prepositions are a closed class type

● Nouns are an open class type

● Verbs are an open class type

● In many human languages, nouns, verbs, adjectives are open

classes

Page 4: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Closed Classes

● Prepositions: at, from, by, to, with, over, near

● Determiners: the, a, an

● Pronouns: he, she, we, I, they, it

● Conjunctions: but, if, and, then, as, or

● Auxiliary verbs: may, might, can, could, should

● Particles: up, down, on, off, in, out, at, by (go on, stop by,

pick up, turn off, etc.)

● Numerals: one, two, three, first, second, third

Page 5: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

English POS Tagsets

● Statistical part-of-speech (POS) taggers require the existence

of tagsets

● Brown corpus (http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html)

uses a 87-tag tagset

● Penn treebank (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html)

uses a 45-tag tagset

● There are other smaller and larger tagsets (see Ch. 8 in

Jurafsky & Martin’s book for references)

Page 6: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Example

Original sentence: The grand jury commented on a

number of other topics.

Sentence tagged with the Penn Treebank tagset:

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

Page 7: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Part-of-Speech Tagging

● POS tagging is the process of assigning a POS tag from a

specific tag set to each wordform and punctuation mark in the

input

● In programming language compilation, this process is called

tokenization

● POS tagging can be done probabilistically; POS tagging

models are trained on large manually tagged data sets

● POS tagging can also be rule-based

Page 8: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Rule-Based POS Tagging

● The earliest algorithms for POS tagging were rule-

based

● The first stage used dictionary lookups to assign all

possible POS tags to each wordform

● The second stage used a large handcrafted rule

database to choose the most appropriate tag for each

wordform

Page 9: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Stochastic POS Tagging

● Stochastic POS tagging are based on maximizing the

following formula:

𝑃 𝑤𝑜𝑟𝑑𝑓𝑜𝑟𝑚|𝑇𝐴𝐺 = 𝑃 𝑇𝐴𝐺| 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑛 𝑡𝑎𝑔𝑠

Page 10: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Splitting Text into Sentences with OpenNL

&

Parsing them with Stanford Parser

Page 11: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Defining Variables

// 1. set the path to Open NL en-sent.bin

final static String OPEN_NL_BIN = “Directory to en-sent.bin";

// 2. Define your text

static String small_route_01 = "Put the ATIA registration desk on your

right side, and walk forward. In 25 feet, you will notice the Antigua

hallway opening on the right side.";

// 3. Define a member for a Stanford Parser object

public static LexicalizedParser mLexParser = null;

Page 12: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Splitting Text into Sentences

mLexParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");

InputStream modelIn = null; SentenceModel model = null;

try {

// 1. Open OpenNL model file

modelIn = new FileInputStream(OPEN_NL_BIN);

// 2. Create an OpenNL SentenceModel object

model = new SentenceModel(modelIn);

// 3. Create a Sentence Detector from the Sentence Model

SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

// 4. Split text into sentences

String sentences[] = sentenceDetector.sentDetect(small_route_01);

// 5. Parse each sentence

for (int si = 0; si < sentences.length; si++) {

parseSentence(mLexParser, sentences[si]);

}

} catch (Exception ex) { // handle exceptions }

Page 13: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Parsing & Displaying Sentences

public static void parseSentence(LexicalizedParser lp, String sent) {

// 1. Get a tokenizer

TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");

Tokenizer<CoreLabel> tok = tokenizerFactory.getTokenizer(new StringReader(sent));

// 2. Tokenize words

List<CoreLabel> rawWords2 = tok.tokenize();

// 3. Parse

Tree parse = lp.apply(rawWords2);

// 4. Get dependencies

TreebankLanguagePack tlp = new PennTreebankLanguagePack();

GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();

GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();

// 5. Use TreePrint object to print trees and dependencies

TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");

tp.printTree(parse);

}

Page 14: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Sample Parser Output: Parse Tree Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.

Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]

Parse:

(ROOT

(S

(PP (IN In)

(NP (CD 25) (NNS feet)))

(, ,)

(NP (PRP you))

(VP (MD will)

(VP (VB notice)

(NP (DT the) (NNP Antigua) (NN hallway) (NN opening))

(PP (IN on)

(NP (DT the) (JJ right) (NN side)))))

(. .)))

Page 15: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Stanford Parser Dependencies

● Syntactic trees allow us to sentence analyze structure

● Dependencies allow to split the sentence into binary relations

● Each dependency can be viewed as a triplet: relation name, governor

of the relation, and dependent

● Read this link on more details and a comprehensive list of

dependencies:

http://nlp.stanford.edu/software/dependencies_manual.pdf

Page 16: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

Sample Parser Output: Dependencies Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.

Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]

Dependencies:

num(feet-3, 25-2)

prep_in(notice-7, feet-3)

nsubj(notice-7, you-5)

aux(notice-7, will-6)

root(ROOT-0, notice-7)

det(opening-11, the-8)

nn(opening-11, Antigua-9)

nn(opening-11, hallway-10)

dobj(notice-7, opening-11)

det(side-15, the-13)

amod(side-15, right-14)

prep_on(notice-7, side-15)

Page 17: Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

References

● Ch. 08 in Jurfasky and Martin’s “Speech & Language Processing”

● http://nlp.stanford.edu/software/lex-parser.shtml

● https://opennlp.apache.org/

● http://nlp.stanford.edu/software/dependencies_manual.pdf


Recommended