Speech & NLP (Fall 2014): POS Tagging, Sentence Splitting & Parsing

transcript

Speech & NLP

www.vkedco.blogspot.com

Part-of-Speech Tagging, Sentence Splitting & Parsing

Vladimir Kulyukin

Department of Computer Science

Utah State University

Outline

● Parts of Speech

● Approaches to POS Tagging

● Splitting Text into Sentences with Open NL and Parsing

them with Stanford Parser

Closed & Open Classes of Words

● Parts of speech are divided into two broad classes: closed

class types & open class types

● Prepositions are a closed class type

● Nouns are an open class type

● Verbs are an open class type

● In many human languages, nouns, verbs, adjectives are open

classes

Closed Classes

● Prepositions: at, from, by, to, with, over, near

● Determiners: the, a, an

● Pronouns: he, she, we, I, they, it

● Conjunctions: but, if, and, then, as, or

● Auxiliary verbs: may, might, can, could, should

● Particles: up, down, on, off, in, out, at, by (go on, stop by,

pick up, turn off, etc.)

● Numerals: one, two, three, first, second, third

English POS Tagsets

● Statistical part-of-speech (POS) taggers require the existence

of tagsets

● Brown corpus (http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html)

uses a 87-tag tagset

● Penn treebank (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html)

uses a 45-tag tagset

● There are other smaller and larger tagsets (see Ch. 8 in

Jurafsky & Martin’s book for references)

Example

Original sentence: The grand jury commented on a

number of other topics.

Sentence tagged with the Penn Treebank tagset:

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

Part-of-Speech Tagging

● POS tagging is the process of assigning a POS tag from a

specific tag set to each wordform and punctuation mark in the

● In programming language compilation, this process is called

tokenization

● POS tagging can be done probabilistically; POS tagging

models are trained on large manually tagged data sets

● POS tagging can also be rule-based

Rule-Based POS Tagging

● The earliest algorithms for POS tagging were rule-

● The first stage used dictionary lookups to assign all

possible POS tags to each wordform

● The second stage used a large handcrafted rule

database to choose the most appropriate tag for each

wordform

Stochastic POS Tagging

● Stochastic POS tagging are based on maximizing the

following formula:

𝑃 𝑤𝑜𝑟𝑑𝑓𝑜𝑟𝑚|𝑇𝐴𝐺 = 𝑃 𝑇𝐴𝐺| 𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠 𝑛 𝑡𝑎𝑔𝑠

Splitting Text into Sentences with OpenNL

Parsing them with Stanford Parser

Defining Variables

// 1. set the path to Open NL en-sent.bin

final static String OPEN_NL_BIN = “Directory to en-sent.bin";

// 2. Define your text

static String small_route_01 = "Put the ATIA registration desk on your

right side, and walk forward. In 25 feet, you will notice the Antigua

hallway opening on the right side.";

// 3. Define a member for a Stanford Parser object

public static LexicalizedParser mLexParser = null;

Splitting Text into Sentences

mLexParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");

InputStream modelIn = null; SentenceModel model = null;

// 1. Open OpenNL model file

modelIn = new FileInputStream(OPEN_NL_BIN);

// 2. Create an OpenNL SentenceModel object

model = new SentenceModel(modelIn);

// 3. Create a Sentence Detector from the Sentence Model

SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

// 4. Split text into sentences

String sentences[] = sentenceDetector.sentDetect(small_route_01);

// 5. Parse each sentence

for (int si = 0; si < sentences.length; si++) {

parseSentence(mLexParser, sentences[si]);

} catch (Exception ex) { // handle exceptions }

Parsing & Displaying Sentences

public static void parseSentence(LexicalizedParser lp, String sent) {

// 1. Get a tokenizer

TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");

Tokenizer<CoreLabel> tok = tokenizerFactory.getTokenizer(new StringReader(sent));

// 2. Tokenize words

List<CoreLabel> rawWords2 = tok.tokenize();

// 3. Parse

Tree parse = lp.apply(rawWords2);

// 4. Get dependencies

TreebankLanguagePack tlp = new PennTreebankLanguagePack();

GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();

GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();

// 5. Use TreePrint object to print trees and dependencies

TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");

tp.printTree(parse);

Sample Parser Output: Parse Tree Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.

Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]

Parse:

(PP (IN In)

(NP (CD 25) (NNS feet)))

(NP (PRP you))

(VP (MD will)

(VP (VB notice)

(NP (DT the) (NNP Antigua) (NN hallway) (NN opening))

(PP (IN on)

(NP (DT the) (JJ right) (NN side)))))

(. .)))

Stanford Parser Dependencies

● Syntactic trees allow us to sentence analyze structure

● Dependencies allow to split the sentence into binary relations

● Each dependency can be viewed as a triplet: relation name, governor

of the relation, and dependent

● Read this link on more details and a comprehensive list of

dependencies:

http://nlp.stanford.edu/software/dependencies_manual.pdf

Sample Parser Output: Dependencies Sentence: In 25 feet, you will notice the Antigua hallway opening on the right side.

Tokens: [In, 25, feet,, you, will, notice, the, Antigua, hallway, opening, on, the, right, side.]

Dependencies:

num(feet-3, 25-2)

prep_in(notice-7, feet-3)

nsubj(notice-7, you-5)

aux(notice-7, will-6)

root(ROOT-0, notice-7)

det(opening-11, the-8)

nn(opening-11, Antigua-9)

nn(opening-11, hallway-10)

dobj(notice-7, opening-11)

det(side-15, the-13)

amod(side-15, right-14)

prep_on(notice-7, side-15)

References

● Ch. 08 in Jurfasky and Martin’s “Speech & Language Processing”

● http://nlp.stanford.edu/software/lex-parser.shtml

● https://opennlp.apache.org/

● http://nlp.stanford.edu/software/dependencies_manual.pdf