+ All Categories
Home > Documents > Part-of-Speech Tagging for Bengali with Hidden Markov Model

Part-of-Speech Tagging for Bengali with Hidden Markov Model

Date post: 19-Jan-2016
Category:
Upload: odin
View: 27 times
Download: 3 times
Share this document with a friend
Description:
Part-of-Speech Tagging for Bengali with Hidden Markov Model. Sandipan Dandapat, Sudeshna Sarkar Department of Computer Science & Engineering Indian Institute of Technology Kharagpur. Machine Learning to Resolve POS Tagging. HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.) - PowerPoint PPT Presentation
25
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat, Sudeshna Sarkar Department of Computer Science & Engineering Indian Institute of Technology Kharagpur
Transcript
Page 1: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Part-of-Speech Tagging for Bengali with Hidden Markov Model

Sandipan Dandapat, Sudeshna Sarkar

Department of Computer Science & Engineering

Indian Institute of Technology Kharagpur

Page 2: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Machine Learning to Resolve POS Tagging

HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.)

Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)

Maximum Entropy (Ratnaparkhi,96; etc.)

TB(ED)L (Brill,92,94,95; etc.)

Decision Tree (Black,92; Marquez,97; etc.)

Page 3: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Our Approach HMM based

Simplicity of the model Language Independence Reasonably good accuracy

Data intensive Sparseness problem when extending order

We are adapting first-order HMM

Page 4: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Schema

Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

Page 5: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

First-order HMM

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

First order HMM: Current state

depends on previous state

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

Page 6: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

µ = (π,A,B)

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

{ ( | )}i iB P w t1{ ( | )}i iA P t t

start{ ( )}iP t Model Parameters First-order HMM

Page 7: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

µ = (π,A,B)

Disambiguation Algorithm

Rawtext

Taggedtext

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer First-order HMM

Page 8: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

µ = (π,A,B)

Viterbi Algorithm

Rawtext

Taggedtext

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer First-order HMM

Page 9: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Disambiguation Algorithm

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

n321 wwww Text:

Tags:• • •

• • •

• • •

• • •

Where, ti {T} , wi {T} = Set of tags

Page 10: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Disambiguation Algorithm

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

n321 wwww Text:

Tags:• •

• •

• •

Where, ti TMA(wi), wi {T} = Set of tags

Page 11: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Learning HMM Parameters Supervised Learning ( HMM-S)

Estimates three parameters directly from the tagged corpus

ino. of sentences which begin with t( )

no. of sentencesstart iP t

- 11

- 1

( )( | )

( )

i ii i

i

count t tP t t

count t

with 1

( )( | )

( )

i ii i

i

count w tP w t

count t

Page 12: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Learning HMM Parameters Semi-supervised Learning (HMM-SS)

Untagged data (observation) are used to find a model that most likely produce the observation sequence

Initial model is created based on tagged training data Based on initial model and untagged data, update the model

parameters

arg max ( | )untaggedP O

New model parameters are estimated using Baum-Welch algorithm

P(O | ̂) P(O | )

Page 13: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Smoothing and Unknown Word Hypothesis

All emission and transition are not observed from the training data

Add-one smoothing to estimate both emission and transition probabilities

Not all words are known to Morphological Analyzer Assume open class grammatical categories

Page 14: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Experiments Baseline Model Supervised bigram HMM (HMM-S)

HMM-S HMM-S + IMA HMM-S + CMA

Semi-supervised bigram HMM (HMM-SS) HMM-SS HMM-SS + IMA HMM-SS + CMA

Page 15: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Data Used Tagged data: 3085 sentences ( ~ 41,000 words)

Includes both the data in non-privileged and privileged mode

Untagged corpus from CIIL: 11,000 sentences (100,000 words) – unclean To re-estimate the model parameters using Baum-Welch

algorithm

Page 16: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes

Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data

Dutch Spanish German English French Bengali

1.11 1.19 1.3 1.34 1.69 2.09

(Dermatas et al 1995)

Page 17: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Development set

Baseline

30405060708090

100

5 10 15 20 25 30 35 40

Size of the traing corpus (1000x words)

Tagg

ing

Acc

urac

y (%

)

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the traing corpus ( 1000x words)

Tag

gin

g A

ccu

racy

( %

)

HMM-S

HMM-S + IMA

HMM-S + CMA

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy (

%)

ACOPOST

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the training corpus ( 1000x words)

Tagg

ing

Acc

urac

y (%

)

HMM-SS

HMM-SS + IMA

HMM-SS + CMA

Page 18: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Development setMethod Accuracy

Baseline 69.11

ACOPOST 83.45

HMM-S 74.53

HMM-S + IMA 78.65

HMM-S + CMA 88.83

HMM-SS 73.77

HMM-SS + IMA 77.98

HMM-SS + CMA 89.65

89.61

89.03

87.0987.4

89.3688.92

85.5

86

86.5

87

87.5

88

88.5

89

89.5

90

knowndata

seen data unknowndata

Tagg

ing

Acc

urac

y(%

)

HMM-S + CMA

HMM-SS + CMA

Page 19: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Error Analysis

Actual Class

Predicted Class

% of total error

% of class error

NNC NN 14.2 4.0

VRB VFM 7.1 8.7

JJ NN 5.9 1.7

QF JJ 5.1 3.7

RB JJ 5.0 3.6

NLOC NN 4.5 1.3

VNN VFM 3.7 4.5

Page 20: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Test Set Tested on 458 sentences ( 5127 words)

Precision: 84.32% Recall: 84.36% Fβ=1 : 84.34%

Type Precision(%) Recall (%) Fβ=1 Frequency

SYM 100 99.78 99.89 911

NEG 95.45 100 97.67 44

PRP 95.72 93.18 94.43 257

QFNUM 94.70 91.24 92.94 132

Top 4 classes in terms of F-measure

Page 21: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Test Set Tested on 458 sentences ( 5127 words)

Precision: 84.32% Recall: 84.36% Fβ=1 : 84.34%

Type Precision(%) Recall (%) Fβ=1 Frequency

VJJ 0 0 0 0

NVB 0 0 0 28

JVB 0 0 0 12

INF 100 12.5 22.22 1

Bottom 4 classes in terms of F-measure

Page 22: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Further Improvement Uses suffix information to handle unknown words Calculates the probability of a tag, given the last m

letters (suffix) of a word

Each symbol emission probability of unknown word is normalized

n 1 n

( | _ ) ( _ )( _ | )

( )

( | ,..., ) ( _ )

( )

ii

i

i m

i

P t Unknown word P Unknown wordP Unknown word t

P t

P t l l P Unknown word

P t

Page 23: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Further Improvement

73.77

89.65

77.98

90.33

84.6183.33

70

75

80

85

90

95

100

HMM-SS HMM-SS+IMA HMM-SS+CMA

Tag

gin

g A

ccu

racy

(%)

Accuracy reflected on development set

90.17

78.65

88.83

74.53

85.04 85.95

70

75

80

85

90

95

100

HMM-S HMM-S+IMA HMM-S+CMA

Tagg

ing

Acc

urac

y(%

)

IMA

CMA

Page 24: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Conclusion and Future Scope Morphological restriction on tags gives an efficient

tagging model even when small labeled text is available

Semi-supervised learning performs better compare to supervised learning

Better adjustment of emission probability can be adopted for both unknown words and less frequent words

Higher order Markov model can be adopted

Page 25: Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Thank You


Recommended