+ All Categories
Home > Technology > A Report of IJCNLP 2011 #TokyoNLP

A Report of IJCNLP 2011 #TokyoNLP

Date post: 11-Jun-2015
Category:
Upload: yoh-okuno
View: 4,073 times
Download: 6 times
Share this document with a friend
Description:
TokyoNLP is a meetup about natural language processing at Tokyo. This slide is presented at the 5th presentation of the 8th event.
Popular Tags:
42
A Report of IJCNLP 2011 @nokuno #tokyonlp
Transcript
Page 1: A Report of IJCNLP 2011 #TokyoNLP

A  Report  of  IJCNLP  2011 @nokuno  

#tokyonlp

Page 2: A Report of IJCNLP 2011 #TokyoNLP

About  the  presenter

•  Name:  Yoh  Okuno  

•  Software  Engineer  at  Yahoo!  Japan  

•  Interest:  NLP,  Machine  Learning,  Data  Mining  

•  Skill:  C/C++,  Python,  Hadoop,  etc.  

•  Website:  http://www.yoh.okuno.name/  

Page 3: A Report of IJCNLP 2011 #TokyoNLP

Recent  nokuno  (1)

Page 4: A Report of IJCNLP 2011 #TokyoNLP

Recent  nokuno  (2)

Page 5: A Report of IJCNLP 2011 #TokyoNLP

Recent  nokuno  (3)

#emnlpreading  2011.  12.  23.  

at  Cybozu  Labs

Page 6: A Report of IJCNLP 2011 #TokyoNLP

Today’s  Topic •  Japanese  Pronunciation  Prediction  as  Statistical  

Machine  Translation  

•  Integrating  Models  Derived  from  non-­‐Parametric  

Bayesian  Co-­‐segmentation  into  a  Statistical  Machine  

Transliteration  System  

•  Discriminative  Phrase-­‐based  Lexicalized  Reordering  

Models  using  Weighted  Reordering  Graphs  

Page 7: A Report of IJCNLP 2011 #TokyoNLP

Japanese  Pronunciation  Prediction  as  Statistical  Machine  Translation

Jun  Hatori  and  Hisami  Suzuki  

University  of  Tokyo,  Microsoft  Research  

IJCNLP  2011  

Page 8: A Report of IJCNLP 2011 #TokyoNLP

Motivation •  Japanese  words  and  sentences  have  multiple  

pronunciations  

•  Proposed  method  predicts  pronunciations  of  

out-­‐of-­‐vocabulary  (OOV)  words  [Hatori+  11]  and  

known  words  sentence  simultaneously  

•  Used  statistical  machine  translation  (SMT)  

framework  at  word  and  character  level

Page 9: A Report of IJCNLP 2011 #TokyoNLP

An  Example

•  Input:  東京都美術館の狩野探幽展に行った  

•  Output:  とうきょうとびじゅつかんのかのうたんゆうてんにいった  

•  Training  corpus:  Japanese  dictionary  and  corpus    with  pronunciation  

Page 10: A Report of IJCNLP 2011 #TokyoNLP

Discriminative  model •  Similar  to  phrase-­‐based  SMT  with  monotone  

alignment,  no  insertion  and  no  deletion  

•  Used  averaged  perceptron  training

λ:parameters  ,  f:  features

Page 11: A Report of IJCNLP 2011 #TokyoNLP

Features

•  Bidirectional  translation  probability  

•  Target  character  n-­‐gram  model  

•  Target  character  length  

•  Joint  n-­‐gram  model  

– Pairs  of  (source,  target)  probability  

Page 12: A Report of IJCNLP 2011 #TokyoNLP

Translation  Process

Page 13: A Report of IJCNLP 2011 #TokyoNLP

Training •  Produce  translation  table  and  language  model

Page 14: A Report of IJCNLP 2011 #TokyoNLP

Experimental  Result

•  Dictionary-­‐based  approach  outperformed  

substring-­‐based  approach  [Hatori+  11]  

Page 15: A Report of IJCNLP 2011 #TokyoNLP

References •  HS11:  [Hatori+  11]  Predicting  Word  Pronunciation  in  

Japanese  

•  Mecab:  [Kudo+  04]  Applying  conditional  random  fields  to  

Japanese  morphological  analysis  

•  KyTea:  [Neubig+  10]  Word-­‐based  partial  annotation  for  

efficient  corpus  construction  

•  [Suzuki+  05]  Microsoft  Research  IME  Corpus  

•  [Maekawa+  08]  Compilation  of  the  KOTONOHA-­‐BCCWJ  

corpus  (in  Japanese)  

Page 16: A Report of IJCNLP 2011 #TokyoNLP

Integrating  Models  Derived  from  non-­‐Parametric  

Bayesian  Co-­‐segmentation  into  a  Statistical  Machine  Transliteration  System  

Andrew  Finch  and  Eiichiro  Sumita  (NICT)  

NEWS  2011  

Page 17: A Report of IJCNLP 2011 #TokyoNLP

Transliteration  Task •  Transliteration  is  defined  as  phonetic  translation  of  names  across  languages  

[Zhang+  11]

Page 18: A Report of IJCNLP 2011 #TokyoNLP

Nonparametric  co-­‐segmentation

•  Extended  monolingual  word  segmentation  

[Mochihashi+  09]  [Goldwater+  06]  

•  Used  Unigram  Dirichlet  Process  Model  as  

language  model,  and  Poisson  distribution  as  base  

measure  (no  character-­‐level  LM)  

•  Simple  Gibbs  Sampling  with  Forward-­‐Backward

[Finch+  10]

Page 19: A Report of IJCNLP 2011 #TokyoNLP

Joint  Source-­‐Channel  Model •  Model  parallel  corpus  as  bilingual  sequence-­‐pairs  

•  Bilingual  sequence-­‐pairs  don’t  cross  word  boundary

(1)

[Finch+  10]

s:  sources,  t:  targets,  w:  words,  γ:  bilingual  segmentation

Page 20: A Report of IJCNLP 2011 #TokyoNLP

Unigram  Dirichlet  Process  Model •  Bilingual  sequence-­‐pairs  are  generated  from  Unigram  

Dirichlet  Process  

•  Used  Chinese  Restaurant  Process  representation  •  Bilingual  sequence-­‐pairs  are  generated  as:  

1.  One  of  the  existing  type  according  to  their  count  

2.  New  type  according  to  a  constant  (α=0.3  in  this  case)

[Finch+  10]

Page 21: A Report of IJCNLP 2011 #TokyoNLP

The  Base  Measure •  Double  Poisson  distribution  for  bilingual  sequence-­‐pairs  

•  Characters  are  uniformly  generated

[Finch+  10]

v:  vocabulary  size,  λ:  parameter  (=2  in  this  case)

Page 22: A Report of IJCNLP 2011 #TokyoNLP

The  Generative  Model •  Generation  from  the  history  of  bilingual  

sequence-­‐pairs

-­‐k:  “up  to  but  not  including  k”  α  =  0.3:  New  bilingual  sequence-­‐pair

[Finch+  10]

Page 23: A Report of IJCNLP 2011 #TokyoNLP

The  Generative  Process [Finch+  10]

Sample  from  multinomial  distribution

Generate  new  pair?

Sample  each  characters  uniformly

Sample  each  lengths  of  the  pair  

No

Yes

Page 24: A Report of IJCNLP 2011 #TokyoNLP

Gibbs  Sampling •  Used  the  Blocked  version  of  Forward-­‐Filtering-­‐

Backward-­‐Sampling  (FFBS)  [Mochihashi+  09]

[Finch+  10]

Page 25: A Report of IJCNLP 2011 #TokyoNLP

Graph  for  all  co-­‐segmentation  of  (abba,  アッバ)

Page 26: A Report of IJCNLP 2011 #TokyoNLP

Experimental  Result •  Outperform  m2m  baseline  with  all  language  pairs

Page 27: A Report of IJCNLP 2011 #TokyoNLP

Translation  table  example

Page 28: A Report of IJCNLP 2011 #TokyoNLP

References •  [Zhang+  11]  Whitepaper  of  NEWS  2011  Shared  Task  on  

Machine  Transliteration  

•  [Finch+  10]  A  Bayesian  Model  of  Bilingual  

Segmentation  for  Transliteration  

•  [Mochihashi+  09]  Bayesian  unsupervised  word  

segmentation  with  nested  pitman-­‐yor  language  

modeling  

•  [Goldwater+  06]  Contextual  dependencies  in  

unsupervised  word  segmentation  

Page 29: A Report of IJCNLP 2011 #TokyoNLP

Discriminative  Phrase-­‐based  Lexicalized  Reordering  Models  using  Weighted  

Reordering  Graphs  

Wang  Ling,  Joao  Grac¸a,  David  Martins  de  Matos,  

Isabel  Trancoso  and  Alan  Blac  

Carnegie  Mellon  University  

IJCNLP  2011

Page 30: A Report of IJCNLP 2011 #TokyoNLP

Reordering  in  Phrase-­‐based  SMT

•  Reordering  model  plays  important  role  in  

language  pairs  like  Japanese-­‐English

LM Reordering Translation

P (e|f) = P (e)I�

i=1

P (fi|ei)P (pi, oi)

[Koehn+  03]

Page 31: A Report of IJCNLP 2011 #TokyoNLP

History  of  reordering  model •  Distance-­‐based  reordering  model  [Kohen+  03]  

•  Word-­‐based  lexicalized  reordering  [Kohen+  05]  

•  Phrase-­‐based  lexicalized  reordering  [Tillmann+  04]  

•  Weighted  word-­‐based  lexicalized  reordering  [Ling+  11]  

– Weighted  alignment  matrices  [Liu+  09]  

–  Reordering  graph  representation  [Su+  10]  

•  Propose  weighted  phrase-­‐based  lexicalized  reordering

Page 32: A Report of IJCNLP 2011 #TokyoNLP

Three  types  of  “orientation”

•  Categorize  3  types  

– monotone  (m)  

– swap  (s)  

– discontinuous  (d)

[Koehn+  05]

Page 33: A Report of IJCNLP 2011 #TokyoNLP

Word-­‐based  reordering •  Most  popular  reordering  model  currently  

•  Extend  count  to  weighted  sum  of  probability

[Koehn+  05]

Page 34: A Report of IJCNLP 2011 #TokyoNLP

Weighted  alignment  matrices [Liu+  09]

Page 35: A Report of IJCNLP 2011 #TokyoNLP

Weighted  Reordering  Graph [Su+  10]

Page 36: A Report of IJCNLP 2011 #TokyoNLP

Forward-­‐Backward  Algorithm

•  To  calculate  reordering  probability  P(p,o)

Page 37: A Report of IJCNLP 2011 #TokyoNLP

Choosing  Weight  Matrix •  Weighted  Alignment  Matrix  

•  Distance-­‐based  edge  weight  

come  from  [Liu+  09]

Page 38: A Report of IJCNLP 2011 #TokyoNLP

Experimental  Result

Page 39: A Report of IJCNLP 2011 #TokyoNLP

References •  [Koehn+  03]  Statistical  phrase-­‐based  translation  •  [Koehn+  05]  Edinburgh  System  Description  for  the  

2005  IWSLT  Speech  Translation  Evaluation  

•  [Liu+  09]  Weighted  Alignment  Matrices  for  Statistical  

Machine  Translation  

•  [Su+  10]  Learning  lexicalized  reordering  models  from  

reordering  graphs  

•  [Ling+  11]  Reordering  modeling  using  weighted  

alignment  matrices  

Page 40: A Report of IJCNLP 2011 #TokyoNLP

Phrase  Extraction  for  Japanese  Predictive  Input  Method  as  Post-­‐Processing  

Yoh  Okuno  

Yahoo  Japan  Corporation  

IJCNLP  2011

Page 41: A Report of IJCNLP 2011 #TokyoNLP

Call  For  Paper  TokyoNLP  #9  

EMNLP  2011  Reading

Page 42: A Report of IJCNLP 2011 #TokyoNLP

Any  Question?


Recommended