Machine Translation Breaking the Communication Barrier By Dr. Gurpreet S. Josan Punjabi University, Patiala
Transcript
Slide 1
By Dr. Gurpreet S. Josan Punjabi University, Patiala
Slide 2
Communication is the activity of conveying meaningful
information. Communication requires a sender, a message, and an
intended recipient The communication process is complete once the
receiver has understood the sender. Machine Translation-Breaking
the language Barrier2
Slide 3
Nonverbal communication- includes gesture, body language or
posture; facial expression Visual communication- includes signs,
typography, drawing, colours etc. Oral communication- spoken verbal
communication Written communication- includes alphabets symbols,
grammar etc. Machine Translation-Breaking the language
Barrier3
Slide 4
4
Slide 5
5 Language is a barrier for information dissemination. All the
major source of information/ discoveries are in English. We are
unable to reach the masses in rural area who do not know English.
Youre a scientist who has just clicked a revolutionary new idea.
How do you find out if a scientist anywhere in world has already
filed a patent on a similar idea in their native language?
Slide 6
Machine Translation-Breaking the language Barrier 6 A
Translator Manual Too Slow Limited Available Costly Accurate
Machine Fast Economical Not accurate
Computers Lack Knowledge! Computers see text in English the
same you have seen the previous text! People have no trouble
understanding language Common sense knowledge Reasoning capacity
Experience Computers have No common sense knowledge No reasoning
capacity
Slide 9
Machine Translation-Breaking the language Barrier9 Which ones
are going to be difficult for computers to deal with? Grammar or
Lexicon? Grammar (Rules for putting words together into sentences)
How many rules are there? 100, 1000, 10000, more Do we have all the
rules written down somewhere? Lexicon (Dictionary) How many words
do we need to know? 1000, 10000, 100000
Slide 10
the dog ate my homework - Who did what to whom? Identify the
part of speech (POS) Dog = noun ; ate = verb ; homework = noun
English POS tagging: 95% Try to tag this text manually I can, can
the can. 2. Identify collocations mother in law, hot dog
Slide 11
Seemingly similar sentences may differ radically in meaning:
The CEO was fired up about his new role. The CEO was fired from his
new role. Seemingly different sentences can have the same meaning:
IBMs PC division was acquired by Lenovo. Lenovo bought the PC
division of IBM. Machine Translation-Breaking the language
Barrier11
Slide 12
Machine Translation-Breaking the language Barrier12 Ambiguity
Structural ambiguity I saw the man with the telescope Word level
ambiguity
Slide 13
Various Meaning of word in Punjabi Machine Translation-Breaking
the language Barrier13
Slide 14
If more than one ambiguous word is present in a sentence, the
number of potential interpretations of the sentence explodes: the
number of interpretations is the product of all possible meanings
of the words. Consider the sentence and assume that only {va } and
{pa } are ambiguous in this sentence, and that they both have 4
senses. This brings the number of possible interpretations to 16.
Machine Translation-Breaking the language Barrier14
Slide 15
Imagine what happens if there are more senses to be taken into
account or if the sentence gets longer. Machine
Translation-Breaking the language Barrier15 (Sukhbir) (has) (twine)
(thigh) (crease) (FootPath) (sultriness) (destroy) (door-leaf)
(silk)
Slide 16
Anaphora Resolution: The dog ate the bird and it died. Gender
Conversion Idioms & Phrases
Slide 17
Named Entity Recognition . Dr. Plant Singh vs Dr. Buta Singh
Foreign words vs Spelling Variation , , etc. Machine
Translation-Breaking the language Barrier17
Slide 18
Rhyming Reduplication - , - Other issues In Indian Languages-no
fixed font For example the word can be written in following
manners: Machine Translation-Breaking the language Barrier18 + + +
= + + = + + = + + + + + + + + + + + + + + + + + +
Slide 19
Interlingua Semantic Structure Semantic Structure Syntactic
Structure Syntactic Structure Word Structure Word Structure Source
Text Target Text Semantic Composition Semantic Decomposition
Semantic Analysis Semantic Generation Syntactic Analysis Syntactic
Generation Morphological Analysis Morphological Generation Semantic
Transfer Syntactic Transfer Direct Machine Translation-Breaking the
language Barrier19
Slide 20
Pros Fast Simple Inexpensive Robust No translation rules hidden
in lexicon Cons Unreliable Not powerful Rule proliferation Requires
too much context Major restructuring after lexical substitution
Machine Translation-Breaking the language Barrier20
Slide 21
Pros Dont need to find language-neutral representation
Relatively fast Cons Large no. of transfer rules: Difficult to
extend Proliferation of language-specific rules in lexicon and
syntax Machine Translation-Breaking the language Barrier21
Slide 22
Pros Portable Lexical rules and structural transformations
stated more simply on normalized representation Explanatory
Adequacy Cons Difficult to deal with terms. Deciding what should be
added is difficult. What will be the universal knowledge format?
How do we encode? Must decompose and reassemble concepts Machine
Translation-Breaking the language Barrier22
Slide 23
Corpus-based approaches Statistics-based Machine Translation
(SMT): Every target language string, , is a possible translation of
. Every string is given a number called probability. We select the
string which has maximum probability. = argmax [Pr() Pr( | )] Where
is a source language and is a target language These are known as
the language modeling problem, the translation modeling problem,
and the search problem. Machine Translation-Breaking the language
Barrier23
Slide 24
Corpus-based approaches Example Based Machine Translation
Translation by Analogy. System is given a set of sentences in the
source language and their corresponding translations in the target
language System uses those examples to translate other, similar
source-language sentences into the target language. Hybrid methods
Combination of Rule Based and Statistical Methods Machine
Translation-Breaking the language Barrier24
Slide 25
Punjabi to Hindi Machine translation system is a direct
translation system based on various lexical resources and
rule-base. The system is modular with clear separation of data from
process. The central idea is to select words from source language
and do the minimal analysis required like extracting the root word,
lexical category and contextual information i.e. tokens at left and
right side of the current token. Machine Translation-Breaking the
language Barrier25
Slide 26
Word sense disambiguation module is called for ambiguous words.
Equivalents of source token in target language are found out from
the lexicon and are replaced to get target language. The rules are
applied to the output for making it appropriate for the target
language. Machine Translation-Breaking the language Barrier26
Slide 27
System Architecture Normalized Source Text Tokenization Named
Entity Recognition Repetitive Construct Handling Lexicon Look up
Ambiguity Resolution Transliteration Hit? Ambiguous? Post Editing
Target Text No Yes Root word & Inflectional Form DB Bigram
& Trigram DB Ambiguous Word DB Append in Output and retrieve
next token If token present Yes No Pre Processing Translation
Engine Post Processing
Slide 28
For a given language pair and text type what kind of system is
required is largely an empirical and a practical question. General
requirements on MT systems such as modularity, separation of data
from processes, reusability of resources and modules, robustness,
corpus-based derivation of data and so on, do not, provide
conclusive arguments for either one of the models. The available
resources are one of the key factors for deciding the approach.
Machine Translation-Breaking the language Barrier28
Slide 29
In general, if the two languages are structurally similar, in
particular as regards lexical correspondences, morphology and word
order, the case for abstract syntactic analysis seems less
convincing. Keeping in view, the similarity in Punjabi and Hindi
language pair, a simpler, direct model is our obvious choice for
Punjabi to Hindi machine translation system. Machine
Translation-Breaking the language Barrier29
Slide 30
The lexicon contains information about the primary component of
languages, i.e. words. Most NLP applications use dictionaries. For
example, morphological analyzers use a lexicon containing
morphemes, and tagging systems use probability data, and parsers
use lexical/semantic information or co-occurrence information, and
MT systems use Translation Memory and a transfer dictionary.
Machine Translation-Breaking the language Barrier30
Slide 31
Machine Translation-Breaking the language Barrier31 The
bilingual dictionary prepared by LTRC dept of IIIT Hyderabad in
ISCII format containing about 22000 entries. Adopted and extended
for our system. Converted in to Unicode format. The entries are
extended to about 33000 covering almost all the root words of
Punjabi language. Root Table { Field name: PW Field Type: Text
Field name: gnp Field Type: Text Field name: cat Field Type: Text
Field name: HW Field Type: Text } Root Table pwgnpcatHw msnm nsadj
fsnf nsadj fnf mnm fnfAmb
Slide 32
Inflectional Form Table pwroothw Table of all the inflected
forms of Punjabi root words. Contains all the inflectional forms of
Punjabi root words and along with their roots. The corresponding
Hindi words are entered manually. It comprise of about 65,000
entries. Inflectional Form Table { Field name: PW Field Type: Text
Field name: ROOT Field Type: Text Field name: HW Field Type: Text }
Where ROOT is one of the entry from Root table. Machine
Translation-Breaking the language Barrier32
Slide 33
For all the ambiguous words in root table as well as
inflectional form table, the entry for target word contains a
symbol amb. It triggers the disambiguation process for the given
word. A table of ambiguous words is prepared for this purpose that
contains most frequent meaning followed by all other possible
meanings of a given word. The Lexicon Ambigous word table { Field
name: PW Field Type: Text Field name: s1 Field Type: Text Field
name: s2 Field Type: Text } Ambigous words Pws1s2 / / // / Machine
Translation-Breaking the language Barrier33
Slide 34
To help the disambiguation module, bigram and trigram tables
are created. They contains the context of ambiguous words along
with their meaning in that context and frequency obtained from a
corpus of about 30 lakh words. Bigram Table { Field name: PREV1
Field Type: Text Field name: PW Field Type: Text Field name: HW
Field Type: Text Field name: COUNT Field Type: Number } Trigram
Table { Field name: PREV2 Field Type: Text Field name: PREV1 Field
Type: Text Field name: PW Field Type: Text Field name: HW Field
Type: Text Field name: COUNT Field Type: Number } Bigram
prev1pwhwcount , 15 , 18 , 25 , 22 , 7 Trigram prev2prev1pwhwCount
, , 2 ,, 3 ,, 2 ,, 19 ,, 10 ,, 54
Slide 35
The lexicon also contains a rule-base. It contains all the
rules to handle different grammatical dissimilarities between two
languages at post processing. Replacement orgtxtreptxt -
Replacement Table { Field name: orgtext Field Type: Text Field
name: reptxt Field Type: Text } Machine Translation-Breaking the
language Barrier35
Slide 36
The text should be in a normalized way i.e. there should be
only one way to represent a syllable. Having several identical
pieces of text represented by differing underlying byte sequences
makes analysis of the text much more difficult. For example, under
the AnmolLipi font, the Latin character A would appear as .
Conversely, under the DrChatrikWeb font it would appear as . This
cause a problem while scanning a text. Machine Translation-Breaking
the language Barrier36
Slide 37
So the source text is normalized by converting it into Unicode
format. It gives us three fold advantages; First it will reduce the
text scanning complexity. Secondly it also helps in
internationalizing the system. Thirdly it eases the transliteration
task. Machine Translation-Breaking the language Barrier37
Slide 38
Spelling normalization There may be the chances that the word
is present in database but with different spellings like {prkhia}
[examination] In database only one may appear and other may not.
The purpose of spelling normalization is to find the missing
variant. Soundex technique is used for the spelling normalization.
Machine Translation-Breaking the language Barrier38
Slide 39
In this technique, a unique number is assigned to each
character of alphabet. Similar sounding letters get same number.
Then codes for each string are generated. All the strings with same
code are the spelling variations of a same one string. Machine
Translation-Breaking the language Barrier39
Slide 40
c1 c14 c30 C32 c2 c15 c26 C38 c11 c16 c31 kb c4 c17 c32 ,, U c6
c18 c33 ,, S c5 c19 c34 L c8 c20 c35 ,, O c9 c21 c36 K c8 c22 c37 L
c1 c23 c38 C26 c3 c24 c39 , , HALANTNo Code c7 c25 c40 c41 c26 c10
H c27 c13 c12 c28 c14 c13 c29 c19 Machine Translation-Breaking the
language Barrier40
Slide 41
With this table the code for came out to be c31c37sc13sc4
enabling the system to detect the variant present in database. For
example, if the database contains {prkhia} as Punjabi word then the
code c31c37sc13sc4 is stored against it. If a user enters as input,
which is not present in database, its code will be generated on the
fly by the system and checked in the database. If code appears in
the database the corresponding Punjabi word is selected as spelling
variant. Machine Translation-Breaking the language Barrier41
Slide 42
In order to achieve this, we make use of the information
contained in the context similar to what humans do. A standalone
word sense disambiguation module that is capable for performing its
work without any help from outside. To start with, all we have is a
raw corpus of Punjabi text. So the statistical approach is the
obvious choice for us. Machine Translation-Breaking the language
Barrier42
Slide 43
We use the words surrounding the ambiguous word to build a
statistical language model. This model is then used to determine
the meaning of examples of that particular ambiguous word in new
contexts. The basic idea of statistical methodologies is that,
given a sentence with ambiguous words, it is possible to determine
the most likely sense for each word. One of such statistical model
is n gram model. Machine Translation-Breaking the language
Barrier43
Slide 44
An n-gram is simply a sequence of successive n words along with
their count i.e. number of occurrences in training data. An n-gram
of size 2 is a bigram; size 3 is a trigram; and size 4 or more is
simply called an n-gram or (n 1)-order Markov model. N-grams are
used as probability estimators which estimates likeliness of a
word(s) to follow a certain point in a document. What is the
optimum value of n? Machine Translation-Breaking the language
Barrier44
Slide 45
Consider predicting the word " " from the three sentences: (1)
. (2) , , . (3) In (1), the prediction can be done with a bigram
(2-gram) language model (n=2), but (2) requires n=4 and (3) require
n > 9. Machine Translation-Breaking the language Barrier45
Slide 46
Number of words to be considered at n positions is important
Factors of concern are Larger the value of n, higher is the
probability of getting correct word sense i.e. for the general
domain; more training data will always improve the result. But on
the other hand most of the higher order n grams do not occur in
training data. This is the problem of sparseness of data. Machine
Translation-Breaking the language Barrier46
Slide 47
As training data size increase, the size of model also increase
which can lead to models that are too large for practical use. The
total number of potential n grams scales exponentially with n. A
large n require huge amount of memory space and time. Does the
model get much better if we use a longer word history for modeling
an n-gram? Do we have enough data to estimate the probabilities for
the longer history? Machine Translation-Breaking the language
Barrier47
Slide 48
An experiment for optimum value of n for Punjabi language is
performed. Different n gram models were generated where n ranges
from 1 to 6 This was observed that as the value of n increases, its
ability to disambiguate a word decreases. This is due to sparseness
of data. Machine Translation-Breaking the language Barrier48
Slide 49
Another interesting point observed is that instead of making
and using a higher order n gram models, we can improve the
efficiency of the system tremendously by utilizing lower order
models jointly. We can use tri-gram model in the first place to
disambiguate a word. If it fails to disambiguate then we move to
lower order model i.e. bi-gram model for WSD. If it also fails, we
can use the unigram model. With this technique we get only 7.96% of
incorrectly disambiguated words This approach is adopted for the
word sense disambiguation module. Machine Translation-Breaking the
language Barrier49
Slide 50
Three models viz. Unigram, Bigram and Trigram of the ambiguous
words to tap the words in context of any ambiguous word are created
from a corpus of about 3 million words generated by including
different types of articles like essays, stories, editorials, News,
novels, office letters, court orders etc. In order to reduce the
size of n-grams, we retain only those context which leads to less
frequent meaning of ambiguous words. Machine Translation-Breaking
the language Barrier50
Slide 51
The idea is to check the contextual information for the least
frequent meaning. If it fails to disambiguate then we use the most
frequent meaning by default. For example {d} in Punjabi can be used
as post position as well as verb, but its usage as verb is very
less frequent. So we place all those bigrams and trigrams in
database that leads to the disambiguation of less frequent meaning
i.e. {d} as verb. Machine Translation-Breaking the language
Barrier51
Slide 52
Machine Translation-Breaking the language Barrier52 It contains
all those entries for which has less frequent meaning. All such
meanings are entered manually in Trigram and Bigram Model. If the
word cannot be disambiguated by bigram and trigram then most
frequent meaning is selected by default. There are the chances when
the previous words in the context leads to one sense but next words
are producing the other sense. For such cases again the sense with
higher probability is selected. bigram prev1pwhwcount , 4 , 2 , 12
, 2 , 2
Slide 53
Transliteration is a solution of OUT-OF- VOCABULAY words.
Transliteration is a process wherein an input string in some
alphabet is converted to a string in another alphabet, usually
based on the phonetics of the original word. If the target language
contains all the phonemes used in the source language, the
transliteration is straightforward e.g. the Hindi transliteration
of Punjabi word (Room) is which is essentially pronounced in the
same way. Machine Translation-Breaking the language Barrier53
Slide 54
For missing sounds or extra sounds in the target language are
generally mapped to the most phonetically similar letter e.g. in
Hindi we have alphabets which have double sound associated with
them like which is a combination of sound of and . In Punjabi,
generally the letter is used to denote such sounds e.g. in
(Alphabet) which is transliterated to . A single foreign word can
have many different transliterations. E.g. (Mehfooz) can be
transliterated as , , , etc. Machine Translation-Breaking the
language Barrier54
Slide 55
Direct Mapping Rule Based Soundex Based Machine
Translation-Breaking the language Barrier55
Slide 56
GurmukhiDevanagariGurmukhiDevanagari ,, ,, ,, ,, ,, ,, ,, ,,
Vowel Mapping Punjabi contains 10 vowel symbols and nine dependent
vowel sounds. Hindi has the one to one representations of all
Punjabi vowel symbols and sounds. Machine Translation-Breaking the
language Barrier56
Slide 57
Consonant Mapping Consonant mapping is shown in Table Below
Gurmu khi Devana gari Gurmu khi Devana gari Gurmu khi Devana gari
Gurmu khi Devana gari Gurmu khi Devana gari - - - - No letter in
Punjabi is present for Hindi letters . This means these letters can
never be mapped in letter to letter based approach. Similar is the
case for some double sound producing letters like .
Slide 58
Sub Joins Mapping There are three sub joins (PAIREEN) in
Gurmukhi, Haahaa, Vaavaa, Raaraa shown in table below PAIRE EN
PunjabiHindi English Lips Sequentiall y Self-respect In Punjabi
They are represented by the virama (or halant) character before the
consonant. Similar viram character is also present in Hindi which
indicate that the inherent vowel is omitted (or 'killed'). PAIREEN
Haahaa and Vaavaa are replaced with full consonants while their
previous consonant is shown in half form, whereas PAIREEN Raaraa
takes the position below the previous consonant in Hindi similar to
as in Punjabi. Machine Translation-Breaking the language
Barrier58
Slide 59
Other Symbols Adhak ( ) is used to duplicate the sound of a
consonant in Punjabi. No such character is present in Hindi. Sound
duplication is represented by half form of consonants in Hindi.
Punctuation marks and digits are same in both scripts. A special
character called as visarga ( ) is present in Hindi but not in
Punjabi. So it will never be mapped in letter to letter based
scheme. Beside this Gurmukhi has two separate nasal characters
Bindi ( ) and Tippi ( ). Hindi has also two nasal characters i.e.
Bindi ( ) and chander bindu ( ). Both nasal characters of Punjabi
are mapped to this single nasal character in Hindi.
Slide 60
Letter to letter mapping produce quite good results. But we can
improve the results by making them more nearer to target language
in term of spellings and choice of alphabets by using some set of
rules. Rule Based Approach Alphabets whose mapping is not available
in Hindi : and are two such alphabets. They are replaced by their
most phonetically equivalent characters i.e. and . A character
adhak is present in Punjabi and used to show the stress on the next
character. No alphabet in Hindi is present to represent this
character. The purpose is solved by placing a half character before
the stressed character. e.g. is transliterated as . There is an
exception for this rule. If the next character of adhak is then
instead of placing half character, a half is placed. e.g.
tranliterated to . Also if next character is then half character is
replaced by half . e.g. as in which is transliterated to .
Similarly if next character is then half character is replaced by
half . e.g. as is transliterated as . Machine Translation-Breaking
the language Barrier60
Slide 61
Rule for tippi if followed by or then is replaced by half as in
and if followed by then is replaced by half as in . Rule for
character when followed or preceded by is translitrated
differently. In case, it is followed by then is ommited from the
transliterated text e.g. is transliterated to . if it is preceded
by then is mapped to in the transliterated text. e.g. is
transliterated to . Miscellaneous Rules If this combination of
letters appears at the last position in a word then instead of
mapping to this letter mapped to . e.g. is transliterated to . if
at last position then it is replaced by e.g. in if at last position
or 2nd last position then it is replaced by e.g. in . Machine
Translation-Breaking the language Barrier61
Slide 62
The Soundex concept is extended for searching the correct
spelling variant of a given transliteration. The transliterations
are produced by the methods discussed earlier. we have developed a
unigram table from a corpus of about 10 million words The codes are
generated for each of the word in unigram For the comparison,
letters in word are converted into phonetic code Machine
Translation-Breaking the language Barrier62
Slide 63
Then this code is looked for in unigram table. The entry with
maximum frequency is selected as the correct variant of given input
For example consider the word { arpha } (Draft) written in Punjabi.
The strings { arpha } is produced by the baseline module. The code
2541483623 is generated for this string. This code is looked for in
unigram database. This database contains two entries against this
code i.e. and with frequency 12 and 8. The string with higher
frequency is selected as correct output for this input. Machine
Translation-Breaking the language Barrier63
Slide 64
Machine Translation-Breaking the language Barrier64 frweIvr =
Direct Mapping = Rule Base Enhancement = Soundex Based Enhancement
=
Slide 65
Rules are applied to cover the minor grammatical differences
between languages. The general structure of the rules is context
dependent replacement. The corresponding phrase or word in a given
context replaces one phrase or a word. Ordering of rules does not
matter. Machine Translation-Breaking the language Barrier65
Slide 66
Input = Output = Machine Translation-Breaking the language
Barrier66
Slide 67
Use Default Meaning Put Hindi Word into Output Look into
Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO
Sentence Complete Check vicinity for direct transliteration Need
Translit. Look into Root Data Base Found Look into Inflectional DB
NO YES Select Next Word NOYES Stop Working of Target Word Generator
Selected word in red
Slide 68
Use Default Meaning Put Hindi Word into Output Look into
Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO
Sentence Complete Check vicinity for direct transliteration Need
Translit. Look into Root Data Base Found Look into Inflectional DB
NO YES Select Next Word NOYES Stop Working of Target Word Generator
Selected word in red
Slide 69
Use Default Meaning Put Hindi Word into Output Look into
Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO
Sentence Complete Check vicinity for direct transliteration Need
Translit. Look into Root Data Base Found Look into Inflectional DB
NO YES Select Next Word NOYES Stop Working of Target Word Generator
Selected word in red Three trigram are possible as shown below Only
2 nd & 3 rd trigram can resolve ambiguity. The meaning with
higher count is selected. In this case both trigram produce same
result.
Slide 70
Use Default Meaning Put Hindi Word into Output Look into
Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO
Sentence Complete Check vicinity for direct transliteration Need
Translit. Look into Root Data Base Found Look into Inflectional DB
NO YES Select Next Word NOYES Stop Working of Target Word Generator
Selected word in red
Slide 71
Machine Translation-Breaking the language Barrier71
Slide 72
Machine Translation-Breaking the language Barrier72
Slide 73
Machine Translation-Breaking the language Barrier73
Slide 74
Machine Translation-Breaking the language Barrier74
Slide 75
The ultimate goal of MT fraternity is to develop MT systems and
integrate them in order to break the language Barrier. Any person
can have required information in his own native language and can
share his/her knowledge across the world without learning the other
languages. Machine Translation-Breaking the language Barrier75
Slide 76
Slide 77
THANKS !! Machine Translation-Breaking the language
Barrier77