+ All Categories
Home > Technology > Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

Date post: 26-Jun-2015
Category:
Upload: content-savvy
View: 600 times
Download: 0 times
Share this document with a friend
Description:
Analyzing language usage on the internet with data mining, natural language processing and text analytics and the challenges ahead.
16
! #$%&' ())* +$,-,.%/$0, 1&23 +.%.%/4 5673 .% 8+09:.%&; 1<0%7,0-6%' =6$%> .% ?0-@& 50%/$0/& +.%.%/ A6:.%. BC D<.:0<. 1:& D303& E%.@&<7.3F 6G ?&H I6<J 03 K$L0,6 #0%F0 M%9C +$,-,.%/$0, (C)' 1$976%' NO NP<., Q!' ()Q(
Transcript
Page 1: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

+$,-,.%/$0,"1&23"+.%.%/4""5673".%"8+09:.%&;"1<0%7,0-6%'"

"=6$%>".%"?0-@&"50%/$0/&"+.%.%/"

A6:.%."BC"D<.:0<."1:&"D303&"E%.@&<7.3F"6G"?&H"I6<J"03"K$L0,6"

#0%F0"M%9C"

+$,-,.%/$0,"(C)'"1$976%'"NO"NP<.,"Q!'"()Q("

Page 2: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

50%/$0/&"E70/&"6%"3:&"M%3&<%&3"

R6,$S&"6%"T:.%&7&"S.9<6U,6//.%/"7.3&"V&.U6":07"7$<P077&>"1H.W&<"6%"S$,-P,&"69907.6%7"

Page 3: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

50%/$0/&"E70/&"6%"3:&"M%3&<%&3"

:WP4XXHHHC.%3&<%&3H6<,>73037C96SX73037YC:3S"

Page 4: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

1<&%>.%/"Z:<07&7X+&S&"[&3&9-6%"

+6%.36<&>"E<>$"%&H7"3:<6$/:6$3"()QQ"

T6SP0<&>"%&H7"G<6S"0<6$%>"3:&">0F7"6G"\70S0"K.%"50>&%"J.,,.%/"8+0F"Q]^;"H.3:"3:&"0<-9,&7"F&0<"36">03&C""K07&>"6%"3:.7'"&23<093"7./%._90%3"%&H"P:<07&7C"

M,,$73<03&7"63:&<"%&H7"U&7.>&7"K.%"50>&%"J.,,.%/'"&C/C"J.,,.%/"6G"=0<66`"K0./'",&0>&<"6G"+$W0:.>0"a$0S."+6@&S&%3C"

?63&4""9$<<&%3,F"$7.%/"6%,F"96%3&%3"0%0,F7.7"36">&3&93"3<&%>.%/"P:<07&7C""1<$&"S&S&">&3&9-6%"<&`$.<&7"769.0,"%&3H6<J"G&03$<&7"07"H&,,C"

Page 5: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

=093$0,X16P.90,"N%0,F7.7"6G"?&H7"

b"

Page 6: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

N%0,Fc.%/"Z63&%-0,"K.07".%"+&>.0"

?6%]16P.90,"3&23"0%0,F7.74"9:0<093&<.c0-6%7"0<&"76$/:3"6G"3:&"6P.%.6%7'"G&&,.%/7'"0%>"0d3$>&7"&2P<&77&>".%"0"3&23'"<03:&<"3:0%"e$73"6G"3:&"36P.97"3:&"3&23".7"0U6$3"

=09.,.303&7"0%0,F7.7"6G":6H"S&>.0"<&P6<37"&@&%374"

• ""D$SS0<.c&":6H"3:&"P<&77":07"7:.f&>".3g7"0d3$>&"36H0<>7"M%>.0"6@&<"3:&"P073"F&0<"• ""D:6H":6H">.L&<&%3"<&/.6%7"8D.%>:'"Z&7:0H0<'"=<6%-&<"3&<<.36<F;">.L&<".%"3:&.<"P&<9&P-6%"6G"3:&"9$<<&%3"0>S.%.73<0-6%"• ""V:03"0<&"3:&"S0.%".77$&7"U&.%/"<&P6<3&>h"

Page 7: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&!'%!!"#$!!()*!'+&"!,)*!(*#-!%&#'!./&!01!2#*3!"4&!!05!678(&)! ![anSary nE kha myry ray^E myN eamr shyl ayk bd dmaG awr Zdy XKS hyN ] !

[Ansari said, “according to me Aamir Sohail is one crazy and stubborn man”]!TARGET

Attributes ID:t1

AGENT Attributes ID:a1 Nested-source: “w” TargetID:t1

!""#"$%&'!(()*+,(-.!"#$%&'&(!)**+*,-./01.$!2.3%*+4.!"2*.25+*0$!6+36(!!7.3%*+4.8*9:%;-$!*&!<95+*+4.8*9:%;-$!2,==!

EXPRESSIVE ELEMENT Attributes ID:ex1 , TargetID:t1, Emotion:anger Intensity:high, Nested-Source: “w”, a1, Polarity:negative

Non-Topical Analysis

Agent: Opinion holder Target: Target of Opinion being expressed (a topic, a person, organization etc.) Attitude: includes Expressive Element

Page 8: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

www.janyainc.com

FACETED SEARCH: DRILL DOWN TO RELEVANT CONTENT/DATA

People are filled with anger and sorrow because of the policies made by Musharaf. OPINION HOLDER – Writer, People

TARGET –Musharaf’s policies (Musharaf is an implied target)

Page 9: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

Human Behavior Analysis •  Process social media content, provide tools for analysts to:

•  Identify social networks: groups, members •  Identify topics of discussion and sentiment

•  E.g. angry at govt., wanting retaliation, peacemakers •  Thought influencers

•  Identify social goals through analysis of verbal communication

•  Manipulation: Persuasion, threats, coercion •  Religious supremacy: religious analogues •  recruitment

Social Media Content

Link Diagrams

Predictive Modeling

Page 10: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

T:0,,&%/&7"

Page 11: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

i6H&@&<'"j>$90-6%"+.%.73&<"+6:0SS0>"i0%.G":0,G]>&0>"<&9&%3,F".%"NG/:0%.730%'"k"S.,,.6%"9:.,><&%"6G"79:66,"3&23U66J7":0@&"

U&&%">.73<.U$3&>"36"76S&":6P&C"

!"#$%&'(&)*+%&%,"#+-"%(&(.(,/%01#2&%3&4#2-1+05&

Name translation output:

l66/,&"3<0%7,0-6%"0$/S&%3&>"UF"UF"D&S0%3&2™"3<0%7,0-6%"6G"%0S&7C"

mi6H&@&<'"j>$90-6%"+.%.73&<"+6:0SS0>"i0%&&G"N3S0<"<&9&%3,F".%"NG/:0%.730%'"k"S.,,.6%"9:.,><&%"6G"79:66,"3&23U66J7":0@&"U&&%">.73<.U$3&>"36"76S&":6P&Cn"

?0-@&",0%/$0/&"P<69&77.%/"<&`$.<&>"G6<"_%&]/<0.%&>"0%0,F7.7o"

Google Translation Context Aware Translation

Page 12: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

i$S0%"3<0%7,0-6%"G6<"0,,"N<0U.9"@0<.0%37"U&,6H".7"3:&"70S&4"m1:&<&".7"%6"&,&93<.9.3F'"H:03":0PP&%&>hn"

N<0U.9"[.0,&937"0<&"%63":0%>,&>"H&,,".%"9$<<&%3"S09:.%&"3<0%7,0-6%"7F73&S7C"

T\5NKN"&%0U,&7"+DN"366,7"36".%3&<P<&3">.0,&937"96<<&93,FC"""

166,7"S0>&"G6<"+DN"G0.,"6%"N<0U.9">.0,&937"

Q("

6,"718&9",1"#%& 6,"718&:3*,8(&;(<%& =332-(&;,"#+-"%(&

j/FP-0%" !"# $%&'(%) )*+!,#) -.* !/,

N3`303"&,&93<.90,"H.<&7'"V:F"0<&"Z673&>h

5&@0%-%&" $)*+!, 0"12 3#,0-,"! 0"#

TJ,6"+0G&&7:")*+!,'"5&9:":&9Jh

M<0`." -+"4 $5)*+!, 3,)2 30 p$"+NT\?"&,&93<.9.3F'"/66>h

+DN" )6)2 $5)*+!, /73")#-#89

[6&7"%63":0@&"&,&93<.9.3F'"H:03":0PP&%&>h

Page 13: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

V&U"07"T6<P$74""+.%.%/"V.J.P&>.0"G6<"50%/$0/&"A&76$<9&7"

• ""1<0%7,0-6%",&2.96%7"0$36S0-90,,F"&23<093&>"G<6S"T:.%&7&"V.J.P&>.0'"$7&"9<677",0%/$0/&",.%J7"36"0>>"j%/,.7:"3<0%7,0-6%7"• ""j07F"36"<&/&%&<03&"H.3:"%&H"@&<7.6%7"6G"V.J.P&>.0"• ""T:.%&7&"V.J.P&>.0".7"96%730%3,F"/<6H.%/"

Page 14: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

Code Mixing, Switching

!  Use of Latin script: lack of transliteration standards makes it difficult to process

!  Urdish, Spanglish, Hinglish etc.

Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]

Solutions: •  Apply “romanized” POS tagger, English tagger in tandem: use machine learning to combine evidence and generate final tag, language ID •  For longer English spans, use English NLP system

Page 15: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

Language Resource Acquisition

Less Commonly taught languages (LCTL) •  Yoruba, Russian, Swahili •  Dialects

Very few few linguistics resources available •  electronic lexicons •  translation lexicons •  part-of-speech taggers, chunkers

• Typically, very expensive to produce these resources by hand

•  The web provides a new opportunity to automatically acquire these resources “web as corpus”

Page 16: Multilingual Text Mining: Lost in (Machine) Translation, Found in Native Language Mining

!"#$%&'"())*"

1:&"A60>"N:&0>h"

T6%3&23'"T6%3&23'"T6%3&23o" U,6/C/37]3<0%7,0-6%C96S"


Recommended