Date post: | 01-Nov-2014 |
Category: |
Technology |
Upload: | shuyo-nakatani |
View: | 56 times |
Download: | 8 times |
Short Text Language Detection with Infinity-Gram
20120514 NAIST Seminar
Nakatani Shuyo Cybozu Labs Inc
Agenda
bull Language Detection
bull Proposal Method
ndash Maximal Substring
bull Corpus
bull Implementation and Estimations
bull Conclusions
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 4
Language Detection
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 5
In What Language
bull Ik kan er nooit tegen als mensen me negeren
bull Aha ich seh angeblich suumlszlig aus
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli
bull Ah Tak Saring skal jeg bare finde ud af hvordan
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen
bull Ccedilok doğru En buumlyuumlk hatayı yaptım
bull Icircncacircntat de cunoștință
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 6
Hints
bull Dutch if there is ik
bull German if there is ich or a letter szlig
bull Polish if there is czy or letters Ł ń ś or ź
bull Scandinavian if there is a letter aring
ndash Danish if there is af Tak means thanks
ndash Norwegian if there is nei Takk means thanks
ndash Swedish if there is och Tack means thanks
bull Turkish if there is a letter ı ( i without point) or ğ
bull Romanian if there is a letter ă or ș or ț
ndash Although ă is also used in Vietnamese it is easy to distinguish them
ndash Although ş is also used in Turkish it is easy to distinguish them
bull Vietnamese if there are many unreadable letters on WinXP P
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 7
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Agenda
bull Language Detection
bull Proposal Method
ndash Maximal Substring
bull Corpus
bull Implementation and Estimations
bull Conclusions
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 4
Language Detection
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 5
In What Language
bull Ik kan er nooit tegen als mensen me negeren
bull Aha ich seh angeblich suumlszlig aus
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli
bull Ah Tak Saring skal jeg bare finde ud af hvordan
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen
bull Ccedilok doğru En buumlyuumlk hatayı yaptım
bull Icircncacircntat de cunoștință
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 6
Hints
bull Dutch if there is ik
bull German if there is ich or a letter szlig
bull Polish if there is czy or letters Ł ń ś or ź
bull Scandinavian if there is a letter aring
ndash Danish if there is af Tak means thanks
ndash Norwegian if there is nei Takk means thanks
ndash Swedish if there is och Tack means thanks
bull Turkish if there is a letter ı ( i without point) or ğ
bull Romanian if there is a letter ă or ș or ț
ndash Although ă is also used in Vietnamese it is easy to distinguish them
ndash Although ş is also used in Turkish it is easy to distinguish them
bull Vietnamese if there are many unreadable letters on WinXP P
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 7
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Language Detection
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 5
In What Language
bull Ik kan er nooit tegen als mensen me negeren
bull Aha ich seh angeblich suumlszlig aus
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli
bull Ah Tak Saring skal jeg bare finde ud af hvordan
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen
bull Ccedilok doğru En buumlyuumlk hatayı yaptım
bull Icircncacircntat de cunoștință
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 6
Hints
bull Dutch if there is ik
bull German if there is ich or a letter szlig
bull Polish if there is czy or letters Ł ń ś or ź
bull Scandinavian if there is a letter aring
ndash Danish if there is af Tak means thanks
ndash Norwegian if there is nei Takk means thanks
ndash Swedish if there is och Tack means thanks
bull Turkish if there is a letter ı ( i without point) or ğ
bull Romanian if there is a letter ă or ș or ț
ndash Although ă is also used in Vietnamese it is easy to distinguish them
ndash Although ş is also used in Turkish it is easy to distinguish them
bull Vietnamese if there are many unreadable letters on WinXP P
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 7
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
In What Language
bull Ik kan er nooit tegen als mensen me negeren
bull Aha ich seh angeblich suumlszlig aus
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli
bull Ah Tak Saring skal jeg bare finde ud af hvordan
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen
bull Ccedilok doğru En buumlyuumlk hatayı yaptım
bull Icircncacircntat de cunoștință
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 6
Hints
bull Dutch if there is ik
bull German if there is ich or a letter szlig
bull Polish if there is czy or letters Ł ń ś or ź
bull Scandinavian if there is a letter aring
ndash Danish if there is af Tak means thanks
ndash Norwegian if there is nei Takk means thanks
ndash Swedish if there is och Tack means thanks
bull Turkish if there is a letter ı ( i without point) or ğ
bull Romanian if there is a letter ă or ș or ț
ndash Although ă is also used in Vietnamese it is easy to distinguish them
ndash Although ş is also used in Turkish it is easy to distinguish them
bull Vietnamese if there are many unreadable letters on WinXP P
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 7
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Hints
bull Dutch if there is ik
bull German if there is ich or a letter szlig
bull Polish if there is czy or letters Ł ń ś or ź
bull Scandinavian if there is a letter aring
ndash Danish if there is af Tak means thanks
ndash Norwegian if there is nei Takk means thanks
ndash Swedish if there is och Tack means thanks
bull Turkish if there is a letter ı ( i without point) or ğ
bull Romanian if there is a letter ă or ș or ț
ndash Although ă is also used in Vietnamese it is easy to distinguish them
ndash Although ş is also used in Turkish it is easy to distinguish them
bull Vietnamese if there are many unreadable letters on WinXP P
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 7
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
In What Language (Solution)
bull Ik kan er nooit tegen als mensen me negeren Dutch
bull Aha ich seh angeblich suumlszlig aus German
bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish
bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish
bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian
bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish
bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish
bull Icircncacircntat de cunoștință Rumanian
bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 8
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Whats Language Detection
bull To detect what language the input text written in
ndash Time fries like arrow rarr English
ndash Buona sera rarr Italian
bull It is prior for many language processing tasks
ndash Language model is built for each language
ndash Text search classification extraction translation
bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]
ndash 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 9
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
SPAM or not
bull It is necessary to know that it is written in Polish
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 10
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Document Categorization with Naive Bayes Classifier
bull Categorize a document 119883 = (119883119894) into category 119862119896
ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)
bull Word probability assumes conditionally independent on each category
ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)
ndash where 119901(119883119894|119862) rate of word frequency for category
bull Estimate the category 119862k to maximize posterior
ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k
119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894
ndash where 119901(119862k) prior for category
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 11
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Language Detection with Naive Bayes Classifier
bull Document categorization with language
labels
ndash Categorize documents into English Japanese
and so on
bull Use character n-gram as features
ndash Unicode code point n-gram strictly speaking
ndash Assume character encoding of the document is
already known
bull Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 12
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Why Use n-Gram to Detect Language
bull Each language has proper characters and spelling rules
ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle
ndash There are many words which start with ldquoZrdquo in German but not in English
ndash There are many words which start with ldquoCrdquo in English but not in German
ndash Spelling ldquoThrdquo is often used in English but not in the other languages
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 13
T h i s T h i s larr1-gram
T Th hi is s larr2-gram
Th Thi his is larr3-gram
C L Z Th
English 075 047 002 074
German 010 037 053 003
French 038 069 001 001
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
language-detection(langdetect) (Nakatani 2010)
bull Language detection library for Java
ndash httpcodegooglecomplanguage-detection
ndash Apache License 20
ndash Character 3-gram + Bayesian filter
ndash Various normalizations + Feature sampling
bull 99 over precision for 53 languages
ndash Training with Wikipedia abstract
ndash Widely support including Asian languages
ndash Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 14
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Estimation with News Text
bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 15
Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)
Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)
zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)
total 9800 9777 (9977)
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Estimation with Europarl datasets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 16
bull Test for 1000 samples for each
language from Europarl Parallel Corpus
ndash from the proceedings of the European Parliament
ndash httpwwwstatmtorgeuroparl
bull httpcodegooglecomplanguage-
detectiondownloadsdetailname=eur
oparl-testzip
language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991
total 21000 20850 993
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Language Detection has been over isnt it
17
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
We still have ENEMY to beat
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 18
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Twitter Language Detection with the Existing Methods
bull Only 90-95 accuracy
for tweet corpus
bull LD = language-detection
bull CLD = Chromium Compact Language
Detection
ndash httpcodegooglecompchromium-
compact-language-detector
ndash regard ms(Malay) as id(Indonesian)
bull Tika = Apache Tika
ndash httptikaapacheorg
ndash Estimate on 15 languages which Tika
supports in our tweet corpus
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 19
language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----
total 922 938 700
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Chromium Compact Language Detection (CLD)
bull Porting the language detector from
Google Chromium ndash httpcodegooglecompchromium-compact-language-detector
ndash Implementation in C++ Python binding
ndash of supported languages CLD = 76
langdetect = 53
ndash Accuracy CLD = 9882 langdetect =
9922
bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 20
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Is twitter Language Detection difficult (1)
bull Tweet is too short to extract 3-gram features
ndash At most 140 characters on twitter
ndash URLs mentions and hashtags are not useful to
detect
bull LIGA [Tromp+ 11]
ndash Graph-features based on 3-gram
bull Add long distance features
bull 95~98 accuracy for twitter Language Detection
bull 6 languages (de en es fr it nl)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 21
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Is twitter Language Detection difficult (2)
bull Tweet is too noisy
ndash Representations against the languages orthography often appear
ndash Acronym Abbreviation lengthened word (like Cooooolll)
bull Likelihood of tweet tends to get smaller on normal language model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
OMG Oh My God
LOL Laughing Out Loud
LMAO Laughing My Ass Out
F4F Follow for Follow
MDR Mort de Rire (French)
TKT Ne tlsquoInquiegravete Pas (Fr)
u you
ur your
4 for
i0u I love you
k che (Italian)
anke anche(Italian)
Letter k isnt used in Italian
22
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Motivation to Detect Short Text Language
bull There are many small chunks of text in addition to twitter
ndash Schedule search query bulletin board and so on
ndash There are many questions about short text detection in the Issues Board of langdetect Project
bull httpcodegooglecomplanguage-detectionissuesdetailid=10
bull Detection for multi-language mixed text
ndash Cut the target document in paragraphs or lines
ndash Detect for each short text
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 23
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Our Goal
bull Over 99 accuracy
ndash However it is too difficult to detect one
word sentence
ndash Our Goal is 99+ accurate detection for
sentence with more than 3 words
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 24
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
We need
bull Rich feature extractable model from
short text
ndash Maximal substring model
(infin-gram Logistic Regression)
bull and twitter-specific Language model
or Corpus to construct it
ndash about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 25
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Proposal Method
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 26
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
How to increase features from 3-grams
bull The more n the more features
bull Maximum at n=infin that is all substring
ndash But it has O(T2) order
gram of n-gram
freq≧1 freq≧2 freq≧10
1 79 72 57
2 1896 1533 902
3 15970 10369 4525
4 64966 33941 10534
5 167543 69719 15538
6 323749 107861 18970
7 524634 142954 21093
8 760719 171995 22159
9 921361 193995 22696
cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 27
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Text Categorization with All Substring Features [Okanohara+ 09]
bull Multiclass Logistic Regression using all
substrings as features
ndash Maximal Substring makes the equivalent
model that can be constructed in linear
time
ndash Store features into TRIE fast prediction
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 28
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Maximal Substring (1)
bull Define a containment(semi-order)
among non empty substrings
abracadabra
ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur
as the substring of ldquobrardquo
ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo
but also ldquocardquo It is strictly defined with also its position in the substring
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 29
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Maximal Substring (2)
bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring
bull Maximal substrings of abracadabra are a abra and abracadabra
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 30
via httpdhatenanejpnokuno201202031328237067
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Maximal Substring and Infinity-Gram
bull Frequencies of substrings that have a containment relationship always equal
bull In the model with linear combination of features it is possible to enclose the common feature values
bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Although the equivalence collapses for test set
we assumes that it can be approximated by a sufficiently large training set
31
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Extended Suffix Array
bull Extended Suffix Array consists of
ndash SA=Suffix Array
ndash L=Longest Common Prefixes
ndash B=Burrows-Wheelers Transformed text
bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type
ndash They can be calculated on linear time
bull esaxx Okanoharas implement of ESA
ndash httpcodegooglecompesaxx
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 32
via [Okanohara+ 09]
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Corpus and Normalization
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 33
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Target Languages
bull Limit character type to detect
ndash In short text detection mixed text can be
divided to type of characters
bull Latin alphabet language
ndash The most difficult alphabet type to detect
ndash Languages which speakers are over 5
million are more than 25
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 34
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Whats Latin Alphabet
bull Latin alphabet ne ascii alphabet
ndash aring ą aelig eth Ħ ŋ and so on
bull They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 35
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Latin Alphabets in Unicode Codepoint Chart
for Vietnamese only use often use sometimes
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 36
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
How to Create Corpus
bull Collect tweets with sample method of
twitter Streaming API
ndash Sampling 1 of all tweets (about 2
million tweets)
ndash Tweets in Latin alphabet language
account for 60 of them
bull The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 37
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Language Label Annotation
bull Group tweets by their timezone
ndash French tweets account for about 1 of all ones
ndash But they account for 50 of ones in Paris
timezone only
bull Annotate tentative labels to tweets using
langdetect
ndash Remove non-French tweets from ones labeled lsquofrrsquo
ndash Recover French tweets from ones not labeled lsquofrrsquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 38
( 20 of the whole tweets have no timezone)
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
How to annotate
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 39
Swedish Norwegian Danish Vietnamese Lithuanian
Czech Hungarian Catalan Rumanian and Polish guides in turn
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Created Corpus
bull Noiseless tweets for training data
bull Noiseful tweets with more than 3 words as test data
bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 40
language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488
total 538789 166773
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Simple Language Detection
bull Language detector can be constructed
from maximal substring model and
twitter corpus
ndash It still gets at most 98 accuracy
bull We guess it is necessary to reduce bias
ndash data size bias
ndash language-specific bias
ndash twitter-specific bias
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 41
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Bias by Data Size
bull Tweet size in each language has huge bias
bull Level them out by sampling with replacement from each language up to the largest data
ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
othersShort Text Language Detection with Infinity-Gram
(NAIST Seminar) 42
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Convert to Lowercase on Multiple Languages
bull Conversion into lower case saves corpus and compresses model
bull But the lower case of I (U+0049) in Turkish differs from others
bull Convert to lower case excluding lsquoIrsquo
Upper case Lower case
Turkish
Azerbaijani
I (U+0049) ı (U+0131)
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 43
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for Rumanian
bull Rumanian uses acirc ă icirc ș ț in addition to a-z
bull There are 2 character type as st with a ldquobeardrdquo
ndash U+015E-F U+0162-3 st with cedilla
ndash U+0218-B st with comma below
bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia
bull The 2 code has the same design in some fonts
ndash Indistinguishable
ș ş U+0219 U+015F
ț ţ U+021B U+0163
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
44
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Rumanian Character Affairs on PC
bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently
ndash 1989 Democratization in Rumania
ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode
ndash 2007 Rumania seated in the EU
ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 45
lsquost with cedillarsquo is used
on an advertisement board
in Bucharest
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for Substitute Characters
bull lsquost with cedillarsquo are substitute characters
ndash But they are more popular than the others
ndash with cedilla with comma = 2 1
ndash ldquoRumanian IMErdquo outputs the substitutes too D
bull Regard lsquost with commarsquo as lsquost with cedillarsquo
ț ţ U+021B U+0163
I reckon it is similar to the relationship of
Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 46
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Arabic Character Normalization (on language-detection)
bull Arabic and Persian have the similar trouble too
bull Character lsquoyehrsquo in Farsi corresponds to 2 code points
ndash Wikipedia uses ی (U+06cc Farsi yeh) only
ndash News uses ي(U+064a Arabic yeh) only
bull U+064a is a substitute in Farsi
ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc
ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails
bull Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 47
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for Vietnamese (1)
bull Vietnamese has 12 vowels
ndash a ă acirc e ecirc i y o ocirc ơ u ư
bull Vietnamese has 6 tones
ndash a ả agrave atilde aacute ạ
ndash These tone symbols are used also in general documents like news
bull The tone symbols can be appended to all vowels
ndash 12 6 = 72
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 48
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for Vietnamese (2)
bull Representation of vowels with
tones
1 Use U+1ea0 - U+1ef9
bull ẵ = U+1eb5
2 Combine with Diacritical Marks
bull ẵ = U+0103 U+0303
ndash Half and half on news and tweet
bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 49
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
CJK-Kanji Normalization (1) (on language-detection)
bull CJK-Kanji has too many characters(more than 20K)
ndash Other character types have only 30-50 characters
bull The character space is very sparse
ndash Characters that donrsquot occur in the training corpus have no probabilities
bull eg 谢谢 Kanji for person name
ndash Common frequent characters are too strong
bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese
bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 50
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
CJK-Kanji Normalization (2) (on language-detection)
bull Group Kanjis by frequency and normalize each group to the representative character
ndash (1) K-means clustering
bull Use tf-idf on Wikipedia and Google News
bull K=50 (size of ascii alphabet = 52)
ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese
bull Simplified Chinese 现代汉语常用字表(3500)
bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)
bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998
ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much
bull Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 51
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for twitter
bull Remove simply
ndash URL
ndash mention
ndash hash tag
ndash RT
ndash face mark using alphabet like XD p
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 52
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Normalization for twitter-Specific Representation
bull How to Like lsquocoooooooollllllrsquo
bull Case 1 Make a normalization dictionary using [Brody+ 2011]
ndash Unsupervised normalization like coooollll rarr cool
ndash It canrsquot handle words that are not in the dictionary
bull Case 2 If the same character continues in more than 3 Shrink it to 2
ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of
bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on
bull Acronym (like WWW СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 53
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Laugh Normalization
bull There are various laughs on each language
ndash HOW MUCH DO YOU LOVE COACH BEISTE
HHAHAHAHAHAH
ndash Hihihihi ) Habe ich regulaumlr 2x die Woche
ndash Tafil con eso Jajajajajajaja
ndash Malo Jejejeje XP
ndash kekeke chỗ đoacute lagravem aacuteo được ko em
bull Shrink them to double
ndash hahahha rArr haha
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 54
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Implementation and Estimation
Short Text Language Detection with
Infinity-Gram (NAIST Seminar) 55
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Language Detection with Infinity-Gram (ldig)
bull tweet language detection for Latin
alphabet
ndash httpsgithubcomshuyoldig
bull MIT license
bull Distribute also the trained model here
ndash infin-gram LR(maximal substring) [Okanohara+ 09]
ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
ndash Double Array
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 56
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Usage (1) Model Initialization
bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]
ndash Extract features from corpus and initialize model
ndash -m model directory
ndash -x path of maximal substring extractor (execute as external process)
ndash --ff Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 57
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Maximal String Extractor
bull maxsubst [input file] [output file]
ndash Input as multiple line text
bull Replace TABs to ldquo ldquo line feeds to U+0001 in it
ndash Output as rdquo[features]yent[frequency]rdquo
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 58
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Usage (2) Learn
bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]
ndash Learn the model using the corpus on 1 cycle of SGD
ndash -e learning rate of SGD
ndash -r regularizer of L1 regularization
ndash --wr what times to regularize for whole parameters
bull Parameters are too many to regularize the whle ones every step
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 59
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Usage (3) Shrink Model
bull ldigpy -m [model] --shrink
ndash Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 60
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Usage (4) Detect Language
bull ldigpy -m [model] [test data]
ndash Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 61
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Data Format
bull Training and test data
ndash [correct label]yent[meta data]yent[text]
en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD
62
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Usage (5) Estimation Tool
bull serverpy -m [model] -p [port number]
ndash Open httplocalhost[port] after it is executed
ndash Output their language probabilities contained features and their parameters for a text inputed in the text area
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 63
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Estimation
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size
64
language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992
total 166711 165053 9901 922 974
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Estimation for LIGA dataset
bull Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
ndash httpwwwwintuenl~mpechenprojectssmm
Short Text Language Detection with Infinity-Gram
(NAIST Seminar)
Use 19 language model
65
Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996
total 9066 8992 992
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Estimation for Europarl Dataset
Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 66
ldig langdetect CLDlanguage size correct rate correct rate correct rate
bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993
total 21000 13957 997 20850 993 20814 991
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
Conclusions
bull Language detector using maximal substring model
ndash Detect over 99 accuracy for 19 languages
ndash langdetect with tweet corpus even has 97 accuracy
bull If the corpus is maintained the precision will be still up
ndash There are still many mistakes (in particular da and no)
bull If metadata is added to features the precision will be still up
ndash How to add and train metadata at low cost
bull Desire to shrink the model without loss of precision
ndash Too large for application (gt100MB)
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 67
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68
References
bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
bull [Okanohara+ 09] Text Categorization with All Substring Features
bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs
bull [Cavnar+ 94] N-Gram-Based Text Categorization
bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
Short Text Language Detection with Infinity-Gram
(NAIST Seminar) 68