Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | evelyn-turner |
View: | 217 times |
Download: | 0 times |
Language Identification of Web Data for Building Linguistic
Corpora
Marija Stupar, Tereza Jurić, Nikola LjubešićFaculty of Humanities and Social Sciences University of Zagreb, Croatia
INFuture2011: “Information Sciences and e-Society”Zagreb, 10 November 2011
Overview
• Introduction• Experimental setup
▫ Languages observed• Methods used
▫ Main approaches▫ Hybrid approaches
• Results▫ Document level▫ Paragraph level
• Conclusion
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Introduction
•Web as a rich source of linguistic material•More than one natural language within
such sources•Defining the method for language
identification of the data collected from the Web▫Comparison of two main and two hybrid
approaches•Ultimate goal
▫Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Experimental setup
cs de en es fr hr hu it pl sk sl svcs - 18 22 26 22 53 25 31 42 70 54 23
de 18 - 34 34 35 12 17 31 20 17 18 53
en 22 34 - 27 33 16 16 35 15 17 19 35
es 26 34 27 - 62 22 18 56 18 23 28 38
fr 22 35 33 62 - 18 15 48 15 18 22 35
hr 53 12 16 22 18 - 11 31 39 51 74 24
hu 25 17 16 18 15 11 - 14 10 22 13 21
it 31 31 35 56 48 31 14 - 22 28 38 32
pl 42 20 15 18 15 39 10 22 - 50 40 18
sk 70 17 17 23 18 51 22 28 50 - 55 22
sl 54 18 19 28 22 74 13 38 40 55 - 26
sv 23 53 35 38 35 24 21 32 18 22 26 -
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
•Twelve languages observed
Table 1: A snippet from Language Similarity Table (Scannell, 2007)
Methods used
•Main approaches▫Function word distributions▫Second-order Markov models
•Hybrid approaches▫Harmonic balance▫Sophisticated method
•Language identification on document and paragraph level
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Function words
Character count
Czech 210 150601German 334 150156English 230 150041Spanish 217 150926French 260 150083Croatian 204 157366Hungarian 223 152202Italian 219 150459Polish 268 150198Slovak 168 150046Slovenian 256 143841 Swedish 256 150762
Table 2: Amount of data collected for each
basic method
Methods used – main approaches• Function word distributions
▫Lists of function words from all languages in question
▫The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language
• Second-order Markov models▫Conditional probabilities of a character regarding
the two previous characters for which distributions of bigram and trigram characters are calculated on a training set
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Methods used - hybrid approaches•Harmonic balance
▫Harmonic mean of the certainty of the function words method and the Markov model method
▫Certainty is calculated as a/(a+b) where a is the first result, and b the second best result
•Sophisticated hybrid method▫Takes into account the strengths of each
main methodStupar, Jurić, Ljubešić, Language Identification of Web Data for Building
Linguistic Corpora
Methods used - hybrid approaches•Sophisticated hybrid method algorithm
▫If the Markov model and function words method give the same results, the result is accepted
▫In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted
▫Otherwise the result of the Markov model method is accepted
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Methods used - evaluation
• Document level▫20 documents per language▫Documents containing less than 70% of any
language are considered unsolvable•Paragraph level
▫Paragraphs in 50 documents were labeled by language they are written in
▫750 paragraphs in total
• Evaluation measure is accuracy ▫a+d/a+b+c+dStupar, Jurić, Ljubešić, Language Identification of Web Data for Building
Linguistic Corpora
Results
•Main approaches
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Function words
Markov model
Function words
Markov model
Document level Paragraph levelPositive 234 239 745 747
Negative
6 1 5 3
Accuracy
0.975 0.996 0.993 0.996
Table 3: Results of the evaluation of the traditional approaches
Results
•Hybrid approaches
Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora
Harmonic balance
Sophisticated method
Harmonic balance
Sophisticated method
Document level Paragraph levelPositive 239 240 746 747
Negative
1 0 4 3
Accuracy
0.996 1.0 0.995 0.996
Table 4: Results of the evaluation of hybrid methods
Conclusion
•Markov model outperforms the function words method
•Hybrid approaches showed to be more efficient on the document level (mixed language content)
•Power-lawish distribution of languages•Three languages - 99% of the data •Around 96% of documents written in only
one language▫4% have mixed contentStupar, Jurić, Ljubešić, Language Identification of Web Data for Building
Linguistic Corpora
Language Identification of Web Data for Building Linguistic
Corpora
Marija Stupar, Tereza Jurić, Nikola LjubešićFaculty of Humanities and Social Sciences University of Zagreb, Croatia
INFuture2011: “Information Sciences and e-Society”Zagreb, 10 November 2011