Date post: | 06-Aug-2015 |
Category: |
Engineering |
Upload: | chamila-wijayarathna |
View: | 128 times |
Download: | 1 times |
Sinmin - Corpus for Sinhala Language
1
Upeksha W. D.Wijayarathna D. G. C. D.Siriwardena M. P.Lasandun K. H. L.
Supervisors :Dr. Chinthana WimalasuriyaProf. Gihan DiasMr. N. H. N. D. de Silva
Outline● Introduction● Crawler Implementation and Design● Data Cleaning and Tokenizing Mechanisms● Selecting Data Storage Mechanism● Data Storage Model of SinMin● User Interface Design and Implementation● API Design and Implementation● Unit Testing● Performance Testing of the API● Implemented Sample Usages
2
What is a Corpus??
“A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010)
3
Usages of a Corpus● Implementing translators, spell checkers and grammar
checkers.● Identifying lexical and grammatical features of a language.● Identifying varieties of language of context of usage and
time.● Retrieving statistical details of a language.● Providing backend support for tools like OCR, POS Tagger,
etc.
4
Sinmin is a Corpus for Sinhala language which is➢ Continuously updating
➢ Dynamic (Scalable)
➢ Covers wide range of language (Structured and unstructured)
5
Identified Sinhala Resources
7
News Academic Creative Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
Crawlers are responsible of finding web pages that contain sinhala
content, fetching, parsing and storing them in a manageable format.
10
Identified Issues● Erroneous characters of the texts● Short forms● Consecutive Sinhala vowel sign problem
fixing
16
Erroneous Characters Of The Texts● Invalid Unicode characters
Eg: Characters in a private user area, Replacement character
● SymbolsEg: “,”, “.”, “{“, “(“, “?”
17
Erroneous Characters Of The Texts
● Unwanted non-Sinhala charactersEg: ‘u+200C’, Á, À, ®, ¡, ª, º
● Non-symbolic characters which were terminating words
18
Short Forms
● Short forms consists of full stops. ● But those full stop marks aren’t separating
sentences nor words.
E.g.: පෙ�. ව. (pm), රු�. (Rupees)
19
Identified Common Short Forms
"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."
"පෙ�.", "ව.", "�.", "රු�."
"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."
20
Consecutive Sinhala Vowel Sign Problem● Solution: Mapping them into one format
● Convention: Only one vowel sign to a Sinhala letter
22
The performance of data insertion and retrieval mainly depend on the Data
Storage Mechanism used for the Corpus.
25
We considered performance for inserting data and for retrieving 12
different information needs.Data set and source code
https://github.com/madurangasiriwardena/performance-test
28
Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly.So we chose it for implementing
corpus.
32
● We Used Cassandra as the Main Storage System of Sinmin
● Apache Cassandra version 2.1.2 used.
● cqlsh version 5.0.1 used
34
Cassandra
● Most queries of API are retrieved from Cassandra Database.
● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need
35
Cassandra
Wildcard Search FeatureWildcard search feature enables users to run wild-card queries on the corpus
Eg: පෙ ? හ*
38
Wildcard Search Feature● Implemented using Apache Solr● More than 1.2 million distinct words● Supports at most 10 asterisks and atmost 10
question marks
39
Sinhala Vowel Sign Problem At Wildcard Search
In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters
40
Sinhala Vowel Sign Problem At Wildcard SearchSolution: Represent Sinhala letter and vowel sign as one entity
41
● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin.
● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort.
43
Sinmin user interface allows to,
● Find the probability of an n-gram● Find the most probable word comes after an n-gram● Compare the usage of n-grams● Find statistics of words, bigrams and trigrams● Wildcard search● Find latest articles for an n-gram
44
REST API● REST API to expose Corpus services
● Much complex and customizable data retrieval and filtering
● Interface for third party applications to consume
48
REST API● Depends on backend databases (Cassandra,
Oracle, Solr)● Cassandra acts as main storage system● Oracle is used as a backup database● Solr is used for wildcard search functions
49
API Functions● wordFrequency● bigramFrequency● trigramFrequency● frequentWords● frequentBigrams● frequentTrigrams● latestArticlesForWord● latestArticlesForBigram● latestArticlesForTrigram
51
● frequentWordsAroundWord● frequentWordsInPosition● frequentWordsInPositionReverse● frequentWordsAfterWordTimeRange● frequentWordsAfterBigramTimeRange● wordCount● bigramCount● trigramCount
Full Stop Predictor For OCR
● One challenge in OCR development is identifying
fullstops.
● This tool is a consumer application of Sinmin that
predicts the full stop marks of Sinhala texts.
55
Publications● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia (Presented)
● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted)
56
Future Works
● Annotate Words with POS Taggers and lemmas.
● Implement tools and applications that make use of the corpus
57