+ All Categories
Home > Engineering > Sinmin final presentation

Sinmin final presentation

Date post: 06-Aug-2015
Category:
Upload: chamila-wijayarathna
View: 128 times
Download: 1 times
Share this document with a friend
Popular Tags:
59
Sinmin - Corpus for Sinhala Language 1 Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. de Silva
Transcript

Sinmin - Corpus for Sinhala Language

1

Upeksha W. D.Wijayarathna D. G. C. D.Siriwardena M. P.Lasandun K. H. L.

Supervisors :Dr. Chinthana WimalasuriyaProf. Gihan DiasMr. N. H. N. D. de Silva

Outline● Introduction● Crawler Implementation and Design● Data Cleaning and Tokenizing Mechanisms● Selecting Data Storage Mechanism● Data Storage Model of SinMin● User Interface Design and Implementation● API Design and Implementation● Unit Testing● Performance Testing of the API● Implemented Sample Usages

2

What is a Corpus??

“A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010)

3

Usages of a Corpus● Implementing translators, spell checkers and grammar

checkers.● Identifying lexical and grammatical features of a language.● Identifying varieties of language of context of usage and

time.● Retrieving statistical details of a language.● Providing backend support for tools like OCR, POS Tagger,

etc.

4

Sinmin is a Corpus for Sinhala language which is➢ Continuously updating

➢ Dynamic (Scalable)

➢ Covers wide range of language (Structured and unstructured)

5

Architecture of Sinmin

6

Identified Sinhala Resources

7

News Academic Creative Writing

Spoken Gazette

News Paper Text books Fiction Subtitle Gazette

News Items Religious Blogs

Wikipedia Magazine

mahawansa

Identified Sinhala Resources

8

Crawler Implementation and Design

9

Crawlers are responsible of finding web pages that contain sinhala

content, fetching, parsing and storing them in a manageable format.

10

Crawler Architecture

11

Sample Xml File With One Article Stored In It

12

Crawler ControllerCrawler controller monitors and handles the status of the web crawlers.

13

14

Data Cleaning and Tokenizing Mechanisms used

15

Identified Issues● Erroneous characters of the texts● Short forms● Consecutive Sinhala vowel sign problem

fixing

16

Erroneous Characters Of The Texts● Invalid Unicode characters

Eg: Characters in a private user area, Replacement character

● SymbolsEg: “,”, “.”, “{“, “(“, “?”

17

Erroneous Characters Of The Texts

● Unwanted non-Sinhala charactersEg: ‘u+200C’, Á, À, ®, ¡, ª, º

● Non-symbolic characters which were terminating words

18

Short Forms

● Short forms consists of full stops. ● But those full stop marks aren’t separating

sentences nor words.

E.g.: පෙ�. ව. (pm), රු�. (Rupees)

19

Identified Common Short Forms

"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."

"පෙ�.", "ව.", "�.", "රු�."

"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."

20

Consecutive Sinhala Vowel Sign Problem

21

Consecutive Sinhala Vowel Sign Problem● Solution: Mapping them into one format

● Convention: Only one vowel sign to a Sinhala letter

22

Consecutive Sinhala Vowel Sign Problem

23

Selecting Data Storage Mechanism for Sinmin

24

The performance of data insertion and retrieval mainly depend on the Data

Storage Mechanism used for the Corpus.

25

We tested performance of several database systems to determine what

should we use to store data.

26

We Considered Following Data Storage Systems

27

We considered performance for inserting data and for retrieving 12

different information needs.Data set and source code

https://github.com/madurangasiriwardena/performance-test

28

Data Insertion Time Comparison

29

Information Retrieval Performance Comparison - Part 1

30

Information Retrieval Performance Comparison - Part 2

31

Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly.So we chose it for implementing

corpus.

32

Data Storage Model of Sinmin

33

● We Used Cassandra as the Main Storage System of Sinmin

● Apache Cassandra version 2.1.2 used.

● cqlsh version 5.0.1 used

34

Cassandra

● Most queries of API are retrieved from Cassandra Database.

● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need

35

Cassandra

● Oracle used as a backup storage server.

36

Oracle

Oracle Schema

37

Wildcard Search FeatureWildcard search feature enables users to run wild-card queries on the corpus

Eg: පෙ ? හ*

38

Wildcard Search Feature● Implemented using Apache Solr● More than 1.2 million distinct words● Supports at most 10 asterisks and atmost 10

question marks

39

Sinhala Vowel Sign Problem At Wildcard Search

In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters

40

Sinhala Vowel Sign Problem At Wildcard SearchSolution: Represent Sinhala letter and vowel sign as one entity

41

User Interface Design and Implementation

42

● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin.

● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort.

43

Sinmin user interface allows to,

● Find the probability of an n-gram● Find the most probable word comes after an n-gram● Compare the usage of n-grams● Find statistics of words, bigrams and trigrams● Wildcard search● Find latest articles for an n-gram

44

45

46

API Design and Implementation

47

REST API● REST API to expose Corpus services

● Much complex and customizable data retrieval and filtering

● Interface for third party applications to consume

48

REST API● Depends on backend databases (Cassandra,

Oracle, Solr)● Cassandra acts as main storage system● Oracle is used as a backup database● Solr is used for wildcard search functions

49

Architecture

50

API Functions● wordFrequency● bigramFrequency● trigramFrequency● frequentWords● frequentBigrams● frequentTrigrams● latestArticlesForWord● latestArticlesForBigram● latestArticlesForTrigram

51

● frequentWordsAroundWord● frequentWordsInPosition● frequentWordsInPositionReverse● frequentWordsAfterWordTimeRange● frequentWordsAfterBigramTimeRange● wordCount● bigramCount● trigramCount

Performance Testing of the API

52

Throughput Under Different Load Conditions

53

Time Taken To Process Requests Under Different Load Conditions

54

Full Stop Predictor For OCR

● One challenge in OCR development is identifying

fullstops.

● This tool is a consumer application of Sinmin that

predicts the full stop marks of Sinhala texts.

55

Publications● Implementing a Corpus for Sinhala Language -

Symposium on Language Technology for South Asia (Presented)

● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted)

56

Future Works

● Annotate Words with POS Taggers and lemmas.

● Implement tools and applications that make use of the corpus

57

Q & A

58

Thank You!

59


Recommended