+ All Categories
Home > Engineering > Sinmin Literature Review Presentation

Sinmin Literature Review Presentation

Date post: 18-Jul-2015
Category:
Upload: chamila-wijayarathna
View: 98 times
Download: 6 times
Share this document with a friend
Popular Tags:
52
SINMIN CORPUS FOR SINHALA LANGUAGE Literature Review Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. De Silva
Transcript
Page 1: Sinmin Literature Review Presentation

SINMINCORPUS FOR SINHALA LANGUAGELiterature Review

Upeksha W. D.

Wijayarathna D. G. C. D.

Siriwardena M. P.

Lasandun K. H. L.

Supervisors :

Dr. Chinthana Wimalasuriya

Prof. Gihan Dias

Mr. N. H. N. D. De Silva

Page 2: Sinmin Literature Review Presentation

Sinmin is a Corpus for Sinhala language which is

➢Continuously updating

➢Dynamic (Scalable)

➢Covers wide range of language (Structured and

unstructured)

Page 3: Sinmin Literature Review Presentation

OUTLINE

● Literature Review

● Introduction to corpus linguistics and What is a Corpus

● Usages of a corpus

● Existing Corpus Implementations

● Identifying Sinhala Sources and Crawling

● Data Storage and Information Retrieval from Corpus

● Information Visualization

● Extracting Linguistic Feature

● Current Progress

Page 4: Sinmin Literature Review Presentation

INTRODUCTION TO CORPUS LINGUISTICS

AND WHAT IS A CORPUS

Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken

business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201

Page 5: Sinmin Literature Review Presentation

WHAT IS A CORPUS??

“A corpus is a principled collection of authentic texts

stored electronically that can be used to discover

information about language that may not have been

noticed through intuition alone.” - Bennet (2010)

Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.

Page 6: Sinmin Literature Review Presentation

● There are mainly 8 kinds of corpora.

● They are generalized corpuses, specialized corpuses,

learner corpuses, pedagogic corpuses, historical

corpuses, parallel corpuses, comparable corpuses,

and monitor corpuses.

● The broadest type of corpus is the genarilezed

corpes.

Page 7: Sinmin Literature Review Presentation

“Sinmin” will

be a generalized corpus.

cover all types of Sinhala Language.

Page 8: Sinmin Literature Review Presentation

USAGES OF A CORPUS

● Implementing translators, spell checkers and grammar

checkers.

● Identifying lexical and grammatical features of a language.

● Identifying varieties of language of context of usage and

time.

● Retrieving statistical details of a language.

● Providing backend support for tools like OCR, POS Tagger,

etc.

Page 9: Sinmin Literature Review Presentation

EXISTING CORPUS IMPLEMENTATIONS

Page 10: Sinmin Literature Review Presentation

● There is a implemented corpus for Sinhala language

which is known as UCSC Text Corpus of

Contemporary Sinhala.

● It consists of about 10 million words, but it covers

very little amount of language and it is not updating.

CORPUS FOR SINHALA LANGUAGE?

Page 11: Sinmin Literature Review Presentation

COMPOSITION OF THE CORPUS

● Language comprising the corpus cannot be random

but chosen according to specific characteristics.

● It must use authentic texts. The language it contains

is not made up for the sole purpose of creating the

corpus

Page 12: Sinmin Literature Review Presentation

EXAMPLE - COMPOSITION OF COCA

● The COCA contains more than 385 million words

from 1990–2008 (20 million words each year).

● Texts are evenly divided between 5 genres, spoken

(20%), fiction (20%), popular magazines (20%),

newspapers (20%) and academic journals (20%).

Page 13: Sinmin Literature Review Presentation

COMPOSITION OF UCSC TEXT CORPUS OF

CONTEMPORARY SINHALA

Page 14: Sinmin Literature Review Presentation

DATA STORAGE AND INFORMATION

RETRIEVAL FROM CORPUS

Existing corpora uses two main technologies for data

storage

● Relational Databases

● Indexed file Systems

Page 15: Sinmin Literature Review Presentation

INDEXED FILE SYSTEMS AS STORAGE

● BNC uses this mechanism.

● data is stored as XML like files which follows a

scheme known as the Corpus Data Interchange

Format.

● This supports to store a great deal of detail about the

structure of each text, such as its division into

sections or chapters, paragraphs, verse lines, etc.

Page 16: Sinmin Literature Review Presentation
Page 17: Sinmin Literature Review Presentation

RELATIONAL DATABASE AS STORAGE

● COCA, Corpus del Español use relational databases.

Page 18: Sinmin Literature Review Presentation

DATA MODEL IN COCA

Page 19: Sinmin Literature Review Presentation

CORPUS DEL ESPAÑOL USES SEPARATE

TABLES FOR BIGRAMS AND TRIGRAMS

Page 20: Sinmin Literature Review Presentation

RELATIONAL DB VS INDEXED FILE

SYSTEMS

● Indexed file systems use extensive use of indexes

● Relational Database models are relatively fast.

● In Indexed file systems, difficult to add additional

layers of annotation.

Page 21: Sinmin Literature Review Presentation

No study has been done on

how NoSQL performs in

implementing Corpora.

Page 22: Sinmin Literature Review Presentation

INFORMATION VISUALIZATION

Most of the popular corpora like BNC, COCA, Corpus

Del Espanol, Google books corpus use similar kind of

Web Interface.

Page 23: Sinmin Literature Review Presentation

USER INTERFACE

OF COCA

Page 24: Sinmin Literature Review Presentation
Page 25: Sinmin Literature Review Presentation

GOOGLE BOOKS NGRAM VIEWER UI

Page 26: Sinmin Literature Review Presentation

EXTRACTING LINGUISTIC FEATURES

● A main usage of a language corpus is extracting

linguistic features of a language.

● Linguistic features for many languages has been

identified using Corpora.

● Example - A corpus-based linguistics analysis on

written corpus: colligation of “TO” and “FOR.”

Page 27: Sinmin Literature Review Presentation
Page 28: Sinmin Literature Review Presentation
Page 29: Sinmin Literature Review Presentation

CURRENT PROGRESS

Page 30: Sinmin Literature Review Presentation

IDENTIFIED SINHALA RESOURCES

● Online Newspapers

● News Websites

● School Textbooks

● Sinhala Wikipedia

● Online Mahawansaya

● Subtitles

● Sinhala Fiction

● Sinhala Blogs

● Sinhala Magazines

● Gazette

Page 31: Sinmin Literature Review Presentation

DIVIDED INTO 5 MAIN GENRES

News Academic Creative

Writing

Spoken Gazette

News Paper Text books Fiction Subtitle Gazette

News Items Religious Blogs

Wikipedia Magazine

mahawansa

Page 32: Sinmin Literature Review Presentation

Implemented Crawlers for different sources,

adhering to same format.

https://github.com/madurangasiriwardena/corpus.sinhala.crawler

Page 33: Sinmin Literature Review Presentation

FINISHED CRAWLERS

Page 34: Sinmin Literature Review Presentation

CRAWLED DATA SAVED TO XML FILES WITH

FOLLOWING META DATA

● Post Name

● Author

● Link

● Published Date

Page 35: Sinmin Literature Review Presentation

CRAWLER CONTROLLER

Crawler controller monitors and handles the status of

the web crawlers.

Crawler controller address -

http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb

Page 36: Sinmin Literature Review Presentation
Page 37: Sinmin Literature Review Presentation

We tested performance of several database

systems to determine what should we use

to store data.

Page 38: Sinmin Literature Review Presentation

WE CONSIDERED FOLLOWING DATA

STORAGE SYSTEMS

Page 39: Sinmin Literature Review Presentation

We considered performance for inserting

data and for retrieving 12 different

information needs.

Data set and source code -

https://github.com/madurangasiriwardena/performance-test

Page 40: Sinmin Literature Review Presentation

DATA INSERTION TIME COMPARISON

Page 41: Sinmin Literature Review Presentation

INFORMATION RETRIEVAL PERFORMANCE

COMPARISON - PART 1

Page 42: Sinmin Literature Review Presentation

INFORMATION RETRIEVAL PERFORMANCE

COMPARISON - PART 2

Page 43: Sinmin Literature Review Presentation

Cassandra performed better than others in

most of the scenarios, and its insertion

time increased linearly.

So we chose it for implementing the

corpus.

Page 44: Sinmin Literature Review Presentation

USER INTERFACE DESIGN AND

IMPLEMENTATION

● Web interface of Sinmin has been designed for users

who would prefer a visualised and summarized view

of statistical data of Sinmin.

● Visual design of the interface has been made in a

way that any user without prior experience of the

interface is able to fulfill his information

requirements with little effort.

http://sinhala-corpus.projects.uom.lk/sinmin-web/

Page 45: Sinmin Literature Review Presentation
Page 46: Sinmin Literature Review Presentation
Page 47: Sinmin Literature Review Presentation
Page 48: Sinmin Literature Review Presentation

CORPUS API DESIGN AND IMPLEMENTATION

• REST API to expose Corpus services

• Much complex and customizable data retrieval and

filtering

• Interface for third party applications to consume

Page 49: Sinmin Literature Review Presentation

PUBLICATIONS

● Comparison between performance of various

database systems for implementing a language

corpus - 11th Beyond Databases, Architectures and

Structures conference (Pending)

● Implementing a Corpus for Sinhala Language -

Symposium on Language Technology for South Asia

(Pending)

Page 50: Sinmin Literature Review Presentation

REMAINING WORK FOR THE NEXT PHASE

• Finish writing crawlers

• Feed data to Cassendra database

• Connecting front end with API calls

Page 51: Sinmin Literature Review Presentation

Questions?

Page 52: Sinmin Literature Review Presentation

Thank you!


Recommended