Sanskrit parser Project Report

Post on 13-Dec-2014

297 views 4 download

Tags:

description

In this project we will basically try to parse a Sanskrit sentence so that later on it could be easy to translate it in some other language.

transcript

SANSKRIT PARSER(Parsing a Sanskrit Sentence in Some

Recognizable Format)

Project Mentor:

Mr. Nikhil DebbarmaAssistant Prof.CSE Dept.NIT,Agartala

Team Members:Akash Bhargava (10UCS002)Ashok Kumar(10UCS010)Laxmi Kant Yadav(10UCS027)Vijay Kumar Gupta(10UCS057)

Translator must know the Grammatical Structure of both Input and Output language.

According to many researchers, Sanskrit is a very scientific language.

Sanskrit behaves very closely as programming language.

So if we are able to make a translator that translates Sanskrit into machine code, then it would prove to be a significant development in the field of NLP(Natural Language Processing).

Why We Chose This Project

Why We Became Interested

“NASA scientist Rick Briggs had invited 1,000 Sanskrit scholars from India for working at NASA. But scholars refused to allow the language to be put to foreign use”- Dainik

Being a computer and human understandable, Sanskrit was considered useful in Space research and many other natural language processing Applications.

ContentWe will first put up some concepts then employ

them -

1. Advantages of using Sanskrit

2. Lexical Analysis

3. Parsing

4. Approach

5. Where we are now.

6. Problems

7. References

Linguistically Sanskrit :- is common base to a large group of Indo-European languages

Limited Vocabulary :- Words represent properties Prefix+Word+Suffix

Fixed Morphology

Concept of Vibhakti

Advantages of using Sanskrit -Why Sanskrit)

Words in Sanskrit belong to 3 categories, namely-

Dhatu Roop – root of all verbsShabda Roop – root of all nounsAvyaya – words with no morphology(indeclinables)

Each word belonging toDhatu Roop has 36 morphed versionsShabda Roop has 21 morphed versionsAvyaya words can represent a single meaning

Fixed Morphology

Vibhakti as Pointer

Consider the Sentence'The man saw the girl with the binoculars.'The man(S) saw(V) the girl(O) with the binoculars(I) ORThe man(S) saw(V) the girl with the binoculars(O)

नरः� द्वि�न�त्र्या बालाम्� अपष्यात्�नरः� द्वि�न�त्री�म्� साकम्� बालाम्� अपष्यात्�

Same is also the reason for UNAMBIGUITY in a sentence. NO effect of shuffling words.

Vibhakti as Pointer

Lexical analysis is the process of converting a sequence of characters into a sequence of tokens

A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer, or scanner

A lexer often exists as a single function which is called by a parser or another function, or can be combined with the parser in scanner less parsing

The lexical analyzer is the first phase of translator. It’s main task is to read the input characters and produces output a sequence of tokens that the parser uses for syntax analysis.

Lexical Analysis

The role of lexical analyzer

Lexical Analyzer

ParserSourceprogram

token

getNextToken

Indexed Database

Output

Output of lexical analysis is a stream of tokens A token is a syntactic category

◦ In English:noun, verb, adjective, …

◦ In sanskrit language:Vibhakti, kriya, vishashena, ..

Parser relies on the token distinctions:

What’s a Token?

An implementation must do two things:

1. Recognize substrings corresponding to tokens2. Search the identified token in the database to

recognize it’s context3. According to the different context it may be different

parts of speech of Sanskrit language eg: verb (kriya), vibhakti (dhatu roop).

4. Every token is tagged accordingly.

Lexical Analyzer: Implementation

Two important points:1. The goal is to partition the string. This is implemented

by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one token ends and the next token begins

◦ Even our simple example has lookahead issues i vs. if = vs. ==

14

Lookahead

Sanskrit's property of FIXED MORPHOLOGY lays thebasis for analyzing individual verbs and nounsprogrammically.

The input word's suffix is analyzed to obtain the following result -

Verbs – Tense,number,personNoun – Sex,number,case

LEXICAL ANALYSIS

LEXICAL ANALYSIS

Consider the dhatu(verb root) त्प� meaning ‘to heat’The following inflections are analyzed lexically -

HEATS WILL HEAT त्पद्वित्, त्पत्�, त्पन्ति�त् | त्प्स्याद्वित्, त्प्स्यात्�, त्प्स्यान्ति�त् | त्पसिसा, त्पथः�, त्पथः | त्प्स्यासिसा, त्प्स्याथः�,त्प्स्याथः | त्पमिम्, त्पवः�, त्पम्� त्प्स्यामिम्, त्प्स्यावः�, त्प्स्याम्�

HEATED HEAT IT(order) अत्पत्�, अत्पत्म्�, अत्पन� | त्पत्�, त्पत्म्�, त्प�त्� | अत्प�, अत्पत्म्�, अत्पत् | त्प, त्पत्म्�, त्पत् | अत्पम्�, अत्पवः, अत्पम् त्पद्विन, त्पवः, त्पम्

LEXICAL ANALYSIS

Consider the noun दे�वः representing GodThe following inclusions are possible

1. Nominative (subject) दे�वः� दे�वः! दे�वः�2. Accusative (object) दे�वःम्� दे�वः! दे�वःन�3. Instrumental (by) दे�वः�न दे�वःभ्याम्� दे�वः#�4. Dative(to) दे�वःया दे�वःभ्याम्� दे�वः�भ्या�5. Ablative(from) दे�वःत्� दे�वःभ्याम्� दे�वः�भ्या�6. Genitive(of) दे�वःस्या दे�वःया$� दे�वःनम्�7. Locative(in) दे�वः� दे�वःया$� दे�वः�षु�

LEXICAL ANALYSIS

Input Sentence

Tokenize

Avyaya Analysis

Verb Analysis

Noun Analysis

Unknown word(add to database)

The scanner recognizes words

The parser recognizes syntactic units Parser operations:

◦ Check and verify syntax based on specified syntax rules

◦ Report errors

Automation:◦ The process can be automated

Parsing

1. Simplicity of design

2. Improving efficiency

3. Enhancing portability

Why to Separate Lexical Analysis and Parsing

Parsing Sanskrit Text

Now we move towards translating a Sanskritsentence into its parser equivalent

PARSING Analyze (a sentence) into its component parts and describe their syntactic roles.

Analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.

Parsing Sanskrit Text

Sanskrit Sentence StructureSOV

English Sentence StructureSVO

बाला� पठम्� पठद्वित् Boy reads chapter S O V S V O

Example Sanskrit Sentence

Approach(Coding Concept)

We first tokenize the input using strtok(str,” ”); Each token can be of 3 types- Noun,verb,

preposition.The task is to identify these token which is done by matching in indexed database.

Each token is stored in a structure along with the meaning and its morphologic.

Then parser comes into play and form a tree

type of structure using these tokens.

Bottom-Up Parser Technique

Bottom-Up LR◦ Construct parse tree in a bottom-up manner◦ Find the rightmost derivation in a reverse order◦ For every potential right hand side and token

decide when a production is found

More powerful Bottom-up parsers can handle the largest class of

grammars that can be parsed deterministically

Approach

Programming language used: C and C++ Database Used: Linux file system, indexed Data Structures: Array, Linked List, structure,Tree,

Indexing and Hashing INPUT: A sanskrit sentence or paragraph eg: यात्री रःम्� गच्छद्वित् त्त्री दे�वः� बाला�न साह नदे*म्� द्विनकषु द्वित्ष्ठन्ति�त्! OUTPUT: recognize all the parts of speech Form a tree structure to be able to understand the

sentence.

How the Output Will be Shown in Terminal

यात्री::: this is a avyaya.. and the meaning is: where_there ] रःम्�::: Nominative,Singular, Gender-Masculine ,noun and the root

is: रःम् and the meaning is Ram गच्छद्वित्::: The root is: गच्छ the meaning is: go present-tense,first-

person,singular त्त्री::: this is a avyaya.. and the meaning is: there दे�वः�::: Nominative,Plural Gender-Masculine ,noun ,and the root is:

दे�वः and the meaning is god बाला�न::: Instrumental,Singular, Gender-Masculine ,noun, and the

root is: बाला and the meaning is boy नदे*म्�::: Accusative,Singular, Gender-Feminine ,noun and the root is:

नदे* and the meaning is river

Avyaya's Role in Sanskrit

Avyaya words(indeclinables) are used to connect 2 or more simple sentences. Examples -यादिदे-त्दिदे (if-then)यात्री-त्त्री (where-there)परः�त्� (but)अथःद्विप (hence)चे�दे� (provided,if)Not only do avyaya connect sentences but they also affect structure of a simple sentence.

Challanges in the code

Every word encountered in the input sentence could be any parts of speech of sanskrit as there is no fixed ordering.

Because of the above mentioned property of sanskrit, searching becomes important.

Database and word collection were in unicode format, size of each word becomes even larger.

Problems

Grammar of Sanskrit language

How can we represent it in BNF grammar.

Parser techniques

Structure of code

Where We are Now

A big chunk of our time was invested in research of sanskrit language and its grammar which was quite difficult.

Till now we have implemented lexer part and parser part.

Reference

Sanskrit & Artificial Intelligence — NASAKnowledge Representation in Sanskrit and Artificial Intelligence by  Rick Briggs

http://www.vedicsciences.net/articles/sanskrit-nasa.html AI Magazine publishes the importance of Sanskrit

http://www.parankusa.org/SanskritAsProgramming.pdf

http://sanskrit.jnu.ac.in/morph/analyze.jsp

http://en.wikipedia.org/wiki/Sanskrit_verbs

http://en.wikipedia.org/wiki/Sanskrit_grammar

Thank You