presentation on digital library“Issues and Solutions”
Nurturing Living Languages
© C-DAC
Social and economic growth is catalyzed by the presence of
Internet
Development of internet is mainly in English
Uses only 26 alphabet (unaccented Latin letters), the 10 digits
(0-9), hyphen and the dot.
For proliferation and preservation of heritage, culture and content
creation in multiple languages it is essential to have the domain
names in multilingual scripts.
Background
Application (such as browser) converts to ASCII Compatible encoding
(ACE) : www.xn--3b7vcv67.com
Registry entry : xn—3b7vcv67.com (ASCII characters)
Background
xn--e2br9czb
xn--m1be
India has largest linguistic diversities in the world
4 major language families and at least 35 different languages and
around 2000 dialects.
Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca
24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca
0.6%) families. Some of the languages of Himalayas still
unclassified.
India has 22 scheduled languages and English continue to be
“associate additional official language”
Following scripts will be most needed : Assamese, Bangla,
Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil,
Telugu, Urdu.
Nurturing Living Languages
Devanagari – Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali,
Dogri, Santhali, etc.
Thus the code page Devanagari can support all languages using that
particular script.
Solution :
Though the contents would reveal the language used, it would be
ideal if a special attribute code to indicate language is
inserted.
Nurturing Living Languages
Konkani is written in Roman, Devanagari, Malayalam and
Kannada.
Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic),
Devanagari, Gujarati and also Roman.
Sindhi has adopted the Perso-Arabic script for representing their
language. In case of Konkani, Devanagari is used as official
script.
Hence it is proposed that the same formula be used in attributing
in IDN.
However nothing stops a client from desiring to have his IDN in all
the scripts and this can be efficiently catered by providing broad
based transliteration facility which would transliterate a name
from one Indian script to another.
Thus a Konkani domain name in Devanagari could be transliterated
into Kannada, Malayalam and Roman.
Solution:
The best solution to this is by way of linguistic or political
consensus
One language :: many scripts
The solution :
A tool for transliteration from one Indian script to another can be
easily deployed.
The transliterated data could be presented to the client who could
verify the transliteration and see if it meets his approval and if
so, the IDN could be registered in all possible scripts
Nurturing Living Languages
ACE i.e. ASCII compatible encoding.
This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492)
as well as to RFC 3454 StringPrep.
ACE prepares a IDN string to be sent down to PunyCode for storage
where it is stored as a 7 bit numeric data
We would like to make a case for the use of ISCII 91 as a parallel
code for Brahmi based scripts.
ISCII deploys the same encoding for all Brahmi based scripts.
The advantage for this obvious as storage in ISCII will allow IDN
to transliterate on the fly a name into any Indic script and
thereby ensure at the PunyCode level itself that a name allotted in
one script is also automatically allotted in another script to the
same owner, thereby doing away with name squatting in Indic
scripts, which will be a regular feature for IDN allocation in
Indic scripts.
Alternate mechanism
IDN & THE PROBLEM OF ALLOTTING NAMES
The IDN server which will attribute the domain names is to be
automated and hence it is of vital interest that a mechanism of
checks and counter-checks be set up to ensure the highest level of
security.
Two major issues are at stake. These issues are mainly specific to
Indian scripts and the complex nature of their visual
rendering.
Nurturing Living Languages
PROBLEM 1: DOUBLETS
The first is the need to ensure that doublets are avoided. Doublets
are IDN’s which are nearly alike either as homophones or close
homographs. Thus spelling: Mahararashtra as:
can lead to identity confusion and since all the three spellings
are different, the server would attribute all the name as valid
IDN’s whereas in fact the original client would not like that his
IDN be misused.
Nurturing Living Languages
Problem 2: SECURITY ISSUES
More serious is the willful use of such tactics to perpetrate fraud
by misleading a user into believing that he has logged on to a
bonafide site and thus persuade the user to divulge information
such as the number of his credit card etc.
Nurturing Living Languages
© C-DAC
UNDERLYING THESE PROBLEMS AND ISSUES ARE THREE MAJOR POTENTIAL
SECURITY HOLES
HOMOPHONES AND HOMOGRAPHS
SPELLING VARIANTS
SPELLING ERRORS
Each of these will be studied in relation to their pertinence to
ensuring maximal security
Nurturing Living Languages
© C-DAC
These are aural and visual look-alikes and given the phonetic
nature of Indian scripts are a potential source of confusion.
A typology of these has been established:
VISUAL LOOK ALIKES
AURAL LOOK ALIKES
Homophones and Homographs
Nurturing Living Languages
Devanagari
The first ligature is a Half da+ Full dha, the second is a half dha
followed by a full da. To an average reader of Hindi, the two forms
look practically alike and lead to confusion.
A similar situation arises in the case of Gujarati
The first is ka+la The second is ka+halanta+la
Homophones and Homographs
Nurturing Living Languages
AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS.
This can be best seen in the case of Nukta characters. These can be
generated out in two different manners:
In each pair, the first character is a single character whereas the
second character is made up of two characters: the consonant
followed by the dot or nukta character. To the naked eye the two
look alike, whereas for the machine, these would be two different
IDN’s.
Homophones and Homographs
Nurturing Living Languages
SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE:
Within a code-page two characters can look practically alike and
create ambiguity. This is especially the case when on the client
machine the font enabled is not of high quality and given the size
of the characters (normally 10 point), can lead to confusion. Some
examples are given below:
Devanagari
Homophones and Homographs
Nurturing Living Languages
IDENTICAL CHARACTERS IN UNICODE
As is the case of the Urdu and Sindhi glyph. Character 06a9 is the
letter /keheh/ in Urdu whereas the same symbol in Sindhi has the
representation /kheheh/. Since both fall within the same codepage
aural disambiguation apart from recourse to the language used is
impossible.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones
Indian Languages being phonetic in nature, aural representation is
a major issue.
These mainly arrive out of the fact that Indian languages are
generally typed as they are spoken. Very often these arrive out
of
spelling variants and/or
The ignorance of the user as to the correct spelling of the
word.
A large number of sub-types of problems can emerge from such
Homophonic representations
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-1
Confusion between the two nasal modifiers (wherever such nasal
modifiers) exist.
Hindi Gujarati
Confusion between two or more similar sounding consonants (normally
dental vs. retroflex sibilants and laterals):
Marathi Gujarati
Confusion arising out of short and long vowels:
Tamil: Gujarati Hindi
Homophones and Homographs
Nurturing Living Languages
Absence or presence of a halanta.
This is a source of errors even among educated speakers of the
language. Proper names tend to be written at times with or without
the halanta.
Thus the name Shirke in Marathi can be written in the following two
ways of which the first is correct, the second not normatively
valid but could be accepted:
Confusion arising out of the use of the rakar+ “u” matra instead of
the vowel form:
vs.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-3
A remote source of error would be the use of the Visarga or Vowel
lengthener to modify an IDN. The Visarga is mainly used in Sanskrit
and very rarely in neo Indian Aryan languages. However an IDN with
or without the Visarga could create ambiguity.
Homophones and Homographs
Nurturing Living Languages
Aural Look-Alikes: Homophones-4
Insertion of a zero width character (ZWJ/ZWNJ) within the name
string:
The first has no non-joiner, the second has a non-joiner. Visually
both look alike and can lead to confusion.
Homophones and Homographs
Nurturing Living Languages
Sub-Type 2: SPELLING ERRORS
SUB-TYPE II Spelling Variants
This is best seen in the case of Hindi where a nasal modifier can
substitute for a corresponding half nasal consonant.
The word Hindi itself allows to be written either as:
Obviously two IDN’s based on these spelling variants should not be
allowed but must be resolved to the same norm.
A similar situation exists in Marathi in the use of (timba) vs. /e/
vowel modifier. The first is used in colloquial Marathi under
special environments whereas the second is the literary form. A
filter which would normalize the two would have to be
written.
Nurturing Living Languages
SUB-TYPE III SPELLING ERRORS
These whether conscious or unconscious could create homographic
doublets and need to be detected in order to ensure that the client
does not have a spurious IDN competing with his real IDN.
Misspellings of words, introversions can all lead to IDN
doublets.
A good example is words in Hindi which have Urdu roots and which
can admit spellings without Halanta (Urdu norm) and with halanta
(Hindi aural norm)
Nurturing Living Languages
Proposed Recommendations
An action plan has been proposed for ensuring maximum security in
allotment of IDN’s in Indian scripts.
This is in shape of recommendations arising out of
discussions.
The recommendations are both specific and generic in nature.
Nurturing Living Languages
Level 2 Government bodies and Institutions (Bank, insurance,
healthcare, etc)
Level 3 Corporate and NGO’s
Level 4 All other users.
Nurturing Living Languages
Proposed Recommendations: GENERIC STRATEGIES-2
The implementation should be tested in TESTBED mode and IDN’s
should be allotted in a phased manner:
Level 1 (Highest security) and Level2 (Government bodies and
Institutions) should be permitted to register in the test bed mode.
This will also have the advantage of blocking out automatically all
demands by “spoofers” and “hackers” to squat on such names.
Levels 1 and 2 should be automatically denied to users.
At this stage the automated software for providing variants based
on visual and homophonic identities should be set in place.
Nurturing Living Languages
Proposed Recommendations: GENERIC STRATEGIES-2
Subsequently Level 3 i.e. corporate, NGO’s should be allowed to
register. The software which will generate out all possible
variants for their names, as per the rules of the language can be
proposed to them. If they so desire they can register all these
variants or keep them open, after being overtly warned that such a
step could lead to spoofing.
Level 4 can be integrated at the end
Phased allotment of IDN’s will eradicate to a large extent spoofing
and phishing and ensure maximal security.
Nurturing Living Languages
Two scripts page should not be mixed.
As far as possible, numbers (digits) should not be used, unless
they acquire a linguistic value such as 365, 24/7 etc. Domain names
are not like mail applications where you can have the name followed
by a digit.
Punctuation marks should be avoided as far as possible. These can
also result in confusion as is the case of eyelash repha in
Marathi:
-
4. Although under ideal circumstances, correct spelling would be
the norm, the first instance of a name registered even if it is
incorrect would be deemed as registered and all further variants
including the correct one, generated out by the software would be
reserved or permitted as per the wish of the sanctioning
authority.
Nurturing Living Languages
Proposed Recommendations: SPECIFIC ISSUES-2
5. The whole process to be automated by means of a software which
will ensure to the highest degree that the “security holes” are not
breached.
Given that there would be a large number of applications and that
manual processing would not be possible and if possible would
result in inordinate delays, automation is a pre-requisite.
Nurturing Living Languages
Identification of Potential zones : Potential zones for ensuring
were identified.
These are:
List of potential spelling variants
List of potential zones of error in terms of misspellings and which
are not trapped by the variants list.
Nurturing Living Languages
© C-DAC
Explanatory documents and Templates for each of the desired data
were provided by CDAC GIST to the concerned
The templates gave examples for each type of requirements in the
sample template below:
Nurturing Living Languages
© C-DAC
CDAC. Pune has been entrusted with the creation of data for three
languages: Hindi, Marathi and Urdu
As per agreement Expert committees for all these three languages
have been appointed, the experts being professors and experts
working in the publishing industry; since these have the linguistic
skills and know-how to investigate and create the required
data
A translation of the three letter extension of the names has also
been provided. To ensure across the board intelligibility, this is
in Sanskrit
In the slides that follow, samples of the quantum of work
accomplished in each of the languages will be detailed out.
Report-1
1) EDU
2) GOV
3) IN
6) MIL -
7) RES
13) MED
14) AGRI
Report-1: Marathi
In the case of Marathi, a committee headed by Shri Phadake who has
books on “shuddha-lekhan” to his credit has been appointed.
Work has commenced on all the three areas:
Variants list
Spelling Variants
Erroneous Spellings
A large number of rules have been generated and so is the data on
spelling variants and misspellings
Nurturing Living Languages
Nurturing Living Languages
Nurturing Living Languages
Nurturing Living Languages
Report -2 Hindi
A similar exercise has been carried out for Hindi. Sample files are
provided below. Over 100 different rule variants have been
identified.
Nurturing Living Languages
Nurturing Living Languages
Report -3 Urdu
Under the able guidance of Prof Yunus Fahmi, spelling variants,
misspellings and variant lists are being created.
Some sample files for variant list and spellings variants are
appended
Nurturing Living Languages
Language
Indo-Aryan
Gujarati
Hindi
Indo-Aryan
Devanagari
Language
Language
Nurturing living languages