+ All Categories
Home > Documents > Urdu Character Set and Collating Sequence

Urdu Character Set and Collating Sequence

Date post: 02-Feb-2016
Category:
Upload: ting
View: 118 times
Download: 0 times
Share this document with a friend
Description:
Urdu Character Set and Collating Sequence. Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences. Purpose of Presentation. Indicate the “state of affairs” Character set Collating sequence - PowerPoint PPT Presentation
Popular Tags:
28
Urdu Character Set Urdu Character Set and Collating and Collating Sequence Sequence Sarmad Hussain Sarmad Hussain ات ق ی ق ح ت ز ک ر م ات ق ی ق ح ت ز ک ر مِ ِ اردو اردوCenter for Research in Urdu Language Processing Center for Research in Urdu Language Processing FAST National University of Computer and Emerging FAST National University of Computer and Emerging Sciences Sciences
Transcript
Page 1: Urdu Character Set and Collating Sequence

Urdu Character Set and Urdu Character Set and Collating SequenceCollating Sequence

Sarmad HussainSarmad Hussain

اردو اردوِ ِمرکزتحقیقاتمرکزتحقیقاتCenter for Research in Urdu Language ProcessingCenter for Research in Urdu Language Processing

FAST National University of Computer and Emerging SciencesFAST National University of Computer and Emerging Sciences

Page 2: Urdu Character Set and Collating Sequence

2 مرکزتحقیقات اردو

Purpose of PresentationPurpose of Presentation

► Indicate the “state of affairs”Indicate the “state of affairs” Character setCharacter set Collating sequenceCollating sequence

►Show what has been done regarding Show what has been done regarding the standardizationthe standardization

► Identify what needs to be doneIdentify what needs to be done

Page 3: Urdu Character Set and Collating Sequence

3 مرکزتحقیقات اردو

SourcesSources

► Data from four dictionaries of UrduData from four dictionaries of Urdu

سنز 1.1. فیروز ، جامع سنز فیروزاللغات فیروز ، جامع لاہور فیروزاللغات لاہور ، ،((FLJFLJ ) )

.2.2Standard Twentieth Century Dictionary: Standard Twentieth Century Dictionary:

Urdu to English, Educational Publishing Urdu to English, Educational Publishing

House, New Dehli, India (STCD)House, New Dehli, India (STCD)

زبان ????????فرہنگفرہنگ3.3. قومی مقتدرہ ، زبان تلفظ قومی مقتدرہ ، اسلام تلفظ اسلام ، ( ( FTFT))آابادآاباد ،

زبان 4.4. قومی مقتدرہ ، لغت اردو زبان جدید قومی مقتدرہ ، لغت اردو اسلام جدید اسلام ، ((JULJUL ) )آابادآاباد ،

Page 4: Urdu Character Set and Collating Sequence

4 مرکزتحقیقات اردو

Character SetCharacter Set

►AlphabetAlphabet

►Harakat (Aerab)Harakat (Aerab)

►Other SymbolsOther Symbols

Page 5: Urdu Character Set and Collating Sequence

5 مرکزتحقیقات اردو

““Typical” AlphabetTypical” Alphabet

خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا آ

ژ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش سغ ع ظ غ ط ع ظ گ ط ک ق گ ف ک ق فم م ل ے ل ی ء ہ و ے ن ی ء ہ و ن

لاہور- ، سنز فیروز ، قاءدہ لاہور- اردو ، سنز فیروز ، قاءدہ اردو

Page 6: Urdu Character Set and Collating Sequence

6 مرکزتحقیقات اردو

Do zabar Do zabar ًًدد Do zerDo zer ٍٍدد

Do peshDo pesh ُُدد Tashdeed Tashdeed ّّدد Noon ghunnaNoon ghunna نن

““Familiar” Harakaat (Aerab)Familiar” Harakaat (Aerab)

JazmJazm ددْْZabarZabar ََدد ZerZer دد?? PeshPesh ُُدد Khari zabarKhari zabar دد Khari zerKhari zer ددUlta peshUlta pesh دد

Page 7: Urdu Character Set and Collating Sequence

7 مرکزتحقیقات اردو

““Common” Other SymbolsCommon” Other SymbolsNumbersNumbers

00 ۰۰11 ١١22 ٢٢33 ٣٣44

55 ۵۵66 ٦٦77

88 ٨٨9 9 ٩٩

Punctuation Punctuation

؟؟؛؛٬٬--

HonorificsHonorifics

Other SymbolsOther Symbols

ס

Page 8: Urdu Character Set and Collating Sequence

8 مرکزتحقیقات اردو

Urdu Alphabet: State of Urdu Alphabet: State of AffairsAffairs

FT, JULFT, JUL خ ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ خ ا ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ د د ا

ژ ز ڑھ ڑ رھ ر ذ ڈھ ڈ ژ دھ ز ڑھ ڑ رھ ر ذ ڈھ ڈ غ دھ ع ظ ط ض ص ش غ س ع ظ ط ض ص ش سگھ گ کھ ک ق گھ ف گ کھ ک ق ء ف وھ و نھ ن ںھ ں مھ م لھ ء ل وھ و نھ ن ںھ ں مھ م لھ ل

ے ے ی ی

FLJ, STCDFLJ, STCD خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا ژ آ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ص د ش ص س ش س

غ ع ظ ط غ ض ع ظ ط و ض ن ں م ل گ ک ق و ف ن ں م ل گ ک ق ے ف ی ء ھ ے ہ ی ء ھ ہ

Page 9: Urdu Character Set and Collating Sequence

9 مرکزتحقیقات اردو

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Page 10: Urdu Character Set and Collating Sequence

10 مرکزتحقیقات اردو

Logical Sections of UZT 1.01Logical Sections of UZT 1.01► Alphabet (80 – 122)Alphabet (80 – 122)► Aerab/diacritics/harakat (66 – 79, 123 – 126)Aerab/diacritics/harakat (66 – 79, 123 – 126)► Other charactersOther characters

Punctuation and arithmetic symbols (32 – 47, 58 – Punctuation and arithmetic symbols (32 – 47, 58 – 65)65)

Digits (48 – 57)Digits (48 – 57) Special symbols (160 – 176, 192 – 199)Special symbols (160 – 176, 192 – 199) MiscellaneousMiscellaneous

► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –

253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)

Page 11: Urdu Character Set and Collating Sequence

11 مرکزتحقیقات اردو

Conclusions: Standard Urdu Conclusions: Standard Urdu Character SetCharacter Set

► No general agreement on Urdu Character No general agreement on Urdu Character Set by dictionary publishersSet by dictionary publishers

► Standard Character Set defined by National Standard Character Set defined by National Language Authority Language Authority not well-publicized not well-publicized not widely adoptednot widely adopted

► GoP Computing Standard for Computing, GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined UZT 1.01 implements the NLA-defined character and symbol set character and symbol set

► Will soon be fully represented in Will soon be fully represented in Unicode/ISO 10646Unicode/ISO 10646

Page 12: Urdu Character Set and Collating Sequence

12 مرکزتحقیقات اردو

Urdu Collating Sequence: Urdu Collating Sequence: State of AffairsState of Affairs

FT, JULFT, JULج ٹھٹھٹ ٹ تھتھت ت پھپھپ پ بھبھب ب آآ اا ج ث خ چھچھچ چ جھجھث خ ح ڈ ڈ دھدھد د ح

ر ڈھڈھ ر ذ ژ ڑھ ڑھ ڑ ڑ رھرھذ ژ ز غ ز ع ظ ط ض ص ش غ س ع ظ ط ض ص ش ک س ق ک ف ق فےے ییء ء ہہ وھوھو و نھنھ نن ںھںھ ںں مھمھم م لھلھل ل گھگھگ گ کھ کھ

FLJFLJ ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س

غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ن ف ن ں ھ و و ں ھ ہ ے ء ء ہ ے ی ی

STCDSTCD ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س

غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ں ف ں ن ے ء ء ہہ ھھ و و ن ے ی ی

Page 13: Urdu Character Set and Collating Sequence

13 مرکزتحقیقات اردو

آا آا ا VariationVariation ا

► STCD and FLJSTCD and FLJ

آابآابآاپآاپابابایوانایوان

► FT and JULFT and JUL

ابابایوانایوانآابآابآاپآاپ

Page 14: Urdu Character Set and Collating Sequence

14 مرکزتحقیقات اردو

ں ں ن VariationVariation ن

► FLJ, FT & STCDFLJ, FT & STCDماںماںمانمان

► JULJULمانمانماںماں

Page 15: Urdu Character Set and Collating Sequence

15 مرکزتحقیقات اردو

ھ ھ ہ VariationVariation ہ

►FLJFLJباپباپبہنبہنبہنگیبہنگیبھابیبھابیبھنگیبھنگیبیٹابیٹا

►STCDSTCDباپباپبھابیبھابیبہنبہنبھنگیبھنگیبہنگیبہنگیبیٹابیٹا

►FT & JULFT & JULباپباپبہنبہنبہنگیبہنگیبیٹابیٹابھابیبھابیبھنگیبھنگی

بانوبانوبانھبانھبانیبانی

Page 16: Urdu Character Set and Collating Sequence

16 مرکزتحقیقات اردو

ےے یی VariationVariation

►FJL,FJL, FT & JULFT & JULبیبی بی بی بی بیبےبےبیابانبیابان

►STCDSTCDبیبیبےبےبیابانبیابان بی بی بی بی

► Middle “yay” predicament: Middle “yay” predicament: ےے or or ییب = ییبب ب = کار ر ےےکار ا ر ک ا کل = = وژن وژن ییٹیلٹیل ی ل ٹ ی ن ییٹ ژ ن و ژ و

Page 17: Urdu Character Set and Collating Sequence

17 مرکزتحقیقات اردو

Role of Aerab in SortingRole of Aerab in Sorting

► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string

ب )= ِِبب ب )= ہار ( ِِہار ( ہار ہار ہانہہانہََببب )= ِِبب ب )= ہاءی ( ِِہاءی ( ہاءی ہاءی

► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match

ب ََبب ب ن ب ِِن ب ن نُنُُُنس ََسس س ن س ِِن س ن نُنُُُن

Page 18: Urdu Character Set and Collating Sequence

18 مرکزتحقیقات اردو

Vocalic Aerab - Zabar, Zer, Vocalic Aerab - Zabar, Zer, PeshPesh

►FT, FLJ, JULFT, FLJ, JULنَنَببنِنِببنُنُُُبب

یریرََبب یریرِِب ب بیر بیر

►STCDSTCDنَنَببنُنُُُببنِنِبب

ننََسسننِِسسننُُُُسس

یریرِِب ب بیر بیر

Page 19: Urdu Character Set and Collating Sequence

19 مرکزتحقیقات اردو

Vocalic Aerab – Khari ZabarVocalic Aerab – Khari Zabar

► No effect at primary level sortingNo effect at primary level sorting وسیوسیََمماعلااعلا وسیوسیُُمماعلان اعلاناعلماعلماعلیاعلی

► No minimal pairs found so secondary No minimal pairs found so secondary level so involvement could not be level so involvement could not be determineddetermined

Page 20: Urdu Character Set and Collating Sequence

20 مرکزتحقیقات اردو

Consonantal Aerab - HamzaConsonantal Aerab - Hamza

► Ignored at primary levelIgnored at primary level►Minimal pairs not found to determine Minimal pairs not found to determine

secondary level effectsecondary level effect مرامراتتٲٲمرمرمراتبمراتبمراممرامآات آاتمر مر

باواباواٹاٹاٶٶباباباونباون

Page 21: Urdu Character Set and Collating Sequence

21 مرکزتحقیقات اردو

Consonantal Aerab - Consonantal Aerab - TashdeedTashdeed

► Ignored are primary level Ignored are primary level ►Effects secondary level sorting Effects secondary level sorting

““heavier than null” heavier than null”

► Interacts with vocalic aerabInteracts with vocalic aerab

راناراناََبب انااناّّبر بر رایارایاََب ب

بدیبدی بّدی بّدی بّدیا بّدیا

بدوبدو وُوُبّد بّد بّدیا بّدیاallall examples from examples from

FTFT

Page 22: Urdu Character Set and Collating Sequence

22 مرکزتحقیقات اردو

Ligature-Break (Half Space) Ligature-Break (Half Space)

► Ignored at primary level and Ignored at primary level and secondary levelsecondary level

وژن ٹیلی ، وژن ٹیلیوژن ٹیلی ، ٹیلیوژن فون ٹیلی ، فون ٹیلیفون ٹیلی ، ٹیلیفون بیکار ، کار بیکار بے ، کار بے

►But given each pair, which word first?But given each pair, which word first? Tertiary level decisionTertiary level decision

Page 23: Urdu Character Set and Collating Sequence

23 مرکزتحقیقات اردو

Word-Break (Normal Space)Word-Break (Normal Space)

► Ignored at primary level ? Ignored at primary level ? ►American Heritage Dictionary (2American Heritage Dictionary (2ndnd Collegiate Collegiate

ed.)ed.) black artblack art black bearblack bear blackberryblackberry black boxblack box blackenblacken Black DeathBlack Death black goldblack gold

►Space ignored at primary levelSpace ignored at primary level

Page 24: Urdu Character Set and Collating Sequence

24 مرکزتحقیقات اردو

Word-Break (Normal Space) - Word-Break (Normal Space) - IIII

► FLJFLJ

بانگبانگ1.1.

درا دراِ ِبانگبانگ2.2.

دینا 3.3. دینا بانگ بانگ If sorting is done at word break then If sorting is done at word break then

1,3,2 1,3,2 So sorting ignores word break So sorting ignores word break

Page 25: Urdu Character Set and Collating Sequence

25 مرکزتحقیقات اردو

Conclusions: Urdu Collating Conclusions: Urdu Collating SequenceSequence

► Multi-level Complex Multi-level Complex ProblemProblem

► Pre-processingPre-processing Contractions (Contractions ( ھ ھ ب ب

((بھبھ► Primary LevelPrimary Level

characterscharacters

► Secondary LevelSecondary Level Vocalic aerabVocalic aerab Consonantal aerabConsonantal aerab Interaction of Vocalic Interaction of Vocalic

and Consonantal and Consonantal aerabaerab

Others (?)Others (?)

► Tertiary LevelTertiary Level Ligature BreakLigature Break Others (?)Others (?)

Page 26: Urdu Character Set and Collating Sequence

26 مرکزتحقیقات اردو

What Needs to be Done: What Needs to be Done: Urdu Urdu

► If required revisit and revise the Urdu If required revisit and revise the Urdu character setcharacter set

► Extensive work on sorting done at linguistic Extensive work on sorting done at linguistic level by NLA and UDB. Need to level by NLA and UDB. Need to Standardize itStandardize it Publicize itPublicize it

► Need to develop at computational level to build Need to develop at computational level to build Collation Element Table to generate sort keysCollation Element Table to generate sort keys Standardize itStandardize it Publicize itPublicize it

Page 27: Urdu Character Set and Collating Sequence

27 مرکزتحقیقات اردو

What Needs to be Done: What Needs to be Done: Other Languages of PakistanOther Languages of Pakistan

►Need to work towards standardization Need to work towards standardization of of Character setCharacter set Collating Sequence Collating Sequence

►Need to do gap analysis of character Need to do gap analysis of character sets with Unicode/ISO 10646 for sets with Unicode/ISO 10646 for international standardizationinternational standardization

►Need to develop Collation Element Need to develop Collation Element Tables for these Languages for sortingTables for these Languages for sorting

Page 28: Urdu Character Set and Collating Sequence

28 مرکزتحقیقات اردو

Thank youThank you

Questions?Questions?


Recommended