Unicode

3/11/2008 UNICODE

UnicodeUnicodeSidhartha SahooSidhartha [email protected]@ncsi.iisc.ernet.in

3/11/2008 UNICODE

Content…Content…

IntroductionIntroductionProblem of character encodingWhat UNICODE is ExactlyWhat UNICODE is ExactlyFrom ASCII to UnicodeThe Logic of UnicodeMultiple FormsMultiple FormsBenefits for IndiaBenefits for IndiaUnicode ConsortiumUnicode Consortium

3/11/2008 UNICODE

Introduction…Introduction…

Computers at their most basic level just deal with numbers. They store letters, numerals and other characters by assigning a number for each one.In the pre-Unicode environment, we had single 8-bit characters sets, which limited us to 256 characters max. No single encoding could contain enough characters to cover all the languages.so hundreds of different encoding systems were developed for assigning numbers to characters.

3/11/2008 UNICODE

As a result, these coding systems conflict with each other. That is, two encodings can use the same number for two different characters or different numbers for the same character.Any given computer needs to support many different encodings.yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

3/11/2008 UNICODE

What is text &Character?

Code pages & encodings describe the handling of and the way text is stored inData structuresInside a computer program or data file, text is stored as a sequence of numbers

3/11/2008 UNICODE

• A character is a:• Letter,• Digit,• Hyphen,• Punctuation or• Math symbol• Furthermore there are control characters –

typically not visible

3/11/2008 UNICODE

Problem of character encodingWhich number is assigned to which character?When typing an ‘A’ on the keyboard computer uses the character code as a basis for pulling the character shape of ‘A’ from a font file listing with the same binary number, and displays or prints it.The character ‘A’ may also have different integer values in different programs or data files (‘A’ might be ‘•’ in an Arabic font file)In some instances no number available for certain characters (“&” à Ä)All data encoded in the form of binary numerical codes.

3/11/2008 UNICODE

Universal Character EncodingUniversal Character Encoding

Unique number for every characterUnique number for every character

…

3/11/2008 UNICODE

What UNICODE is ExactlyWhat UNICODE is Exactly

The Unicode Standard is a character coding The Unicode Standard is a character coding system designed to support the worldwide system designed to support the worldwide interchange, processing, and display of the interchange, processing, and display of the written texts of the diverse languages.written texts of the diverse languages.It is a 16It is a 16--bit character encoding scheme allowing bit character encoding scheme allowing characters from Western European, Eastern characters from Western European, Eastern European, Cyrillic, Greek, Arabic, Hebrew, European, Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Thai, Urdu, Hindi and Chinese, Japanese, Korean, Thai, Urdu, Hindi and all other major world languages, to be encoded in all other major world languages, to be encoded in a single character set. a single character set.

3/11/2008 UNICODE

Unicode provides a unique number for every Unicode provides a unique number for every character, character,

no matter what the platform, no matter what the platform, no matter what the program, no matter what the program, no matter what the language,no matter what the language,

3/11/2008 UNICODE

From ASCII to Unicode

Most character sets and encodings in 70s/80s were modifications or extensions of ASCIIMany of them used 8-bit with a subset of the 94 used ASCII charactersMost common encodings now adays use single byte per character (SBCS)They are all limited to 256 charactersDue to that, none of them can even cover the letters for the Western European languages

3/11/2008 UNICODE

The Logic of UnicodeBasic Design FeaturesEach character has a unique code and, a unique name.Establishment of character class graphic,control, combining, punctuationgovern behavior in systemsExpansion space for additional characters.

3/11/2008 UNICODE

Where is Unicode used?The Unicode standards has been adopted bymany software and hardware vendorsMosts OSs support UnicodeUnicode is required for international document and data interchange, the Internet and theWWW, and therefore by modern standards suchas:Java, C#, Perl, PythonMarkup languages such as XML, HTML, XHTML,JavaScript, LDAP, CORBA etc.

3/11/2008 UNICODE

Multiple FormsMultiple Forms

UTFUTF--8: maximal compatibility with 88: maximal compatibility with 8--bit systemsbit systems

UTFUTF--16: good storage, interoperability with 16: good storage, interoperability with Windows/JavaWindows/Java

UTFUTF--32: simplest processing32: simplest processing

3/11/2008 UNICODE

UTF-8

UTF-8 is the 8-bit encoding of UnicodeIt’s a variable-width encoding and also a strict superset of ASCII.“Strict superset” means that every character in ASCII is available in UTF-8 with the same corresponding code point value1 character = 1byte – 4 bytes in the encodingCharacters from European scripts: either 1or 2 bytesAsian scripts: 3 or 4 bytes

3/11/2008 UNICODE

UTF-8 used for UNIX-platforms, HTML and most Internet BrowsersMain benefits of UTF-8compact storage requirements for European scriptsIn general European scripts will occupy lessstorage on disk and memoryEase of migration –since 7-bit ASCII data remains the same in UTF-8, data conversion effort between ASCII based character sets and UTF-8 is reduced significantly.

3/11/2008 UNICODE

UTF-16

UTF-16 is the 16-bit encoding of UnicodeBasically an extension of UCS-2One Unicode character can be 2 or 4 bytes inthe encoding Characters from European and most Asian scripts are represented in 2 bytesSupplementary characters are represented in 4 bytesUTF-16 is the main Unicode encoding fromWindows 2K

3/11/2008 UNICODE

Main benefits of UTF-16:More compact storage requirements for Asian scripts (2 bytes for commonly used characters)Ideal if European and Asian scripts are used togetherUTF-16 will occupy less storage on disk and memory than with UTF-8 (3 bytes for Asian part) Balance of efficient access to characters and economical use of storage.

3/11/2008 UNICODE

UTF-3232-Bit encodingPopular when memory space is no concernFixed width (4Byte)

3/11/2008 UNICODE

Indic Support in UnicodeIndic Support in Unicode

ISCII the basis for characters and allocation ISCII the basis for characters and allocation

Consortium actively engaged with Indian Consortium actively engaged with Indian Government, which is a memberGovernment, which is a memberWelcomes addition of missing characters (e.g. Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usageVedic), clarifications or corrections of usage

3/11/2008 UNICODE

Additional CharactersAdditional Characters

Indian Government is developing Indian Government is developing proposals for:proposals for:

Additions of missing characters:Additions of missing characters:VedicVedicIndividual characters for certain scriptsIndividual characters for certain scripts

Annotations and DescriptionsAnnotations and Descriptions

3/11/2008 UNICODE

Global Applications now Global Applications now support languages of Indiasupport languages of India

Companies supporting Indic with UnicodeCompanies supporting Indic with UnicodeOpenType fontsOpenType fonts

Font support for IndicFont support for IndicMicrosoft WindowsMicrosoft WindowsJava (IBM contributed ICU Indic Layout)Java (IBM contributed ICU Indic Layout)LinuxLinux……

3/11/2008 UNICODE

Benefits for IndiaBenefits for India

All documents, anywhere in the world, can have All documents, anywhere in the world, can have Indic textIndic textAllows seamless multilingual documents in IndiaAllows seamless multilingual documents in India

including scriptures and minority languagesincluding scriptures and minority languagesOpens up software export market, beyond Opens up software export market, beyond EnglishEnglishConnects India to the worldConnects India to the world

3/11/2008 UNICODE

How India Can ContributeHow India Can Contribute

Effective Communication with the Unicode Effective Communication with the Unicode ConsortiumConsortiumProvide Resources for DevelopmentProvide Resources for Development

Descriptions of UsageDescriptions of UsageDescriptions of Character ShapingDescriptions of Character ShapingTransliteration Tables from Script to ScriptTransliteration Tables from Script to ScriptCollation InformationCollation InformationOpenType fontsOpenType fonts……

3/11/2008 UNICODE

Unicode @ the Library

» Display all scripts and characters» Record data in all languages» Exchange bibliographic data» Search in all languages …

3/11/2008 UNICODE

misconceptions

Unicode has everything. Not true!Archaic scripts are not yet fully covered.Indic scripts are not yet fully covered

Unicode is a font. Definitely not true!It's a standard for encoding, not displaying.

3/11/2008 UNICODE

Unicode ConsortiumUnicode Consortium

The Unicode Consortium is a The Unicode Consortium is a nonnon--profit profit organization organization originally founded to develop, originally founded to develop, extend and promote use of the Unicode extend and promote use of the Unicode Standard.Standard.Members of the Consortium Members of the Consortium include major include major computer corporations, software producers, computer corporations, software producers, database vendors, research institutions, database vendors, research institutions, international agencies, various user groups, international agencies, various user groups, and interested individuals. and interested individuals.

3/11/2008 UNICODE

There are three consortium committees: There are three consortium committees: Unicode Technical CommitteeUnicode Technical Committee::creation, maintenance, and quality of the creation, maintenance, and quality of the Unicode Unicode Standard Standard CLDR Technical CommitteeCLDR Technical Committee::for the Common Locale Data Repository, and related for the Common Locale Data Repository, and related software localization standards and documents.software localization standards and documents.Editorial Committee:Editorial Committee:Responsible for editing the Consortium's publications Responsible for editing the Consortium's publications and web pages. and web pages.

3/11/2008 UNICODE

International Domain NamesInternational Domain Names

Approved Approved -- UnicodeUnicode--BasedBasedExamples:Examples:

http://Юникод.comhttp://Юникод.comhttp://Βαλκανίων.comhttp://Βαλκανίων.comhttp://हमसब.comhttp://हमसब.com

3/11/2008 UNICODE

ReferenceReference

www.unicode.orgwww.unicode.orghttp://en.wikipedia.org/wiki/Unicodehttp://en.wikipedia.org/wiki/Unicodewww.alanwood.net/www.alanwood.net/unicodeunicode/ / kb.iu.edu/data/aems.html kb.iu.edu/data/aems.html

3/11/2008 UNICODE

Date post:	12-May-2015
Category:	Technology
Upload:	sidhartha-sahoo
View:	955 times
Download:	4 times

Unicode

Technology