Date post: | 12-May-2015 |
Category: |
Technology |
Upload: | sidhartha-sahoo |
View: | 955 times |
Download: | 4 times |
3/11/2008 UNICODE
Content…Content…
IntroductionIntroductionProblem of character encodingWhat UNICODE is ExactlyWhat UNICODE is ExactlyFrom ASCII to UnicodeThe Logic of UnicodeMultiple FormsMultiple FormsBenefits for IndiaBenefits for IndiaUnicode ConsortiumUnicode Consortium
3/11/2008 UNICODE
Introduction…Introduction…
Computers at their most basic level just deal with numbers. They store letters, numerals and other characters by assigning a number for each one.In the pre-Unicode environment, we had single 8-bit characters sets, which limited us to 256 characters max. No single encoding could contain enough characters to cover all the languages.so hundreds of different encoding systems were developed for assigning numbers to characters.
3/11/2008 UNICODE
As a result, these coding systems conflict with each other. That is, two encodings can use the same number for two different characters or different numbers for the same character.Any given computer needs to support many different encodings.yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
3/11/2008 UNICODE
What is text &Character?
Code pages & encodings describe the handling of and the way text is stored inData structuresInside a computer program or data file, text is stored as a sequence of numbers
3/11/2008 UNICODE
• A character is a:• Letter,• Digit,• Hyphen,• Punctuation or• Math symbol• Furthermore there are control characters –
typically not visible
3/11/2008 UNICODE
Problem of character encodingWhich number is assigned to which character?When typing an ‘A’ on the keyboard computer uses the character code as a basis for pulling the character shape of ‘A’ from a font file listing with the same binary number, and displays or prints it.The character ‘A’ may also have different integer values in different programs or data files (‘A’ might be ‘•’ in an Arabic font file)In some instances no number available for certain characters (“&” à Ä)All data encoded in the form of binary numerical codes.
3/11/2008 UNICODE
Universal Character EncodingUniversal Character Encoding
Unique number for every characterUnique number for every character
…
3/11/2008 UNICODE
What UNICODE is ExactlyWhat UNICODE is Exactly
The Unicode Standard is a character coding The Unicode Standard is a character coding system designed to support the worldwide system designed to support the worldwide interchange, processing, and display of the interchange, processing, and display of the written texts of the diverse languages.written texts of the diverse languages.It is a 16It is a 16--bit character encoding scheme allowing bit character encoding scheme allowing characters from Western European, Eastern characters from Western European, Eastern European, Cyrillic, Greek, Arabic, Hebrew, European, Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Thai, Urdu, Hindi and Chinese, Japanese, Korean, Thai, Urdu, Hindi and all other major world languages, to be encoded in all other major world languages, to be encoded in a single character set. a single character set.
3/11/2008 UNICODE
Unicode provides a unique number for every Unicode provides a unique number for every character, character,
no matter what the platform, no matter what the platform, no matter what the program, no matter what the program, no matter what the language,no matter what the language,
3/11/2008 UNICODE
From ASCII to Unicode
Most character sets and encodings in 70s/80s were modifications or extensions of ASCIIMany of them used 8-bit with a subset of the 94 used ASCII charactersMost common encodings now adays use single byte per character (SBCS)They are all limited to 256 charactersDue to that, none of them can even cover the letters for the Western European languages
3/11/2008 UNICODE
The Logic of UnicodeBasic Design FeaturesEach character has a unique code and, a unique name.Establishment of character class graphic,control, combining, punctuationgovern behavior in systemsExpansion space for additional characters.
3/11/2008 UNICODE
Where is Unicode used?The Unicode standards has been adopted bymany software and hardware vendorsMosts OSs support UnicodeUnicode is required for international document and data interchange, the Internet and theWWW, and therefore by modern standards suchas:Java, C#, Perl, PythonMarkup languages such as XML, HTML, XHTML,JavaScript, LDAP, CORBA etc.
3/11/2008 UNICODE
Multiple FormsMultiple Forms
UTFUTF--8: maximal compatibility with 88: maximal compatibility with 8--bit systemsbit systems
UTFUTF--16: good storage, interoperability with 16: good storage, interoperability with Windows/JavaWindows/Java
UTFUTF--32: simplest processing32: simplest processing
3/11/2008 UNICODE
UTF-8
UTF-8 is the 8-bit encoding of UnicodeIt’s a variable-width encoding and also a strict superset of ASCII.“Strict superset” means that every character in ASCII is available in UTF-8 with the same corresponding code point value1 character = 1byte – 4 bytes in the encodingCharacters from European scripts: either 1or 2 bytesAsian scripts: 3 or 4 bytes
3/11/2008 UNICODE
UTF-8 used for UNIX-platforms, HTML and most Internet BrowsersMain benefits of UTF-8compact storage requirements for European scriptsIn general European scripts will occupy lessstorage on disk and memoryEase of migration –since 7-bit ASCII data remains the same in UTF-8, data conversion effort between ASCII based character sets and UTF-8 is reduced significantly.
3/11/2008 UNICODE
UTF-16
UTF-16 is the 16-bit encoding of UnicodeBasically an extension of UCS-2One Unicode character can be 2 or 4 bytes inthe encoding Characters from European and most Asian scripts are represented in 2 bytesSupplementary characters are represented in 4 bytesUTF-16 is the main Unicode encoding fromWindows 2K
3/11/2008 UNICODE
Main benefits of UTF-16:More compact storage requirements for Asian scripts (2 bytes for commonly used characters)Ideal if European and Asian scripts are used togetherUTF-16 will occupy less storage on disk and memory than with UTF-8 (3 bytes for Asian part) Balance of efficient access to characters and economical use of storage.
3/11/2008 UNICODE
UTF-3232-Bit encodingPopular when memory space is no concernFixed width (4Byte)
3/11/2008 UNICODE
Indic Support in UnicodeIndic Support in Unicode
ISCII the basis for characters and allocation ISCII the basis for characters and allocation
Consortium actively engaged with Indian Consortium actively engaged with Indian Government, which is a memberGovernment, which is a memberWelcomes addition of missing characters (e.g. Welcomes addition of missing characters (e.g. Vedic), clarifications or corrections of usageVedic), clarifications or corrections of usage
3/11/2008 UNICODE
Additional CharactersAdditional Characters
Indian Government is developing Indian Government is developing proposals for:proposals for:
Additions of missing characters:Additions of missing characters:VedicVedicIndividual characters for certain scriptsIndividual characters for certain scripts
Annotations and DescriptionsAnnotations and Descriptions
3/11/2008 UNICODE
Global Applications now Global Applications now support languages of Indiasupport languages of India
Companies supporting Indic with UnicodeCompanies supporting Indic with UnicodeOpenType fontsOpenType fonts
Font support for IndicFont support for IndicMicrosoft WindowsMicrosoft WindowsJava (IBM contributed ICU Indic Layout)Java (IBM contributed ICU Indic Layout)LinuxLinux……
3/11/2008 UNICODE
Benefits for IndiaBenefits for India
All documents, anywhere in the world, can have All documents, anywhere in the world, can have Indic textIndic textAllows seamless multilingual documents in IndiaAllows seamless multilingual documents in India
including scriptures and minority languagesincluding scriptures and minority languagesOpens up software export market, beyond Opens up software export market, beyond EnglishEnglishConnects India to the worldConnects India to the world
3/11/2008 UNICODE
How India Can ContributeHow India Can Contribute
Effective Communication with the Unicode Effective Communication with the Unicode ConsortiumConsortiumProvide Resources for DevelopmentProvide Resources for Development
Descriptions of UsageDescriptions of UsageDescriptions of Character ShapingDescriptions of Character ShapingTransliteration Tables from Script to ScriptTransliteration Tables from Script to ScriptCollation InformationCollation InformationOpenType fontsOpenType fonts……
3/11/2008 UNICODE
Unicode @ the Library
» Display all scripts and characters» Record data in all languages» Exchange bibliographic data» Search in all languages …
3/11/2008 UNICODE
misconceptions
Unicode has everything. Not true!Archaic scripts are not yet fully covered.Indic scripts are not yet fully covered
Unicode is a font. Definitely not true!It's a standard for encoding, not displaying.
3/11/2008 UNICODE
Unicode ConsortiumUnicode Consortium
The Unicode Consortium is a The Unicode Consortium is a nonnon--profit profit organization organization originally founded to develop, originally founded to develop, extend and promote use of the Unicode extend and promote use of the Unicode Standard.Standard.Members of the Consortium Members of the Consortium include major include major computer corporations, software producers, computer corporations, software producers, database vendors, research institutions, database vendors, research institutions, international agencies, various user groups, international agencies, various user groups, and interested individuals. and interested individuals.
3/11/2008 UNICODE
There are three consortium committees: There are three consortium committees: Unicode Technical CommitteeUnicode Technical Committee::creation, maintenance, and quality of the creation, maintenance, and quality of the Unicode Unicode Standard Standard CLDR Technical CommitteeCLDR Technical Committee::for the Common Locale Data Repository, and related for the Common Locale Data Repository, and related software localization standards and documents.software localization standards and documents.Editorial Committee:Editorial Committee:Responsible for editing the Consortium's publications Responsible for editing the Consortium's publications and web pages. and web pages.
3/11/2008 UNICODE
International Domain NamesInternational Domain Names
Approved Approved -- UnicodeUnicode--BasedBasedExamples:Examples:
http://Юникод.comhttp://Юникод.comhttp://Βαλκανίων.comhttp://Βαλκανίων.comhttp://हमसब.comhttp://हमसब.com
3/11/2008 UNICODE
ReferenceReference
www.unicode.orgwww.unicode.orghttp://en.wikipedia.org/wiki/Unicodehttp://en.wikipedia.org/wiki/Unicodewww.alanwood.net/www.alanwood.net/unicodeunicode/ / kb.iu.edu/data/aems.html kb.iu.edu/data/aems.html
3/11/2008 UNICODE