Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | vanessa-spencer |
View: | 218 times |
Download: | 0 times |
Supplementary Character Support Supplementary Character Support in Microsoft Productsin Microsoft Products
Michael S. KaplanSoftware Design EngineerMicrosoft
12 September 2002 San Jose, California (IUC22)
What are supplementary characters?What are supplementary characters?
"a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate"
12 September 2002 San Jose, California (IUC22)
High/low surrogate?High/low surrogate?
High: U+D800 - U+DBFFLow: U+DC00 - U+DFFFTerminology:
– "surrogate pair" preferred over "surrogate character“
See http://www.trigeminal.com/16to32AndBack.asp
12 September 2002 San Jose, California (IUC22)
Conversion example #1Conversion example #1 Example #1:
– The first character in the Surrogate range (D800, DC00) as UTF-32:
1. D800: binary 1101100000000000 (lower ten bits: 0000000000)
2. DC00: binary 1101110000000000 (lower ten bits: 0000000000)
3. Concatenate 0000000000+0000000000 = x0000
4. Add x10000
Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF)
12 September 2002 San Jose, California (IUC22)
Conversion example #2Conversion example #2 Example #2.
– You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16
1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10
bits piece (0001000001) - Result: 1101100001000001 (D841)
4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A)
Your surrogate pair: D841, DC0A
12 September 2002 San Jose, California (IUC22)
UTF-8 conversionsUTF-8 conversions
Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately)
legal conversions: four-byte UTF-8 (one UTF-32 code point)
CESU-8 is the the inverse of the above
12 September 2002 San Jose, California (IUC22)
UTF-8 exampleUTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx
becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx
Instead, you should take a Unicode surrogate pair:
110110wwwwzzzzyy, 110111yyyyxxxxxx
and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1):
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
12 September 2002 San Jose, California (IUC22)
Encoding choices for MSEncoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32
REASONS: There was obviously an existing, well-tested set of APIs
that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space
for all characters. A move to UTF-8 would require even more than twice as
much space in many cases.
12 September 2002 San Jose, California (IUC22)
The products...The products...Mostly the new generation of products:
– Windows 2000/XP– Office XP (some support in Office 2000)– Visual Studio.Net
Most (all) of these products supported Unicode already– a little bit of extra work needed for
supplementary characters– usually just UTF-8 changes were needed
12 September 2002 San Jose, California (IUC22)
Windows 2000Windows 2000
Uniscribe support for renderingEach surrogate pair is a single graphemeAPIs like CharPrev/CharNext not changedNo specific surrogate font/IMEMust be turned on:http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp
12 September 2002 San Jose, California (IUC22)
Windows XPWindows XP*.* from Windows 2000Turned on by default!GDI+ support for rendering Font CMAP extensionsLots of UTF-8 issues fixedNo specific surrogate font/IME (yet)Extensions to fallback fonts [limited]:
HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3(etc.)
12 September 2002 San Jose, California (IUC22)
Other system componentsOther system components
MLangInternet Explorer
http://i18nWithVB.com/surrogate_ime/IIS 5.0/6.0
12 September 2002 San Jose, California (IUC22)
The downlevel storyThe downlevel story
No good support for Unicode, let along supplementary characters
Uniscribe/RichEdit does improve the downlevel story for display purposes
Officially, no support on Win9x
12 September 2002 San Jose, California (IUC22)
The Office suiteThe Office suite
WordFrontpageExcel/AccessOutlookRichEdit 4.0
12 September 2002 San Jose, California (IUC22)
Office - Specific FeaturesOffice - Specific Features
Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) -
Word/RichEdit
12 September 2002 San Jose, California (IUC22)
CHS/CHT/CHP OfficeCHS/CHT/CHP Office
The product and the langpacks support an extended Unicode IME that handles supplementary characters
An Extension B font is also included
12 September 2002 San Jose, California (IUC22)
Visual Studio[.NET]Visual Studio[.NET]
String class and globalization namespaceStringInfoGetTextElementEnumerator
– Handles supplementary characters– Also handles composite characters
GDI+IDE support
12 September 2002 San Jose, California (IUC22)
SQL ServerSQL Server
Past - no support (for Unicode, even!)Present - surrogate "safe" (neutral)Future - surrogate “aware”
12 September 2002 San Jose, California (IUC22)
Items not [currently] supportedItems not [currently] supported
Character MapGraph 10Outlook 10 mail headersFonts/IMEs“Collations” for supplementary characters
12 September 2002 San Jose, California (IUC22)
Collation plan for Collation plan for supplementary characters in supplementary characters in
the UCA?the UCA? All Plane-1 (non-ideographic) characters sort after all the
other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all
the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like
any other unassigned characters. Plane 14 language tags will be treated as if they were
unassigned. All characters encoded in Plane 15-16 (private use) will be
sorted after all other characters.
12 September 2002 San Jose, California (IUC22)
Questions?Questions?
12 September 2002 San Jose, California (IUC22)
Supplementary Character Support in Microsoft
Products
Don’t forget to fill out your evals!