Date post: | 11-Apr-2017 |
Category: |
Software |
Upload: | kenneth-farrall |
View: | 31 times |
Download: | 2 times |
ENCODINGNIGHTMARES And how to avoid
them
PHILADELPHIA SOFTWARE LOCALIZATION MEETUP
Welcome to our kickoff event! For more information, visit the meetup site at: https://www.meetup.com/Philadelphia-Software-Localization-Meetup/
PLAN OF TALK Encoding Nightmares Character Encoding and the Modern Tower of Babel Rise of Unicode Rules of Thumb to Avoid Nightmares Tricks of the Trade Discussion
TAIWANESE WEBSITE FAIL
DZONGKHA (BHUTANESE) AS WINDOWS-1252
CORRUPTED DOCUMENT, DATA LOSS
ENCODING NIGHTMARES CAN LEAD TO … Confusion Missed deadlines Software Bugs Data corruption Embarrassment
CHARACTER ENCODING AND THE MODERN TOWER OF
BABEL
BINARY LANGUAGE The Bit, Two States (0, 1) Represented by switches “on” (1) or “off” (0) (Yes, No) Grouped Together, Represent More States n bits = 2n States 8 bits = 1 byte = 256 states
BINARY CHARACTER ENCODING ASCII Character Encoding Associate Binary string with English, letters, numbers, etc. How Many Needed? Used 127 distinct binary numbers, each mapped to a member of the ASCII character set Defined in the ASCII “Code Page”
EUROPEAN LANGUAGES NEED MORE SPACE
German, French, other languages needed more than 128 characters Started to use the 8th bit (doubles the possibilities) 256 spaces in these 8 bit character maps
CHINESE, JAPANESE, KOREAN (CJK) NEED EVEN MORE In Chinese, 2,000 distinct characters is often considered a minimum threshold for literacy. 40,000 characters are in common use and tens of thousands more in rare, historical literature. Japanese uses 2,000 characters, mixing their own phonetic scripts comprising the phonetic and ideographic characters borrowed from the Chinese Modern Korean tends toward more phonetic language and relies much less on the broader set of Chinese characters 256 characters in not enough for any of them
DOUBLE BYTE CHARACTER ENCODINGS
Two Bytes, 16 Bits 216 = 65,536 possible characters some bits used as signals, so can’t actually store 65,000 total
https://r12a.github.io/scripts/tutorial/part2 / (Creative Commons license)
NUMBER OF ENCODINGS EXPLODE
ISO 646, ASCII, EBCDIC, CP37, CP930, CP1047, ISO 8859-1, ISO 8859-2, ISO 8859-3, ISO 8859-4, ISO 8859-5, ISO 8859-6, ISO 8859-7, ISO 8859-8, ISO 8859-9, ISO 8859-10 , ISO 8859-11 , ISO 8859-12 , ISO 8859-13 , ISO 8859-14 , ISO 8859-15 ,CP437, CP720, CP737, CP850, CP852, CP855, CP857, CP858, CP860, CP861, CP862, CP863, CP865, CP866, CP869, CP872, Windows-1250, Windows-1251, Windows-1252, Windows-1253 , Windows-1254 , Windows-1255 , Windows-1256, Windows-1257, Windows-1258, Mac OS Roman, Shift JIS, GB 2312, GBK, Taiwan Big5, KOI8-R, KOI8-U, KOI7 ….
1980S: THE COMPUTING TOWER OF BABELSame binary sequence represents entirely different characters Sharing documents across borders becomes very difficult Unintelligible Files (common experience during early days of web) Hard to create a document containing multiple languages. Double-byte encodings increase likelihood of and add complexity to encoding nightmares
THE NIGHTMAREIf you open and save a file with the wrong character encoding, you can change it permanently. Important data may then be irretrievable.
RISE OF UNICODE
WHAT IS UNICODE? Global, unified “solution” to character encoding tower of babel One big encoding table for all world’s characters All linguistic symbols have a unique, defined “code point” Capacity for 1 million characters
UNICODE CONSORTIUM Non-profit corporation with global members from industry, government, academia, and other NGOs Approve new characters for registration as official Unicode Works closely with W3C and ISO
MORE ON UNICODE Abstract characters, not glyphs Broken Into Planes (each with 65,536 characters): Basic Multilingual Plane + 16 other planes
Room for more than 1 million individual characters
NOT a specific binary encoding of that number (UTF-8 differs from UTF-16) lots of room for growth
VERSION 9.0 (JUNE 21 2016) Adds exactly 7,500 characters, for a total of 128,172 characters: Osage, a Native American language Nepal Bhasa, a language of Nepal Fulani and other African languages Tangut, a major historic script of China 72 emoji characters, such as new smilies and people, animals and nature, and food and drink
STILL NOT UBIQUITOUS! Pre-Unicode encodings very much still in use. Legacy operating systems Popular applications MS Office Products And even within Unicode, nightmares still possible (UTF-?)
4 RULES OF THUMB
LIMIT YOUR APPLICATIONS Every app in chain has potential to corrupt.Make sure nobody opens the file “just to take a look.”
USE UTF-8 For websites and mobile apps, almost always the right choice If resource uses different encoding, use ICONV or similar tool to convert
KNOW YOUR METADATA <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<head> <meta charset="UTF-8"> </head>
KNOW THE DIFFERENCE BETWEEN CHARACTERS
AND GLYPHS technically, Unicode encodes characters, not glyphs or fonts characters can be thought of as the base shape while glyphs and fonts are particular appearances of those characters, including combination of “root characters which appear as one symbol, like the é this distinction can be important when you are diagnosing a character display problem; but the boundary can be fuzzy . Ä, for example is actually a complete character with unique code point, but is can also be stored as two code points, which combine the base character A with the umlaut in combination you may have correct encoding, but the particular font you are using to display the characters may not have the appropriate glyphs to display the encoded character.
TRICKS OF THE TRADE
CHECK AND CONVERT ENCODING Some text editors and stand alone utilities (like ICONV) guess and convert the encoding Libraries available (Mozilla Universal Charset Detector, International Components for Unicode) Can often guess correctly, but they are imperfect Some tools allow you to check large sets of files in batches
UTF-8 WITH BOM? BOM = Byte Order Mark Essentially a signal to receiver of message that the string is Unicode Can be appended to binary strings by otherwise “neutral” apps like Windows Notepad Can trip up various programming languages and introduce garbage (PHP, for example) Could show up in text editor (if misinterpreted) as series of characters to right Use editor (such as Sublime Text) or encoding converter to convert to straight UTF-8

SPREADSHEET TIP Careful with CSV and Excel Excel often mangles CSV encoding Use Google Docs (or MAC) to save CSV as Excel and then convert back to CSV
TOOLS Will post to our discussion page at the Meetup site. Add your own!
DISCUSSION
Questions? Tips? Horror Stories?
THANK YOU!Merci – Gracias – Danke
Grazie – Obrigado شكرا 谢谢 당신을 감사하십시오 ありがとう
www.mtmlinguasoft.com