Post on 26-Mar-2015
transcript
Bits of Unicode
Data structures for alarge character set
Mark DavisIBM Emerging Technologies
☢ Caution ☢
• “Characters” ambiguous, sometimes:– Graphemes: “x̣�” (also “ch”,…)
– Code points: 0078 0323
– Code units: 0078 0323 (or UTF-8: 78 CC A3)
• For programmers– Unicode associates codepoints (or
sequences of codepoints) with properties
– See UTR#17
The Problem
• Programs often have to do <key,value> lookups– Look up properties by codepoint– Map codepoints to values– Test codepoints for inclusion in set
• e.g. value == true/false
• Easy with 256 codepoints: just use array
Size Matters
• Not so easy with Unicode!• Unicode 3.0
– subset (ex̣cept PUA)
– up to FFFF16 = 65,53510
• Unicode 3.1– full range
– up to 10FFFF16 = 1,114,11110
Array Lookup
• With ASCII• Simple• Fast• Compact
– codepoint ➠ bit:32 bytes
– codepoint ➠ short:½ K
• With Unicode• Simple• Fast• Huge (esp. v3.1)
– codepoint ➠ bit:136 K
– codepoint ➠ short:2.2 M
Further complications
• Mappings, tests, properties often must be for sequences of codepoints.– Human languages don’t just use
single codepoints.
– “ch” in Spanish, Slovak; etc.
First step: Avoidance
• Properties from libraries often suffice– Test for (Character.getType(c) == Nd)
instead of long list of codepoints• Easier• Automatically updated with new versions
• Data structures from libraries often suffice– Java Hashtable– ICU (Java or C++) CompactArray– JavaScript properties
• Consult http://www.unicode.org
Data structures: criteria• Speed
– Read (static)
– Write (dynamic)
– Startup
• Memory footprint– Ram
– Disk
• Multi-threading
Hashtables
• Advantages– Easy to use out-of-the-box̣
– Reasonably fast
– General
• Disadvantages– High overhead
– Discrete (no range lookup)
– Much slower than array lookup
Overhead: char1 ➠ char2
value
next
key
overhead
char1overhead
char2overhead
…
hash…
overhead
Trie
• Advantages– Nearly as fast as array lookup
– Much smaller than arrays or Hashtables
– Take advantage of repetition
• Disadvantages– Not suited for rapidly changing data
– Best for static, preformed data
Trie structure
…Index̣
Data
M1 M2Codepoint
Trie code
• 5 Operations– Shift, Lookup, Mask, Add, Lookup
v = data[index[c>>S1]+(c&M2)]]
S1
M1 M2Codepoint
Trie: double indexed
• Double, for more compaction:– Slightly slower than single index̣
– Smaller chunks of data, so more compaction
Trie: double indexed
……
…Index̣2
Data
Index̣1
M1 M3M2Codepoint
Trie code: double indexed
b1 = index1[ c >> S1 ]
b2 = index2[ b1 + ((c >> S2) & M2)]
v = data[ b2 + (c & M3) ]
S2
S1
M1 M3M2Codepoint
Inversion List
• Compaction of set of codepoints• Advantages
– Simple
– Very compact
– Faster write than trie
– Very fast boolean operations
• Disadvantages– Slower read than trie or hashtable
Inversion List Structure
• Structure– Index̣ (optional)
– List of codepoints in ascending order
• Ex̣ample Set
[ 0020-0061, 0135, 19A3-201B ]
002000620135013619A3201C
Index
0:
1:
2:
3:
4:
5:
in
out
in
out
in
out
Inversion List Example
• Find smallest i such that c < data[i]– If no i, i = length
• Thenc List ↔ ∈
odd(i)• Ex̣amples:
– In: 0023, 0135
– Out: 001A, 0136, A357
002000620135013619A3201C
Index
0:
1:
2:
3:
4:
5:
in
out
in
out
in
out
Inversion List Operations• Fast Boolean Operations• Ex̣ample: Negation
002000620135013619A3201C
Index
0:
1:
2:
3:
4:
5:
➠002000620135013619A3201C
Index
1:
3:
2:
4:
5:
6:
00000:
➠
Inversion List: Binary Search
• from Programming Pearls• Completely unrolled, precalculated
parameters
int index = startIndex;
if (x >= data[auxStart]) {index += auxStart;
}
switch (power) {
case 21: if (x < data[t = index-0x10000])
index = t;
case 20: if (x < data[t = index-0x8000])
index = t;
…
Inversion Map
• Inversion List plus
• Associated Values– Lookup index̣ just
as in Inversion List
– Take corresponding value
002000620135013619A3201C
Index
0:
1:
2:
3:
4:
5:
0539830
0:
1:
2:
3:
4:
5:6:
Key ➠ String Value• Problem
– Often almost all values are 1 codepoint
– But, must map to strings in a few cases
– Don’t want overhead for strings always
• Solution– Ex̣ception values indicate ex̣tra processing
– Can use same solution for UTF-16 code units
Example
• Get a character ch• Find its value v• If v is in [D800..E000], may be
string– check v2 = valueException[v - D800]
– if v2 not null, process it, continue
• Process v
String Key ➠ Value• Problem
– Often almost all keys are 1 codepoint
– Must have string keys in a few cases
– Don’t want overhead for strings always
• Solution– Ex̣ception values indicate possible follow-on
codepoints
– Can use same solution for UTF-16 code units
– Use key closure!
Closure
• If (X + Y) is a key, then X is a key
Before
s ➠ x̣
sh ➠ y
shch ➠ z
After
shc ➠ yw
➠c ➠ w
s ➠ x̣
sh ➠ y
shch ➠ z
c ➠ w
Why Closure?
s h c h a …
x̣
y
yw
z
not found,use last
Bitpacking
• Squeeze information into value• Ex̣ample: Character Properties
– category: 5 bits
– bidi: 4 bits (+ ex̣ceptions)
– canonical category: 6 bits + ex̣pansion
• compressCanon = [bits >> SHIFT] & MASK;• canon = expansionArray[compressCanon];
Statetables
• Classic:– entry = stateTable[ state, ch ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
Statetables
• Unicode:– type = trie[ch];
– entry = stateTable[ state, type ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
• Also, String Key ➠ Value
Sample Data Structures: ICU
• Trie: CompactArray– Customized for each datatype
– Automatic ex̣pansion
– Compact after setting
• Character Properties– use CompactArray, Bitpacking
• Inversion List: UnicodeSet– Boolean Operations
Sample Usage #1: ICU• Collation
– Trie lookup
– Ex̣panding character: String Key ➠ Value
– Contracting character: Key ➠ String Value
• Break Iterators– For grapheme, word, line, sentence
break
– Statetable
Sample Usage #2: ICU• Transliteration
– Requires• Mapping codepoints in context to others• Rearranging codepoints• Controlling the choice of mapping
– Character Properties
– Inversion List
– Ex̣ception values
Sample Usage #3: ICU• Character Conversion
– From Unicode to bytes• Trie
– From bytes to Unicode• Arrays for simple maps• Statetables for complex̣ maps
– recognizes valid / invalid mappings– provides compaction
• Complications– Invalid vs. Valid mapped vs. Valid unmapped
– Fallbacks
References• Unicode Open Source — ICU
– http://oss.software.ibm.com/icu
– ICU4j: Java API
– ICU4c: C and C++ APIs
• Other references — see Mark’s website:– http://www.macchiato.com
Q & A