+ All Categories
Home > Documents > Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Date post: 26-Mar-2015
Category:
Upload: hailey-gorman
View: 222 times
Download: 3 times
Share this document with a friend
Popular Tags:
36
Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies
Transcript
Page 1: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Bits of Unicode

Data structures for alarge character set

Mark DavisIBM Emerging Technologies

Page 2: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

☢ Caution ☢

• “Characters” ambiguous, sometimes:– Graphemes: “x̣�” (also “ch”,…)

– Code points: 0078 0323

– Code units: 0078 0323 (or UTF-8: 78 CC A3)

• For programmers– Unicode associates codepoints (or

sequences of codepoints) with properties

– See UTR#17

Page 3: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

The Problem

• Programs often have to do <key,value> lookups– Look up properties by codepoint– Map codepoints to values– Test codepoints for inclusion in set

• e.g. value == true/false

• Easy with 256 codepoints: just use array

Page 4: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Size Matters

• Not so easy with Unicode!• Unicode 3.0

– subset (ex̣cept PUA)

– up to FFFF16 = 65,53510

• Unicode 3.1– full range

– up to 10FFFF16 = 1,114,11110

Page 5: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Array Lookup

• With ASCII• Simple• Fast• Compact

– codepoint ➠ bit:32 bytes

– codepoint ➠ short:½ K

• With Unicode• Simple• Fast• Huge (esp. v3.1)

– codepoint ➠ bit:136 K

– codepoint ➠ short:2.2 M

Page 6: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Further complications

• Mappings, tests, properties often must be for sequences of codepoints.– Human languages don’t just use

single codepoints.

– “ch” in Spanish, Slovak; etc.

Page 7: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

First step: Avoidance

• Properties from libraries often suffice– Test for (Character.getType(c) == Nd)

instead of long list of codepoints• Easier• Automatically updated with new versions

• Data structures from libraries often suffice– Java Hashtable– ICU (Java or C++) CompactArray– JavaScript properties

• Consult http://www.unicode.org

Page 8: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Data structures: criteria• Speed

– Read (static)

– Write (dynamic)

– Startup

• Memory footprint– Ram

– Disk

• Multi-threading

Page 9: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Hashtables

• Advantages– Easy to use out-of-the-box̣

– Reasonably fast

– General

• Disadvantages– High overhead

– Discrete (no range lookup)

– Much slower than array lookup

Page 10: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Overhead: char1 ➠ char2

value

next

key

overhead

char1overhead

char2overhead

hash…

overhead

Page 11: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie

• Advantages– Nearly as fast as array lookup

– Much smaller than arrays or Hashtables

– Take advantage of repetition

• Disadvantages– Not suited for rapidly changing data

– Best for static, preformed data

Page 12: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie structure

…Index̣

Data

M1 M2Codepoint

Page 13: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie code

• 5 Operations– Shift, Lookup, Mask, Add, Lookup

v = data[index[c>>S1]+(c&M2)]]

S1

M1 M2Codepoint

Page 14: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie: double indexed

• Double, for more compaction:– Slightly slower than single index̣

– Smaller chunks of data, so more compaction

Page 15: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie: double indexed

……

…Index̣2

Data

Index̣1

M1 M3M2Codepoint

Page 16: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Trie code: double indexed

b1 = index1[ c >> S1 ]

b2 = index2[ b1 + ((c >> S2) & M2)]

v = data[ b2 + (c & M3) ]

S2

S1

M1 M3M2Codepoint

Page 17: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion List

• Compaction of set of codepoints• Advantages

– Simple

– Very compact

– Faster write than trie

– Very fast boolean operations

• Disadvantages– Slower read than trie or hashtable

Page 18: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion List Structure

• Structure– Index̣ (optional)

– List of codepoints in ascending order

• Ex̣ample Set

[ 0020-0061, 0135, 19A3-201B ]

002000620135013619A3201C

Index

0:

1:

2:

3:

4:

5:

in

out

in

out

in

out

Page 19: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion List Example

• Find smallest i such that c < data[i]– If no i, i = length

• Thenc List ↔ ∈

odd(i)• Ex̣amples:

– In: 0023, 0135

– Out: 001A, 0136, A357

002000620135013619A3201C

Index

0:

1:

2:

3:

4:

5:

in

out

in

out

in

out

Page 20: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion List Operations• Fast Boolean Operations• Ex̣ample: Negation

002000620135013619A3201C

Index

0:

1:

2:

3:

4:

5:

➠002000620135013619A3201C

Index

1:

3:

2:

4:

5:

6:

00000:

Page 21: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion List: Binary Search

• from Programming Pearls• Completely unrolled, precalculated

parameters

int index = startIndex;

if (x >= data[auxStart]) {index += auxStart;

}

switch (power) {

case 21: if (x < data[t = index-0x10000])

index = t;

case 20: if (x < data[t = index-0x8000])

index = t;

Page 22: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Inversion Map

• Inversion List plus

• Associated Values– Lookup index̣ just

as in Inversion List

– Take corresponding value

002000620135013619A3201C

Index

0:

1:

2:

3:

4:

5:

0539830

0:

1:

2:

3:

4:

5:6:

Page 23: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Key ➠ String Value• Problem

– Often almost all values are 1 codepoint

– But, must map to strings in a few cases

– Don’t want overhead for strings always

• Solution– Ex̣ception values indicate ex̣tra processing

– Can use same solution for UTF-16 code units

Page 24: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Example

• Get a character ch• Find its value v• If v is in [D800..E000], may be

string– check v2 = valueException[v - D800]

– if v2 not null, process it, continue

• Process v

Page 25: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

String Key ➠ Value• Problem

– Often almost all keys are 1 codepoint

– Must have string keys in a few cases

– Don’t want overhead for strings always

• Solution– Ex̣ception values indicate possible follow-on

codepoints

– Can use same solution for UTF-16 code units

– Use key closure!

Page 26: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Closure

• If (X + Y) is a key, then X is a key

Before

s ➠ x̣

sh ➠ y

shch ➠ z

After

shc ➠ yw

➠c ➠ w

s ➠ x̣

sh ➠ y

shch ➠ z

c ➠ w

Page 27: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Why Closure?

s h c h a …

y

yw

z

not found,use last

Page 28: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Bitpacking

• Squeeze information into value• Ex̣ample: Character Properties

– category: 5 bits

– bidi: 4 bits (+ ex̣ceptions)

– canonical category: 6 bits + ex̣pansion

• compressCanon = [bits >> SHIFT] & MASK;• canon = expansionArray[compressCanon];

Page 29: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Statetables

• Classic:– entry = stateTable[ state, ch ];

– state = entry.state;

– doSomethingWith( entry.action );

– until (state < 0);

Page 30: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Statetables

• Unicode:– type = trie[ch];

– entry = stateTable[ state, type ];

– state = entry.state;

– doSomethingWith( entry.action );

– until (state < 0);

• Also, String Key ➠ Value

Page 31: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Sample Data Structures: ICU

• Trie: CompactArray– Customized for each datatype

– Automatic ex̣pansion

– Compact after setting

• Character Properties– use CompactArray, Bitpacking

• Inversion List: UnicodeSet– Boolean Operations

Page 32: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Sample Usage #1: ICU• Collation

– Trie lookup

– Ex̣panding character: String Key ➠ Value

– Contracting character: Key ➠ String Value

• Break Iterators– For grapheme, word, line, sentence

break

– Statetable

Page 33: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Sample Usage #2: ICU• Transliteration

– Requires• Mapping codepoints in context to others• Rearranging codepoints• Controlling the choice of mapping

– Character Properties

– Inversion List

– Ex̣ception values

Page 34: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Sample Usage #3: ICU• Character Conversion

– From Unicode to bytes• Trie

– From bytes to Unicode• Arrays for simple maps• Statetables for complex̣ maps

– recognizes valid / invalid mappings– provides compaction

• Complications– Invalid vs. Valid mapped vs. Valid unmapped

– Fallbacks

Page 35: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

References• Unicode Open Source — ICU

– http://oss.software.ibm.com/icu

– ICU4j: Java API

– ICU4c: C and C++ APIs

• Other references — see Mark’s website:– http://www.macchiato.com

Page 36: Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies.

Q & A


Recommended