Post on 10-Feb-2016
description
transcript
Collation in ICU
Mark DavisChief SW Globalization Architect
IBM Globalization Center of Competency
San Jose, California — 04/22/2322st International Unicode Conference 2
Collation = Sorting Order
How hard can it be?A < B < C < …Complications
Languages are complex and variedUnicode is a big set of charactersPerformance is crucial
San Jose, California — 04/22/2322st International Unicode Conference 3
Varies By:
Language Swedish: z < ö German: ö < z
Usage Dictionary: öf < of Telephone: of < öf
Customizations A < a a < A
Versioning Fixes New Gov. Stds New Characters
San Jose, California — 04/22/2322st International Unicode Conference 4
Strength Levels1. Base characters: a < b2. Accents: as < às < at
ignored if there is a L1 character difference3. Case: ao < Ao < aò
ignored if there is a L1 or L2 difference4. Punctuation: ab < a-b < aB
ignored* if there is a L1, L2, or L3 difference5. Tie-breaker: NFD code point order
San Jose, California — 04/22/2322st International Unicode Conference 5
Context SensitivityContractions
H < Z, but CZ < CHExpansions
OE < Œ < OFBoth
カー < カイ キー > キイ
San Jose, California — 04/22/2322st International Unicode Conference 6
Canonical Equivalence
Å ≡ Å≡ A + º
x + . + ^ ≡ x + ^ + .ự ≡ u + ’
≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ’ + .
San Jose, California — 04/22/2322st International Unicode Conference 7
OdditiesNormal accents
cote < coté < côte < côté• first accent difference determines order
French accentscote < côte < coté < côté• last accent difference determines order
Logical Order Exception (Thai, Lao) เ ก sorts like ก เ
San Jose, California — 04/22/2322st International Unicode Conference 8
Merging Database Fields
F1 = LastName, F2 = FirstName
Sequential Weak 1st MergedF1, then F2 F1 (L1), F2 L1, L2, L3
diSilva, JohndiSilva, Freddi Silva, Johndi Silva, Freddísilva, Johndísilva, Fred
diSilva, Johndísilva, Johndi Silva, Johndi Silva, FreddiSilva, Freddísilva, Fred
diSilva, Johndi Silva, Johndísilva, JohndiSilva, Freddi Silva, Freddísilva, Fred
San Jose, California — 04/22/2322st International Unicode Conference 9
Customizations
Parameters that change collation behavior
Choice of language (locale)Runtime choices
Examples to follow
San Jose, California — 04/22/2322st International Unicode Conference 10
Parametric Customizations
StrengthBaseBase+AccentBase+Accent+ Case&c.
Case: A < a a < A
Punctuation: di Silva < diSilva diSilva < di Silva
San Jose, California — 04/22/2322st International Unicode Conference 11
Punctuation (Alternates)Base Character
di silvadi SilvaDi silvaDi SilvaDickensdisilvadiSilvaDisilvaDiSilva
Ignoreable
Dickens di silvadisilvadi SilvadiSilvaDi silvaDisilvaDi SilvaDiSilva
San Jose, California — 04/22/2322st International Unicode Conference 12
Extended Customizations
User-defined“&” ≡ “ampersand”
Merging tailoringsIranian + French
Script Orderb < ב < β < бβ < b < б < ב
Numbers A-10 < A-2 A-2 < A-10
San Jose, California — 04/22/2322st International Unicode Conference 13
Collation also used for:Searching
ignore case, accent optionsSelection
Return all records where• Jones ≤ name < Smith
GraphemesWhat a user considers a “character”Regular expressions (Level 3)• See UTR #18, UTR #29
San Jose, California — 04/22/2322st International Unicode Conference 14
UCAUTS #10: Unicode Collation Algorithm
Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.Default ordering: all Unicode code pointsProvides for tailoring to given languagesAlso see: The Unicode Standard, §5.17: Sorting and Searching
Aligned with ISO 14651
San Jose, California — 04/22/2322st International Unicode Conference 15
APIsString CompareSort KeysString SearchSpecial-Purposes
Sortkeys that bracket “Smith”• X <= Smith* < Y
Merged sortkeys
San Jose, California — 04/22/2322st International Unicode Conference 16
Sort Keys
Transform string into series of bytes which will binary-compare
a: 06 C3 01 20 01 02 00
A: 06 C3 01 20 01 08 00
á: 06 C3 01 20 32 01 02 02 00
ab: 06 C3 06 D7 01 20 20 01 02 02 00
b: 06 D7 01 20 01 02 00
Level 1 Level 2 Level 3
San Jose, California — 04/22/2322st International Unicode Conference 17
String Compare vs. Sort KeysSame results in either caseSC faster for single comparisons
average 5 to 10 times!SK faster for multiple comparisons
index once binary compare many times
San Jose, California — 04/22/2322st International Unicode Conference 18
String SearchNaïve Approach
key matches in target at <x, y>iff target.substring(x, y) ≡ key
Boundary ComplicationsIgnorables: “a” matches in “(a)”?• at <0,2> & <1, 2> & <0,3> & <1,3>?
Contractions: “c” matches in “churo”?Normalization: “å” matches in “a¸˚”?
San Jose, California — 04/22/2322st International Unicode Conference 19
WARNING 1: BasicsNot aligned with character set or repertoire
Latin-1: Swedish and German sorting differsNot code point (binary) order
Binary: Z < a < v < wEnglish: Z > aSwedish: v ≡ w
Not a property of stringsWith same database
• Swedish user: view/select• German user: view/select
San Jose, California — 04/22/2322st International Unicode Conference 20
WARNING 2: Operations
Order not preserved under concatenation / substringing
x < y ↛ xz < yzx < y ↛ zx < zyxz < yz ↛ x < yzx < zy ↛ x < y
San Jose, California — 04/22/2322st International Unicode Conference 21
WARNING 3: DependenceCollation is a relation over strings
Sort keys embody part of that relationThus, comparing sort keys from different tailorings (or parameters) gives undefined results.C < CH < DMay move binary value for D
San Jose, California — 04/22/2322st International Unicode Conference 22
WARNING 4: StabilityStable Sort
Records with equal comparison come out in original orderProperty of algorithm, not comparison
Semi-Stable Comparisonx ≠ y → x ≢ yProperty of comparison, not algorithmDegrades performanceDoesn’t do what people think (or really want)!
San Jose, California — 04/22/2322st International Unicode Conference 23
Implementation Details
Many possible implementationsICU as example here.
San Jose, California — 04/22/2322st International Unicode Conference 24
What is ICU?Internationalization libraries for C, C++, Java*
Open source – non-viralSponsored by IBM
* Sun’s Java licenses an earlier ICU version; ICU4J updates it.
Unicode standard compliantfull supplementary support
Cross-platform; extensible and customizableHigh performance and thread-safe
Multiple locales in same thread – simultaneouslyhttp://oss.software.ibm.com/icu/
San Jose, California — 04/22/2322st International Unicode Conference 25
ICU FeaturesUnicode text handling
Character set conversions (700+)
Collation & Searching
Locales (170+)
Resource Bundles
Calendar & Time zones
Complex-text layout engine
Breaks: character, word, line, & sentence
FormattingDate & time
Messages
Numbers & currencies
TransformsNormalization
Casing
Transliterations
San Jose, California — 04/22/2322st International Unicode Conference 26
JavaSun licensed and includes an early version of ICU collation in JavaLatest ICU Java version:
Dramatically fasterMuch lower in memory consumptionHalved sortkey lengthMany additional features
San Jose, California — 04/22/2322st International Unicode Conference 27
ICU/Java Collation ArchitectureL1-3, contractions, expansions, …Locale tailoringsFully rule-based specificationArbitrary runtime user customizations
& ‘?’ = ‘question mark’ & ‘$’ = ‘dollar sign’ & z < ‘george’
San Jose, California — 04/22/2322st International Unicode Conference 28
ICU Collation I
Full UCA complianceFull supplementary character support
Solid performanceSmall sort-keysSmall Memory Footprint
San Jose, California — 04/22/2322st International Unicode Conference 29
ICU Collation II
Parametric controlTailorable to any languageMultiple Versions simultaneously
San Jose, California — 04/22/2322st International Unicode Conference 30
Memory Requirements
Flat-file (memory mapped)speeds initializationreduces memory footprint(next slide)
Delta TailoringSingle copy of UCA (≈80K)Small delta files per locale
San Jose, California — 04/22/2322st International Unicode Conference 31
Memory Mappable
Old: separate allocations
New: offsets within mem-map
San Jose, California — 04/22/2322st International Unicode Conference 32
Delta Tailoring
“a”
FR
found
UCA not
found
codenot
synthesized
San Jose, California — 04/22/2322st International Unicode Conference 33
Sort Key CompressionCommon weights are 1-byte
Primary, secondary, tertiary, quarternarySequences are compressedUTF-16 Values for “Märk Davis” (22 bytes)
004D 00E4 0072 006B 0020 0044 0061 0076 0069 0073 0000
Sort Key (L3, ignorable punctuation - 19 bytes)2F 17 39 2B 1D 17 41 27 3B 0177 96 0A 018F 80 8F 07 00
San Jose, California — 04/22/2322st International Unicode Conference 34
Simultaneous Multiple Versions
Programs can link against different versions of ICU, simultaneously!Preserves exact binary order over time.
App
San Jose, California — 04/22/2322st International Unicode Conference 35
Performance: CodingAvoided unnecessary function calls.
Example: strlen too expensive!Avoided excess object creation
Reduce, Reuse, RecycleFast-pathed common casesUsed stack memory buffers
(with expansion if necessary)Made inner loops as tight as possible
San Jose, California — 04/22/2322st International Unicode Conference 36
Performance: Algorithmic
Checks for identical prefixesTolerant of most unnormalized text
invokes normalization rarely
Compressed sort keysIncremental length/normalizationFCD format
San Jose, California — 04/22/2322st International Unicode Conference 37
Fast C or D (FCD)
Accepts all NFD, most NFC, without normalization
X FCD NFC NFD
A- ring Y YAngstrom YA + ring Y YA + grave Y YA-ring + grave YA + cedilla + ring Y YA + ring + cedillaA-ring + cedilla Y
San Jose, California — 04/22/2322st International Unicode Conference 38
Perf: ICU vs. Windows, glibcFunction: Full UCA!String comparison: comparable
≈ 20% worse to 400% betterSort keys: much shorter
≈ half as long
Warning: speed comparisons are approximate!Depends on data, parameters, features, CPU
San Jose, California — 04/22/2322st International Unicode Conference 39
Perf: ICU vs. JavaFunction: Full UCA!String comparison: faster
≈ 2-3 times betterSort keys: shorter
≈ half as longAlso available: JNI version
Warning: speed comparisons are approximate!Depends on data, parameters, features, CPU
San Jose, California — 04/22/2322st International Unicode Conference 40
More InformationICU
http://oss.software.ibm.com/icu/Design Document
http://oss.software.ibm.com/cvs/icu/icuhtml/design/collation/
Latest Version of these slideshttp://www.macchiato.com
San Jose, California — 04/22/2322st International Unicode Conference 41
Q & A
San Jose, California — 04/22/2322st International Unicode Conference 42
Backup Slides
Not used in the presentation, except in response to questions
San Jose, California — 04/22/2322st International Unicode Conference 43
WARNING 5: Math. RelationS = {Unicode Strings}Reflexive∀a ∊ S: a ≤ a
Antisymmetric∀a, b ∊ S: a ≤ b & b ≤ a → a = b
Transitive∀a, b ∊ S: a ≤ b & b ≤ c → a ≤ c
Total∀a, b ∊ S: a ≤ b ∨ b ≤ a
San Jose, California — 04/22/2322st International Unicode Conference 44
Identical Prefixes
Sorting / Searching DatabasesMany comparisons to “close” stringsCheck initial prefixes with binary compareDrop into collation loop at first differenceComplication…
San Jose, California — 04/22/2322st International Unicode Conference 45
Initial Prefix Complication
Need to backup if in “bad” position:
TypeContraction (Spanish) c hNormalization a °Surrogate Pair <L> <T>
Example
San Jose, California — 04/22/2322st International Unicode Conference 46
Fractional UCAFractional weights for compressionGaps for tailoring, future UCA additionsOnly stores differences in tailoring fileReduces memory footprint
a æ ɒ b a æ ɒ bprimary 0861 0865 0871 0875 17 18 60 18 66 19
secondary 20 20 20 20 03 03 03 03tertiary 02 02 02 02 03 03 03 03
UCA Frac. UCA
San Jose, California — 04/22/2322st International Unicode Conference 47
Exceptional Values
Normal weight storageP P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T
1 116b 8b 6b
F F F F T T T T d d d d d d d d d d d d d d d d d d d d d d d d4b 4b Tag 24 bit data
Special Weight StorageNOT_FOUND, EXPANSION, CONTRACTION, THAI, …