+ All Categories
Home > Documents > Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM...

Last revised: 10 December 2006 HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM...

Date post: 15-Jan-2016
Category:
View: 224 times
Download: 0 times
Share this document with a friend
40
Last revised: 10 December 2006 HKIUG Unicode Task Force HKIUG Unicode Task Force and the EACC to Unicode and the EACC to Unicode Migration Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library [email protected] 7 th Annual Hong Kong Innovative Users Group Meeting 11 and 12 December 2006 HKUST Library
Transcript
  • HKIUG Unicode Task Force and the EACC to Unicode MigrationKi Tat LAMHead of Library SystemsThe Hong Kong University of Science and Technology [email protected] 7th Annual Hong Kong Innovative Users Group Meeting 11 and 12 December 2006HKUST Library

  • ContentsHKIUG Unicode Task ForceCJK/Unicode Resources and the Unicode Version of TSVCC TableMigrating INNOPACs storage environment from EACC to UnicodeMARC-8 and Unicode EnvironmentsOutstanding Issues

  • Observations

  • [Calendar][History]Simplified form of and

  • Observation #1:Although OCLC WorldCats storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attentionHow about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC-based sites to migrate to?

  • U+5386

  • Export (in MARC-8)

  • Export output is {27 46 2A} incorrect!

  • Observation #2:The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in UnicodeImmediate need for INNOPAC sites to migrate to Unicode storage environment!

  • HKIUG Unicode Task ForceIn 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings.Developed HKIUG Version of the EACC to Unicode mapping tableResolved EACC to Unicode multi-mapping problemBegan drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table

  • HKIUG Unicode Task Force [2]February 2005, the HKIUG Unicode Task Force was officially established to:maintain the CJK/Unicode resources produced in 2003-2004;develop new resources, such as the Unicode Version of the TSVCC table;facilitate the searching, display and retrieval of CJK records in library catalogs; andassist member libraries in migrating from EACC-based character encoding to Unicode

  • HKIUG Unicode Task Force [3]Member of the Task Force:CHAN Wai Ming (Secretary), University of Hong KongHO Yee Ip, Chinese University of Hong Kong LAM Ki Tat (Chair), The Hong Kong University of Science and TechnologyJoanna PONG, City University of Hong KongSUN Zehua, The Hong Kong University of Science and Technology Mr. Philip WONG, City University of Hong KongRecruiting new members we welcome colleagues to join force

  • HKIUG Unicode Task Force [4]Achievements in 2006:July 2006 - finished and released the Unicode Version of the TSVCC TableAugust 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [http://hkiug.ln.edu.hk/unicode/]November 2006 visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III.

  • TSVCC Table - Unicode VersionWhen searching Li fa, you will prefer to retrieve records that have: where and have a Traditional Simplified relationshipSimilarly, when searching , you will prefer to retrieve its Variant Requires linking T,S,V forms during searching

  • TSVCC Table - Unicode Version [2]Results of implementing TSVCC Linking:Improvement in searching higher recallTrade-off lower precisionIf search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously

  • TSVCC Table - Unicode Version [3]HKIUG Unicode Task Force constructed two versions of TSVCC tablesEACC Version [1.0 released August 2005]Unicode Version [1.0 released July 2006]for INNOPAC systems that store characters in EACC and in Unicode respectively

  • TSVCC Table - Unicode Version [4]TSVCC link cases collected in the Unicode Version are:derived from the EACC Version, e.g. EACC link, U+XXXX multi-mapped;harvested from Unicode Consortiums Unihan Database, e.g. kSimplifiedVariant, kZVariant;proposed by the Unicode Task Force members, e.g. hkiugSimplifiedVariant, hkiugZVariant

  • TSVCC Table - Unicode Version [5]Examples of Link Cases in Unicode Version:

    U+66C6 | U+5386 | U+66A6 | U+6B77 | U+6B74 | U+F98B | U+F98C | #EACC link ([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multi-mapped 27462A,274349 AND kZVariant of U+F98B is U+66C6 AND kZVariant of U+F98C is U+6B77

    U+5C5B | U+5C4F | U+6452 | #EACC link ([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B

  • TSVCC Table - Unicode Version [6]Support linking of CJK Compatibility Ideographse.g. [U+F92F ] in the previous screen dump, a variant from KS C5601-1987Support linking of forms used differently in Mainland China and in Hong Kong, for example:

  • TSVCC Table - Unicode Version [7]We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tablese.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs

  • MARC-8 and Unicode EnvironmentsIn 2000, the Library of Congress issued:Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment [http://www.loc.gov/marc/specifications/speccharintro.html]MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)

  • A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment

  • MARC-8 and Unicode Environments [2]UCS/Unicode Environment [http://www.loc.gov/marc/specifications/speccharucs.html]Use UTF-8 as character encodingLeader position 9 contains value aField 066 (Character Sets Present) is not neededThe script identification information in subfield 6 (Linkage) can be droppedLengths specified by number of 8-bit bytes, rather than number of characters.

  • MARC-8 and Unicode Environments [3]Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify

  • Migrating from EACC to UnicodeThe following INNOPAC systems are in Unicode Storage Environment:HKSYC (Hong Kong Shue Yan College)HKALL (the INN-Reach system for the eight universities in Hong Kong)HKUST Tool Testing Database

  • Migrating from EACC to Unicode [2]HKSYC VisitA group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage EnvironmentA number of outstanding issues were identified and/or confirmedIf you have migrated to Unicode storage or plan to migrate now, you might also face the same problems

  • Migrating from EACC to Unicode [3]Outstanding IssuesTSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG versionWhen entering CJK characters via Millennium Editor, such as U+8AAC and U+7CB5 , and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing

  • Migrating from EACC to Unicode [4]Export from INNOPAC - only export in MARC-8 Environment was provided. There should be option for users to export in Unicode Environment III replied that this option is availableImport (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF-8).III replied that this option is available

  • Migrating from EACC to Unicode [5]It seemed that sorting at HKSYC is still EACC-basedSorting key seemed to be constructed from: [No. of strokes][EACC code value]For example, as observed from WebPACs URL, sorting key for is: 04{213034}11{21376f}. It should instead be sorted in Unicode code value, i.e. 04{u4e2d}11{u570b}

  • Migrating from EACC to Unicode [6]Also need to fix the illogical sorting orders as found in HKUSTs Tool Testing Database:1: ASCII space/punctuations (e.g. :)2: ASCII numerals (e.g. 1)3: CJK characters with pinyin (e.g. )4: ASCII Alphabets (e.g. a)5: CJK characters without pinyin (e.g. )

  • Migrating from EACC to Unicode [7]Pure Unicode Storage EnvironmentOnce migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routinesIn order to maintain a natively Unicode environment, EACC dependence should be identified and eliminated

  • ConclusionHow far are we towards native Unicode?Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC recordsILS vendors including III are working very hard to implement and enhance the Unicode supportLibraries and CJK experts are providing advice and suggesting solutions

  • Conclusion [2]Migrating INNOPAC to UnicodeWe have reviewed various outstanding issues as found in INNOPACs Unicode Storage EnvironmentWe hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to UnicodeHKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration

  • Additional ReadingsK.T. Lam. EACC to Unicode migration. OCLC-CJK Users Group 2006 Annual Meeting. [http://hdl.handle.net/1783.1/2500]Wong, Philip and K.T. Lam. HKIUGs Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429]

  • Thank You!


Recommended