+ All Categories
Home > Documents > Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Date post: 19-Dec-2015
Category:
View: 234 times
Download: 6 times
Share this document with a friend
Popular Tags:
21
Chinese Information Chinese Information Processing (I): Basic Processing (I): Basic Concepts and Practice Concepts and Practice Unit 2: Encoding Chinese Characters
Transcript
Page 1: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Chinese Information Chinese Information Processing (I): Basic Processing (I): Basic

Concepts and PracticeConcepts and Practice

Unit 2: Encoding Chinese Characters

Page 2: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Character Standard SetCharacter Standard Set

Character sets are standard sets of characters established for two main purposes:

1. education (non-coded)

2. computing (coded).

Page 3: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Non-coded Character Set: Non-coded Character Set: Hanzi in ChinaHanzi in China

Xiàndài Hànyǔ Tōngyòng Zìbiǎo 现代汉语通用字表 (Commonly Used Characters in Modern Chinese), published on March 25, 1988. It is a standardized list of 7,000 hanzi defined in Among these characters, 3,500 are chángyòng zì 常用字 (frequently used characters) and 1,000 are cì chángyòng zì 次常用字 (secondary frequently used characters)

Page 4: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Non-coded Character Set: Non-coded Character Set: Hanzi in ChinaHanzi in China

Jiǎnhuàzì zǒngbiǎo 简化字总表 (Simplified Character Table) enumerates 2,249 simplified hanzi.

Table Characters Description

1

2

3

350

146

1,753

Independently simplified hanzi

simplified characters components used in other hanzi

hanzi simplified by using simplified components from Table 2 of the Simplified Character Table

Page 5: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Non-coded Character Set: Non-coded Character Set: Hanzi in TaiwanHanzi in Taiwan

1. The basic set of hanzi in Taiwan is listed in a table called 常用國字標準字體表 chángyòng guózì biāozhǔn zìtǐ biǎo (The Table of Standard Commonly Used Chinese Characters). It enumerates 4,808 hanzi.

2. An additional set of 6,431 hanzi is defined in 次常用國字標準字體表 cì chángyòng guózì biāozhǔn zìtǐ biǎo (The Table of Standard Secondary Commonly Used Chinese Characters).

3. 18,480 rare hanzi are defined in 罕用字體表 hǎnyong zìtǐ biǎo (The Table of Rarely Used Characters)

4. 18,609 hanzi variants are defined in 異體國字字表 yìtǐ guózì zìbiǎo (The Table of Character Variants)

(Source: Lunde, 1999. p. 68)

Page 6: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set: ASCIICoded Character Set: ASCII

ASCII: American Standard Code for Information Interchange

In 1963, ASA (The American Standards Association ) announced the American Standard Code for Information Interchange (ASCII).

Total number 128

Page 7: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set: ASCIICoded Character Set: ASCII

Page 8: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set: ASCIICoded Character Set: ASCII

Page 9: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set in Coded Character Set in China: GBChina: GB

GB is an abbreviation of Guo-jia Biao-zhun, or "National Standard”

Page 10: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set in Coded Character Set in China: GB China: GB 2312-802312-80

symbols (94)

numerals (72)

ISO 646-CN (94 full-width characters)

hiragana (83)

katakana (86)

Greek alphabet (48)

Cyrillic (Russian) alphabet (66)

pinyin and bopomofo characters (26, 37)

line-drawing elements (76)

hanzi level 1 (3,755, ordered by pinyin reading)

hanzi level 2 (3,008, ordered by Chinese character radical, then stroke)

Page 11: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

GB 2312-80 Table

Row 1 (0x81):丂丄丅丆丏丒丗丟丠両丣並丩丮丯丱丳丵丷丼乀乁乂乄乆乊乑乕乗乚乛乢乣乤乥乧乨乪乫乬乭乮乯乲乴乵乶乷乸乹乺乻乼乽乿亀亁亂亃亄亅亇亊亐亖亗亙亜亝亞亣亪亯亰亱亴亶亷亸亹亼亽亾仈仌仏仐仒仚仛仜仠仢仦仧仩仭仮仯仱仴仸仹仺仼仾伀伂伃伄伅伆伇伈伋伌伒伓伔伕伖伜伝伡伣伨伩伬伭伮伱伳伵伷伹伻伾伿佀佁佂佄佅佇佈佉佊佋佌佒佔佖佡佢佦佨佪佫佭佮佱佲併佷佸佹佺佽侀侁侂侅來侇侊侌侎侐侒侓侕侖侘侙侚侜侞侟価侢

Row 2 (0x82):侤侫侭侰侱侲侳侴侶侷侸侹侺侻侼侽侾俀俁係俆俇俈俉俋俌俍俒俓俔俕俖俙俛俠俢俤俥俧俫俬俰俲俴俵俶俷俹俻俼俽俿倀倁倂倃倄倅倆倇倈倉倊個倎倐們倓倕倖倗倛倝倞倠倢倣値倧倫倯倰倱倲倳倴倵倶倷倸倹倻倽倿偀偁偂偄偅偆偉偊偋偍偐偑偒偓偔偖偗偘偙偛偝偞偟偠偡偢偣偤偦偧偨偩偪偫偭偮偯偰偱偲偳側偵偸偹偺偼偽傁傂傃傄傆傇傉傊傋傌傎傏傐傑傒傓傔傕傖傗傘備傚傛傜傝傞傟傠傡傢傤傦傪傫傭傮傯傰傱傳傴債傶傷傸傹傼

Row 3 (0x83):傽傾傿僀僁僂僃僄僅僆僇僈僉僊僋僌働僎僐僑僒僓僔僕僗僘僙僛僜僝僞僟僠僡僢僣僤僥僨僩僪僫僯僰僱僲僴僶僷僸價僺僼僽僾僿儀儁儂儃億儅儈儉儊儌儍儎儏儐儑儓儔儕儖儗儘儙儚儛儜儝儞償儠儢儣儤儥儦儧儨儩優儫儬儭儮儯儰儱儲儳儴儵儶儷儸儹儺儻儼儽儾兂兇兊兌兎兏児兒兓兗兘兙兛兝兞兟兠兡兣兤兦內兩兪兯兲兺兾兿冃冄円冇冊冋冎冏冐冑冓冔冘冚冝冞冟冡冣冦冧冨冩冪冭冮冴冸冹冺冾冿凁凂凃凅凈凊凍凎凐凒凓凔凕凖凗

Page 12: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set in Coded Character Set in Taiwan: Big 5 Taiwan: Big 5

(so called because it was drawn up by "five large computer makers")

Big-5

symbols (157)

symbols (157)

symbols (94)

hanzi level 1 (5,401 Chinese characters ordered by number of strokes, then radical)

hanzi level 2 (7,652 Chinese characters ordered by number of strokes, then radical)

Page 13: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Big Five Table

Row 1 (0xA1): ,、。.‧;:?!︰…‥﹐﹑﹒ ·﹔﹕﹖﹗|–︱—︳╴︴﹏()︵︶{}︷︸〔〕︹︺【】︻︼《》︽︾〈〉︿﹀「」﹁﹂『』﹃﹄﹙﹚﹛﹜﹝﹞‘’“”〝〞‵′#&*※ §〃○●△▲◎☆★◇◆□■▽▼㊣℅ ¯ ̄_ ˍ﹉﹊﹍﹎﹋﹌﹟﹠﹡+- ×÷±√<>=≦≧≠∞ ≡﹢﹣﹤﹥﹦~∩∪⊥∠∟⊿㏒㏑∫∮≒∵∴♀♂⊕⊙↑↓←→↖↗↙↘∥∣/

Row 2 (0xA2):\/﹨$¥〒¢£%@℃℉﹩﹪﹫㏕㎜㎝㎞㏎㎡㎎㎏㏄ °兙兛兞兝兡兣嗧瓩糎▁▂▃▄▅▆▇█▏▎▍▌▋▊▉┼┴┬┤├▔─│▕┌┐└┘╭╮╰╯═╞╪╡◢◣◥◤╱╲╳0123456789ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ〡〢〣〤〥〦〧〨〩十卄卅ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuv

Row 4 (0xA4):一乙丁七乃九了二人儿入八几刀刁力匕十卜又三下丈上丫丸凡久么也乞于亡兀刃勺千叉口土士夕大女子孑孓寸小尢尸山川工己已巳巾干廾弋弓才丑丐不中丰丹之尹予云井互五亢仁什仃仆仇仍今介仄元允內六兮公冗凶分切刈勻勾勿化匹午升卅卞厄友及反壬天夫太夭孔少尤尺屯巴幻廿弔引心戈戶手扎支文斗斤方日曰月木欠止歹毋比毛氏水火爪父爻片牙牛犬王丙

Row 5 (0xA5):世丕且丘主乍乏乎以付仔仕他仗代令仙仞充兄冉冊冬凹出凸刊加功包匆北匝仟半卉卡占卯卮去可古右召叮叩叨叼司叵叫另只史叱台句叭叻四囚外央失奴奶孕它尼巨巧左市布平幼弁弘弗必戊打扔扒扑斥旦朮本未末札正母民氐永汁汀氾犯玄玉瓜瓦甘生用甩田由甲申疋白皮皿目矛矢石示禾穴立丞丟乒乓乩亙交亦亥仿伉伙伊伕伍伐休伏仲件任仰仳份企伋光兇兆先全

Page 14: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

CJKV Character Set Server

 

This is the site that generates properly-encoded CJKV character sets to be displayed directly in your browser or sent to you (in uuencoded form, if requested or necessary) via e-mail.

http://www.oreilly.com/~lunde/cjkv-char.html

Page 15: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Coded Character Set: Coded Character Set: UnicodeUnicode

U.S. computer firms began work in the first half of the 1980s on multilingual character sets and multilingual character encoding systems, and Xerox Corporation and IBM Corporation successfully implemented computer systems based on their research results. The Xerox researchers then proselytized their work to other U.S. software firms, and they were eventually successful in launching a U.S. industry project called Unification CodeUnification Code, or UnicodeUnicode, the goal of which was to unify all of the worlds character sets into a single large character set.

Page 16: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

ISO/IEC 10646-1: 1993

ISO 646

ISO 8859-1

Eastern European accented characters

International Phonetic Alphabet (IPA)

Greek (including accented characters, "monotoniko" and "polytoniko")

Cyrillic, Georgian and Armenian

Hebrew

Arabic characters (all four forms: initial, medial, final and stand-alone)

Indian subcontinent character sets (including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada and Malayalam)

Thai and Lao

Chinese/Japanese/Korean (CJK) ideographic characters (including hangul, katakana, hiragana, and bopomofo )

Mathematical operators and special character forms

Box and line drawing characters

Geometric shapes and Dingbats

Special OCR characters used on cheques

Encircled characters and numbers

Page 17: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟᄠᄡᄢ

გდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰჱჲჳჴ

अआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढ

'()* +,- ./ ٤٥٦٧٨أؤإئابةتثجحخدذرزسشـفقكلم

Page 18: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

How to Encode CharactersHow to Encode CharactersThe ASCII letters are represented by binary numbers (made up of zeroes and ones) in a character code table.

ASCII codes are represented by 7 zeros and ones, so they are called 7-bit codes. 7 bits are called 1 byte.

Page 19: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

Encoding Chinese CharactersEncoding Chinese Characters

Chinese has much more characters. 7-bit encoding cannot cover all the characters. So two 8-bit (2 bytes) encoding method is used.

“啊” the first byte is 0110000, the second byte is 0100001. That means this character is located in zone 16 (0110000) and the first position (0100001) .

 

Page 20: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.
Page 21: Chinese Information Processing (I): Basic Concepts and Practice Unit 2: Encoding Chinese Characters.

In order not to be overlapped with 7-bit ASCII codes, it is stipulated that each byte in GB must begin with 1. Therefore, the actual code for a character is 2 bytes (two 8-bit) . For example,

“啊” has 10110000 in the first byte and 10100001 in the second byte.


Recommended