+ All Categories
Home > Documents > Language Tags the next generation

Language Tags the next generation

Date post: 03-Feb-2016
Category:
Upload: tess
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Language Tags the next generation. Internationalization and Unicode Conference #32. Presenters. Addison Phillips Lab126 Mark Davis Google. Languages, Language Tags, and Locales (oh my!). Identifying language (and locale) — the challenge ISO 639 IETF BCP 47 RFC 4646, RFC 4647 RFC 4646bis - PowerPoint PPT Presentation
47
Language Tags the next generation Internationalization and Unicode Conference #32 1
Transcript
Page 1: Language Tags the next generation

Language Tags

the next generation

Internationalization and Unicode Conference #32

1

Page 2: Language Tags the next generation

Presenters

Addison PhillipsLab126

Mark DavisGoogle

2

Page 3: Language Tags the next generation

Languages, Language Tags, and Locales (oh my!)

Identifying language (and locale)—the challenge

ISO 639 IETF BCP 47

– RFC 4646, RFC 4647– RFC 4646bis

Challenges for users

3

Page 4: Language Tags the next generation

Human Language as Metadata

Some data is just data, but some data is human-readable text.

Text processing depends on language:– spelling, stemming, tokenization, word/line/sentence

boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation.

IT systems depend on language negotiation:– localization, message selection, user interface, presentation,

number/date/time/etc. formatting, list presentation

4

Page 5: Language Tags the next generation

Human Language

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)

IN this book a number of dialects are used, to wit: the Missouri negro dialect; the extremest form of the backwoods Southwestern dialect; the ordinary "Pike County" dialect; and four modified varieties of this last. The shadings have not been done in a haphazard fashion, or by guesswork; but painstakingly, and with the trustworthy guidance and support of personal familiarity with these several forms of speech. I make this explanation for the reason that without it many readers would suppose that all these characters were trying to talk alike and not succeeding.

5

Page 6: Language Tags the next generation

Identifying Languages

Languages don’t form nice hierarchies– “splitters” vs “lumpers”– dialects, subdialects, regional and stylistic differences,

patois Differing communities with different needs

– terminology, librarians, computer systems, translators, etc.

6

Page 7: Language Tags the next generation

In the Beginning (ca. 1980 CE)

Received Wisdom from the Dark Ages Locales:

– japanese, french, german, C– ENU, FRA, JPN– ja_JP.PCK– AMERICAN_AMERICA.WE8ISO8859P1

Languages…… looked a lot like locales (and vice versa)

7

Page 8: Language Tags the next generation

ISO 639

Defines language identifier codes Multiple parts:

– ISO 639-1 (alpha2 codes676) (136 codes)– ISO 639-2 (alpha3 codes17576) (about 500)– ISO 639-3 (alpha3 codes) (about 7000)– ISO 639-4 (principles for encoding)– ISO 639-5 (language families)– ISO 639-6 (alpha4 codes) (under development)

8

Page 9: Language Tags the next generation

Impact of ISO 639-3

ISO 639-2 and 639-3 share a codespace– all 639-2 codes are also 639-3 codes– Macrolanguages

9

Page 10: Language Tags the next generation

Human Language

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)

en

10

Page 11: Language Tags the next generation

Parallel Efforts

ISO 639– ISO 639-1 (early 1980s)

– ISO 639-2 (alpha3)

– ISO 639-3 (2007)

IETF BCP 47– RFC 1766 (1995)

– RFC 3066 (2001)

– RFC 4646 (2006)– RFC 4646bis (2008)

11

Page 12: Language Tags the next generation

BCP 47

Internet Engineering Task Force (IETF) “Best Current Practice” (BCP)

Enable presentation, selection, and negotiation of content in protocols and formats

– Widely used! XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….

12

Page 13: Language Tags the next generation

Adds Granularity

Need to identify language on varying levels of mutual intelligibility and granularity

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)en

en-US

13

Page 14: Language Tags the next generation

BCP 47 (Historic) Basic Structure

Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh)

1*8alphanum * [ “-” 1*8 alphanum ]

14

Page 15: Language Tags the next generation

RFC 1766

zh-TW

ISO

63

9-1

(alp

ha2

)

ISO

31

66 (a

lpha2)

i-klingoni-klingonR

egiste

red

valu

e

15

Page 16: Language Tags the next generation

RFC 3066

sco-GB

ISO

63

9-2

(alp

ha 3

codes)

But use…

enengg-GB-GBalpha 2 codes when they exist

X

16

Page 17: Language Tags the next generation

What’s a Locale

– “a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences.”

java.util.Locale.Net CultureLANG (setlocale in C, C++)NLS_LANG in Oracle… and so on…

17

Page 18: Language Tags the next generation

Locales? Huh?

Theatre Center News: The date of the last version of this document was 2003 年 3 月 20. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.

18

Page 19: Language Tags the next generation

Locales and Languages

locale language + [other stuff]≊ Language needs to specify written form

U+224A (“≊”) = ALMOST EQUAL OR EQUAL TO

19

Page 20: Language Tags the next generation

Locale Identifiers

Different ideas:– “Accept-Locale” vs. Accept-Language– URIs/URNs, etc.– CLDR/LDML

And Requirements:– Operating environments and harmonization– App Servers– Web Services

New Solution? Cost of Adoption:– UTF-8 to the browser: 8 long years

20

Page 21: Language Tags the next generation

Locales and Language Tags meet

We really need locale identifiers.

Language tags are being (ab)used as locale identifiers

anyway…

Not going to need a big new

thing…

… we can do this really fast…

Yeah, we’ll write an RFC

IUC23, March 2003

21

Page 22: Language Tags the next generation

Problems with BCP 47 (circa RFC 3066)

Script Variation:– zh-Hant/zh-Hans– (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)

Obsolesce of registrations:– art-lojban (now jbo), i-klingon (now tlh)

Instability in underlying standards:– sr-CS (CS used to be Czechoslovakia and now it’s not

Serbia and Montenegro) Lack of a single authoritative, stable source22

Page 23: Language Tags the next generation

And More Problems

Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions

– the rules: 1*8 alphanum *[ “-” 1*8 alphanum]example “abcd1234-5678efgh-boont”

– badly interpreted as: 2*3 ALPHA [ “-” 2ALPHA ]example: only stuff like “en-US” or “frr-CH”

Many registrations to cover small variations– 8 German registrations to cover two variations

23

Page 24: Language Tags the next generation

LTRU and RFC 4646

Defines a generative syntax – machine readable– future proof, extensible

Defines a single source (IANA Language Subtag Registry)– Stable subtags, no conflicts– Machine readable

Defines when to use subtags– (sometimes)

24

Page 25: Language Tags the next generation

Anatomy of a Language Tag

sl-Latn-IT-rozaj-1994-r-foovia-x-mine

ISO

63

9-1

/2 (a

lpha2/3

)

ISO

15

924 scrip

t codes

(alp

ha 4

)

ISO

31

66 (a

lpha2) o

r UN

M

49 R

egiste

red v

aria

nts

Priv

ate

Use

Exte

nsio

ns (n

one a

t pre

sent)

25

Page 26: Language Tags the next generation

More Examples

fr, de, nl, en, ja fr-FR, fr-CA, de-DE, de-CH… es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-1994 (Multiple variants) zh-t-wadegile (Extensions)

26

Page 27: Language Tags the next generation

Solves the Script problem

zh-Hant (!= zh-TW) zh-Hans (!= zh-CN)

Azerbaijani (az)– Arab, Cyrl, Latn

Serbian (sr)– Cyrl, Latn

Yiddish (yi)– Hebr, Latn

Mongolian (mn)– Cyrl, Latn, Hani

Belarussian (be)– Cyrl, Latn

Etc.

27

Page 28: Language Tags the next generation

Benefits

Subtag registry in one place: one source, machine-readable

Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are forever

28

Page 29: Language Tags the next generation

Tag Choice

“Tag Content Wisely”– use the shortest tag reasonable– use as many subtags as necessary to disambiguate– don’t invent things; use the registry– map deprecated values to modern equivalents

Suppress-Script– avoid scripts when they add no additional information

(Suppress-Script in the registry indicates this for some languages in some cases.)

29

Page 30: Language Tags the next generation

Specialized Subtags

zxx (non-linguistic, not applicable)

und (undetermined)

mis (uncoded)

mul (multiple)

Zxxx (not written) Collection codes

30

Page 31: Language Tags the next generation

Unicode Language Identifiers (CLDR)

Adds some region codes:

– ZZ– QU– etc.

Provides for canonicalization

Restricts syntax:– no grandfathered

codes– no extlang

31

Page 32: Language Tags the next generation

Problems

Matching– Does “en-US” match “en-Latn-US”?

Tag Choices– Users have more to choose from.

Implementations– More to do, more to think about– (easier to parse, process, support the good stuff)

32

Page 33: Language Tags the next generation

Tag Matching (RFC 4647)

Uses “Language Ranges” in a “Language Priority List” to select sets of content according to the language tag

Three Schemes– Basic Filtering– Extended Filtering– Lookup

– See also: “Unicode in Google” talk for “distance matching” (later today)

33

Page 34: Language Tags the next generation

Tags are not Tokens!

Many technologies would like language tags (attributes, etc.) to be atomic—but language tags have structure

<span class=“foo” xml:lang=“en-US” />

foo(lang:en) {color: red;

}

Accept-Language=zh;q=1.0;de-DE;q=0.8

34

Page 35: Language Tags the next generation

Filtering

Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont”

Basic matching uses plain prefixes– “en-US” matches “en-US” or “en-US-boont” but not

“en-Latn-US”

Extended matching can match “inside bits”– “en-*-US”

35

Page 36: Language Tags the next generation

Lookup

Range specifies the most specific tag in a match. Returns exactly one item.

– “en-US” might return either “en” or “en-US” but not “en-US-boont”

Mirrors the locale fallback mechanism and many language negotiation schemes.

36

Page 37: Language Tags the next generation

Lookup and Language Negotiation

Resources “fall back” to find the best match

Global Binary

Resources

zh-Hans-SG (Chinese, Simplified script, Singapore)

zh-Hans (Chinese, Simplified script)

zh (Chinese)

(root)

Fallin

g b

ack

See also: “Unicode in Google” talk (later today)37

Page 38: Language Tags the next generation

What Do I Do (Content Author)?

Not much.– Existing tags are all still valid: tagging is mostly

unchanged.– Resist temptation to (ab)use the private use subtags.

Unless your language has script variations:– Tag content with the appropriate script subtag(s)

Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.

38

Page 39: Language Tags the next generation

What Do I Do (Programmer)?

Check code for compliance with RFC 4646– Decide on well-formed or validating– Implement suppress-script– Change to using the registry– Bother infrastructure folks (Java, MS, Mozilla, etc) to

implement the standard

39

Page 40: Language Tags the next generation

I need a new subtag…

Register new subtags with [email protected]

– only primary language or variant subtags– read RFC 4646 for instructions– two-week review period with expert approval

40

Page 41: Language Tags the next generation

LTRU Milestone Dates

RFC 4646 – Registry went live in December 2005

RFC 4647 (Anticipated) RFC 4646bis

– This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6

41

Page 42: Language Tags the next generation

RFC 4646bis (Internet-Draft)

Currently taking shape– Adds about 7000 additional primary language subtags

from ISO 639-3– Extended language subtags for Chinese and other

languages being debated… and some cleanup work on processes and procedures

42

Page 43: Language Tags the next generation

Macrolanguages and Extlang: The Big Debate

zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR

yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR

zh-yue-Hant-HKChinese, Cantonese, Traditional Script, Hong Kong SAR

extlang

or do we do………..

43

Page 44: Language Tags the next generation

Current Solution

yue-Hant-HK

zh-yue-Hant-HKPermitted, but

Deprecated in favor of “no extlang”

form

44

Page 45: Language Tags the next generation

Things to Do (languages)

Get involved in LTRU Get involved in W3C Internationalization Activity Get involved with Unicode and CLDR Write implementations Work on adoption of BCP 47: understand the

impact

45

Page 46: Language Tags the next generation

Things to Read

Tag and Registry RFChttp://www.ietf.org/rfc/rfc4646.txt

Matching RFChttp://www.ietf.org/rfc/rfc4647.txt

4646bis Drafthttp://www.ietf.org/internet-drafts/draft-ltru-4646bis-17.txt

Referenceshttp://www.langtag.nethttp://www.inter-locale.com

LTRU Mailing Listhttps://www1.ietf.org/mailman/listinfo/ltru

46

Page 47: Language Tags the next generation

Ideas and Questions

47


Recommended