Date post: | 27-May-2015 |
Category: |
Technology |
Upload: | nick-patch |
View: | 1,009 times |
Download: | 0 times |
UnicodeRegular Expressions
s/ / /g� �Nick Patch
23 January 2013
Unicode Refresher
Unicode attempts to support thecharacters of the world — a massive task!
Unicode Refresher
It's hard to attach a single meaning to theword “character” but most folks think ofcharacters as the smallest stand-alone
components of a writing system.
Unicode Refresher
In Unicode, this sense of characters is represented by one or more code points,
which are each stored in one or more bytes.
Unicode Refresher
However, programmers andprogramming languages tend to think of
characters as individual code points,or worse, individual bytes.
We need to modernize our habits!
Unicode Refresher
Unicode is not just a big set of characters.It also defines standard properties for
each character and standard algorithmsfor operations such as collation,
normalization, and segmentation.
Normalization
NFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀
Normalization
NFD(Чю рлёнис◌́ ) = Чю рле нис◌́ ◌̈NFC(Чю рлёнис◌́ ) = Чю рлёнис◌́
Normalization
ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡α ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
Perl Normalization
use Unicode::Normalize;
say $str; # ᾀ◌̀say NFD($str); # α◌̓◌̀◌ͅsay NFC($str); # ᾂ̀
JavaScript Normalization
var unorm = require('unorm');
console.log($str); # ᾀ◌̀console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅconsole.log(unorm.nfc($str)); # ᾂ̀
PHP Normalization
echo $str; # ᾀ◌̀
echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ
echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
Grapheme Clusters
regex: /^.$/
string 1: ᾂ
string 2: α◌̓◌̀◌ͅ
Grapheme Clusters
regex: /^.$/
string 1: ᾂ ⇧
string 2: α◌̓◌̀◌ͅ ⇧
1. anchor beginning of string
Grapheme Clusters
regex: /^.$/
string 1: ᾂ ⇧
string 2: α◌̓◌̀◌ͅ ⇧
1. anchor beginning of string2. match code point (excl. \n)
Grapheme Clusters
regex: /^.$/
string 1: ᾂ ⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string
Grapheme Clusters
regex: /^.$/
string 1: ᾂ ⇧⇧
string 2: α◌̓◌̀◌ͅ
1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string4. 1 success but 1 failure — mixed results �
Grapheme Clusters
regex: /^\X$/
string 1: ᾂ
string 2: α◌̓◌̀◌ͅ
Grapheme Clusters
regex: /^\X$/
string 1: ᾂ ⇧
string 2: α◌̓◌̀◌ͅ ⇧
1. anchor beginning of string
Grapheme Clusters
regex: /^\X$/
string 1: ᾂ ⇧
string 2: α◌̓◌̀◌ͅ ⇧
1. anchor beginning of string2. match grapheme cluster
Grapheme Clusters
regex: /^\X$/
string 1: ᾂ ⇧⇧
string 2: α◌̓◌̀◌ͅ ⇧ ⇧
1. anchor beginning of string2. match grapheme cluster3. anchor at end of string
Grapheme Clusters
regex: /^\X$/
string 1: ᾂ ⇧⇧
string 2: α◌̓◌̀◌ͅ ⇧ ⇧
1. anchor beginning of string2. match grapheme cluster3. anchor at end of string4. success! �
Perl
use v5.12; # better yet: v5.14use utf8;use charnames qw( :full ); # unless v5.16use open qw( :encoding(UTF-8) :std );
$str =~ /^\X$/;
$str =~ s/^(\X)$/->$1<-/;
PHP
preg_match('/^\X$/u', $str);
preg_replace('/^(\X)$/u', '->$1<-', $str);
JavaScript
[This slide intentionally left blank.]
Match Any Character
two bytes (if byte mode): е..иcode point (exc. \n): е.иcode point (incl. \n): е\p{Any}иgrapheme cluster (incl. \n): е\Xи
Match Any Letter
letter code point:е\p{General_Category=Letter}иletter code point: е\pLиCyrillic code point: е\p{Script=Cyrillic}иCyrillic code point: е\p{Cyrillic}и
letter grapheme cluster: е(?=\pL)\Xи
regex: / \p{Cyrillic} о т /x
string 1: който
string 2: кои то◌̆
regex: / о \p{Cyrillic} т /x
string 1: който
string 2: кои то◌̆
1. match letter о
regex: / о \p{Cyrillic} т /x
string 1: който
string 2: кои то◌̆
1. match letter о2. match Cyrillic letter (1 code point)
regex: / \p{Cyrillic}о т /x
string 1: който
string 2: кои то◌̆
1. match letter о2. match Cyrillic letter (1 code point)3. match letter т
regex: / \p{Cyrillic} о т /x
string 1: който
string 2: кои то◌̆
1. match letter о2. match Cyrillic letter (1 code point)3. match letter т4. 1 success but 1 failure — mixed results �
regex: / (?= \p{Cyrillic} ) \X о т /x
string 1: който
string 2: кои то◌̆
regex: / о (?= \p{Cyrillic} ) \X т /x
string 1: който
string 2: кои то◌̆
1. match letter о
regex: / о (?= \p{Cyrillic} ) \X т /x
string 1: който ⇧
string 2: кои то◌̆ ⇧
1. match letter о2. positive lookahead Cyrillic letter (1 code point)
regex: / (?= \p{Cyrillic} )о \X т /x
string 1: който ⇧
string 2: кои◌̆то ⇧
1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)
regex: / (?= \p{Cyrillic} ) \Xо т /x
string 1: който ⇧
string 2: кои◌̆то ⇧
1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т
regex: / (?= \p{Cyrillic} ) \X о т /x
string 1: който ⇧
string 2: кои т◌̆ о ⇧
1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т5. success! �
Character Literals
[ یي ]
(?: ی| (ي
Character Literals
[ یي ]
(?: ی|ي )
Character Literals
[ یي ]
(?: ی|ي )
[\x{064A}\x{06CC}]
Character Literals
[ یي ]
(?: ی|ي )
[\x{064A}\x{06CC}]
[\N{ARABIC LETTER YEH}\N{ARABIC LETTER FARSI YEH}]
Properties
\p{Script=Latin}
Name: ScriptValue: Latin
Match any code point with thevalue “Latin” for the Script property.
Properties
\P{Script=Latin}
Name: ScriptValue: not Latin
Negated form:Match any code point without the
value “Latin” for the Script property.
Properties
\p{Latin}
Name: Script (implicit)Value: Latin
The Script and General Categoryproperties don't require the namebecause they're so common and
their values don't conflict.
Properties
\p{General_Category=Letter}
Name: General CategoryValue: Letter
Match any code point with the value“Letter” for the General Category property.
Properties
\p{gc=Letter}
Name: General Category (gc)Value: Letter
Property names may be abbreviated.
Properties
\p{gc=L}
Name: General Category (gc)Value: Letter (L)
The General Category property isso commonly used that its valuesall have standard abbreviations.
Properties
\p{L}
Name: General Category (implicit)Value: Letter (L)
And the General Category values may evenbe used on their own, like the Script values.These two properties have distinct values.
Properties
\pL
Name: General Category (implicit)Value: Letter (L)
Single-character General Categoryvalues don't require curly braces.
Properties
\PL
Name: General Category (implicit)Value: not Letter (L)
Don't forget negation!
s/ / /g� �