Dyalog’08
Migrating to Unicode
Morten Kromberg
Workshop at Dyalog’08 - Elsinore
Agenda
• What is Unicode?• V.12 Design Goals• Key Unicode Features• Language Differences
– ⎕DR, ⍋ of char data– Space & Performance
• ”Interop”: Classic vs Unicode– WSs & Component Files– TCP Sockets & Conga– External Vars, Mapped
Files– Own DLLs and Aps
• Native Files– Unicode Text Files (UTF-
8)
• External Interfaces– COM/OLE, Microsoft.NET– ODBC / SQAPL– ⎕NA: A & W win32 calls
• Source Code Management– SALT, SubVersion, Diff Tools
• Planning Migrations
Migrating to Unicode
Dyalog’08 - Elsinore 3
What is Unicode?
Wikipedia: An industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems.
• It assigns a number, or code point, to each of approximately 100,000 characters– Including the APL character set.
• The first version of the standard appeared in 1991, support is now becoming “common” on all platforms
Migrating to Unicode
Dyalog’08 - Elsinore 4
Why do we want Unicode?• Obviously: It allows us to write applications
which use text from all the world’s written languages…
• Less obviously, but perhaps more important in the short term:– APL no longer needs it’s own character set (“Atomic
Vector”)– Characters no longer need to be translated on the way
in and out of APL– APL Source Code can be stored in “ordinary” text files
and be handled by “standard” management tools
Migrating to Unicode
Dyalog’08 - Elsinore 5
What is Unicode in practice?
Char
Name HEX DEC UTF-8
A Latin capital letter A 00041 65 65
Æ Latin capital letter AE 000C6 198 195 134
α Greek small letter alpha 003B1 945 206 177
ؤ Arabic letter waw with hamza above
00624 1572 216 164
⍺ APL functional symbol alpha 0237A 9082 226 141 186
𠀁 CJK ideograph extension B, second
20001 131073
240 160 128 129
Migrating to Unicode
Dyalog’08 - Elsinore 6
• Most often, when someone tells you the data ”is Unicode”, they mean ”UTF-8 encoded”.
Use Google...
Migrating to Unicode
Dyalog’08 - Elsinore 7
Wikipedia too ...
Migrating to Unicode
Dyalog’08 - Elsinore 8
Encodings
Encoding Description
UCS-4 4 bytes per character (= Dyalog ⎕DR type 320). Often used as internal representation on Unix systems.
UCS-2 2 bytes per character (= type 160). The internal format for ”wide” chars under Windows until Win2000.
UTF-8 THE most popular encoding for text files. Identical to ASCII for range 0-127 (= good for Americans). 2 bytes/char from 128-2047, 3 bytes 2048-65535, 4 bytes after that. The only encoding which is independent of ”endian-ness”.
UTF-16 Identical to UCS-2 for most of first plane, but can encode all characters. Replaced UCS-2 on Windows after Win2000.
Migrating to Unicode
Dyalog’08 - Elsinore 9
• ”Unicode” assigns unique numbers to characters. Encodings are ways to represent these numbers on file.
• UCS (Universal Character Set) encodings have a fixed width,UTF (Unicode Transformation Format) encodings are variable width.
Version 12.0 Design Goals• To allow users to develop Unicode applications
(containing all the worlds symbols)• To make the Dyalog IDE a Unicode application
– No more ”translate tables”!
• Avoid having to explain ⎕AV to future generations– Only one ”kind” of characters
• Design should encourage migration– Controlled migration with ”interop” between old & new
apps– No ”Big Bang” data conversion events– Classic & Unicode editions allow ”parallel runs”
Migrating to Unicode
Dyalog’08 - Elsinore 10
Unicode vs Classic
• Unicode Edition:– Character data is defined as Unicode code points– No translation of data as it moves in & out of APL
• Classic Edition:– Character data is defined as indices into ⎕AV– Translate tables used for keyboard, display and file I/O
• Classic will be available so long as a single major customer has not been able to migrate– The price may increase at some point
Migrating to Unicode
Dyalog’08 - Elsinore 11
Key Unicode Features (1)• New Character Data Types 80, 160, 320:
1-, 2-, 4-byte representations of Code Points.
⎕DR 'Hello' 80 ⎕DR '{⍺+⍵}' 160
⎕DR '𠀁𠀁𠀁 ' 320
• NB: One character = one array element!Migrating to Unicode
Dyalog’08 - Elsinore 12
Key Unicode Features (2)• Monadic ⎕UCS converts to and from
code points (self inverse):
⎕UCS 'Hello'72 101 108 108 111
⎕UCS '{⍺+⍵}' 123 9082 43 9077 125
⎕UCS (2*17)+⍳3 𠀁𠀁𠀁
Migrating to Unicode
Dyalog’08 - Elsinore 13
Key Unicode Features (3)• Dyadic ⎕UCS encodes and decodes data as UTF-8,
UTF-16 or UTF-32:
'UTF-8' ⎕UCS 'ABCÆØÅ'65 66 67 195 134 195 152 195 133 'UTF-8' ⎕UCS 240 160 128 129, 240 160 128 130,
240 160 128 131𠀁𠀁𠀁 'UTF-16' ⎕UCS '𠀁𠀁𠀁 '55360 56321 55360 56322 55360 56323
Migrating to Unicode
Dyalog’08 - Elsinore 14
Demo 1 ...
(key features)
Migrating to Unicode
Dyalog’08 - Elsinore 15
Language Differences
• If you are only using APL workspaces, and component files, most code from earlier versions will just load & run
• Potential problems are:– Monadic ⍋ (only real language
difference)– ⎕DR to test for character data– Dyadic use of ⎕DR to ”cast” data– Space usage (char arrays can be larger)
Migrating to Unicode
Dyalog’08 - Elsinore 16
Monadic ⍋
• Due to differences in the internal representation, upgrade without a collation sequence may return different results:
• Give ⍋ a left argument of ⎕AV to maintain the current behaviour
• In many cases where monadic use, ⍋ order does not matter
Migrating to Unicode
Dyalog’08 - Elsinore 17
Classic Unicode
⍋'aA'1 2 ⎕AV⍳'aA‘18 66
⍋'aA'2 1 ⎕UCS 'aA'97 65
Testing for Character Data• This no longer works as expected:
82=⎕DR X • Dyalog recommends:
(10|⎕DR ⍵)∊0 2– The latter is correct in all versions
Migrating to Unicode
Dyalog’08 - Elsinore 18
Dyadic ⎕DR for ”Casting”• Classic (and previous versions): 83 ⎕DR '⍋' ⍝ ⎕AV[⎕IO+198]¯109 ⍝ Via APL+Win tables• Unicode: 83 ⎕DR '⍋' ⍝ ⎕UCS 903575 35 ⍝ 9035 = 256⊥⌽75 35 • The internal representation is different, and
Unicode does NO TRANSLATION• Code which (e.g.) reads characters from native files
and then ”casts” to number using ⎕DR needs work
Migrating to Unicode
Dyalog’08 - Elsinore 19
More on ⎕DR ... (and ⎕UCS)• Unicode Edition still recognises 82 as an left argument: 82 ⎕DR ¯109⍋
• This returns the same character as in Classic. But: ⎕DR 82 ⎕DR ¯109160 ⍝ Type 82 cannot exist in Unicode
• Conversely, ⎕UCS exists in Classic: ⎕UCS 9035⍋ ⎕UCS 180 ⍝ But must return elements of ⎕AVTRANSLATION ERROR ⍝ Cannot convert to type 82
Migrating to Unicode
Dyalog’08 - Elsinore 20
Space and Time
• Character data will require 2 bytes per element in the Unicode Edition, if it contains APL symbols. No existing APL arrays can need 4 bytes per element.
• Primitives which manipulate or search this data may run more slowly (more data to sift through).
• Comments and character constants in code, and the script form of namespaces and classes, is also affected
Migrating to Unicode
Dyalog’08 - Elsinore 21
Time and Space• When copying functions between Classic and Unicode,
the format needs to be converted – this can be expensive.
• The same applies when reading a ⎕OR “across the line”.• It is not recommended to dynamically import functions
across the Classic/Unicode boundary in production applications.
• Some VERY LARGE functions which could fix in v11.0 may not fix in the Unicode Edition: Lists of names and constants in a function share space with comments.– Proposal to relax all limits on functions may be executed for
version 12.1
Migrating to Unicode
Dyalog’08 - Elsinore 22
Unicode vs Classic
• Use the Unicode Edition if:– You want to develop new applications– You need to manage characters not in ⎕AV now.
• Use the Classic Edition if:– You need other v12+ enhancements, but are not
ready to convert to Unicode yet – Classic is upwards compatible with v11.0 (as usual)
• UE and CE are maintained from single source, and are ”identical” except for character arrays.
• Start planning your migration now! (please!)
Migrating to Unicode
Dyalog’08 - Elsinore 23
So you want to migrate soon...• If you ”only use APL” (workspaces, component
files, sockets), applications SHOULD just load & run
• If you – Fell for the temptation to use any external tools or
storage media as part of your application – Wrote your own AP’s or DLL’s– Or want to start using data not in ⎕AV
... you may have a little work to do. Let’s take a look!
Migrating to Unicode
Dyalog’08 - Elsinore 24
”Interop”
• Unicode and Classic editions are designed to inter-operate seamlessly – also with v11 & v10.1
• 12.0 Classic can read and translate Unicode character data found in files, workspaces and on TCP sockets
• Unicode editions will translate data to type 82 when using TCP Sockets and Component files flagged as non-Unicode (for interop with v11 & v10.1)
• If Unicode data contains characters not in ⎕AV => TRANSLATION ERROR
• Unicode editions still recognise 82 as a valid argument to ⎕DR and native file functions, and are able to map data in old native files to ”the same character”.
Migrating to Unicode
Dyalog’08 - Elsinore 25
”Interop”
• The intention is that users should be able to perform controlled experiments when migrating to Unicode
• No ”Big Bang” data conversion events; old files and workspaces can still be read
• We hope that users will ”reciprocate” by moving as quickly as possibly; it is as easy as we could make it!
Migrating to Unicode
Dyalog’08 - Elsinore 26
Workspaces
• Classic and Unicode editions can load each others workspaces, but:– Classic cannot load (or COPY from) a workspace containing
characters not in ⎕AV (TRANSLATION ERROR)
• The contents of ⎕AV are defined by ⎕AVU, a list of 256 Unicode Code Points:
⎕AV[97+⍳26] ⍝ By default in v12.0, "Dyalog Alt"
ÁÂÃÇÈÊËÌÍÎÏÐÒÓÔÕÙÚÛÝþãìðòõ
⎕AVU[97+⍳26]←9397+⍳26 ⍝ Underscored alphabet (sort of)
⎕AV[97+⍳26] ⍝ Now we have "Dyalog Std” mapping
Ⓜ
• When )COPYing from a pre-v12 workspace, ⎕AVU in the target namespace decides how incoming character data is translated. So code written using Alt & Std can be merged and maintain the original looks.
Migrating to Unicode
Dyalog’08 - Elsinore 27
More on ⎕AVU
• The Dyalog Std font is still in some older (”anglo”) applications
• Dyalog Alt is used across Western Europe• Some countries use fonts created by local distributors:
)copy avu Russian.⎕AVUC:\...avu saved Fri Jun 27 10:00:52 2008 3 50⍴65↓⎕AVABCDEFGHIJKLMNOPQRSTUVWXYZАБВГД⍙ЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮ{€}⊣⌷¨Яабв⍨гдежзийклмнопрстуфхцч[/⌿\⍀<≤=≥>≠∨∧-+÷×?∊⍴~↑↓⍳○*⌈⌊∇∘(⊂⊃∩∪⊥⊤|;,⍱⍲⍒⍋⍉⌽⊖⍟⌹!⍕⍎⍫⍪≡≢шщъы
• The translate table is also used when reading component files and APL data arriving on TCP Sockets
• It has namespace scope, so classes or namespaces can be defined to read data from Classic systems using different languages if necessary
Migrating to Unicode
Dyalog’08 - Elsinore 28
Underscores Must Die!
• There is no Underscored alphabet in Unicode. Underscoring is a form ”emphasis” (like bold or italic). The underscored alphabet is the ONLY incompatibility with the rest of the world and should be phased OUT.
• The APL385 Unicode font incorrectly displays underscores for code points 9398-9423 (decimal). The positions should really display as .. .Ⓐ Ⓩ
• (Don’t ask why circled alphabetics ARE in unicode, while underscores are not – but Dyalog decided to map underscores to this range)
Migrating to Unicode
Dyalog’08 - Elsinore 29
⎕AV: Just another variable• In the Unicode Edition, the Atomic Vector is only used to
define how to inter-operate with Classic systems. Only characters in ⎕AV can be shared. Assuming the default (Alt) setting:
'Á '∊⎕AVⒶ1 0
• System variable ⎕Ⓐ (name now displays as ⎕Á) should no longer be used. It continues to exist and returns ⎕AV[97+⍳26]
Migrating to Unicode
Dyalog’08 - Elsinore 30
Chars Allowed in Names• The list has not been extended, the following are allowed:
0123456789 (but not as the 1st character in a name) ABCDEFGHIJKLMNOPQRSTUVWXYZ_ abcdefghijklmnopqrstuvwxyz ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß àáâãäåæçèéêëìíîïðñòóôõöøùúûüþ ∆⍙ Ⓜ
• In a standard font, underscores display as to Ⓐ Ⓩ• I Unicode, all of the above can now be used
simultaneously (previously, the available set depended on whether the Alt or Std font was selected). Russian letters are NOT allowed.
Migrating to Unicode
Dyalog’08 - Elsinore 31
Component File Interop
• Like workspaces, Component Files can be shared between Classic and Unicode editions.
• The same restriction applies: Classic cannot read arrays containing characters not in ⎕AV.
• Files can be marked as non-Unicode, in which case Unicode cannot write characters not in ⎕AV.– All ”small” (32-bit) component files are non-Unicode
• For ordinary APL arrays (no ⎕ORs), the Unicode edition can share files with old versions of APL too.
Migrating to Unicode
Dyalog’08 - Elsinore 32
File Properties
• New system function ⎕FPROPS allows you to control whether a file may contain Unicode data:
'c:\temp\smallfile' ⎕FCREATE 32 32 'EJSU' ⎕FPROPS 1 ⍝ Endian, Journaled, Size, Unicode
0 0 32 0 'c:\temp\bigfile’ ⎕FCREATE 64 64 'EJSU' ⎕FPROPS 640 0 64 1
• Size defaults to 64 from v12.0 (new startup flag –F32/-F64)• Small address size (32-bit) files are limited to 4Gb in size
and can NOT have the Unicode bit set• Setting Journaling on prevents sharing with v11.0 or earlier
Migrating to Unicode
Dyalog’08 - Elsinore 33
Translation Error on Write• Unicode edition can write to non-Unicode component files:
'{⍺+⍵}' ⎕FAPPEND 32 ⍝ ∧/'{⍺+⍵}'∊⎕AV – fine!
'U' 0 ⎕FPROPS 64 ⍝ Switch Unicode OFF
'𠀁𠀁𠀁 ' ⎕FAPPEND 64 ⍝ Chars not in ⎕AV
TRANSLATION ERROR
'U' 1 ⎕FPROPS 32 ⍝ Not allowed for small files
TRANSLATION ERROR
• If non-Unicode files do not contain namespaces or ⎕ORs, v10.1 and v11.0 can use them
• Note: Large files (64-bit) cannot be used with versions 10.0 or earlier.
Migrating to Unicode
Dyalog’08 - Elsinore 34
Migrating to Unicode
Dyalog’08 - Elsinore 35
TCP Socket / Conga Interop• TCPSocket objects have an Encoding property:
• The default is None for Char, and Classic for APL• APL sockets are non-Unicode by default to avoid crashing
down-version APL interpreters receiving Unicode data• Conga always sends data in ”native” form, receive will fail
with a TRANSLATION ERROR if data cannot be represented
Migrating to Unicode
Dyalog’08 - Elsinore 36
Encoding Style Meaning
None Char No translation, characters must be in range 0-255.
UTF-8 Char To UTF-8 on send, from UTF-8 on receive
Classic APL Chars transmitted encoded as elements of ⎕AV
Unicode APL Types 80, 160 or 320 used as required
External Variables
• External Variables are implemented as small span component files (32-bit files) – and can thus NOT contain Unicode data:
'c:\temp\xvar’ ⎕XT'x' x Hello World x←'𠀁𠀁𠀁 ' TRANSLATION ERROR
• External Variables should be seen as a ”deprecated” feature: You will still be able to use existing external variables, but should plan to convert to component files or mapped files at your convenience.
Migrating to Unicode
Dyalog’08 - Elsinore 37
Mapped Files
• Like external variables, the use of APL mapped files (containing APL arrays with header information) should be seen as a deprecated feature.
– Convert to using other mechanisms at your earliest convenience.
• Support for RAW mapped files (where type information is provided when mapping) remains core functionality (and will probably get more important in a world of multicore machines):
32↓102↑80 ¯1 ⎕MAP'c:\Program Files\ComfortKeyboard\changes.txt'Added new interface languages: Latvian, Brazilian Portuguese, Italian.
• Type 82 is NOT supported in the Unicode Edition: Mapped variables are ”in the workspace” and cannot be translated on access.
• To read a raw file written using data type 82, map with data type 83 and the characters extracted by indexing into ⎕AVU.
Migrating to Unicode
Dyalog’08 - Elsinore 38
(Own) DLLs and APs
• The format for passing APL arrays to Libraries and Auxiliary Processors is unchanged, except that a Unicode Edition will pass character arrays of type 80, 160 or 320
• Dyalog-provided libraries have been upgraded. A number of old Aps like PREFECT are no longer shipped, but v11 versions will continue to work fine with the Classic Edition.
• If you have written your own APs or DLLs which handle character data, these need to be updated to deal with new data types.
• You can return any of the Classic or Unicode character types, they will be translated (subject to the usual TRANSLATION ERROR limitations).
Migrating to Unicode
Dyalog’08 - Elsinore 39
Native Files
• Unicode Edition also still supports type 82, so that old files containing APL characters can be used. They mapping to the ”same characters” - but with a different internal representation:
V11: 'c:\temp\plus'⎕NCREATE ¯1 '{⍺+⍵}' ⎕nappend ¯1V12: ⎕DR ⎕←⎕NREAD ¯1 82 5 0{⍺+⍵}160
Migrating to Unicode
Dyalog’08 - Elsinore 40
Native Files & Unicode
• Unicode Edition supports new data types 80, 160, 320 – reading or writing 1, 2 or 4 bytes at a time (file is UCS-1, -2 or -4 encoded).
• Code Change Possibly Required: The DEFAULT TYPE when appending character arrays is now 80 (was 82):
'plus:’ ⎕NAPPEND ¯2 ⍝ Type 80 (all ANSI) '{⍺+⍵}' ⎕NAPPEND ¯1 ⍝ Type 160 (APL chars) DOMAIN ERROR ⍝ Data cannot be narrowed
• Early Beta versions of 12.0 used the type of the left argument, but this lead to variable numbers of bytes being used when writing depending on the content of an array (160 if a non-ANSI character included).
• If you need to write text containing APL to a native file, use type 160 – or perhaps better, use UTF-8!
Migrating to Unicode
Dyalog’08 - Elsinore 41
Native Files & UTF-8• The most common way to store Unicode data in text files is to
encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications.
text←'plus←{⍺+⍵}' 'UTF-8' ⎕UCS 'plus'
112 108 117 115
'c:\temp\plus.txt' ⎕NCREATE ¯1
(⎕UCS 'UTF-8' ⎕UCS 'plus') ⎕NAPPEND ¯1
⎕CMD 'notepad c:\temp\plus.txt' 'normal’
• Windows Notepad is able to detect that the file is UTF-8 encoded and displays the text correctly.
• The monadic ⎕UCS on the left converts integers in the range 0-255 into one-byte Unicode characters before appending. Integers above 127 would become type 163 (2 bytes per element).
Migrating to Unicode
Dyalog’08 - Elsinore 42
Native Files & UTF-8
• The most common way to store Unicode data in text files is to encode it using UTF-8: This is a format understood by ”most” web applications and other Unicode-enabled applications.
• UCS-2 (2 bytes per character) is supported by many Microsoft apps (like Visual Studio). UCS-2 was the standard until Windows 2000 – now replaced by UTF-16, which is identical to UCS-2 for most data, but expands to 4 bytes when required.
• Applications need to know which encoding has been used. Two common methods of indicating this are ”Byte Order Marks” at the beginning of the file, and (for web pages) HTTP tags.
Migrating to Unicode
Dyalog’08 - Elsinore 43
Byte Order Mark
1st bytes are...
Encoding is therefore probably
EF BB BF UTF-8
FF FE UTF-16 or UCS-2, written by little endian CPU (Intel)
FE FF UTF-16 or UCS-2, big endian
FF FE 00 00 UTF-32 / UCS-4, little endian
00 00 FE FF UTF-32 / UCS-4, big endian
Migrating to Unicode
Dyalog’08 - Elsinore 44
• By convention, the first few bytes of text files are sometimes (but not always) an encoding of U+FEFF, the ”Byte Order Mark”, also known as ”Zero width no-break space”:
• This convention allows applications to ”guess” the encoding used:
• The convention is more common under Windows than Unix/Linux. Sometimes writing the BOM makes things worse...
Reading Text Files
Migrating to Unicode
Dyalog’08 - Elsinore 45
∇ Chars←ReadFile name;nid;signature;nums [1] ⍝ Read ANSI or Unicode character file (Windows) [2] nid←name ⎕NTIE 0 [3] signature←3↑⎕NREAD nid 83 3 0 [4] :If signature≡¯17 ¯69 ¯65 ⍝ UTF-8 (EF BB BF)[5] Chars←⎕NREAD nid 80(¯3+⎕NSIZE nid) 3 [6] Chars←'UTF-8' ⎕UCS ⎕UCS Chars[7] :ElseIf (2↑signature)≡¯1 ¯2 ⍝ LittleEnd UTF-16 (FF FE)[8] Chars←⎕NREAD nid 160(¯1+⎕NSIZE nid)2 [9] :Else ⍝ ANSI [10] Chars←⎕NREAD nid 80(⎕NSIZE nid)0 [11] :EndIf [12] ⎕NUNTIE nid ∇
Writing Text Files
Migrating to Unicode
Dyalog’08 - Elsinore 46
Writing a UTF-8 Web Page
Migrating to Unicode
Dyalog’08 - Elsinore 47
html←'<html>',NL,' <head>',NL html,←' <meta http-equiv="content-type"
content="text/html; charset=UTF-8" />' html,←’ </head>',NL,'<body>',NL html,←’ <font face="APL385 Unicode">' html,←'plus←{⍺+⍵}</font>',NL html,←'</body>',NL,'</html>',NL
'c:\temp\plus.htm'⎕NCREATE ¯1 (⎕UCS 'UTF-8' ⎕UCS html) ⎕NAPPEND ¯1 ⎕NUNTIE ¯1
⎕CMD 'iexplore c:\temp\plus.htm' ''
Web Page: Results
Migrating to Unicode
Dyalog’08 - Elsinore 48
UTF-8 Files with .NET
Migrating to Unicode
Dyalog’08 - Elsinore 49
UTF-8 Files with .NET
Migrating to Unicode
Dyalog’08 - Elsinore 50
apltxt←⎕SE.SALT.New 'C:\..\UTF8File' 'c:\temp\apl.txt'
apltxt.Text Compute average in APL: avg←{(+/⍵)÷⍴⍵} apltxt.Text,←⊂'⍝ Morten was here’
System.Text.Encoding.⎕nl -2 ASCII BigEndianUnicode Default Unicode UTF32 UTF7 UTF8
External Interfaces: COM/.NET• COM/OLE, Microsoft.Net: No problem
– Have been translating chars to UCS-2/UTF-16 ”always”
– Translation code removed in v12 Unicode
• We already saw it in action:
↑System.IO.File.ReadAllLines ⊂'c:\temp\apl.txt'Compute average in APL: avg←{(+/⍵)÷⍴⍵}
Migrating to Unicode
Dyalog’08 - Elsinore 51
SQAPL / ODBC & Unicode SQA.Connect 'B' 'MS SQL Server' 'pass' 'user’
(not all results displayed in the following) SQA.Columns 'B' 'idioms'0 COLUMN_NAME .. DATA_TYPE TYPE_NAME COLUMN_SIZE id .. 4 int identity 10 exp .. ¯9 nvarchar 400
⎕←data←3 1⊃SQA.Do 'B' 'select * from idioms' 1 {(+/⍵)÷⍴⍵} 2 {⍵/⍳⍴⍵} 3 {(<\⍵)⍳1} data[;2]←{⎕UCS 'UTF-8' ⎕UCS ⍵}¨data[;2] ⍝ Make UTF8
Migrating to Unicode
Dyalog’08 - Elsinore 52
SQAPL Example (continued)
SQA.Do 'B' 'alter table idioms add utf8exp varbinary(100)' SQA.Prepare 'B.U1' 'update idioms set utf8exp=:<X20: where id=:<I:' ('Bulk' 20) SQA.X 'B.U1' (⌽data) ⍝ Store UTF8
⎕←data←3 1⊃SQA.Do 'B' 'select id,exp,utf8exp from idioms'1 {(+/⍵)÷⍴⍵} {(+/âµ)÷â´âµ}� � �2 {⍵/⍳⍴⍵} {âµ/â³â´âµ}� � � �3 {(<\⍵)⍳1} {(<\âµ)â³1}� � data[;2]≡¨{'UTF-8' ⎕UCS (⎕UCS ⍵)~0}¨data[;3] ⍝ It works!1 1 1
Migrating to Unicode
Dyalog’08 - Elsinore 53
ODBC / SQAPL Summary• SQAPL 6.0 supports ODBC Unicode data types:
• These can be used in the same was as the single-byte types. In most cases, the choice is automatic (as we have seen).
• Note: The above applies to databases which have Unicode data types. However, Unicode data is often stored in single-byte types, UTF-8 encoded.
• Most of the work will be understanding how to store Unicode in your database – and converting the data (see your Database Manual ).
Migrating to Unicode
Dyalog’08 - Elsinore 54
ODBC Type
SQAPLType Description
WCHAR U ”Wide” fixed-length string
WVARCHAR W ”Wide” variable-length
WLONGVARCHAR Q ”Wide” unlimited-length
External Interfaces: ⎕NA
• In Classic & previous editions, parameter type C meant untranslated bytes and T meant ”text”, translated to ANSI.
• In Unicode, both are untranslated.• T without a width specification now means ”wide characters
according to the host convention”• Thus: T means T1 in Classic, T2 in Unicode for Windows, and
T4 under Unicode for Unix/Linux• This means that the use of type T (<0T, >0T, =T) should be
portable across Classic/Unicode systems• Some (typically Unix/Linux) system calls expect data to be
UTF-8 encoded: You must use dyadic ⎕UCS to do the translation.
• Future extensions to ⎕NA may provide UTF-8 encoding.
Migrating to Unicode
Dyalog’08 - Elsinore 55
Selection of A or W Functions• Under Windows, Win32 library calls which handle text
are generally available in two variants:– An ANSI (narrow) version with a name ending in A – a Unicode (wide) version with a name ending in W
• For example, the function to display a message box is available as MessageBoxA and MessageBoxW.
• If you specify the character * at the end of a name, this will be replaced by A in Classic and W in the Unicode Edition.
• The intention is to allow you to write code which will work now under Classic and continue to work under Unicode – to facilitate parallel code testing and a controlled migration.
Migrating to Unicode
Dyalog’08 - Elsinore 56
Portable ⎕NA Example
• The following function is portable between Classic and Unicode:
∇ ok←title MsgBox msg;MessageBox [1] ⎕NA 'I user32∣MessageBox* I <0T <0T I' [2] ok←1=MessageBox 0 msg title 1 ⍝ 1=OK, 2=Cancel. ∇
• The function MessageBoxA will be selected by Classic, MessageBoxW by Unicode.
• <0T will mean 1-byte (translated) text under Classic, and 2-byte (untranslated) text under Unicode– Strictly speaking, text should be translated to UTF-16 in Classic,
but this is only required for ”a few” special chars
Migrating to Unicode
Dyalog’08 - Elsinore 57
APL Source in Unicode Files• SALT (Simple APL Library Toolkit) supports storage of
functions, namespaces and classes in UTF-8 files with a .dyalog extension.
• You can also very easily write your own storage mechanism using Unicode text files. Under .Net it is trivial:
Save: System.IO.File.WriteAllText 'c:\temp\foo.txt' (⎕VR 'foo') System.Text.Encoding.UTF8
Load: ⎕FX System.IO.File.ReadAllText ⊂'c:\temp\foo.txt’
• Without .Net it requires a wee bit more work (as we have seen earlier)
Migrating to Unicode
Dyalog’08 - Elsinore 58
Source Code Management
• Storing APL source in Unicode text files may seem less convenient to the seasoned APL programmer, but there are very significant advantages:
• High quality tools (both free and ”commercial”) built for other languages can be used to edit, compare, manage source, and build systems – without further ado
• Not only does this make it easier to position APL as a tool for ”professional” software development, many of these tools are actually useful (there are some smart people ”out there”)
• Young developers joining your APL team will already be familiar with these tools and feel ”at home” more quickly
• The quality of life of the APL developer need not be sacrificed!
Migrating to Unicode
Dyalog’08 - Elsinore 59
Demo of Source Code Mgt
Migrating to Unicode
Dyalog’08 - Elsinore 60
Demo of Source Code Mgt
Migrating to Unicode
Dyalog’08 - Elsinore 61
Source Code Mgt Demo
• All tools shown here downloaded from internet, none of them knew about APL in any way.
Migrating to Unicode
Dyalog’08 - Elsinore 62
Demo: Working with MyApp
Migrating to Unicode
Dyalog’08 - Elsinore 63
Keyboarding
• Discuss IME vs new Keyboards• Demo new Console Unix/Linux
APLs
Migrating to Unicode
Dyalog’08 - Elsinore 64
Migration Check List
• Are you migrating in order to simplify and stay current, or because you want to support ”foreign” text in your application?– Probably, you should do the former first (or at least
experiment with it), before trying the latter
• For the former, you only need to make sure that your interfaces to external systems (native files, databases etc) work the same way as before– You may need to add checks to prevent the inadvertant entry of
Unicode characters that your external interfaces cannot handle
• For the latter, you need to be sure that external systems ALSO support Unicode, and how they want to exchange data with your application
Migrating to Unicode
Dyalog’08 - Elsinore 65
Think about ...
• (Dyadic) ⎕DR• Monadic ⍋ of char data• APL style TCP Sockets• Interop required with
earlier versions?• External Vars• Mapped Files• Own DLLs and Aps
• Native Files– Need non-⎕AV/ANSI data– Convert to UTF-8?
• Win32 or other system calls via ⎕NA
• Underscores(!)• Switching to SALT /
SubVersion?
Migrating to Unicode
Dyalog’08 - Elsinore 66
Suggested Strategy
• Migrate to v12 Classic, write code which works in both Classic & Unicode.
• Wait until entire user base upgraded to v12.• Move application to Unicode Edition.
• Suggested timeframe for a large application with many interfaces might be 2-4 years.
• Start thinking now!
Migrating to Unicode
Dyalog’08 - Elsinore 67