CDuce: a white paper

C Duce: a white paper(working document)

Veronique BenzakenLRI, UMR 8623,C.N.R.S.

Universite Paris-Sud, Orsay, FranceVeronique.Benzaken�lri.fr Giuseppe CastagnaC.N.R.S., Departement d’Informatique

Ecole Normale Superieure, Paris, FranceGiuseppe.Castagna�ens.fr Alain FrischDepartement d’Informatique

Ecole Normale Superieure, Paris, FranceAlain.Fris h�ens.frVersion of October 22, 2002

Document regularly updated; the lastest version can be retrieved atwww. du e.orgAbstractIn this paper, we present the functional languageCDuce, dis-cuss some design issues, and show its adequacy for work-ing with XML documents. Peculiar features ofCDuce are apowerful pattern matching, first class functions, overloadedfunctions, a very rich type system (arrows, sequences, pairs,records, intersections, unions, differences), precise type infer-ence and a natural interpretation of types as sets of values.We also discuss how to add constructs for programming XMLqueries in a declarative (and, thus, optimizable) way and fi-nally sketch a dispatch algorithm to demonstrate how statictype information can be used in efficient compilation schemas.

1 IntroductionIn this paper, we present the functional languageCDuce, dis-cuss some design issues, and show its adequacy for writing ap-plications that handle, transform, and query XML documents.To keep the presentation short, and because part of the de-sign is still in progress, we just present some highlights ofthelanguage. Theoretical foundations of the type system can befound in [8]. The homepage forCDuce, including an extendedversion of this paper, other references, and an online interactiveprototype, ishttp://www. du e.org.CDuce is a general purpose typed functional programminglanguage, whose design is guided by keeping XML applica-tions in mind. The work onCDuce started from an attempt toovertake some limitations of XDuce [11, 9, 10]:� General purpose types.XDuce is XML-specific: the only

datatype it can manipulate is (sequences of) XML doc-uments; this makes it difficult to write complex applica-tions, which are not just simple transformations (filtering,reordering, renaming). XDuce demonstrates how specificfeatures (regular expression types and patterns) may beadequate to XML applications, but we believe that thesefeatures could be integrated in a less specific language,improving the interface between XML and the core ap-plication. CDuce uses a general purpose type algebra,

A preliminary and shorter version of this work was presentedin the workshopPLAN-X: Programming Language Technologies for XML, Pittsburgh, PA, Oc-tober 2002.

with standard type constructor (product, records, func-tions), and retains all the power of XDuce regular ex-pression types through explicit use of recursive types andboolean combinations (union, intersection, difference).Itis possible withCDuce to create complex data structures,model XML document types, and to interface smoothlywith other languages.� Expressive patterns. As a by-product, we extendedthe pattern algebra, allowing to extract by a single pat-tern non-consecutive subsequences of elements; unlikeXDuce1, we have anexacttyping even for non-tail vari-ables, and the pattern algorithms are easily derived fromsimple definitions [8].� Higher-order and overloading.XDuce is a functionallanguage, with a pattern matching reminiscent of ML,but it lacks higher-order functions.CDuce type systemincludes higher-order functions, and also provides latebound overloaded functions. Such functions may dis-patch on the dynamic type of their argument, making itpossible to use an object-oriented style programming in afunctional setting. We are also considering to support inCDuce incremental programming through a module sys-tem allowing further redefinition/specialization of over-loaded functions.� XML attributes.To encode XML attribute sets in a clas-sical language, the natural datatype seems to be recordtypes. To take into account attribute specificities (an at-tribute may be optional, mandatory, or prohibited), we de-signed special record type constructors.� Type constraints.An XML document is not only a treestructure; it also has basic information, such as numbersor structured strings, in attributes or element contents.We believe that the type system and the pattern matchingshould reflect this in some detail, for instance to express(and validate) some constraints on the produced XMLdocuments (for instance, that a date attribute is a stringof the form YYYY-MM-DD).

1Hosoya has drawn our attention to his independent work (not yet pub-lished) on this issue; he has recently implemented in XDuce the support forexact typing of non-tail variables, using an automaton-based approach.

1

During our work we found out thatCDuce and more gener-ally XML applications raise some interesting implementationissues. As a matter of fact, we verified that precise type in-formation on the documents the applications work on allowsmany important optimizations (for instance, searching foranelement with a given name in a whole document is of coursemuch easier when one knows where such an element may po-tentially be).

2 A sample sessionLet us write and comment a sampleCDuce program, on thelines of the typical example of [7]. First, we declare sometypes:type Bib = <bib>[Book*℄;;type Book = <book>[Title Year Author+℄;;type Year = <year>[IntStr℄;;type Title = <title>[String℄;;type Author = <author>[String℄;;type IntStr = /['0'-'9'℄+/;;The typeBib represents XML documents that store biblio-graphic information (the corresponding XSchema can be foundin [7]). It states that a bibliography is a possibly emptysequence of book elements, each consisting of a sequenceformed by one title element, one year element and one or moreauthor elements. Square brackets[: : :℄ are used to denote se-quences, whose type is constrained by a regular expressionover types. Regular expressions for character strings are en-closed in/: : :/ (they can be omitted when they consist of asingle string): for example the typeIntStr specifies that thecontent of a<year> element is a string representation of aninteger. We could have used directly integers by definingYearas<year>[1900..maxint℄

An XML document satisfying the above type is the follow-ing<bib><book><title>Persistent Obje t Systems</title><year>1994</year><author>M. Atkinson</author><author>V. Benzaken</author><author>D. Maier</author></book><book><title>OOP: a unified foundation</title><year>1997</year><author>G. Castagna</author></book></bib>If the file is stored in the file, say,bib.xml, then it can beloaded with the built-in operatorload__xml, assigned to a lo-cal variablebib0, and immediately checked to be of typeBibby pattern matching:let bib0 =mat h (load_xml "bib.xml") with| (x & Bib) -> x| _ -> error "Wrong type !";;when this declaration is entered interactively the system an-swers:|- bib0 : Bib

which indicates that the type checker keeps track thatbib0 isindeed of typeBib (error raises a fatal exception when theloaded document is not of the correct type). We could havedefinedbib0 directly as followslet bib0 =<bib>[<book>[<title>["Persistent Obje t Systems"℄<year>["1994"℄<author>["M. Atkinson"℄<author>["V. Benzaken"℄<author>["D. Maier"℄℄<book>[<title>["OOP: a unified foundation"℄<year>["1997"℄<author>["G. Castagna"℄℄℄Suppose that instead of working with the XML typeBib wepreferred to use an internal representation for bibliographiesbased on record types. We can easily define this type and aconversion function as follows:type Intern ={title=String; year=Int; authors=[String+℄};;let fun intern (Bib -> [Intern*℄)<bib>l -> map l with<book>[<title>[t℄ <year>[y℄ a::Author+℄ ->{title = t;year = int_of_string y;authors = map a with <author>[x℄ -> x};;|- intern : Bib -> [Intern*℄let bib1 = intern bib0;;|- bib1 : [Intern*℄Thelet fun construction defines the functioninternwhosetype isBib->[Intern*℄ as declared in theinterfaceof thefunction which follows its name.

The first pattern extracts the sequence of books and bindsit to the variablel; the map l with ... expression trans-forms each element ofl into the corresponding record; the lastpattern matching removes the tagsauthor. Note that the ap-plication ofint__of__string cannot fail because they valueis of typeIntStr; otherwise, the system would have issued awarning. In the pattern<book>[...℄, we observe two kindsof capture variables:t andy capture a single object, whereasa captures a sequence of objects (x::E binds tox the wholesubsequence matching the regular expressionE).

We can now define an overloaded function that extracts thelist of authors from either representation; if the argumentis anIntern object, then we get a sequence of strings; otherwise,we get a sequence ofAuthor elements.let fun authors(Intern->[String+℄; Book->[Author+℄)| { authors = a }| <book>[<title>_ <year>_; a℄ -> a;;The matching expression in the body of the function is formedby a single branch (with an alternative pattern: that is,({...}|<book>[...℄)-> a). The function interface reflects

2

the overloaded aspect of the function as it declares two distinctarrow types. The underscore symbol_ matches every expres-sion, while the semi-colon followed bya binds the rest of thesequence toa (the semicolon expression;x is syntactic sugarfor x::(_*)): the type system infers thata captures an objectof type [Author+℄ when the argument of the function is oftypeBook.let fun extra t ([(String|Author)+℄ -> [String+℄)l -> map l with| <author>[a℄ -> a| a -> a;;This function takes a list of values that are either stringsor author elements and removes the tags; the| in typesstands for union. Now, any sequence of type[String+℄ or[Author+℄ is a fortiori an acceptable argument for this func-tion, so we can define:let fun authors2 ( (Intern|Book) -> [String+℄ )b -> extra t (authors b);;(of course, there are more direct ways to implement this func-tion.)

Another possible XML representation for the bibliographywould be a flat sequence of elements, as defined by the type:type Flat = [ (Title Year Author+)* ℄;;The corresponding pair of conversion functions is:let fun flatten_bib ([Book*℄ -> Flat)l -> transform l with <book>x -> x;;|- flatten_bib : [Book*℄ -> Flatlet fun unflatten_bib (Flat -> [Book*℄)[ b::(Title Year Author+); r ℄ ->[<book>b℄ � (unflatten_bib r)|[℄ -> [℄;;|- unflatten_bib : Flat -> [Book*℄The � operator performs sequence concatenation. Thetransform construction is very much likemap; the main dif-ference is that all the returned sequences are concatenatedto-gether (that is, it flattens the result thatmap would return).

It is possible to use avalue in type position to force thatspecific value; for instance if we declaretype Chair_auth = <author>["Pier e"|"Wadler"℄;;type Chair = <book>[_* Chair_auth _*℄;;thenChair is the type of books where either Pierce or Wadler(chairmen of the PLAN-X workshop) appear as one author.Now we can define an extraction function:let fun hair_books (Bib -> [(Chair & Book)*℄)<bib>[(b::Chair | _)*℄ -> b;;|- hair_books: Bib -> [(Chair & Book)*℄The& denotes the intersection of types, thus the interface of thefunction declares that the result will be both of typeChair andof type Book (these two being incomparable). Note that theb variable is under a repetition operator (the* Kleene-star in

regular expressions); the meaning is that all the matching sub-sequences are concatenated together (here, each of them hasasingle element). As explained in Section 3.6 on pattern match-ing, a variable can occur several times in the same pattern (inthe case above because of the repetition operator): when thishappens the variable is bound to the recomposition of the bind-ings of all occurrences by the constructor they appear in (inthe case above all the bindings of the occurrences ofb are re-composed into a sequence). This example demonstrates how asingle pattern can perform a quite complex operation.

The typeChair does not say anything about theyear andtitle elements. But since the argument of the function isknown to be of typeBib, it is possible to infer—andCDucedoes it—that the extracted values are both of typeChair andof typeBook (this is indicated by the intersection type operator&).

3 Overview of theC Ducelanguage3.1 The type algebraCDuce type algebra has no specific constructor for sequencesand XML documents. The constructions we used in the previ-ous section are encoded, as shown in Section 3.2, in the coretype algebra formed by the following types:� basic scalar types, such asInt, String, Bool, etc.,

atoms (an atom is a constant of the formid whereid isan arbitrary identifier) and two type constantsEmpty andAny (the latter is also written__, especially in patterns)that denote respectively the empty (i.e., the smallest) andthe universal (i.e., the largest) type;� classical types constructors: product types(t1,t2),record types{ a1 = t1; : : : ; an = tn }, and functionaltypes (t1 -> t2);� boolean connectives: intersectiont1&t2, uniont1|t2 anddifferencet1\t2;� singleton types: for any scalar or constructed (non-functional) valuev, v is itself a type (for instance, thevalue `nil denotes also the type of empty sequences,while 18 is the type of the integer18);� recursive types: they are defined by recursive topleveldeclarations or by the syntaxT where T1 = t1 and ...and Tn = tn, whereT andTi’ are type identifiers (i.e.,identifiers starting by a capitalized letter). For instance,the type of sequences of integers may be writtenIlistwhere Ilist = (Int, Ilist) | `nil.

In CDuce types have a set-theoretic interpretation: a type is theset of allvalues(i.e. closed irreducible expressions: roughly,expressions that are neither applications, nor field selections,nor matching expressions; sometimes we equivalently use theword “result” instead of “value”) that have that type. For ex-ample, the type(t1,t2) is the set of all expressions(v1,v2)wherevi is a value of typeti; similarly t1->t2 is the set ofall functional expressionsfun f (s1;...;sn)e that have type

3

t1->t2 (that is, all the functional expressions that when appliedto an expression of typet1 return a result int2). Likewise,when a value is used in a type position (as inChair__auth inthe previous section) it denotes the singleton containing thatvalue (whence the name of singleton types).

This interpretation of types is at the basis of the whole in-tuition of theCDuce’s type system: the programmer must relyon it to understand all the type constructions and type equiva-lences of the system. Thus, for example, the difference of twotypes contains all the values that are contained in the first typebut not in the second, the union of two types is formed by allthe values of each type, and the intersection of, say, an arrowand a record isequivalent(in the sense that it has the sameinterpretation as) to the empty type.

In particular, subtyping is just set inclusion: a type is a sub-type of another if the latter contains all the values that areinthe former (for more details see [8]).

There are actually two kinds of record types: the openrecord type{ a1 = t1;: : :; an = tn } that classifies recordsin which the fields labeledai are present with the pre-scribed type, but other fields may also appear; the closedrecord type{| a1 = t1; : : : ; an = tn |}, instead, forbidsany label other than theai’s.2 It is also possible (bothfor open and for closed record types) to specify optionalfields: the syntaxai =? ti states that theai field may be ab-sent, but when it is present, it must have typeti. Thereis a lot of natural subtyping and equivalence relations thathold for record types, like{|a = t|}�{ a = t }, or { a1 = t1;a2 = t2 } ' { a1 = t1 }&{ a2 = t2 }, or {|a1=t1;a2=?t2|} '{|a1=t1|} | {|a1=t1; a2=t2|}; and once more they can allbe deduced by considering the set theoretic interpretationofrecord types as sets of record values.

For scalar types, we also introduce specific subtypes ofIntand String: i..j � Int is an interval (i and j are inte-ger constants), and/regexp/�String is the set of charac-ter strings described by the regular expressionregexp. Reg-ular expressions are built from string constants, the wild-card“.”, character classes, and usual regular expression operators|,*,?,+ and concatenation.

3.2 XML documentsCDuce is close in spirit and in syntax to other XML-orientedlanguages such as XDuce or the algebra introduced in [7].However, it is a general purpose programming language basedon a very small functional core. In particular, XML-relatedtypes are encoded in terms of the type constructors we pre-sented in the previous section.

Sequences Sequences are encodeda la Lisp using pairs anda terminator`nil representing the empty sequence: a se-quence of valuesv1; : : : ; vn is written inCDuce as[ v1...vn

2There is a small subtlety about singleton record types. For instance, thetype{ x = 3 }, being open, contains all the records that have fieldx = 3and maybe other fields too. The singleton type correspondingto the value{x = 3 } must be written{| x = 3 |}. We have chosen the same notationfor record values andopenrecord types, because we believe that open recordtypes are much more useful in programming than closed ones.

℄, but this actually is only syntactic sugar for(v1,(: : :,(vn,`nil): : :)).In the sample section we saw that regular expressions can

be applied to types to define new sequence types. This is justsyntactic sugar as the same sequence types could be defined bycombining boolean type connectives and recursive types. Moreprecisely it is possible to define sequence types by[tyregexp℄where tyregexpis a regular expression built from types andusual regular expression operators. For instance, theIlisttype in the previous section is equivalent to[Int*℄ while[Int* String+℄ represents sequences built from a possiblyempty list of integers concatenated with a non-empty list ofstrings.

XML elements The value`nil is just a special case ofatom typeìd where id is any identifier. Atoms are alsoused to encode XML tags. We saw in the sample ses-sion that an XML element<tag> elem-seq</tag> can bewritten in CDuce as<tag>[elem-seq℄ where elem-seqisa sequence of elements. This latter notation actually issyntactic sugar for(`tag,({},[elem-seq℄)). The expres-sion {} denotes the empty record value. In its more gen-eral form tags can have attributes as for<tag a1 = v1 : : :an = vn> elem-seq</tag>. This is written inCDuce as<tag{a1 = v1;: : :;an = vn}>[elem-seq℄ which is syntactic sugarfor (`tag,({a1 = v1;: : : ;an = vn},[elem-seq℄)). When ap-pearing in tags curly braces and semicolons can be omitted asin:<a href=" li k.htm" target="_top">["Cli k here"℄We applied this convention in all the examples of the samplesession as we always omitted the pair of braces{} denotingthe empty record.

As an illustration, here is a set of declarations for an XMLdocument type representing a (flat) address book where the ad-dress tag has the optional attributekind:type AddrBook = <book>Content;;type Content = [(Name Addr Tel?)*℄;;type Name = <name>[String℄;;type Addr = <addr {kind=?"home"|"work"}>[String℄;;type Tel = <tel>[String℄;;The same convention as for record expressions is used foropenrecord types occurring in tags of XML types (and patterns, seeSection 3.6), thus an equivalent notation for theAddr type is:type Addr = <addr kind=?"home"|"work">[String℄;;On the contrary the{| |} parentheses for closed record typescannot be omitted. For example the typetype Addr1 = <addr {|kind =? "home"|"work"|}>[...℄matches elements with tagaddr and that haveat mostthekindattribute with the specified type (elements of typeAddr, in-stead, can have arbitrary attributes with the only restriction thatif present thekind attribute must have the specified type).

This kind of flat representation (mixing fields for all the en-tries in the address book) is a little weird; here is theCDucefunction that transforms it into a sequence of entries (coded asrecords) where the address attribute is discarded:

4

type Entry = { name = String; addr = String;tel =? String };;let fun parse (Content -> [Entry*℄)[<name>[n℄ <addr>[a℄ <tel>[t℄; r℄ ->({ name = n; addr = a; tel = t }, parse r)| [<name>[n℄ <addr>[a℄; r℄ ->({ name = n; addr = a }, parse r)| [℄ -> [℄;;The function body is made of a pattern matching, that is asequence of branchesp -> e wherep is a pattern ande anexpression. The powerful pattern algebra (discussed in Sec-tion 3.6) greatly contributes toCDuce expressivity. For in-stance, a single pattern can filter out of an addressbook all thetelephone numbers:let fun notel (Content -> [(Name Addr)*℄)[ (x::(Name Addr) Tel?)* ℄ -> x;;The several matched subsequences forx are concatenated to-gether; the pattern could also be written[(x::(Name|Addr)Tel?)*℄ and theCDuce type-checker would still be cleverenough to infer thatx can only bind a sequence of type[(NameAddr)*℄. The following function splits a list ofEntry recordsaccording to whether the phone number is present or not:type A = Entry & { tel = Any };;type B = Entry & { tel =? Empty };;let fun split ([Entry*℄ -> ([A*℄,[B*℄))[((x::A) | (y::B))*℄ -> (x,y);;Note the definition ofB. The record type{tel=?Empty}means “whenever the fieldtel is present, its value must be-long to the typeEmpty”; as there is no value of typeEmpty,this is equivalent to saying “the fieldtel must be absent”.

3.3 Overloaded functionsThe simplest form for a toplevel function declaration islet fun f (t->s) x -> ein which the body of a function is formed by a single branchx->e of pattern matching. As we saw in the previous sections,the body of a function may be formed by several branches withcomplex patterns. The interface(t->s) specifies a constrainton the behavior of the function to be checked by the type sys-tem: when applied to an argument of typet, the function re-turns a result of types. In general the interface of a functionmay specify several such constraints, as we did for example intheauthors function in Section 2.

The general form of a toplevel function declaration is thus:let fun f(t1->s1;: : :;tn->sn) | p1->e1 | : : : | pm->em(the first vertical bar can be omitted). Such a function acceptsarguments of type (t1|: : :|tn), it has all the typesti->si, and,thus, it also has their intersection (t1->s1&: : :&tn->sn).

The use of several arrow types in an interface serves to giveto the function a more precise (inasmuch as smaller) type. Wecan roughly distinguish two different uses of multiple arrowtypes in an interface:

1. when each arrow type specifies the behavior of a differentpiece of code forming the body of the function, the com-pound interface serves to specify theoverloadedbehavior

of the function. This is the case for theauthor functionof Section 2 or for the function belowlet fun add ((Int,Int)->Int;(String,String)->String)| (x & Int, y & Int) -> x+y| (x & String, y & String) -> x^y;;where each arrow type in the interface refers to a differentbranch of the body.

2. when the arrow types specify different behavior for asame code, then the compound interface serves to give amore precise description of the behavior of the function.For example the interface belowtype IL = [Int*℄;; integer listtype NE = [Int+℄;; non empty listtype EE = ([℄,[℄);; pair of empty listslet fun on at ( ((IL,IL)\EE)->NE; EE->[℄)| ((h,t),l2) -> (h, on at (t,l2))| ([℄,l2) -> l2;;specifies that the function on at returns an empty se-quence if and only if both the argument sequences areempty (and the type checker verifies it).

Of course, these two uses can be combined allowing the defi-nition of overloaded functions with a precise typing.

A typical use for overloaded functions in the XML settingcould be the definition of transformations that operate on doc-uments of different DTDs and, for example, uniformly replacesome tag for another (changing in this way the type) like in thefollowing case:type Movie = <movie>[Title <produ er>[String℄℄;;type Film = <film>[Titre <produ teur>[String℄℄;type Livre = <livre>[Titre Annee Auteur+℄;;type Titre = <titre>[String℄;;...let fun to_fren h (String -> String;Book -> Livre;Movie -> Film;Title -> Titre; ...)<(t)>x ->let u = mat h t with| `book -> `livre | `movie -> `film| `title -> `titre | `year -> ànnee| àuthor -> àuteur | `produ er -> `produ teur| y -> yin<(u)>(map x with e -> to_fren h e);;Similarly overloading can be used to produce documents struc-tured in different ways according to the value of some tag at-tribute, and this is particularly useful in case of recursive defi-nitions. We will show this by an example that we comment indetail in Section 3.8.

3.4 Higher-order functionsFunctions are first class values inCDuce. This meansthat a function can be fed to or returned by a so-calledhigher-order function. The syntax for a local function isthe same as a toplevel function declaration,fun f(t1->s1;: : :;tn->sn): : :, with the only difference thatf may be omit-ted if the function is not recursive. A classical example of

5

higher order function is the composition function, which takestwo functions and returns as result the function that composesthem:let fun omposition( (s->t,t->u) -> (s->u) )(f,g) -> fun(s->u) x->g(f x)Higher-order holds for all functions, overloaded functions in-cluded. For example consider the following function whichtakes a binary integer function, anITree, and returns an inte-ger:type ITree = Int | (ITree,ITree);;let fun squeeze((((Int,Int)->Int),ITree) -> Int)| (f,(x,y)) -> f (squeeze(f,x),squeeze(f,y))| (f,x) -> x;;|- squeeze : ((Int,Int)->Int),ITree) -> Intit is perfectly correct to pass tosqueeze the overloaded func-tion add defined before:squeeze((3,(4,3)),add);;|- Int=> 10we can always pass a function of type((Int,Int)->Int)& (String,String)->String) where a function of type(Int,Int)->Int is expected, as the latter is a subtype of theformer. Once more it would be possible to use a compoundinterface to specialize the behavior ofsqueeze. If for ex-ampleSTree = String | (STree,STree), then we couldhave specified forsqueeze the following interface( ((String,String)->String , STree) -> String ;((Int,Int)->Int , ITree) -> Int )In this case we would have extended the application domain ofsqueeze sincesqueeze(add,("a","b"))would typecheckand return the result"ab".

In the setting of XML applications, a typical use of higherorder functions is to parametrize the behavior of another func-tion. For instance, the functionrender below, which isin charge of “rendering” a complex document to HTML, isparametrized by a function of typeEntry2html that rendersspecific parts of the document.type Html = ...;;type Entry2html = [Name Addr Tel?℄ -> Html;;let fun render ((Entry2html,AddrBook) -> Html)...;;We next define the functionprint__entry of typeEntry2html that is passed torender to define a newfunctionmy_render:let fun print_entry (([Name Addr Tel?℄ | Entry) -> Html)| [<name>[n℄ <addr>[a℄ <tel>[t℄℄| { name = n; addr = a; tel = t } ->...| [<name>[n℄ <addr>[a℄℄| { name = n; addr = a } ->...;;let fun my_render (AddrBook -> Html)book -> render (print_entry, book);;

Note that([Name Addr Tel?℄|Entry) -> Html is a sub-type ofEntry2html: every function value that when appliedto an argument of type([Name Addr Tel?℄|Entry) returnsa result inHtml is also a function that when applied to an ar-gument of type[Name Addr Tel?℄ returns a result inHtml.Therefore it is legal to useprint__entry as the first argumentof render.

Functions being first-class, it is possible to store them in datastructures; for instance, one could dynamically lookup a ren-dering function in an associative table, according to the valueof some field in an input document. This would create a dy-namic template system.

3.5 C DucemodulesIn this section we sketch the design guidelines we are follow-ing in the definition of the (not yet implemented)CDuce’smodule system.

A CDuce module is a sequence of recursive toplevel dec-larations (types, functions, data, . . . ); a module may refertocomponents defined in other modules using a classical dot no-tation. A module may come with aninterfacethat specifies (ina separate file) the type of defined functions and values.

Type declaration All the type declarations in a single mod-ule are mutually recursive. A declarationtype T = t doesnot createa new type; it just gives the nameT to the typet.This is important because it puts the emphasis on the structureof the values, not the specific type declarations; two unrelatedmodules may thus communicate and work on the same values,even though they do not share any common declaration.

We are planning to add support for abstract types in two fla-vors. A transparentabstract type gives no information aboutthe structure of its values outside the module where it is de-fined, but it is still possible to use pattern matching to inspectvalues of this type. Anopaqueabstract type prevents any suchinspection: values are really black boxes outside the modulethey belong to.

Incremental programming It is possible to overload a func-tion first defined in another module. For instance, imagine thatthe functionsadd and on at of Section 3.3 were defined insome moduleA, and that instead of having on at in a sepa-rate function we want to obtain the same behavior by overload-ing add. This can be done in a different module as follows:overload A.add ( ((IL,IL)\EE)->NE; EE->[℄)| (l,l) -> A. on at(l,l);;This affects the global behavior of the functionA.add, evenwhen called inside the moduleA. Technically, this is referredto as dynamic binding. The new behavior is obtained byadding the new branches before the old ones in the definitionof A.add. As for typing, we see that it is possible to spec-ify an additional set of constraints in the interface. Of coursethe old constraints must be satisfied by the new function, too(this automatically enforces the inheritance condition of�&:see [3]). The new constraints are statically visible only inthemodule where the overloading occurs (and the ones that rely

6

on it), not inA, as this would require to type-check it again.However, sometimes, redoing type-checking might result use-ful. For instance, if we redefineadd so that it always returnspositive integersoverload A.add ((Int,Int)->(0..maxint))| (x&Int,y&Int) -> if x+y>0 then x+y else 1;;and we type check again every function inA that callsaddwe might obtain more precise typing by using the informationthat the result ofadd is always positive. We are investigatingways to allow this powerful kind of incremental programmingwithout sacrificing all the benefit of separate compilation andimplementation independence; basically, we are planning touse some kind ofinterfacefor modules and allow the parts ofthe implementation that have to be potentially re-typecheckedto appear in the interface.

3.6 Pattern matchingPattern matching is one ofCDuce’s key features. Although ithas an ML-like flavor, it is much more powerful, as it allowsone to express in a single pattern a complex processing thatcan dynamically check both the structure and the type of thematched values.

We already saw examples of pattern matching forming thebody of a function declaration. As in ML, inCDuce there isalso a standalone pattern-matching expressionmat h e withp1->e1 |...| pn->en. Local bindinglet p = e1 in e2 isjust syntactic sugar format h e1 with p -> e2.

A pattern may either match or reject a value; when itmatches, it binds itscapture variablesto the correspondingparts of the value and the computation can continue with thebody of the branch. Otherwise, control is passed to the nextbranch. Note that this is only a description of the semanticsof pattern matching, but the actual implementation uses lessnaive and more efficient algorithms to simulate it; for instance,we designed an algorithm that uses a single (partial) run onthe value to dispatch on the correct branch, and takes profit ofstatic typing information (see Section 6).

Capture variables and deconstructors As in ML, a vari-able, say,x is a pattern that accepts any value and binds it tox. A pair pattern(p1,p2) accepts every value of the form(v1,v2) wherevi matchespi. If a variablex appears both inp1 and inp2, then each patternpi bindsx to some valuev0i; thesemantics is here to bind the pair(v01,v02) to x for the wholepattern (and so recursively). For instance, a pattern matchingbranch(x,(y,x)) -> (x,y) is equivalent to(x1,(y,x2))-> ((x1,x2),y). Similarly (x,(x,(y,x))) -> (x,y) isequivalent to(x1,(x2,(y,x3))) -> ((x1,(x2,x3)),y).Such examples are though not very interesting as we reallygain in expressivity when the multiple occurrences of a vari-able are generated by the use of recursive patterns, as we showlater on.

Similarly to pair patterns, record patterns are of the form{a1=p1;...;an=pn} and {|a1=p1;...;an=pn|}: the for-mer matches every record whose fieldsai match thepi whilethe latter matches record formed exactly by theai fields and

whose content matchespi. We use the same convention as fortypes and allow to omit the braces for open record parenthesesoccurring in tags. However, contrary to pair patterns, we donot allow multiple occurrences of a variable in a record pattern(there is no difficulty to structure the results of the differentoccurrences into a record, but we did not found any interestinguse of such a feature).

Type constraint and conjunction Every type can be usedas a pattern, whose semantics is to accept only values of thistype3, and create no binding. This is particularly useful be-cause inCDuce, a type may reflect precise constraints on thevalues (structure or content, for instance0..10). Note thatscalar constants can be used, and are just special case of typeconstraint with singleton types. The wild-card type__ sim-ply is an alternative notation for the type constantAny and assuch it matches every type. Similarly<__>t (resp.<__ r>t) isa shorthand for(__,(__,t)) (resp.(__,(r,t))).

To combine a type constraint and a capture variable, one canuse the conjunction operator& for pattern, as in(x & Int).The semantics of the conjunction in a pattern is to check bothsub-patterns and merge their respective set of bindings. Thismerging may require a non trivial inference of the type system:for example if we have the pattern(x & <book>__), then inorder to deduce the type ofx the type system must infer thetype of the content of the element<book>.

Alternative and default value There is also an alternative(disjunction) operatorp | q with first match policy: it first triesto match the patternp, and if it fails, it tries withq; the two pat-terns must have the same set of capture variables. This patternmay be used in conjunction with the pattern(x := ), where is an arbitrary scalar constant, to provide a default value for acapture variable. For instance, the pattern((x & Int) | (x:= 0)) captures a value and binds it tox when it is an integer,and otherwise ignores it and continue with the bindingx :=0.

Recursive patterns As for types, recursive patterns maybe defined through the syntaxP where P1 = p1 and ...and Pn = pn whereP; P1; : : : ; Pn are variables ranging overpatterns. Recursive patterns allow one to express complex ex-traction of information from the matched value. For instance,consider the patternp where p = (x & Int, __) | (__,p); it extracts the first element of typeInt from a sequence (re-call that sequences are coded with pairs). The order in the al-ternative is important, because the patternp where p = (__,

3Conversely, every pattern without capture variables is a type; this inter-mingle motivated our choice to use the same notation for constructors com-mon to types and patterns. For instance, the term(1,2) in a pattern po-sition can be interpreted(i) as a type constraint where the type is the con-structed constant(1,2), (ii) as a type constraint where the type is the prod-uct type of of the two scalar constants1 and 2, or (iii) as a pair patternformed by two type constraints,1 and2. All these interpretations yield thesame semantics. Therefore the use of the same constructors for types andpatterns reduces the number of possible denotations for a same (from a se-mantic viewpoint) pattern: had we used a different syntax for product types,say, t1x t2 , then we would have had two denotations, i.e.(1,2) and1x2,for the same pattern. The same reason motivates our choice ofdenotingrecord types by{ a1=t1;...;an=tn } rather than by the more common{ a1:t1;...;an:tn }.

7

p) | (x & Int, __) would extract thelast element of typeInt.A pattern may also extract and reconstruct a subsequence,

using the convention described before that when a capturevariable appears on both side of a pair pattern, the twovalues bound to this variable are paired together. For in-stancep where p = (x & Int, p) | (__, p) | (x :=`nil) extracts all the elements of typeInt from a sequenceand the patternp where p = (x & Int, (x & Int, __))| (__, p) extracts the first pair of consecutive integers.

Regular expression patterns CDuce provides syntacticsugar to define patterns working on sequences with regularexpressions built from patterns, usual regular expressionop-erators, andsequence capture variable. For instance, we haveseen the pattern[ (x::(Name Addr) Tel?)* ℄; the vari-ablex captures subsequences of consecutiveName andAddrelements, and concatenates all these subsequences. It is actu-ally compiled into the pattern:p where p = (x & Name, q) | (x & `nil)and q = (x & Addr, r)and r = (Tel, p) | pThis example illustrates howsequence capture variablesarecompiled by propagating them down to simple patterns, wherethey become standard capture variable. The(x & `nil) pat-tern above has a double purpose: it checks that the end of thematched sequence has been reached, and it bindsx to `nil, tocreate the end of the new sequence.

Note the difference between[ x & Int ℄ and[ x :: Int ℄. Both patterns accept sequences formedof a single integeri, but the first one bindsi to x, whereas thesecond one binds tox the (full) subsequence[i℄.

Regular expression operators*,+,? aregreedyin the sensethat they try to match the longest possible sequence. Ungreedyversion*?, +? and?? are also provided; the difference inthe compilation scheme is just a matter of order in alterna-tive patterns. For instance,[__* (x & Int) __*℄ is compiledto p where p = (__,p) | (x & Int, __) whereas[__*?(x & Int) __*℄ is compiled top where p = (x & Int,__) | (__,p).

It is often useful to bind (or ignore) the tail of thematched sequence; instead of[... r::(__*)℄ (resp.[...(__*)℄), one can use the notation[...; r℄ (resp. [...;__℄).

String patterns The previous paragraph introduced regu-lar expression patterns that match sequences and in whichvariables may capture subsequences.CDuce has also regu-lar expression patterns for strings, with the syntax/pregexp/wherepregexpis a regular expression built from string con-stants, wild-cards., character classes, usual regular expres-sion operators, andsubstring capture variables. For instance,the pattern /y::(....) "-" m::(..) "-" d::(..)/could be used to extract relevant numeric information froma string representation of a date; if the matched value isknown to be of type/['0'-'9'℄{4} "-"['0'-'9'℄{2}"-" ['0'-'9'℄{2}/, then CDuce type-checker can infer

that the value bound toy has type/['0'-'9'℄{4}/, andthis can be used to check statically that a call to a functionint__of__string ony will succeed.transform e with (x & Int) -> [x x℄;;(* Sele t and dupli ate any integer *)transform e with| (x & Int) -> [(to_string x)℄| (x & String) -> [x℄;;(* Sele t only strings and integers(transformed to strings) *)

Actually, transform can be defined frommap and theflatten unary operator, that concatenate a sequence of se-quences.

Ourtransform construction is very similar to thefor con-struction in [7]; a loopfor x in e1 do e2 would be simplytranslated totransform e1 with x -> e2.

3.7 Extra support for sequencesAlthough there is no special support for sequences in the coretype and pattern algebras (regular expression types and pat-terns are just syntactic sugar),CDuce provides some languageconstructions to support them.

Map A common operation on sequences is to apply sometransformation to each element. In ML, this kind of operationmay be defined as an higher-order functionmap, taking as ar-guments a list and a function that operates on the elements ofthe list; the type ofmap is 8�; �:(� ! �) � � list ! � list.In CDuce, one can define for instance:let fun int_str((Int -> String, [Int*℄) -> [String*℄)| (f,(x,l)) -> (f x, int_str (f,l))| _ -> `nil;;However, there are two drawbacks:(i) we can only define amonomorphic instantiation of the ML counterpart, hence codeduplication; and(ii) the application of such a function is moreverbose, because the abstraction (local function) passed as firstargument has to provide an explicit interface inCDuce (for in-stance,fun (Int -> Int) x -> x + 1 instead of the MLfun x -> x + 1).

Actually (ii) is related to the following issue: the complextype algebra ofCDuce makes type inference probably infeasi-ble in practice (inferred types may be completely unreadable);moreover, as the semantics is driven by types, there is not nec-essarily a notion ofbesttype for an expression. However, weare planning to consider limited form of inference, such as lo-cal inference, to handle simple cases as above.

To address(i), one can try to add some form of parametricpolymorphism toCDuce; we already started to consider theproblem and adaptedCDuce type algebra to handle type vari-ables. But it turns out that ML-like polymorphism is not al-ways sufficient for XML-oriented processing andCDuce liketype algebra. The functionfun x -> x + 1 has of coursetypeInt->Int, but also all the typesi..j -> i + 1..j + 1for any integersi andj. As a different example, suppose wehave a sequence of type[ (t1 t2)* ℄ and want to obtain a

8

type Person = FPerson | MPerson;;type FPerson = <person gender='F'>[ Name Children ℄;;type MPerson = <person gender='M'>[ Name Children ℄;;type Children = < hildren>[Person*℄;;type Name = <name>[String℄;;type Man = <man>[ Name Sons Daughters ℄;;type Woman = <woman>[ Name Sons Daughters ℄;;type Sons = <sons>[ Man* ℄;;type Daughters = <daughters>[ Woman* ℄;;let fun sort (MPerson -> Man ; FPerson -> Woman)<person gender=gen>[ n < hildren>[(m ::MPerson | f ::FPerson)*℄ ℄ ->let tag = mat h gen with 'M' -> `man | 'F' -> `woman inlet s = map m with x -> sort x inlet d = map f with x -> sort x in<(tag)>[ n <sons>s <daughters>d ℄;;Figure 1: ACDuce program

sequence of type[ (t01 t02)* ℄ (say,ti andt0i correspond toXML elements), by applying two distinct transformations forelements of typet1 and those of typet2. This is beyond thepower of ML polymorphism; we could instantiate in the typeof map � with t1 | t2 and� with t01 | t02, and this would give[ (t01 | t02)* ℄ for the type of the result, which forgets theorder of elements.

As a pragmatic solution, we adopted the following con-struction inCDuce: map e with p1 -> e1 | ... | pn-> en. Here, the expressione must evaluate to a sequence,and each of its elements will go through the pattern matchingand get transformed if matched by some branch (otherwise, itis left unchanged, as if there were an implicitx -> x branchat the end of the pattern matching). Here is an example wherethis kind of polymorphism is required: the following functionuses two different tags to represent home addresses and workaddresses (default is “work”):type AddrBook1 = <book>[(Name Addr1 Tel?)*℄;;type Addr1 = <home>[String℄ | <work>[String℄;;let fun pat h_addr (AddrBook -> AddrBook1)<book>x -> let y = map x with| <addr kind="home">z -> <home>z| <addr>z -> <work>zin <book>y;;Transform Themap construction does not affect the lengthof the sequence, and each element is mapped to a single el-ement in the result.CDuce also provides a variant ofmap,written transform, where each branch of the pattern is sup-posed to return a (possibly empty) sequence, and all the re-turned sequences, for each element in the source sequence, areconcatenated together. The implicit default branch is now__-> [℄ (so, unhandled elements are discarded). Here are someexamples:

3.8 A detailed exampleIn Figure 1 we have written a small program that illustrates theexpressivity and the peculiar characteristics ofCDuce. Let usgo throught it in detail.

We have XML-documents of typePerson that are used tostore information of persons. To that end they use an attribute

gender, an elementname, and an element hildren that isa sequence ofPerson elements. This is expressed by the firstfive type declarations of Figure 1. A completely equivalentdefinition forPerson would have been:type Person = <person gender=("F"|"M")>[Name Children℄Imagine now that we want to transform our representation soto get rid of the attribute by specifying either the tag<man> orthe tag<woman>. We also want to distinguish the children ofpersons into two different sequences, one of sons, composedby men, and the other of daughters, composed by women. Ofcourse we want to recursively apply this transformation to thechildren of a person. In practice, we want to define a func-tion sort of typePerson ->(Man | Woman)whereMan andWoman are the types defined in the last four type declarations ofFigure 1. The definition of the transformation ends the figure.It is declared to be an overloaded function that when appliedto aMPerson returns an XML document of typeMan and thatapplied to aFPerson it returns a value of typeWoman. Thebody is composed by a single pattern matching whose patternbinds four variables:gen that is bound to the gender of theargument of the function,n that is bound to its the name,m that is bound to the sequence of all children that are of typeMPerson, andf that is bound to the sequence of all childrenthat are of typeFPerson. Here we see the first use of a pe-culiar feature ofCDuce namely the ability of patterns to cap-ture subsequences of non consecutive elements of a sequence.On the next line we definetag to be`man or `woman accord-ing to the value ofgen, and then we recursively applysortto the elements ofm andf . And here it is the crucial useof a second feature ofCDuce, that is overloading: sincem is of type[MPerson*℄ then by the overloaded type ofsortwe can deduce thats is of type [Man*℄; similarly we de-duce ford the type[Woman*℄. From this the type checkerdeduces that the expressions<sons>s and <daughters>dare of typeSons andDaughters, and therefore it returns forthesort function the type(MPerson -> Man) & (FPerson-> Woman). Note that the use of overloading here is criti-cal: althoughsort has typePerson ->(Man | Woman) (assort is of typeMPerson->Man & FPerson->Woman which

9

is a subtype of the former) had we declaredsort of that typethe function would not have type checked: in the recursive callsthe type we would have been able to deduce fors and ford is[ (Man | Woman)* ℄, which is not enough to type check theresult.

Another point that is worth to be stressed is that if for exam-ple we had defined theChildren type as followstype Children = < hildren>[Person+℄with the intention of asserting that all the persons in the basehave at least a children, then the type checker emits a Warningstating that the typesPerson, FPerson, MPerson, Childrenare empty. As a matter of facts there does not exist anyvaluewith these type, for the simple reason that there is no value tostart from (we are in a purely function language).4

4 TypesThe type system is at core ofCDuce. The whole languagewas conceived and designed on it. From a practical point ofview the most interesting and useful characteristic of the typesystem is the semantic interpretation we described before,ac-cording to which a type is nothing but a set of values denotedby some syntactic expression5. This simple intuition is all isneeded to grasp the semantics of theCDuce’s type system and,in particular, of:

Subtyping: subtyping is simply defined as inclusion of sets ofvalues: a typet is a subtype ofs if and only if every valuewhich has typet has also types; when this does not hold,the type system can always exhibit a value of typet andnot of types.

Boolean connectives:boolean connectives in the type algebraare simply interpreted as their set-theoretic counterpartonsets of values: intersection&, union|, and difference\are the usual set theoretic operations.

Type equivalences:two types are equivalent if and only if allthe values in the former are values in the latter and vice-versa. So for example[ Int (String Int)* ℄ ' [(Int String)* Int ℄.

Understanding types is fundamental toCDuce programmingas they are pervasive. In particular, pattern matching is basedon types: all a pattern can do is to capture a value, deconstructit, or check its type. So pattern matching can be basically seen

4While there does not exist anyvalue in this type, there do existexpres-sionswith this types. Of course they are all diverging. For example considerthe mutually recursive functions:let mkP( __->Person)x -> <person gender='F'>["Infinite Loop" mkC(x)℄and mkC( __->Children)x -> < hildren>[ mkP(x) ℄;;ThenmkC(1) is an expression of the typeChildren above.

5Of course, every type system induces a set-theoretic interpretation oftypes as sets of values. The point is thatCDuce type system isbuilt on suchan interpretation. As a result, for instance, the subtypingrelation ofCDuceis both sound and complete w.r.t. set-inclusion, whereas inother type systemsjust soundness holds (if a typet is a subtype ofs then all the values int arealso ins, but the converse does not hold).

as dynamic dispatch on types, combined with information ex-traction. This gives toCDuce a type-driven semantics reminis-cent of object-oriented languages as overloaded functionscanmimic dynamic dispatch on method invocations. Note how-ever that a class based approach (mapping each XML elementtype to a class) would be infeasible, as the standard dispatchmechanism in OO-languages is much less powerful than pat-tern matching (which can look for and extract information deepinside the value). By keeping “methods” outside objects, wealso get the equivalent of multi-methods (dispatch on the typeof all the arguments, not just on the type of a distinguished“self”).

Besides this dynamic function, types play also a major rolein the static counterpart of the language. Type correctnessofall CDuce transformations can be statically ensured. This is animportant point: although many type systems have been pro-posed for XML documents (DTD, XML-Schema, . . . ), mostXML applications are still written in languages (e.g. XSLT)that, unlike XDuce orCDuce, cannot ensure that a programwill only produce XML documents of the expected type. Fur-thermore, inCDuce pattern matching hasexacttype inference,in the sense that the typing algorithm assigns to each capturevariable exactly the set of all values it may capture. This yieldsa very precise static type system, that provides a better descrip-tion of the dynamic behavior of programs.

Finally, types play an important role also in the compilerback-end, as the type-driven computation raises interesting is-sues about the execution model ofCDuce and opens the doorto type-aware compilation schemas and type-driven optimiza-tions that we hint at in Section 6.

4.1 Highlights of the type system

Since the whole intuition of theCDuce type system relies oninterpreting types as set of values, it is important to explainhow values are typed. This is straightforward in most casesapart from function values. So we explain below the typingrule for functions and, in order to ease the presentation we splitit in two rules, a subrule for typing function bodies (that islistof pattern matching branches) whose derivation is then usedinthe typing rule for functions.

A typing judgment has the form� ` e : twhere� is a typingenvironment (a map from variables to types),e is a CDuceexpression, andt is a type; the intended meaning is that if thefree variables ofe are assigned values that respect�, then everypossible result ofe will be a value int.Pattern matching LetB denote the sequence of branchesp1-> e1 | : : : | pn -> en. The rule below derives the typingjudgment� ` t=B ) s, whose intended meaning is: match-ing a value of typet against the sequence of branchesB always

10

succeeds and every possible result is of types.t � ***p1 +++ |...| *** pn+++ti = t\ *** p1 +++ \ : : : \ *** pi�1 +++ & *** pi+++(�; (ti=pi) ` ei : si if ti 6' Emptysi = Empty if ti ' Emptys = s1|...|sn� ` t=B ) sLet us look at this rule in detail. The matched value is supposedto be of typet. The first line checks that the pattern matchingis exhaustive; for each patternpi, ***pi+++ is a type that representsexactlyall the values that are matched bypi. The exhaustivitycondition is just saying that every value that belongs to type tmust be accepted by some pattern.

Now we have to type-check each branch. At runtime,when the branchpi->ei is considered, one already knowsthat the value has been rejected by all the previous patternsp1; : : : ; pi�1; if the branch succeeds, one also knows that thevalue is of type***pi+++. So, when type-checking the expres-sion of the branch, one knows that the value will be of typeti, that is, of typet and of type***pi+++ but not of any of thetypes***p1+++; : : : ;***pi�1+++. Now we type-check the bodyei ofthe branch; to do so, one must collect some type informationabout the variables bound bypi. This is the purpose of(ti=pi):it is a typing environment that associates to each variablex inpi a type that collects all the values that can be bound tox bymatching some value of typeti againstpi.

It is evident that all the “magic” of type inference resides inthe operators***p+++ and(t=p). These operators were introducedin [8]. Their definition reflects their intuitive semantics and isalso used to derive the algorithms that compute them. In thenext section examples are given to illustrate some non-trivialcomputations performed by these algorithms.

The result of the pattern matching will be the result of oneof the branches that can potentially be used. This is expressedby taking for the type of the pattern matching the union of theresult type of each branchi such thatti is not empty; indeed,if ti is empty, the branch cannot be selected, and we takesi =Empty as its contribution.

Functions What unused branches (i.e., those withti 'Empty) are useful for in a pattern matching? The answer isin the typing rule for abstractions:t = t1->s1 & : : : & tn->sn(8i) �; f :t ` ti=B ) ui � si� ` fun f(t1->s1;: : :;tn->sn)B : tThe type system simply checks all the constraints given in theinterface (as the function can call itself recursively, we remem-ber when typing the body thatf is a function of the type givenby the interface). So the body is type-checked several times,and for some typeti, it may be the case that some branch inBis not used. Let us illustrate this by a simple example:fun (Int -> Int; String -> String)| Int -> 42| (x & String) -> x

When type-checking the body for the constraintString ->String, the first branch is not used, and even though its returntype is not empty (it is42, which is the type assigned to theconstant42), it must not be taken into account to prove theconstraint.

This is not a minor point: the fact of not considering thereturn type of unused branches is the main difference be-tween dynamic overloading and a type-case (or equivalentlythedynami types of [1]). The latter always returns the unionof the result types of all the branches and, as such, it is not ableto discriminate on different input types.

4.2 Pattern type inference: examplesWe saw that***p+++ and (t=p) are the core of the type system.They are defined as the least solution of some set of equa-tions (there may be several solutions when considering recur-sive patterns). These definitions are quite straightforward asthey reflect the intuitive semantics of the operators. For exam-ple,***p+++ is defined by the following set of equations***x+++ = Any ***p1|p2+++ = ***p1+++ | *** p2+++***t+++ = t ***p1&p2+++ = ***p1+++ & *** p2+++***(x:= )+++ = Any ***(p1,p2)+++ = (***p1+++,***p2+++)which simply states that a pattern formed by a variable matches(the type formed by) all values, that a pattern type matches allthe values it contains, that an alternative pattern matchestheunion of the types matched by each pattern and so on. Thesame intuition guides the definition of(t=p). So for example:(t=x)(x) = t(t=(p1|p2))(x) = ((t& ***p1+++)=p1)(x)|((t\ ***p1+++)=p2)(x)

...

states that when we match the patternx against values rang-ing over the typet then the values captured byx will be ex-actly those int, similarly when we match values ranging overt against an alternative pattern, then the values captured byavariablex will be those captured byx when the first patternis matched against those values oft that are accepted byp1,union those captured byx when the second pattern is matchedagainst the values int that are accepted byp2 but not byp1.

The most important result is that the equations above canbe used to define two algorithms that compute***p+++ and(t=p).Rather than entering in the details of the algorithms6, we pre-fer to give a couple of examples that show the subtlety of thecomputation they are required to perform.

Filter Consider the patternp where ((x & Int),p) |(__,p) | (x:=`nil) that extracts from a sequence all the in-tegers occurring in it7. In the table below we show the typesof all values that are captured by the variablex of the pattern

6See [8] for more details about these algorithms as well as thealgorithmsof typing and subtyping. Here we just want to signal that withour semanticapproach the proofs of completeness and termination for thealgorithms seemsconsiderably easier than those of XDuce’s ones.

7More precisely, ifp is matched against a sequenceL, thenx is bound tothe subsequence ofL containing all the integers inL

11

p when this latter is matched against (values ranging over) dif-ferent types: t (t=p)(x)[Int String Int℄ [Int Int℄[Int|String℄ [Int?℄[Int* String Int℄ [Int+℄[Int+ String Int℄ [Int+ Int℄[(0..10)+ String℄ [(0..10)+℄[(Int String)+℄ [Int+℄Pairs Consider the following types already introduced inSection 3.3type IL = [Int*℄;; integer listtype NE = [Int+℄;; not empty listtype T = (IL,IL)\([℄,[℄);; pair of lists not both empty

and the patternsp = ((x,y),z) andq = ([℄,z). Then:((T& ***p+++)=p) = [x 7! Int ; y 7! IL ; z 7! IL℄((T\ ***p+++)=q) = [z 7! NE℄The typing of patterns, pattern matching, and functions is es-sentially all is needed to understand how the type algorithmworks, as the remaining rules are straightforward. The onlyexception to that are the typing of the constructionsmap andtransformwhich need to compute the transformations of reg-ular expressions (over types) and for which the same tech-niques as those of [7] can be used.

5 QueriesIn the world of XML, the boundary between program-ming languages, transformation languages, and query lan-guages/algebras is not easy to draw and as pointed out in [12]there is no definitive standard for query languages for XML.Indeed, a query can be seen as a transformation that filtersthe XML documents to extract the relevant information andpresents it with a given structure, and a transformation is justa special kind of application. A declarative language such asXSLT [4] is clearly not on the “programming” side, but sys-tems such as XQuery [2] or the one in [7] are very close inspirit to XDuce, and they can be seen as real programminglanguages for XML.CDuce was designed as a programming language, recastingsome XML specific features from XDuce in the more generalsetting of higher-order functional languages. But it turnsoutthat a small set of extra constructions can also endow it withquery-like facilities that are standard in the database world:projection, selection, and join.8

We already mentioned that thetransform construction al-lows us to encode thefor iteration from [7] inCDuce.

8The fact thatCDuce can implement such constructions is not surprising:any Turing-complete language can obviously do it. The pointis that, insteadof defining a fixed implementation of these constructions, one can use the se-mantic foundations ofCDuce to obtain different implementations and natural(insofar as semantic) transformations that open the door toquery optimiza-tion.

As in [7], the projection operator—denoted by/— can bedefined from this construction: ife is aCDuce expression rep-resenting a sequence of elements andt is a type,e/t is syntac-tic sugar for9:transform e with <_> ->transform with (x & t) -> [x℄This new syntax can be used to obtain a notation close toXPath [5]. For example consider the typeAddrBook of Sec-tion 3.2 modified so that the<addr> elements contain subele-ments such as<street>, <town> and so on. Ifaddrbook isof type AddrBook, then the expression[addrbook℄/<addrkind="home">__/<town>__ extracts fromaddrbook the se-quence of all town elements that occur in a “home” ad-dress. To enhance readability we use the syntactic con-vention that in such paths the wild-cards__ that fol-low tags can be omitted and write[addrbook℄/<addrkind="home">/<town>. This corresponds to the XPath ex-pression/addr[�kind="home"℄/town. Our type system al-lows us to push further the simulation of XPath, for exampleto consider the position of the elements. So we may imagineto write[addrbook℄/<addr kind="home">{2}/<town> toselect exactly the town of the second “home” address elementof addrbook which would correspond to the XPath expres-sions/addr[2℄[�kind="home"℄/town and coded into the fol-lowing (unreadable but efficient) expression:transform addrbook with<_>[(_*?) <addr kind="home">_ (_*?)<addr kind="home">[ (x::<town> | _)* ℄; _℄ -> xBut CDuce path expressions are not just there to mimic XPath.Since our syntax allows us to specify the type of element con-tents, we can use this option to express complex conditions inpaths; for instance, using the types defined in Section 2, thepath[bib℄/<book>[Title Year Author Author℄/<title>extracts the titles of all the books with exactly two authors.

A CDuce compiler could take profit of equivalences similarto those mentioned in [7] to optimize complex queries; a typi-cal example of optimization rule is:transform (transforme with p1 -> e1) with p2 -> e2 ; transform ewith p1 -> transform e1 with p2 -> e2 . Consider-ing query-like features in a programming language leads tointroducing a query optimizer in the code generator.

To implement joins, we introduce a cartesian productoperator in CDuce; if s1; s2,. . . ,sn are sequences, thenprod(s1,s2,...,sn) evaluates to a sequence containing allthe(v1,v2,...,vn) wherevi appears insi. We let the orderof the resulting sequence unspecified so to allow optimizations.For instance, ifs1 = [ 1 2 ℄ ands2 = [ "A" "B" ℄, thenthe expressionprod(s1,s2) evaluates to some permutation of[ (1,"A") (1,"B") (2,"A") (2,"B") ℄. A typical joincreates the cartesian product of sequence of XML documents,

9We can take advantage of the fact that inCDuce a single pattern canperform the complex operation of extracting all the elements of a given type,to define the following more compact encoding:transform e with <__>[ (x::t | __)* ℄ -> x

12

and filters it (withtransform or a pattern). The compiler isfree to apply algebraic optimizations or use available indexesto implement the join efficiently.

The typing rule forprod is the following:� ` e1 :[t1*℄ : : : � ` en :[tn*℄� ` prod(e1,...,en) : [(t1,...,tn)*℄Note that we restrict theei’s to be homogeneous sequences(i.e, sequences whose elements are all of the same type) whichyields the product to be an homogeneous sequence, as well. Asele t construction can then be defined easily. The meaningof sele t e from x1 in e1,...,xn in en where e0 isdefined to be the same as:transform prod(e1,...,en)with (x1,...,xn) -> if e0 then e else [℄. Thecompiler can implement this more efficiently; for instance,ife0 does not involve all thexi, the query optimizer can, as usual,push selections (and/or projections) on some components be-fore creating the full cartesian product. The following exam-ple, inspired from [7], illustrates the join between two docu-mentsbib0 andrev0 of typesBib andReview respectively.type Review =<reviews>[BibRev*℄;;type BibRev = <book>[<title>[String℄<review>[String℄℄;;let rev0 = <reviews>[<book>[<title>["Persistent Obje t Systems"℄<review>["Good topi "℄℄<book>[<title>["Les illusions perdues"℄<review>["A promising writer"℄℄℄;;sele t <bookr>([b℄/<title>�[b℄/<author>�[r℄/<review>)from b in [bib0℄/<book> , r in [rev0℄/<book>where [b℄/<title> = [r℄/<title>yielding the following result.[<bookr>[<title>["Persistent Obje t Systems"℄<author>["M. Atkinson"℄<author>["V. Benzaken"℄<author>["D. Maier"℄<review>["Good topi "℄℄<bookr>[<title>...℄℄Sometimes we need to access to the content of an elementwhen this is a sequence of just one element. This is done bythe expressione.<tag> as in the following example where weadd the review as an attribute of the book tag:sele t<book review=r.<review>>( [b℄/<title>�[b℄/<book>�[<pri e>["unknown"℄℄ )from b in [bib0℄/<book> , r in [rev0℄/<book>where b.<year>="2000" and b.<title> = r.<title>The type system authorizes to access the content of an ele-ment only if the element occurs exactly once and it containsa sequence of length 1, as stated by the following typing rule(where:::t is by definitionAny\t)� ` e :[(:::<t r>__)* <t r>[s℄ (:::<t r>__)*℄� ` e.<t r> : s

(r is an optional attribute specification) that can be easily de-duced from the encoding ofe.<t r>:mat h e with [(:::<t r>__)* <t r>[x℄ (:::<t r>__)*℄ -> xThis construct corresponds to the XSLT elementvalue-ofwhere, say, (/<a>/<b>/< >).< > would be written as<xsl:value-of sele t="/a/b/ ">.

On the lines of what precedes we could easily show how touseCDuce to encode the various examples presented in thecurrent W3C proposal for XML Query and, in general, obtaina more precise typing. However the point is not there, as wedo not want to fix a precise implementation for queries. Onthe contrary, in this section we wanted to single out construc-tions that left the compiler with maximal freedom in the imple-mentation and, therefore, a large latitude in query optimization.In other terms, we sought for constructions that allowed us toexpress queries as most “declarative” as possible, especiallybecause the set-theoretic semantic foundations ofCDuce con-stitutes an adequate support for defining optimizations. Intheframe of theCDuce type system, theprod(...) constructionabove looks like a promising choice.

6 ImplementationWe described the static type checking ofCDuce programsin [8]. In particular, we defined a high-level specification ofa subtyping algorithm by a notion of coinductive simulationthat characterizes empty types (t is a subtype ofs if the differ-ence typet \ s is empty). Many optimizations and non-trivialimplementation techniques can be applied to pass from thisspecification to a practical algorithm: among these are thosementioned in [11] (for the knowledgeable reader, note that itis also possible to cache negative results of the subtyping al-gorithm in a destructive structure, as they cannot be invali-dated even under different assumptions) and the definition ofmaximal sharing and unique representation for recursive typeswhich improve the impact of caching mechanisms.

In CDuce type checking is not just a preliminary verifica-tion. We believe that static typing is a key information fordesigning an efficient execution model forCDuce and XMLlanguages in general. To motivate this idea let us consider thefollowing example. Suppose thatA andB are two types andconsider the function:fun (<a>[A+|B+℄ -> Int)<a>[A+℄ -> 0| <a>[B+℄ -> 1;;A naive compilation schema would yield the following be-havior for the function. First check whether the first patternmatches the argument. To do this:(i) check that it has theform (à,(__,l)) and(ii) run throughl to verify that it is anon-empty sequence of elements of typeA (checking that an el-ement is of typeA may be complex, ifA represents for instancea complex DTD). If this fails, try the second branch and doall these tests again withB. The argument may be run throughcompletely several times.

There are many useless tests; first, one knows statically thatthe argument is necessarily a pair with first componentà:

13

there is no need to check this. Then, one knows that the sec-ond is a pair(__,l) wherel is a non-empty sequence whoseelements are either all of typeA or all of typeB. To know inwhich situation we are, one just has to look at the first elementand perform some tests to discriminate betweenA andB (for in-stance, ifA = <x>[...℄ andB = <y>[...℄, it is enough tolook at the head tag). Using these optimizations, only a smallpart of the argument is looked at (and just once).

In general, a naive approach to compiling pattern matchingmay yield multiple runs and backtracking through the matchedvalue. Forgetting for a moment functions and records, valuescan be seen as binary trees, and types simply represent regulartree languages. It is a well-known fact that such tree languagesmay be recognized by deterministic bottom-up tree automata;this indicates that backtracking can be eliminated. It is pos-sible to adapt the theory of tree automata to handle the fullrange ofCDuce patterns (with capture variables) and values.However, determinization may create huge and intractable au-tomata (number of states and transition function); this is dueto the fact that such automata perform a uniform computation,disregarding the current position in the tree. When matching apair(v1; v2), one can choose to perform different computationson v1 and onv2 (each with a smaller automaton), but classi-cal bottom-up tree automata do not have this flexibility. Also,we want to take into account static type information about thematched value to help pattern matching. This can be combinedwith the previous remark: when matching(v1; v2), one canstart for example withv1; according to the result of this com-putation, we get more information about the (dynamic) type ofthe matched value and this information can simplify the workonv2: for instance, if we statically know that(v1; v2) has type(t1,t2) | (s1,s2)with non-intersecting typest1 ands1, andif the computation onv1 tells us thatv1 has typet1, then weknow thatv2 has necessarily the typet2, and no further checkis required. By using static type information, it is thus pos-sible not only to avoid backtracking, but also to avoid check-ing whole parts of the matched value. This is particularly use-ful when working with tag-coupled document types (as DTDtypes) where the tag of an XML element already provides a lotof information about its content.

Let us give another example, where we use atoms to simu-late ML datatype constructors:type Ex =(`number,Int)| (`plus,(Ex,Ex))| (`mul,(Ex,Ex));;let fun eval(Ex->Int)(`number,(n & Int)) -> n| (`plus,(x & Ex,y & Ex)) -> (eval x) + (eval y)| (`mul,(x & Ex,y & Ex)) -> (eval x) * (eval y);;The type constraints& Ex and& Int are useless here, but theprogrammer may want to specify them. A naive implementa-tion would check that the “constructors arguments” are of thecorrect type (and this requires to look at the whole subexpres-sions), even though this is guaranteed by the static type of thefunction argument. An efficient implementation would simplylook at the tag, dispatch according to its value, and then assumecorrect types for the arguments.

A third example is given by the selection expressione.<t r> introduced in the previous section that would berather implemented bymat h e with [(:::<t r>__)* <tr>[x℄ ;__℄ -> x as the static type-checking already ensuresthat<t> does not occur in the rest of the sequence.

Even more, combining static information and query opti-mization techniques such as rewriting, the expression[bib℄/<book>[Title Year Author Author℄/<title>mentioned in the previous Section could be translated into:mat h bib with <_>[ (<_>[(x::_) _ _ _℄ | _)* ℄ -> xwhere tests are reduced to the essential.

An efficient dispatch algorithm Let us sketch an efficientdispatch compilation schema; we will just present some ideas,as the full formalization is outside the scope of this paper.Given n disjoint types t1,. . . ,tn, and a valuev belonging totheir union to the problem is to decide whichti the valuebelongs to. For instance, if the program has to test at run-time whether a value belongs to a typet, and the type-checkerproves that the value is necessarily of types, then the compilercan just produce code that decides between the typest1 = s&tandt2 = s\t. If either t1 or t2 is empty, then there is no codeto produce at all. More generally, a possible implementation ofpattern matching could first determine which branch to choose,by making a choice among the typesti defined as in the typingrule of the pattern matching (ti = t\***p1+++ : : :\***pi�1+++&***pi+++).

The algorithm we outline has the important property to besemantic, in the sense that its result does not change if thetiare replaced by equivalent types (two types are equivalent ifthey denote the same set of values). As a consequence, for in-stance, there is no need to apply algebraic rewriting rules tosyntactically simplify the typesti before applying the algo-rithm.

The compilation schema consists in descending deep in thevalue, starting from the root, and accumulate enough informa-tion to stop the process as soon as possible. For instance, ifthe value is a pair(v1; v2), the idea is to generate a new set ofdisjoint typess1,. . . ,sm and usev1 to make a choice amongthem; according to the resulti (meaning “v1 is in si”), a newset of typesui1,. . . ,uili is examined andv2 is used to single outa new resultj that will be enough to determine theti the pair(v1; v2) belongs to.

All the types si, uij are computed at compile time fromt1,. . . ,tn.10 The choice of thesi’s must be done so that(i)any possible valuev1 will belong to onesi, and(ii) the in-formation extracted fromv1 during the selection of thesi, isenough to avoid backtracking tov1 (that is, the knowledge ofv2 and of thei such thatv1 is in si must be enough to decidethetj the value(v1; v2) belongs to).

For example, considert1 = ((0..100),Int) and t2 =((50..150),String). A possible choice iss1 = (0..49),s2=(50..100), ands3=(101..150). Givenv=(v1,v2),

10Actually, it is possible to takeli = n and arrange so that the dispatch onv2 is a tail-recursive call (this means: the resultj of dispatching among theuij is also the result for dispatching amongt1,. . . ,tn).

14

if v1 is in s1 (resp.s3), thenv is in t1 (resp.t2); if v1 is in s2,then we look atv2 and check whether it is inu21 = Int or inu22 = String.

The choice of thesi’s is not unique and must be done heuris-tically. Indeed, the capture of more information fromv1 thanthe strictly necessary may allow to cut down the explorationof v2. For instance, suppose thatt1 = (a1; a2), t2 = (b1; b2),thata1 is disjoint fromb1 anda2 is disjoint fromb2. In order todecide in whichti a value(v1; v2) is, one can either look atv1(making the choice betweena1 andb1) and ignorev2, or lookat v2 (making the choice betweena2 andb2) and ignorev1:there is not a better choice in general. For the implementationof CDuce we chose to extract as much information fromv1 aspossible (that is, any subdivision of asi would give the sameset of possible values forv2); our choice is based on the factthat in CDuce sequences are coded as right associating pairsand we do not want to run through a whole sequence when thefirst elements may be enough to conclude. Similarly for the en-coding of XML elements we first descend on the left of a pairso that we examine in turn the tag, the attributes, and only atlast the children. We do not detail here formally how to choosethe si’s, nevertheless it is important to signal that this can bedone “semantically” so that the choice will be invariant to thereplacement of theti’s by equivalent types.

7 Other and future issuesWe summarize here some open questions in the design ofCDuce and sketch further research directions.

Polymorphism and inference Section 3.7 discusses lackof polymorphism and type inference; it suggests thatCDucecould benefit from some kind of powerful polymorphismmechanism, such as higher-order types; of course, introduc-ing such features would make everything more complicated.For instance, we would probably be obliged to get rid of sim-ple semantic definitions for the typing of pattern matching,anduse more or less ad hoc approximations.

The current approach is to add specific constructions suchasmap or transform to handle typical cases where polymor-phism is needed; a possible direction is to allow the program-mer to define new constructions (syntax, typing rules, compi-lation schemes) in a meta language (ML for instance), and usesome kind of plug-in mechanism to extend theCDuce corecompiler.

Concrete interaction with XML The natural behavior of anXML transformation written inCDuce would be to parse theinput XML document, validate it with a givenCDuce type, runthe transformation, and output a final XML document. How-ever, we can consider improvements to this simple scenario:� combining programs: frameworks such as Transmorpher

[6] are proposed to combine several XML transforma-tions together to form a complex application. Betweentwo transformations, it may be useless to output and thenre-parse immediately documents; instead, XML transfor-mations engines (andCDuce could be considered as one

of them) should be able to communicate directly by ex-changing internal representation of documents. More-over, if the output of a transformation is proven to havethe expected type (andCDuce type system can enforcesuch a constraint), it is not necessary to validate the input;� textual representation of XML is important for exchang-ing documents between several organizations, but it maybe inadequate for some applications. For instance, in thesetting of XML databases, we do not want to parse the fulldatabase for each query. A binary representation of XMLwould match current practice of non-XML databases; therepresentation should be optimized to take into accountthe structure of documents, and types will of course beimportant in this setting. For instance, if thecontentmodelof an element is completely fixed by the type, com-pact and efficient storage as in relational databases couldbe used.� a specificCDuce program, such as a query, may not needto look at the whole input document (database); it may beinteresting to design communication protocols betweenXML programs (theCDuce interpreter or programs) and(binary) XML storage managers that allow to extract onlyneeded part of documents.

Missing XML features XML recommendation specifies away to refer from an element to another element with ID,IDREF and IDREFS attributes. Currently, there is no specificsupport inCDuce for these. We are planning to add (mutable)reference types toCDuce to reflect the pointer structure de-fined by ID and IDREF inside a document. This feature wouldbe close to ML references, thus allowing the definition of com-plex and spurious data structures (such as graphs).

We are also planning to add support for namespaces (usingpairs instead of simple atoms to represent element tags).

References[1] M. Abadi, L. Cardelli, B. Pierce, and G. Plotkin. Dynamictyp-

ing in a statically typed language.Transactions on Program-ming Languages and Systems, 13(2):237–268, April 1991.

[2] Scott Boag, Don Chamberlin, Mary Fernandez, Daniela Flo-rescu, Jonathan Robie, Jerome Simeon, and Mugur Stefanescu.XQuery 1.0: An XML Query Language. W3C Recommenda-tion, http://www.w3.org/TR/xslt, November 1999.

[3] G. Castagna, G. Ghelli, and G. Longo. A calculus for over-loaded functions with subtyping.Information and Computa-tion, 117(1):115–135, 1995.

[4] James Clark.XSL Transformations (XSLT). W3C Recommen-dation,http://www.w3.org/TR/xslt, November 1999.

[5] James Clark and Steve DeRose.XML Path Language (XPath).W3C Recommendation,http://www.w3.org/TR/xpath,November 1999.

[6] Jerome Euzenat and Laurent Tardif. XML transformation flowprocessing. In2nd conference on Extreme markup languages,pages 61–72, 2001. Available athttp://transmorpher.inrialpes.fr/wpaper/.

15

[7] Mary Fernandez, Jerome Simeon, and Philip Wadler. Analge-bra for XML query. InFoundations of Software Technology andTheoretical Computer Science, number 1974 in Lecture Notesin Computer Science, pages 11–45, 2000.

[8] Alain Frisch, Giuseppe Castagna, and Veronique Benzaken.Semantic Subtyping. InProceedings, Seventeenth Annual IEEESymposium on Logic in Computer Science, pages 137–146.IEEE Computer Society Press, 2002.

[9] Haruo Hosoya and Benjamin C. Pierce. XDuce: A typed XMLprocessing language. InProceedings of Third InternationalWorkshop on the Web and Databases (WebDB2000), 2000.

[10] Haruo Hosoya and Benjamin C. Pierce. Regular expression pat-tern matching for XML. InThe 25th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,2001.

[11] Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. Reg-ular expression types for XML. InProceedings of the Interna-tional Conference on Functional Programming (ICFP), volume35(9) ofSIGPLAN Notices, 2000.

[12] V. Vianu. A web odissey: from Codd to XML. InProc. of Inter-national Conference on Principles of Database Systems (PODS’01), pages 1–15. ACM Press, 2001.

16

Date post:	15-May-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

CDuce: a white paper

Documents