+ All Categories
Home > Documents > Computing with Data: Concepts and Challengessignallake.com/innovation/Neyman98.pdf... it used the...

Computing with Data: Concepts and Challengessignallake.com/innovation/Neyman98.pdf... it used the...

Date post: 19-Mar-2018
Category:
Upload: ngotuong
View: 215 times
Download: 0 times
Share this document with a friend
33
Computing with Data: Concepts and Challenges John Chambers Bell Labs, Lucent Technologies Abstract This paper examines work in “computing with data”—in computing sup- port for scientific and other activities to which statisticians can contribute. Relevant computing techniques, besides traditional statistical computing, in- clude data management, visualization, interactive languages and user-interface design. The paper emphasizes the concepts underlying computing with data, with emphasis on how those concepts can help in practical work. We look at past, present, and future: some concepts as they arose in the past and as they have proved valuable in current software; applications in the present, with one example in particular, to illustrate the challenges these present; and new directions for future research, including one exciting joint project. 1
Transcript

Computing with Data:Concepts and Challenges

John ChambersBell Labs, Lucent Technologies

Abstract

This paper examines work in “computing with data”—in computing sup-port for scientific and other activities to which statisticians can contribute.Relevant computing techniques, besides traditional statistical computing, in-clude data management, visualization, interactive languages and user-interfacedesign. The paper emphasizes the concepts underlying computing with data,with emphasis on how those concepts can help in practical work. We look atpast, present, and future: some concepts as they arose in the past and as theyhave proved valuable in current software; applications in the present, withone example in particular, to illustrate the challenges these present; and newdirections for future research, including one exciting joint project.

1

Contents

1 Introduction 2

2 The Past 22.1 Programming Languages in 1963 . . . . . . . . . . . . . . . . . . 32.2 Statistical Computing: Bell Labs, 1965 . . . . . . . . . . . . . . . 52.3 Statistical Computing: England, 1967 . . . . . . . . . . . . . . . 82.4 Statistical Computing: Thirty Years Later . . . . . .. . . . . . . 9

3 Concepts 113.1 Language . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 113.2 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 The Present 154.1 How are We Doing? . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Challenges 19

6 The Future 226.1 Distributed Computing with Data . . . . . . . . .. . . . . . . . . 226.2 A Co-operative Project in Computing for Statistics . . . . . . . . 23

7 Summary 25

A Principles for Good Interactive Languages 26

B Languages and GUIs 27

1 Introduction

Our use of computing, as statisticians or members of related professions, takesplace in a broader context of activities, in which data is acquired, managed, andprocessed for any of a great variety of purposes: the termcomputing with datarefers to these activities. Traditional statistical computing is an important part, butfar from the whole.

This paper examines concepts, both in statistical software and in other areas,that can contribute to computing with data and thus to the activities that it supports,including science as well as industry, government and many others. This paper

2

is based on the Neyman Lecture presented at the 1998 Joint Statistical Meetings,at the invitation of the Institute for Mathematical Statistics. The Neyman Lectureis intended to cover some aspect of the interface between statistics and science.Computing with data is an appropriate topic for the lecture, since the ability todefine and execute the computations that we really want is often the limiting factorin applying statistical techniques today. The paper presents personal reflections onthe topic, based on a long career in many related fields. I hope that users of currentstatistical software will find the concepts useful, and especially that the paper willcontribute to discussion of how future software can improve its support of statisticsresearch and applications.

The plan of the paper is to sandwich two general sections, on concepts and onchallenges, between three sections looking at more specific examples in the past,the present, and the future. Some glimpses at the past will help motivate the con-cepts, an application in the present will help introduce the challenges, and possibledirections for the future, including an exciting new joint project, will suggest re-sponses to the challenges.

2 The Past

A coherent history of computing with data would require much more space andattention to diverse issues than we can spare here. As an alternative, we will takebrief glimpses, first of some relevant general programming languages in 1963, andthen of two activities in statistical computing that took place in 1965 and 1967. Thechoice, particularly in the case of statistical computing, is admittedly personal, withthe excuses that it provides some history that has not previously been published andthat the specifics will lead us to useful general concepts.

2.1 Programming Languages in 1963

What programming languages might catch the attention of someone interested incomputing with data around 1963? In that year I was a beginning graduate studentin statistics at Harvard. Those of us interested in computing occasionally wandereddown the river to MIT, sometimes to hear about new developments, sometimes justto play “space war”, perhaps the world’s first computer video game. Here are threedocuments we might have seen (at any rate, all of them found their way into mycollection soon after):

� an IBM manual for the Fortran II language;

� the MIT press publication of theLisp manual (McCarthy 1962);

3

� the bookA Programming Language by Kenneth Iverson (1962);

Fortran would have been visible around Harvard; to learn about the other two at thattime, one would likely have needed to be someplace such as MIT more in the thickof computing. Each of the three introduced some concepts worth noting; together,they give a broad though not complete idea of the background for thinking aboutcomputing for statistics during that period.

Fortran (FORmula TRANslator), introduced in 1958, brought the notion thatformulas, the symbolism by which scientists and mathematicians described theirideas, could be translated into computer instructions that produced analogous re-sults. The formulas were the standard “scientific notation” for arithmetic and com-parisons (slightly altered to accommodate contemporary character set limitations).Also, and perhaps more importantly, it used the “functional” notation that called forevaluating a function by writing down its name followed by a list of its arguments:

min(x, y, 100)

The concept of formulas and function calls to communicate computations mayseem obvious, even trivial. But when I encountered Fortran first as an undergradu-ate in 1961, the concept was revolutionary for someone who had learned computingvia machine language and wiring boards. An important step had been taken in see-ing computing as a central tool for science.

The Lisp language, like Fortran, remains an active part of the computing scenetoday. Indeed, while Fortran is far from dead, its relative importance has dimin-ished over the last twenty years or so. Lisp, on the other hand, started out as aspecialized and rather academic language for manipulating list structures. Sincethat time it has vastly expanded its applications, to support software such as theemacs system for editing and programming, and the Lisp-Stat statistical system(Tierney 1990). Two concepts closely associated with the development of Lisp areparticularly relevant. First, Lisp emphasized the ability to define serious compu-tations from a simple, clearly defined starting point. Therecursive definition ofstructures and computations was the key mechanism. Second, Lisp, especially inits later versions, developed the concept of anevaluation model for the language.That is, the language not only allowed users to specify what computations were tobe done, it gave them a definition of the meaning of those computations, based on amodel for how the evaluator operated (the� calculus, taken from early theoreticalwork on computing).

Most readers will have heard of both Fortran and Lisp. Rather fewer perhapswill recognize the title of Iverson’s book. The book did not, in fact, describe an ex-isting programming language; rather it presented a system of notation to describecertain computations. Soon after, however, it was decided to implement the no-tation as an actual language. To name the language, for want of any other ideas

4

perhaps (I can testify that naming programming languages is a challenge), the au-thors just took the initials of the book’s title:APL.

APL, like Fortran, was a language using a version of scientific formula notationto express computations. In APL, in fact, this was theonly way to express com-putations. All computations were built up from what would now be called unaryand binary infix operators, parentheses and square brackets. APL had many oper-ators, using non-alphanumeric characters, greek letters, and special symbols. Forexample, the question mark indicated random operations:

4 ? 4

evaluates to a random permutation of the numbers 1,2,3,4. Typing the expressionsrequired a special terminal interface (originally a special type ball on an IBM elec-tric typewriter). APL developed an intensely loyal user group, for statistical as wellas other applications; the book by Anscombe (1981) presented a detailed approachto statistical programming in APL. For non-believers, the expressions could be ofdaunting obscurity; for example,

M � ( � 4 ) � 4 4 � ’ABCD’[ 4 ? 4 ]

This assigns toM a random symbolic 4 by 4 Latin Square (Anscombe 1981, pp 44–47). Devoted use of APL continued for many years, but its reputation for obscurityand inefficiency discouraged wider adaptation. In retrospect, however, the languagewas a bold and important contribution, if admittedly as idiosyncratic as its author.

Two additional concepts are introduced by APL, the first being an interactiveuser interface: users typed formulas, APL evaluated them and, by default, printedthe result. The user becomes part of the computing model. Interactive interfaces ex-isted contemporaneously in a few specialized systems, but APL brought the conceptthat a general programming language could also be an interactive, user-friendlysystem.

A second new concept appeared in the treatment of the data, the operands ap-pearing in the formulas. The data corresponding to operands could be scalars,vectors, or multiway arrays. But they were, in fact,dynamic, self-describing ob-jects, to use modern terminology. Assignment operations created objects in theworking area by storing the results of a computation. The user was not responsiblefor figuring out the dimensions of an array result; instead, APL stored the struc-ture information with the object, and provided operators to let the user/programmerretrieve this information.

2.2 Statistical Computing: Bell Labs, 1965

Over the next few years, a number of projects were begun in providing comput-ing for statistical applications. In this section and section 2.3 we will look at two

5

somewhat unusual efforts, a design developed at Bell Labs for a statistical systemand a working group in England looking at possible steps to encourage the use ofcomputers in statistics. Neither of these resulted in a specific software package, butboth involved thinking hard about the computing needs of statistics. Concepts andgoals were laid out that remain relevant today, although the overall computing en-vironment has of course changed radically. The first effort and some aspects of thesecond have never been described in print before, and both have some interestinganecdotes.

During 1965, a group at Bell Labs discussed a possible system to support dataanalysis and statistical research. There were already a number of statistical systemsand programs by this time. The BMD and P-Stat programs were in use, and un-known to us at the time, John Nelder and Graham Wilkinson were then developingthe first version of GenStat (Dixon 1964, Buhler 1965, Nelder 1966). The BellLabs situation was, then as now, characterized by the combination of some large-scale, real problems with the freedom to deal in long-term research that was itselfoften stimulated by the applications. The question “What should the computer dofor us?” then needed to be answered for both purposes.1.

During a series of meetings and by writing design documents and a little proto-type software, we designed a statistical computing system based on an extension ofthe PL-1 language. The system was never implemented, for several reasons, mostlynot related to the feasibility or merits of the proposed statistical system itself. Thesystem had been predicated on a general operating system, Multics, which itselfwas a bold joint venture, but which Bell Labs dropped because of delays (it wenton to be created largely at MIT and to hold on to a small but faithful user communityfor about twenty years). The concepts behind the statistical system, however, andeven more the motivation for the concepts, are worth reviewing. The discussionsincluded John Tukey, Martin Wilk, Ram Gnanadesikan, and Colin Mallows, and sobrought substantial collective insight into the needs of statistical data analysis.

A number of unpublished documents record our thinking from 1965; two briefexcerpts will show some key concepts. On February 2, 1965, we had a plan-ning meeting and general discussion. The next day, John Tukey wrote an informalmemo, Tukey (1965), to the participants. Figure 1 is a diagram from that memo.

Tukey’s contributions to statistics need little summary here, but his relationto this particular project does: he divided his time between Princeton Universityand Bell Labs, and tended to appear at the Labs on occasion to attend meetings orseminars, to discuss current interests, and to advise management. His appearance at

1Large-scale applications did exist, even at this time; the analysis of the data recorded by Tel-Star, an early communications satellite, involved tens of thousands of observations and challengedcontemporary computing technology, (Gabbe, Wilk & Brown 1965)

6

Facilitating

Transducers.

Data

Analyst’s I/OIntuition

Mind and

Data Analyst’s

and Operations(Analysh)

Conceptual

ComputerActual

Computer

Conceptual Items

John W. Tukey (Feb. 3, 1965)

Figure 1:Concepts for a Statistical System, from John Tukey.

meetings often resulted in one of his inimitable memos, full of new ideas and newlyinvented terminology. Interpreting Tukey’s ideas was a fruitful if risky activity;Figure 1 provides a good example. Let me try my hand at explaining it, with theadmission that thirty-three years of hindsight may affect the results. Keep in mindthat this was written within a day of the meeting that stimulated it. We should belooking for the concepts it reveals, not picking at details.

Follow the arrows clockwise from theMind and Intuition block. Tukey’s notionis that data analysts have an arsenal of operations applicable to data, which theydescribe to themselves and to each other in a combination of mathematics and(English) words, for which he coins the termAnalysh. These descriptions can bemade into algorithms (my term, not his)—specific computational methods, but notyet realized for an actual computer (hence theconceptual computer). Then a furthermapping implements the algorithm, and running it produces output for the data

7

analyst. The output, of course, stimulates further ideas and the cycle continues.2

On August 3, 1965, an extended group met to review some of the work so far.The following paragraph comes from opening remarks (Wilk 1965), at that meetingmade by Martin Wilk, who was then head of the statistics research department atBell Labs (emphasis added):

What do we want? We want to haveeasy, flexible, availability ofbasic or higher level operations, with convenient data manipulation,bookkeeping and IO capacity. We want to be able easily to modifydata, output formats, small and large programs and to do all this andmore witha standard language adapted to statistical usage.

The first emphasized passage in this paragraph defines a goal expressed thirty yearslater in the bookProgramming with Data (Chambers 1998b) by the slogan:

To turn ideas into software, quickly and faithfully.

The operations must beeasily available so we can go quickly from the data ana-lyst’s concepts to working software. They must beflexible so that the software canfaithfully reflect those concepts, as we work to refine the first efforts. The termhigher level operations refers to efforts to find close computational parallels to thedata analyst’s essential concepts, another aspect of easy use of computing.

The second emphasized passage reflects another general principle, that we needto embed our statistical computing in a general environment, not restrict it to somepre-defined scope. The approach planned in 1965 to attain this goal was to embedthe statistical operations directly in a general programming language. Later on, thesame goal would be approached viainterfaces between a statistical language andother languages and systems (see section 3.3 for interfaces and section 6.2 for apossible combination of both approaches).

The documents cited and the project itself were essentially forgotten after BellLabs dropped out of the Multics operating system development. Eleven years later,a different group met again at Bell Labs with a similar charter; the resulting soft-ware eventually became the first version of S. In 1976, we did not make any explicituse of or reference to the previous work and I was the only participant common toboth efforts. Only when re-reading the older documents in preparing the Neymanlecture was I struck by the key concepts that were later re-expressed in differentterms and finally implemented.

2The “facilitating transducers” I interpret to mean software that allows information to be trans-lated back and forth between internal machine form and forms that humans can write or look at—atransducer, in general, converts energy from one form to another. So parsers and formatting softwarewould be examples.

8

2.3 Statistical Computing: England, 1967

In December, 1966, a meeting on statistical computing was held in Cheltenham,England, organized by Brian Cooper and John Nelder. Papers were presented de-scribing current systems and other topics. Some of the papers appeared inAppliedStatistics in 1967, (Cooper & Nelder 1967). I was able to participate, since DavidCox had invited me to spend a post-graduate year teaching at Imperial College; mycontribution was a general review, (Chambers 1967b), since the 1965 Bell Labseffort was not going ahead.

After the meeting, a number of the participants spontaneously formed a “BritishWorking Party on Statistical Computing”. Members included some of the mainpractitioners of statistical computing in Britain at the time. The goals of the groupwere to promote statistical computing and start some initiatives to encourage re-search and co-operation in the field. Supported initially by the U.K. Science Re-search Council, the Working Party was, I believe, the first professional organizationin statistical computing; it later metamorphosed into the computing section of theRoyal Statistical Society. A description of the Working Party later appeared in theAmerican Statistician (Chambers 1968); experience with the Working Party en-couraged some of us to support early efforts towards a computing section for theAmerican Statistical Association.

The main concrete achievement of the Working Party was the algorithm sectionof the journalApplied Statistics. We had some other, more ambitious goals as well;for example, to encourage data exchange and standardization among statistical sys-tems. No explicit proposal to this effect ever resulted, but at least two memorandaand some additional correspondence provide another interesting example of con-cepts that were to recur later.

John Nelder wrote a description of data structures, drawing both on the recentGenStat work and on an earlier paper related to the structure of variates in ex-perimental design. Partly in response to this and partly reflecting the earlier BellLabs effort, I wrote a memo (Chambers 1967a) titled Self-Defining Data Structuresfor Statistical Computing. A thirty-one-year-old document with that title must beinteresting, one way or another.

The memo contained “some ideas concerning self-defining data structures, withparticular reference to data matrices”. The underlying concepts, which had evolvedfrom discussions of the Working Party, were in effect a proposed standard for or-ganizing statistical data on external media (although the word “standard” neverappeared in the memo). What did “data matrix” mean? “Conceptually, a data ma-trix consists of a two-way array of data.” However, the term “array” was used ina more general sense than in Fortran or similar languages. In later terminology, itcorresponded closely to adata table in a spreadsheet or relational database system,

9

or to adata frame in S.The memo defined a data matrix abstractly by twelve functions that would have

values for a data structure “if and only if the structure is a data matrix”. For exam-ple, datum(i,j) returned the value of thej-th variate for thei-th observationalunit. Individual data elements could have numeric, string, or logical values, or thevaluesMISSING or UNDEFINED (the distinction being thatMISSING values were forobservations undefined for(i, j) in the range of the data matrix dimensions, andUNDEFINED for observations outside the range. Other functions provided variouslabelling information. There was additional discussion of possible physical storagelayouts and extended structures, relating to what we would now consider differentexperimental designs.

The concept of self-defining data structures was implicit in APL. Here it is ex-plicit and in some respects more general; at any rate, the data matrix structure ashere defined is closely related to later data tables in statistical and database soft-ware. Self-defining data structure is key toobject-based computing with data (sec-tion 3.2).

2.4 Statistical Computing: Thirty Years Later

Programming languages of thirty-five years ago had introduced some key ideas.Concerned professionals in statistics were starting to address important issues suchas the real needs for serious computing with data and the concepts underlying suchcomputing.

Some natural questions then arise. If all this was going on over thirty years ago,why do we still need to discuss the same topics now? And did the aborted effortsat that time represent missed opportunities for the statistics profession?

The first question is not at all impertinent; most of the rest of the paper willamount to considering partial answers to the question. What current concepts havewe in fact got right, more or less? What are the key challenges still facing us? Arethere opportunities in the near future to get some things more nearly right than wehave managed in the past?

As for missed opportunities, some measure of regret is indeed inevitable. TheWorking Party discussion of data matrix structure had many correspondences to therelational data model, which was then about a decade in the future and which nowdominates most database management software. There were far too many thingswrong with the data matrix proposal as it was left for it to have made a positivecontribution, but had we pursued our ideas together, statisticians might well haveexerted more influence on data definition in computing, to the benefit of both thestatistics and computing communities.

10

Only gradually, and along with enormous changes in the general computing en-vironment, would the concepts recognized to some extent thirty years ago becomepractical parts of the statistician’s computing toolkit.

The key concepts identified in programming languages of 1963 were not soobvious then, and practical issues about languages like Lisp and APL would inhibittheir use for some time. The concepts for statistical computing, for example of datastructure, would need rethinking and refinement. Some important computations,such as graphics, lacked the general formulation thirty years ago that would arrivea few years later. Non-interactive languages (such as our proposed extension of PL-1) would not give sufficiently easy access to computations, and general interactivelanguages such as APL would not prove to be flexible enough, in the opinion ofmost potential users. That “easy and flexible” combination would remain difficultto attain.

All the languages in section 2.1 belonged to what was then beginning to becalled scientific computing, in contrast tobusiness computing. They shared thecommon notion that some form of symbolic formulation (function, list, or operator)was a reasonable way for the human to communicate to the computer. Business-oriented languages rejected this as too abstract for their users, in favor of variousspecialized syntaxes, often verbal in some form of English-like layout. This bias,for example, continued into systems for database management, including the SQLstandard for relational databases. Integrating such systems with languages similarto those discussed here remains a major challenge. Modern facilities such as theJDBC interface to database software, (see, for example, Hamilton, Cattell & Fisher(1997)), are an improved approach to this interface, but much remains to be done.

3 Concepts

With reflections on the past as background, let us next attempt to characterize whatmakes computing with data effective. Our emphasis will be on defining some keyconcepts, and relating them to practical software. Three concepts in particular willhelp:

1. language, the ability to express computations;

2. objects, the ability to deal with structure;

3. interfaces, the ability to connect to other computations.

All of these arose in our discussion of past work; over time, they have evolved andbecome even more important

11

One additional general point needs emphasis: in modern computing, thereshould not be a sharp distinction betweenusers andprogrammers. Most program-ming with statistical systems is done by users, and should be. As soon as the systemdoesn’t do quite what the user wants, the choice is to give up or to become a pro-grammer, by modifying what the system currently does.

Such user/programmers then naturally go through stages of involvement. Inthe first stage, the user needs to get that initial programming request across to thesystem, quickly and easily. Later, the user needs the ability to refine that firstrequest gradually, to come closer to what was really wanted. Good software forcomputing with data should support all such stages smoothly.

3.1 Language

The role oflanguage in programming is to prescribe (to the computer) and to de-scribe (to ourselves and other humans) requests for computations. In a fundamentalsense, what we can describe is what we can compute. For computing with data, inthe sense we are using the term—the organization, visualization and analysis ofprocesses involving significant quantity and complexity of information—the lan-guages available determine what we can learn and show about the data.

Over the thirty or thirty-five year history of languages for statistical computing,has some definable progress been made? I think the answer is yes (an answer ofno would certainly be discouraging) and that a little further examination of therole language plays will help. Certainly what makes a “good” language is bothsubjective and empirical. Every user or observer of a language will justifiably forman opinion; and empirically a good language is one with which users can producegood results.

To be a little more specific about the assessment, however, I think we can isolatetwo goals already implicit in the discussion.

� Users should be able to program in the language quickly and conveniently.

� Users should be able to say what they mean accurately in the language.

This is another version of the slogan in section 2.2:

To turn ideas into software, quickly and faithfully.

The relevant sense of time in measuring “quickly” should be the clock on thewall, not the computer processor time consumed. Human time is the scarce re-source, for many reasons. Most users have many demands on their time; learningfrom data has to compete for that time, and if a new idea takes too long to express,some alternative to gaining the knowledge will be used instead. How can languagesreduce the human time expended to express a new idea?

12

1. Users should be able to capture what has already been done and just changeit a little.

2. The language and the system should give the user a high-level view, doing asmuch as possible automatically.

The first goal needs some tools for easy definition and modification of softwarein the language; see, for example Chambers (1998b, pp 19-21). This helps theuser already working in a language; not so well served currently is the user whostarts from a graphical interface; we need to develop techniques to capture suchinteraction and convert it cleanly to language form.

The second goal is one area where real progress has, I think, been made. Lan-guages such as APL and data structure discussions such as the Working Party mem-orandum had, indeed, grasped some key concepts. Modern languages, however,have a much more advanced set of concepts. Invoking a function or operation ona self-describing, general object, particularly in a language using the class/methodparadigm, is much more likely to produce an expression that visibly “says what itmeans”. There is some tension between this goal and principles, such as strong typ-ing, proposed for general programming languages. Section 6.2 will suggest waysto resolve such tensions.

Next, the “faithfully” part of the slogan. For a language to reflect faithfully theideas its users want to express, it needs first to be expressive and flexible. Easeof expression, which we need to say things quickly, is equally important in sayingthings accurately. A long, convoluted piece of software is unlikely to be clear,probably not even to its author and certainly not to anyone else. Unclear softwareis also unlikely to be accurate, particularly as time goes by and it is modified forvarious reasons.

Just how best to encourage and assist faithful programming of ideas remainscontroversial. Many principles have been proposed, including structured program-ming, functional languages, object-oriented programming, strong typing, and manymore. The notion that users become programmers and, in particular, that they do soby moving from interactive use of a system to increasingly serious programming,introduces some special considerations. As a general position, I believe that lan-guages for computing with data should include a language that encourages initialprogramming steps by the user of an interactive systems, but which also allows theprogramming to migrate smoothly to more precise and detailed control. The cur-rent version of S (Chambers 1998b) is an attempt to implement such an approach,but new and better efforts are definitely a hope for the future, as section 6 discusses.Interestingly, some recent discussion of programming languages in general makesa related distinction between interactive “scripting” languages and “system pro-

13

gramming” languages (Ousterhout 1998). The ability to combine both, however, isperhaps the most important future goal.

An appendix to the web version of this paper (Chambers 1998a) suggests somespecific principles for languages to support computing with data. To summarize, itrecommends a combination of functional programming, dynamic and untyped ob-ject references, and class/method-based software for further refinement. A secondappendix discusses the relation between graphical user interfaces and languages;while sometimes cast as opponents they should ideally work together, particularlyby defining elements of the interfaces in the language.

3.2 Objects

What are the main useful concepts in organizing data for our computations?

1. The data we deal with exist as dynamic, self-definingobjects.

2. A fundamental organizing principle for objects is that they belong to, or haveas an attribute, aclass, which determines the information they contain andthe methods or operations that are defined for them.

3. The information content of classes can be definedrecursively, starting fromsome simple base classes.

4. Everything is an object. Computations can examine the definitions of classes(and other constructs in the language) by using suitable objects (metadataobjects) containing that information.

These concepts are almost exactly those in the discussion of data structure for theversion of S described in Chambers (1998b). Very encouragingly, though, theyalso apply, with a few changes, to current versions of Java and to the languageunderlying the CORBA standard for distributed computing (of which we shall saymore in the discussion of future projects), as well as to other modern languages toa greater or lesser degree.

3.3 Interfaces

In contrast to the biblical account of human languages, languages for computingnever had a period before the tower of Babel. As soon as the concept of a program-ming language or system started to crystallize, different realizations of the conceptbegan to appear. By thirty-five years ago, there were many languages (relative tothe amount of computing in the world, at least as many as today), with recognizablydifferent purposes.

14

For our discussion, the terminterface means the ability of software in one lan-guage to invoke operations or methods in other languages. The need for interfacecomputation was recognized very early, but only later was this concept explicitlyimplemented.

Theeverything is an object principle and a functional programming style sug-gest that an inter-language interface should be a function call returning a self-describing object. This is the model that S and some other statistical languagesuse currently for interfaces to subroutines in languages such as C and Fortran in thesame process, and to subprocesses such as shell commands. The function definingeach interface uses a model for computations in the other language; for example,mapping basic objects in S into arrays in a Fortran subroutine call.

What about general communication with other processes, perhaps running onother machines? The technique ofremote procedure calling implements this styleof interaction, for a single language; for example, current versions of Java imple-mentremote method invocation, again between two Java applications.

This model for communication will only work between different languages if(at least) one language knows enough to construct the necessary data and performthe request in a form meaningful for the remote program in the other language.The limitation of such approaches is that each pair of systems will need a separateinterface. Just as the data-structure proposal of thirty years ago hoped to do withdata matrices, so it would be better if some single form could be devised to mediateboth the requests and the objects. The prospects for better approaches in the futureare promising, as discussed in section 6.1.

4 The Present

A few general reflections and a look at one substantial current project may helpassess the impact of past work on computing with data and the challenges ahead.

4.1 How are We Doing?

This question paraphrases former New York mayor Ed Koch. The answer reallycomes from the voters, or in our case the users, but some general reflections are pos-sible. The concepts outlined in the previous section have matured over the decadesand been incorporated in various ways into systems and languages for computingwith data. Users do get a real boost in their ability to express their ideas in software.

For example, consider a typical linear regression expression in S:

lm(Fuel � Weight + Type, auto98)

15

Assuming the user is comfortable with functional notation, and a little familiar withmodels in S, the expression says fairly directly what the user wants: a linear modelthat fits the variableFuel to a predictor containing terms in variablesWeight andType, using the dataset namedauto98. The language, based on users’ reactions,seems to express the ideas reasonably well, contributing to turning ideas aboutmodels into software quickly.

Less overt but equally important is the role of objects, particularly theevery-thing is an object philosophy. The formula expressing the model is itself an object,meaning that the software implementing the model fitting can examine its own callto find the variables in the model and construct from that information the specificdata needed. The user isnot required to extract the specific variables, only to iden-tify them. The variables are themselves self-describing objects, meaning that theycan be interpreted according to their structure. For example,Weight is a numericvariable andType a categorical factor; in the model, these will contribute suitable,different predictors to the fit. The value of the call tolm is also a self-describingobject. Rather than pre-specifying all the information wanted from the fit, the userconstructs the fitted model as an object and then studies that object in an open-ended, interactive series of computations.

The effectiveness of the computation also benefits from the use of the interfaceconcept. While the management of the model specification and the constructionof the object resulting from the fit are naturally part of a statistical language, thefundamental numerical computations in this case fall in an area long and skillfullyworked by numerical analysts and implemented in program libraries, traditionallyin Fortran. A natural interface from the statistical language to Fortran allows thesecomputations to be used, essentially unchanged, without much concern on the sta-tistical programmer’s part about the numerical details.

The features cited are largely “obvious” by now for users of modern statisticalsystems, but they represent efficiency gains in the use of the scarcest resource,skilled human time. So we have done pretty well in a number of ways in improvingthe environment for data analysis. However, the challenges arguably loom at leastas large now as ever, for several reasons. We will look at these in section 5, butan underlying theme is that the general environment has been changing around usat least as quickly as our own software has been evolving. Both new opportunitiesand tougher challenges have resulted. A look at one current project will illustrate.

4.2 An Application

To focus our understanding of where we are and what challenges we face in im-proving our computing with data, we will benefit from looking at a current, serious,and quite successful project. The computations involve the most voluminous data

16

source in the communications industry: the detailed information about telephonecalls, orcall detail records in the jargon of the industry. Over a period of severalyears a group of research statisticians and others looked at statistical and compu-tational techniques using this data, with the specific goal of identifying possiblefraudulent use of the system; for example, calls charged to a customer account butnot authorized by the customer. The project began at AT&T Bell Labs and con-tinued, after the split between AT&T and Lucent Technologies, as two projects atAT&T Research and at Bell Labs (the details below follow the Bell Labs version).Both projects have been interesting and successful; in particular, AT&T believes ithas made substantial savings by applying the techniques.

Why, on the other hand, is this a useful example for discussing computing withdata? There are several reasons: first, the computational problems are very serious,just at the edge of what is currently possible; second, the success of the projectcame with a great deal of human effort, so that we can reasonably ask whetherimproving the computing environment might make similar projects easier in thefuture; and, third, although the project is unique, as are nearly all serious efforts,it demonstrates a number of similarities with other major applications of statisticsto science, business, and society. Elucidating these common features will help toilluminate some challenges and suggest directions for future work.

Figure 2 illustrates the process for the application. All billable calls generateinformation (the call detail records) needed for billing and other transaction re-quirements. The data includes the time and duration of the call, the calling andcalled number, and special codes for different kinds of call. The transaction man-ager program makes this information available to the fraud detection software. Ascalls come in for a particular customer, the software updates asignature, a sta-tistical characterization of that customer’s calling pattern. We all use the phonesystem for diverse purposes, so one call may differ from the next in many ways;however, in a statistical sense the distributional properties of our usage tend to havedistinctive features, differing among customers but changing only in fairly slow orregular ways for an individual customer. Designing and computing signatures tocharacterize both normal and fraudulent use is the art behind this approach to frauddetection. When a customer’s calling pattern starts to differ significantly from thesignature, and perhaps starts to resemble signatures typical of fraudulent use, thesoftware will at some point raise an event (in computing terminology) that triggersan intensive followup procedure, involving human intervention. Additional data,such as customer records, may be used in the followup.

The volume of data flowing through the transaction manager can be very large,from millions to hundreds of millions of calls per day depending on the applicationand on where the data is being drawn off for analysis. The software must update thesignatures very quickly, putting strong time constraints on the form of the signature.

17

Transaction

Data Manager

Phone

Switch

Telephone

Phone

Switch

Telephone

Customer Records

Intensive

Followup

...

Customer Signatures

Signature

Update & Test

Figure 2: Processing of telephone call detail records to update signatures characterizingcustomer behavior, and to test for possible fraudulent usage.

Also, the computations must be designed both to maintain a database of customersignatures, when behavior seems normal, and also to provide as accurate a discrim-inator as possible for fraudulent behavior. The classification of calling patterns aspossibly fraudulent can never be sure; hence, the need for a followup with humanintervention. But the enormous volume of data means that the signatures must do agood job of selection, since otherwise the followup will be swamped by false detec-tions (and cost more than it saves) or will miss a substantial amount of fraudulentcalling. Clearly the design, implementation, and testing of the signatures presentsa major challenge.

As a technical and statistical effort, the application has many fascinating as-pects. We have introduced it here to examine computational questions, but beforemoving on to that, a few brief comments on the application generally are neededto put the computations in perspective. What contributed to the success of the ap-plication? As usual, good ideas (such as the signature notion) and much hard workwere the main ingredients. In addition, however, there was a willingness to engagein theprocess underlying the technical problem, and to take on challenges in areas

18

such as data management that were not obviously part of the statistician’s job, butwere essential to success. There was also a willingness to rethink the statisticalmethods in a form that made sense and was implementable under the constraints.

On the other hand, one might ask whether the statistician brings some specialvalue to the application, beyond what might be achieved by techniques broadlydefined as “data mining”, based more on algorithmic than data analytical perspec-tives. Two aspects seem to support the value of statistical insights. First, the notionof estimating and comparing distributions, particularly in an adaptive way, turnsout to be an important insight, and such thinking comes naturally from a statisticalperspective. Second, this application does not lend itself to completely automatedsolutions: the best that a procedure is likely to hope for is to help the human in thefollowup, not by any means to declare automatically that a fraud situation exists.In this context, the data analyst’s skills at summary and visualization are key, par-ticularly if we can combine them with a quality interface for the followup, as wewill suggest in the next section.

The signatures have to be computable very quickly, so the production versionsneed to be implemented directly in an efficient form, with relatively little flexibility.In designing the signatures, however, the statisticians will benefit from “easy andflexible” access to any helpful tools. The current technique uses tools such as shellandawk scripts. The tools may themselves be manufactured or augmented by otherprograms, and the results of testing them will be assessed by reading the resultsinto a statistical system (S) for display and summary. A production system will beimplemented by another group, in a language that can be added to the transactionmanager in the switch.

It takes nothing away from the achievement here to note that the programmingfor the computations is very labor intensive and requires the participants to linkthe tools together in anad hoc way. Also, the databases and the user interfacefor followup are not currently part of the design process, but should be. In thenext section, we will consider how we might modify the computing environment tomake better use of the designers’ time and effort.

5 Challenges

The position of statistics as a profession, relative to science in general, to applica-tions in industry, business, and government, or to overall academic activities, hasmade substantial advances over the period surveyed by this paper. Research andapplications in industry, for example, have scored a number of successes such asthe example in the previous section. (It can be added that advances in the com-puting support for statistics have contributed to many of these successes.) There

19

remains a fairly widespread sense that our technology should be applied and/or ap-preciated more in many areas. Hahn & Hoerl (1998) and their discussants providesome views on the subject, in the context of industrial applications.

As suggested in the example, much of the challenge can be summarized bysaying that we need to be more involved in theprocess behind the data. Fromthe perspective of the present paper, this leads to the question: What aspects ofcomputing can contribute to broadening our involvement in the process, and arethere some specific new directions in statistical computing that would help?

A prescription to the statistics profession that it should be involved in moreof the process will not be very helpful if it does not go beyond just encouragingstudents to learn more and practitioners to work harder. In particular, from thecomputational perspective, if the advice is only to learn database management lan-guages, user interface programming, and various other systems in addition to thestatistical software already needed, we are unlikely to see much progress.

So the overall challenge presented here translates into the challenge for com-puting of making this increasingly diverse range of tasks accessible to practitionerswithout forcing them to learn many new languages or to abandon the computingenvironments that, as we have noted, have largely proven helpful to the profession.

Fortunately, new technology in computing may help us to meet this challenge:section 6 will look at examples. First, though, let us make the nature of the chal-lenge more concrete. To do so, suppose we revisit the application in section 4.2,where call detail data was used to detect possible fraudulent calling behavior.

As we noted in that section, the tools used to test out possible signature com-putations were to be reprogrammed later in another language; the database require-ments for managing the signature databases likewise were specially crafted; resultsfrom the testing were imported manually into a statistical system for display andsummary. The design of software for the followup (in particular, of a user interfacefor the human intervention required) was not included in the study.

Some improvements in the computing environment would help both the statis-tical work and the reimplementation that follows. If the computations could bedis-tributed transparently among different languages and systems, better use could bemade of software specialized to tasks such as data management and user interfaceprogramming. Softwarereuse in the reimplementation would also be improved ifthe initial design and experimentation could use some or all of the same systemsas would be appropriate for the final procedure. In other words, we should be ableto choose high-level, appropriate languages for the various tools, and link themtogether easily and flexibly.

Figure 3 suggests one possible configuration to support the application in thisstyle. By comparison with Figure 2, we have essentially blown open the single boxfor computing and testing signatures into a co-ordinated collection of diverse tools.

20

Customer Records

Oracle DBMS

Processing

Perl App

Intensive

Followup

Java GUI

Special DBMS

Experimental

Production

Processing

Selection and Analysis

Analysis

Visualization,Java or C++ App

Statistical Language

Communication

JDBC App

Manage

Signature Data

Signatures

Figure 3:Revised version of Figure 2, filling in the specific computations with a wish list,in which high-level systems take over important tasks. Arrows show communication ofrequests among the systems.

Signature databases are now managed for us in a systematic way; as one helpfulpossibility, the figure shows an interface to the database management through theJava DataBase Connectivity (Hamilton et al. 1997). Many other possibilities ex-ist, but JDBC is general, handles many database systems, and benefits from manyfeatures of Java.

Similarly, the followup box, which was unspecified before, is now assumed tobe a graphical user interface, programmed at least in prototype form during thedesign phase. Bringing the GUI into the design phase is important because it al-lows experimenting with tools that might help the human in the followup to obtainmore information, perhaps even by communicating with the statistical languageto obtain visualizations and summaries. The prototypes of signature computationswill still need to be done in some easily re-programmed form, but now we wantto bring the corresponding language explicitly into the programming environment.Finally, there will need to be some more efficient version of the chosen computa-tions for the production system. However, instead of assuming that these will bere-programmed from scratch we would much prefer for at least a close approxima-tion to be produced during the design and analysis phase. That way, at least a goodapproximation to the performance of the final product can be obtained early on, in-stead of committing to an expensive reprogramming effort without direct evidencethat the result will perform well enough.

21

Notice that relatively little of the technology we needed in the example falls un-der the typical definition of statistical computing. Database management systems,software for designing graphical user interfaces, and techniques for communicatingcomputations among diverse systems are not found in the usual texts on statisticalcomputing. Nor, to repeat the earlier point, is it useful to simply argue that theyshould be. Statisticians will need to understand some keyconcepts from these ar-eas, to some extent, but the challenge is more to make the concepts accessible fromgood computing environments for data analysis, while hiding as much of the detailas possible.

The arrows in Figure 3 show requests that might likely be made from one of thesystems to another. For example, if we assume the statisticians are working in theirinteractive environment, they will need to issue requests to the database directly tofind some summary information about the signatures on the database. The proto-type signature computations, of course, also need to communicate to the simulatedtransaction database to get input and to the signature database to update signatures.The statistical language needs to communicate to the prototype generator, both tore-program the specific computations and also to run and analyze simulations. Theuser interface for followup needs access to the customer database and as we men-tioned plausibly to the statistical language or to some other graphical system forhelpful visualization and summary. And so on, through a very extensive range ofuseful inter-connections.

Through all this we need to remember that statisticians and, even more, usersof the followup interface shouldnot be expected to master more than one or a veryfew of these systems in detail. It is the job of the general computing environmentto provide “easy and flexible” communication among the systems. Surprisingly,perhaps, the prospects for such an environment are quite promising. Several routesmight get us there; section 6.2 describes a new joint project in statistical computingthat I regard as an exciting approach to this and other goals.

6 The Future

Those of us interested in computing with data have many opportunities and chal-lenges. In particular we would like to use computing technology to help statisti-cians contribute more to applications. The good news is that there are some excitingopportunities for relevant research in these areas.

22

6.1 Distributed Computing with Data

A computing environment such as that suggested by Figure 3 combines traditionalstatistical computing, say through one or more statistical languages, with a richvariety of other tools. The statistician can adopt a broader role, becoming directlyconcerned with data management issues and with the design of user interfaces tohandle the results produced by the statistical techniques. The languages and sys-tems involved will need to communicate with one another, and with the end user.

All of this is a rich but complicated network of tools. To make things still morecomplicated, the users, the data, and the tools may be distributed geographically orover different computers and operating systems.

Neither the designers nor the users of tools in this environment should have tostruggle with the details of all these systems. For the most part, each person willwant to work with a single language or non-programming interface; the distributionof the actual computations should be transparent.

To deal with such an environment, we need some new programming concepts,not particularly associated with statistical computing in the past. Fortunately, anumber of trends in general programming languages and systems are currentlymoving towards similar environments; after all, the need for the kind of distributedenvironment suggested by Figure 3 is felt broadly, not just in our own work.

Some of the key concepts, and some general programming tools related to themare as follows.

� Requests from one component of the system to another should appear local,with the communication dealt with for the programmer, not by the program-mer. For example, Java provides the technique ofremote method invocationin which an essentially ordinary-looking call to a Java method may in fact beevaluated remotely.

� Such requests, however, should ideally not be restricted to a particular lan-guage or operating system. There are many existing systems, certainly in ourapplications, that need to be accessible. The most developed standard and ar-chitecture for language- and system-neutral communication is CORBA, theCommon Object Request Broker Architecture, (Pope 1998). To oversimplify,CORBA provides a mechanism to define methods offered by an applicationserver in a form that can be translated into various supporting languages.CORBA also provides a variety of services that manage the details of com-munication.

� Although we want the details of control and communication handled auto-matically, the programming environment must include a variety of high-level

23

ways to specify interactions between systems and with users. In particu-lar, the handling ofevents is central to all user interface design and manyother aspects of distributed systems as well. Both Java and CORBA providemechanisms to define desired event management at a relatively high level ofabstraction.

� Finally, all the information about methods, classes, and related objects mustbe available dynamically, so that the system itself can find the informationfor the programmer, rather than leaving the programmer to define the detailsof interfaces. This ability, variously called introspection, reflectance, runtimediscovery, and self-defining metadata, is increasingly recognized as crucialto powerful distributed computing. Java, CORBA, and other systems providesuch mechanisms.

To move towards powerful distributed systems, we need to incorporate the newconcepts into our basic programming environment. The more modern statisticallanguages often include some such facilities currently, but none of them combinesa full set of such features with a broad range of the basic necessities for computingwith data.

Here then, is a challenge for the next generation of our software, along withsome strong suggestions about useful computing tools to deal with the challenge.Where next?

6.2 A Co-operative Project in Computing for Statistics

A new project is starting that promises to bring many benefits to computing withdata, including access to the distributed approach of the previous section. Theproject, called Omega for the moment, proposes a joint effort among a wide groupinterested in computing for statistics. The initial discussions included designers re-sponsible for the current statistical languages S (John Chambers and Duncan Tem-ple Lang), R (Robert Gentleman and Ross Ihaka) and Lisp-Stat (Luke Tierney).During and immediately after the conference “Statistical Science and the Internet”in July 1998, it became clear that we all have interests in new work that could ben-efit greatly from working together with a broad group of participants to producesome high-quality, open-source software.

The Omega web page,http://omega.stat.wisc.edu, documents the goalsand activities of the Omega project. The present section outlines my personal viewof the current, preliminary plans.

The expected results of the project can be viewed on three levels:

1. the definition of standardmodules, to be define capabilities needed for statis-tical applications;

24

2. implementation of these modules inpackages that provide the capabilities;

3. interactivelanguages that include access to the capabilities.

Nearly all details are subject to change at this time, but a general approach thatcould provide the three levels can be described now. If other implementations provebetter, these can be chosen instead of or in addition to the description below.

The second level is perhaps the easiest to describe. While the implementationof packages is notrestricted to a specific language, there is particular interest in theJava language and the virtual machine on which it is based. Java is designed arounddistributed computing and includes dynamically accessible definitions of availabledata structures and methods. A large and growing body of software in Java isavailable. Its use in programming user interfaces and graphical applications is well-known, but increasing attention is being paid to its use as a general programmingenvironment, including progress in more efficient compilation of the code.

Although it includes dynamic facilities, Java grew from a more traditional,compilation-oriented view of programming. For our applications, this needs tobe supplemented by a high-level but general interactive language—the “quickly”part of our goal for computing with data. In the Omega project, we anticipate pro-viding several such interactive approaches. An interactive interface to Java itselfhas been developed; initially, it is relatively “pure” Java, but can now be extendedin a number of ways to provide features of interest to us and our users. For exam-ple, some functional-language style of evaluating calls to function objects wouldmimic abilities in current statistical langugaes. A language based on Java or onits underlying virtual machine can access packages written in Java directly, usingJava’s reflectance capability; therefore, the interface issue, in the sense of section3.3, largely goes away for such packages.

The other way to provide interactive use of packages is to add an interfacebetween an existing statistical language (S, R, or Lisp-Stat, for example) and thepackages. These interfaces would use dynamic access to module definition, sayin CORBA, to request a service without intermediate compilation steps. Ratherthan a separate interface for each service, we only need to implement one clientinterface between each statistical system and CORBA. We can also implement asingle server interface for each statistical system, to accept requests for methodsand attempt to satisfy them dynamically in the statistical language.

We expect to implement both new languages and interfaces. The new languagesare themselves interesting areas of research, in addition to the work on modules andpackages. Interfaces to existing languages will be essential to providing a full rangeof computing to programmers and users, as facilities at the module and packagelevel are developed in Omega.

25

The long-range value of the new project will rely on wide participation fromthe statistical community. We need to design and implement modules for newcomputing facilities (among many others, interesting possibilities include: generaldatabase access; new directions in modeling; Bayesian estimation; new visualiza-tion and graphics techniques). Our optimism that useful, high-quality software cancome from such joint projects is based in part on existing successes in this style,such as Linux, and more directly on experience with statistical software, such as R.Here is another opportunity for us to work together to our mutual benefit.

7 Summary

The “peaceful collision of computing and statistics”, as John Tukey described itin his 1965 memo, has profoundly changed the way statistics is done and how itis applied in science and other activities. The changes, on the whole beneficial,continue at an undiminished pace as we cope with new opportunities posed bychanges in computing and in the applications we try to serve.

Over this long period, some concepts have been found that help produce usefulsoftware for computing with data. Languages that make it easy to express statisti-cal ideas and that adapt smoothly as we refine the ideas allow us to turn those ideasinto effective software. Powerful capabilities are created from true object-basedcomputing, in which everything is an object and the language is capable of under-standing and operating on its own classes, methods, and other components, as wellas on an open-ended structure for statistical data. Effective interfaces are neededbetween languages and systems, so that computing with data is not restricted to anarrow scope; in the future, this needs to include computing distributed over geog-raphy, language, and computing environment.

Modern statistical software has made substantial strides in applying these con-cepts. Statisticians benefit from the advances in their own research and in devel-oping tools for applications. The challenges remain undiminished however, sinceboth computing technology and the scope of potential applications have changed atleast as fast as we have. Fortunately, those changes have included some powerfulnew computing tools that we can use to move ourselves forward; in the process,we can enjoy some exciting research in software for statistics. One new projecthas been described that offers hope for interesting and useful joint work on a widevariety of topics. If we can work effectively together, the benefits for all of us couldbe substantial.

26

Acknowledgements

In terms of computing ideas, no small list could begin to enumerate contributionsover this long period of time, but mention should be made of some of those in-volved in versions of S in the past (Rick Becker, Allan Wilks, Trevor Hastie, andDaryl Pregibon, for example) and in current plans (Duncan Temple Lang and DavidJames, for example).

For the description of the fraud detection project I am indebted to discussionswith both the Bell Labs group, including Diane Lambert, Jos´e Pinheiro, and DonSun, and the AT&T Research group, including Rick Becker and Allan Wilks.

A Principles for Good Interactive Languages

No one can object to this endorsement of clear, compact, understandable program-ming, but what properties should languages have to encourage such programming?(Encourage is the correct verb; no language can prevent dreadful programming,much less ensure good programming.) There has, needless to say, been a genera-tion or more of argument about such issues. Here are some personal opinions.

1. The effect of evaluating basic operations or function calls should be onlyto return an object, with no hidden side effects (the principle of functionallanguages).

2. The standard, default invocation of an operation should be simple and shouldthen be simple to modify, for example through specification of optional ar-guments by name.

3. One should be able to build up extensive software by gradual refinement andextension. One important tool is ageneral way to define methods. Methodsorganize software in a conceptual table of generic functions by signatures forclasses of the arguments to the functions.

4. Individual functions and/or methods should be local objects, organizable atone or more levels into packages. The language should contain simple mech-anisms for persistent storage of objects, including function/method objects,along with the necessary database operations implied to access and managesuch objects. Users should be able to include or attach such packages easilyto provide tools for their own programming.

5. The language’s syntax should be, to paraphrase Einstein, as simple as possi-ble, but no simpler. In practice this means that the designer has the difficult

27

task of deciding how much syntactic sugar is good for the users. Extremesimplicity appeals to designers and theoreticians but has been shown to turnoff users; conversely, sloppy or excessively elaborate languages can prettyconvincingly be argued to lead to poor programming.

6. The most important principle for modern languages may be: “everything isan object”. The language must be able to compute on itself (see the exam-ple in section 4.1). There must be dynamic access to information about thelanguage itself and about the available methods (see section 6).

No language would get perfect marks on these criteria, and the score for any par-ticular language would likely be debatable.

The version of S described in Chambers (1998b) was designed with these ideasin mind, but constrained of course by back compatibility, limits on time and re-sources, and plenty of human frailty. My personal opinion is that Java is a generalprogramming language providing a good base, in terms of these criteria, particu-larly if it is supplemented by interactive access to suitable packages, in order toimprove the “quickly” part of the goal. The supplement needs to provide “func-tional programming” features, such as items 1–3 in the list; these have proven to beuseful in interactive programming. The project described in section 6.2 addressesa number of these issues.

B Languages and GUIs

The fundamental goal oflanguage in programming is the ability to prescribe (tothe computer) and to describe (to ourselves and other humans) requests for com-putations. In a fundamental sense, what we can describe is what we can compute.For those needing to compute with data, in the sense we are using the term—theorganization, visualization and analysis of processes involving significant quantityand complexity of information—the languages available determine what we canlearn and show about the data.

It may seem obvious, perhaps even trivial, to assert then that languages, andin particulargood languages, are the heart of successful computing with data. Ibelieve the statement to be true, but it is not easy to defend explicitly, and muchactivity in computing with data (or equally in other kinds of computing) goes on asif it were not true. So before trying to outline some key concepts for languages, weneed to examine some counter-indications.

The most obvious examples areGUIs: all those buttons, sliders, menus, flyingicons, waving elves and other visual aids to computing. Don’t they imply that for

28

a large fraction of users, formal languages are one or more of irrelevant, incom-prehensible, or not worth the effort? The busy executive, the non-specialist fromanother field, the non-technical person just needing a few simple answers—theseusers are much likelier to interact with a non-threatening, step-at-a-time visual in-terface than they are to learn the rules of a language. Even those of us who may beinclined to program and not particularly frightened by a little formal notation maystill feel inclined to use a visual interface when operating outside the range of ourusual computational tasks.

The point-and-click, drag-and-drop, and generally non-verbal approaches tocomputing must be accepted as increasingly dominant, particularly for non-technicalaudiences. (And most particularly for the ”third society” of those who intensivelyuse technology but generally without thinking about the underlying computations.)

This recognition isnot, however, a reason for either despair or for downplayingthe role of language in computing with data. The role changes, but if anythingthe importance of language, and especially of good language increases in the newcontext.

1. Non-verbal systems are worth much more if well-based on a language.

2. Languages play multiple roles in the new context.

Base the GUI on Clear Language Definitions

An interactive, non-verbal system for computing with data justifies itself by pro-viding its users with an easy, effective way to interact with the application at issue.If we, as statisticians or others with some interest in the techniques being used, feelsome uncertainty, the qualms are likely to be with the quality of what the user haslearned. What in effect was the analysis implied by a particular session of pointing,dragging, sliding, etc.? Does the end result communicate a reasonable message tothe user? What if the user really meant to ask something else, perhaps not easily“programmed” in this non-verbal form?

All of these questions will be better handled if the non-verbal actions map sim-ply and directly into a good language, particularly one that is object-based andobject-oriented, in senses made more specific on page 13. Just as a simple exampleto make the ideas concrete, suppose the user selects a button looking like Figure 4.We would probably guess that this produces a linear fit to some data. (That’s ourprofessional bias showing; other audiences might have some other interpretation,but suppose we’re right in this case.)

The result is very context-sensitive; that is, the action performed in responseto the event of selecting the button depends on both the button definition directlyand on the sequence of previous events in the user’s session with the system. Most

29

o oo o

o ooo

o o

oo o

oo

Figure 4:A button from a hypothetical graphical user interface.

importantly, what the userperceives as the result of the action depends on thiscontext. Something happens to the state of the user’s interaction with the system;perhaps the user sees some immediate visualization of the event such as a lineon one or more scatter plots, a table summarizing the fit, at the least a messagethat computations were performed. Then future interactions with the system maydepend on this event; the user may call up other displays, or further analyticalbuttons may use the results.

If, let’s say, the effect of selecting the button is to perform a linear regression,then we’re asking for the definition of that action. We can only satisfactorily answerany of the questions we raised if both the direct action in response to the button andalso the relevant “state” we talked about have some clear definition. That cleardefinition can come from a language supporting the computing with data. If thelanguage can represent clearly the regression action itself and also (this is the re-ally important part) the objects that define the context of the regression, then we cansay what the user’s analysis means in a clear way. We can then hope to make someguesses about how meaningful the results would be (for example, did the user’sprevious actions create a dataset for the regression that might likely be misleadingbecause of selections of variables or observations). Most of all, if the action repre-sented by the sequence of user interactions has a clear language definition, the wayis open tomodify that action if it should be changed.

So the argument is that a clear language definition for the actions correspond-ing to a non-verbal interactive system greatly improves the quality of the system. Iwould also argue that the implementation of the system benefits as well, in a fairlyobvious way. The development of the system, its debugging, and the process ofrefining it to be more useful will all require the developers to understand the sys-tem and the effects of various modes of user interaction with it. Developers will

30

also benefit from the simplicity of changing the system. In both respects, basingthe system on a good language and environment for programming with data willbenefit, while programming the system in a low-level, obscure implementation willcorrespondingly defeat attempts to understand what’s going on and will discourageefforts to change what needs changing.

Multiple Language Roles for GUIs

Issues of computing with data, however, are only part of the language role in anysuch non-verbal system. The presentation (rendering in computer graphic terms) ofthe relevant information to the user and the management of user-interaction eventsdefines the user’s actual view and feel of the system. Obviously, the quality ofrendering and interaction greatly affect how happy the user will be with the system.With our background in mathematics and science, we may tend at first to minimizethe importance of such questions. Any attempt to create a high-quality GUI shouldquickly disabuse us of that tendency, but may also generate a strong desire to havesomeone else do all the GUI programming.

Whoever does program the rendering and interaction, however, theinterfacebetween that programming and the computing with data is an essential part of theoverall system design. The danger is sometimes that lack of a good way to programthis interface causes the programming to be done in an language or an environmentthat is good for one side of the interface at the cost of the other side. To be specific,there is a temptation to either add the GUI programming to the statistical system(usually producing a clumsy and/or ugly result), or to program the system in avisual language with crude or inflexible interface to the computing with data.

The good news here is that current developments in programming environmentspromise a better approach. Programming environments that provide good supportfor GUI programming (Java, for example) can exist in a distributed system withenvironments that provide good support for computing with data. The interfacesbetween such systems can be defined in clear, object-based terms and can be im-plemented conveniently by the programmer.

References

Anscombe, F. (1981),Computing in Statistical Science through APL, Springer-Verlag.

Buhler, R. (1965), P-stat; an evolving user oriented language for statistical analysisof social science data,in ‘Fall Joint Computer Conf.’.

31

Chambers, J. M. (1967a), Self-defining data structures for statistical computing,Technical report, Bell Laboratories. (also a report to the British WorkingParty on Statistical Computing).

Chambers, J. M. (1967b), ‘Some general aspects of statistical computing’,Appl.Statist. 17, 124–132.

Chambers, J. M. (1968), ‘A British working party on statistical computing’,Amer.Statistician 22(April), 19–20.

Chambers, J. M. (1998a), Computing with data: Con-cepts and challenges, Technical report, Bell Labs.(http://cm.bell-labs.com/stat/doc/Neyman98.ps).

Chambers, J. M. (1998b), Programming with Data: A Guide to the S Language,Springer-Verlag.

Cooper, B. E. & Nelder, J. A. (1967), ‘Prologueand epilogue’, Appl. Statist.16(2), 88 and 149–151. (Pages 89-148 contain papers and discussion fromthe December, 1966 meeting.).

Dixon, W. J. (1964),BMD: Biomedical Computer Programs, Health SciencesComputing Facility, University of California, Los Angeles.

Gabbe, J. D., Wilk, M. B. & Brown, W. L. (1965), An analytical description ofTelstar 1 measurements of the spatial distribution of 50-130 Mev protons,Technical report, Bell Laboratories.

Hahn, G. & Hoerl, R. (1998), ‘Key challenges for statisticians in business andindustry’, Technometrics 40, 195–213. (with discussion).

Hamilton, G., Cattell, R. & Fisher, M. (1997),JDBC Database Access with Java:A Tutorial and Annotated Reference, Addison-Wesley.

Iverson, K. E. (1962),A Programming Language, Wiley.

McCarthy, J. (1962),Lisp 1.5 Programmer’s Manual, M.I.T. Press.

Nelder, J. A. (1966), General statistical program (GENSTAT IV): User’s guide,Technical report, Waite Institute, Glen Osmond, Australia.

Ousterhout, J. K. (1998), ‘Scripting: Higher level programmingfor the 21st century’, IEEE Computer. (web version athttp://www.scriptics.com/people/john.ousterhout/scripting.html).

32

Pope, A. (1998),The CORBA Reference Guide, Addison-Wesley.

Tierney, L. (1990),Lisp-Stat: An Object-Oriented Environment for Statistical Com-puting and Dynamic Graphics, Wiley.

Tukey, J. W. (1965), An approach to thinking about a statistical computing system,(unpublished notes circulated at Bell Laboratories).

Wilk, M. (1965), Introductory remarks, (unpublished notes for a meeting).

33


Recommended