Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a...

Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker.

Jon Rasbash Centre for Multilevel ModellingUniversity of Bristol

The way it is.

Here is Edward Bear, coming downstairs now, bump, bump,

bump, on the back of his head.

It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way,

if he could stop bumping for a moment and think of it.

And then he feels that perhaps there isn’t.

Another relevant opening paragraph

A doctor, a civil engineer and a computer scientist were arguing about what was the oldest profession in the world. The Doctor said “well in the Bible it says that God created Eve from a rib taken from Adam, clearly this required surgery so my profession must be the oldest in the world.” The civil engineer interrupted – “but earlier in the book of Genesis it says that God created the order of the heavens and the earth out of chaos. That was certainly a most spectacular feet of civil engineering. So Doctor my profession is older”

The Computer Scientist smiled confidently – “who do you think created the chaos”

Grady Booch – Object Orientated Analysis and Design.

Origins of MLwiN

Mike Healy’s Nanostat(~1981) a Minitab clone written in RATFOR.

B. W. Kernighan, RATFOR -- A Rational Fortran, Workshop on Fortran Preprocessors Pasadena Calif., pp. 3, November 1974.

Mike wanted to do something on his Osborne Portable Computer – so he wrote NANOSTAT

NANOSTAT architecture

Like MINITAB data represented as a set of columns

Command verbs taking columns, numbers and boxes as arguments

Commands can be strung together outputs from 1 command acting as inputs to another

A simple architecture involving a command parser, functions to create columns and a series of a hundred or so commands that take inputs create outputs with no side-effects.

ML2,3,N DOS programs

We added capabilities to fit a two level multilevel model in 1988 and called the program ML2.

ML3 was released in 1990 and the source code was translated to C.

MLN was released in 1995 and the new N-level algorithm was written in C++.

MLN and C++

The N-level computational algorithm(never published) is a set of C++ classes for handling problem specific highly patterned matrices. To illustrate consider the model:

),(N~e

),(N~u

euy

eij

uj

ijjij

2

20

00

0

0

One computationally intensive step in

the IGLS algorithm is to estimate the variances and covariance of the random effects. Lets look at what that involves-from a computing perspective

2

2

e

u

Estimating Equation for

*Y)V(Z)Z)V(Z( *T***T* 111

Which greatly reduces computational load

).n(dimisY

)n.n(dimisV

)p.n(dimis*Z

j*j

jj*j

jj

12

221

2

But given block diagonality of V*-1 this simplifies to

j

j*j

T*j

*j

*j

T*j *Y)V(Z)Z)V(Z 111

Storage nj4 and flop counts

are proportional to nj3

if nj=100 RAM requirements > 100MB in the early 1990s this was not possible on PC’s so

Exploiting patterns

All the large matrices were highly structured and could be represented in terms of complex expressions using smaller building block matrices. Doing this reduces computation

Storage from nj4 to nj p

and flop counts from nj3 to nj p2

Creating the C++ matrix class hierarchy

In designing this class hierarchy I wanted to be able to take expressions such as

*Y)V(Z)Z)V(Z( *T***T* 111

and program them directly as

theta=inv(~Zstar*inv(Vstar)*Zstar)*(~zstar*inv(V))*ystar

However, we are working here in terms of the big matrices which directly reflect statistical logic but are hopelessly inefficient computationally. Each big matrix is representend internally as a patterned set of smaller rectangular and symmetric matrices. The statistical logic can then be expressed at an abstract level but the details of storage and computation handled efficiently by subclasses.

Success?Code was fast and efficient and been pumping away for over a decade.

But did C++ and OOD help? Not sure.

C++ syntax, compiler error messages and garbage collection difficult.

I ignored advice “dont do a new application and learn C++/OOD at the same time”

Have not touched the code for at least 5 years and have no intention of extending it.

For example, get some complex message about why a variable could not be seen, when I thought I had followed, C++/OOD principles and syntax.

Then I think oh sod it I’ll just make the variable global – breaks the encapsulation principle.

Would have helped to have a mentor with good applied experience of OOP/OOD.

How well do OOD, which conceptualises problems around a series of communicating objects with taxonomic relationships specified by class-hierarchies work for the highly procedural business of statistical algorithm development?

Is there a macho(or perhaps lawyer like) culture lurking in software engineering?

COM example??

My early experience contacting computer scientists

MLwiNIn 1996 we begun work on a windows version of MLN.

Key difference between console based MLN and windows based MLwiN: In MLN you only see something e.g. model setup, graph, prediction, data, multilevel residuals, model constraints, hypothesis tests etc when you ask for them with a command.

In MlwiN all these interdependent objects can be displayed simultaneously on screen in different windows and an action changing one can have effects on the objects viewed in all the other windows and the other windows must be re-drawn.

We therefore require an architecture that passes messages to windows when their displays have become out of date : the windows can then respond by redrawing themselves as they see fit.

Objects responding to messages : OOD paradigm.

MLwiN implementation

GUI front end written in VB. Turn command driven console app from EXE to DLL.

Simultaneously we had an application into JISC for a parallel and distributed processing version of MLN/MLwiN. Where GUI runs on PC and computation is done on a server or a grid.

This required minimising data transfer from GUI to DLL handling the computation.

Recording system state and task processing handled by the C++ DLL. The VB front end is a view on the system(collecting input and displaying output)

VB GUI

User interface windows

Data structures

Command interpreter

MLwiN architecture to handle simultaneous interdependent displays and buffering of GUI/back end data.

C++ command driven program

Action manager

(dispatcher)

Data buffers

invalid flags(one per data item)

register interest in actions

send commands

request action

notify windows of action

copy data

requ

est d

ata

data invalid

Action : what data structures are set out of date by the action

Window:what actions effect it

Done with some help

A friend, Bruce Cameron was hired as a project consultant, to design the framework. We benefited greatly from the input of an experienced software engineer/system analyst.

Above architectural framework has worked well.

Bruce’s input probably crucial to MLwiN’s success such as it is.

MLwiN 1.0 released in 1998.

The Equations windowOne of the design features was to allow users to work with statistical equations directly to specify and explore multilevel models

This is because expository materials were all based around equations representations and users learning MM had a double whammy of understanding how the equations operationalised the techniques and then translating from that representation to equations running the model and then back translating text based tables of results to the equation representation. This translation

placed an unnecessary cognitive load on learners.

Many quantitative social scientists were resistant to equations. But the influential quantitative social scientists loved it.

Equations window

ML regression model with random intercepts already specified by pointing and clicking. To extend to random slopes.

An IO device that allows, via direct manipulation, models to be specified and changed and results to be viewed. An IO device embedded in the statistical context. Not an open ended declarative symbolic language processor.

..and view results(after running the model)

Programming the Equations window

The equations window was a great success – but extremely straight forward to implement. This was because we had the right frameworks:

•VB’s GUI programming model

•Bruce’s synchronisation architecture.

MCMC

The project in 1998 was joined by Bill Browne who implemented MCMC algorithms for Multilevel Models in MlwiN

Bill implemented special case, optimised code.

It became apparent that MCMC algorithms were easier to extend to a wide range of statistical models than the IGLS and other algorithms we had been working with. Also these algorithms scaled well in terms of computational load.

Bill worked with the Centre for Multilevel Models from 1998-2003 much of his work on the program is recorded in:

Browne, W.J. (2003). MCMC Estimation in MLwiN (Version 2.0) Institute of Education University of London

Extensibility problemsBy 1999, although the architecture for the move to windows was reasonably sound, another architectural problem was coming into focus.

The software architecture reflecting the representation of statistical models was ten years out of date with new developments being “shoe-horned” into the old architecture. A few key differences over the decade:

Normal Responses

Hierarchical population structures

IGLS estimation

Normal, Poisson, Binomial, Multinomial responses

Hierarchical, crossed, multiple membership structures

IGLS, bootstrap and MCMC estimation

1989 1999

Time for a major redesign of the software

Update architecture to reflect new types of models that we had developed

A central strand of statistical analysis is the process of working through a series of models and comparing them. Update software architecture to support multiple “live” statistical models.

Make new model information structures estimation method independent eg convenient to plug in IGLS, MCMC, quadrature, SIM_ML, bootstrapping, AIP. Current model structures IGLS-centric.

Create an object model of the objects that are the stuff of statistical modelling :data, models, estimates, predictions, graphs, estimation engines etc

Design in interoperability with other software (via COM, CORBA)

A big task-could UML help?

After reading quite a bit of Grady Booch and other 3-Amigo texts I got excited about using UML and OOD to help us implement the next generation of the MlwiN software.

A year later I crumpled into a heap and simply could not continue.

I thought this is a great opportunity to learn OO design and process skills and bring some much needed rigour, clarity and good practice to our software design and development procedures.

I set to work…

What went wrong?

A key feature claimed for the UML diagrams is that they serve as representation that software developers and application experts(statisticians) can use to communicate reasonably unambiguously. This helps ensure that the developers build the system the application experts want and that the objects in the system(and their inter-relationships) correspond to objects in the application knowledge domain – facilitating extensibility.

When I tried to use UML diagrams to talk about statistical structures and processes to statisticians I found they got in the way. This could be due to my in-expert use of the diagrams. They got frustrated and I got defensive.

UML helping communication:

Lost in the process

I got lost in the UML multi-phase, iterative process. Had I spent enough time developing use-cases? Should I now move on to static class diagrams? How detailed should they be at this stage? Have I got the fundamental class design right? Would these interaction diagrams be useful now? And what exactly was this Rational Unified Process anyway?

First of all, I thought if I read enough, I would be able to get things clear. Which seemed to work. Until I tried to apply what I had read.

Then I thought, well I’ll just plough on anyway and it will become clear through doing. Oh I am still confused better go back and read some more.

After a year of this I had failed to produce a single line of code.

Not another bloody ticket sales application.

All the UML texts used airline ticket sales or loyalty card schemes as their exemplars – hundreds of pages for a single worked example sometimes. But I found it hard to transpose those exemplars onto using UML to design/implement a statistical modelling system.

A victim of hype

Although the UML texts contain statements like “there is no silver bullet”. They are very persuasive, they are selling a methodology and in the case of Rational Rose, software products to go with the methodology.

Some stronger health warnings on the packet might have been helpful and also some case studies of where and why UML failed.

Mentor required

In hindsight I realised that I needed a mentor to guide me through the process.

Mea Culpa : I could have sought out a mentor but I had the feeling that I really better clarify things a bit before I seek help from an expert. A possibly fatal lack of confidence on my part.

Friendly, accessible experts required.

Current development strategy for new statistical models

We are currently developing MCMC estimation models for

Multilevel latent category models(aka growth trajectories)

Multilevel mover/stayer models

Multilevel factor analysis and structural equation models

Multilevel multivariate response models with response of different types defined at different models : useful for simultaneous equation models, multiprocess models, causal models. And as an engine for multiple imputation for missing data.

All these models are being developed in MATLAB

MATLAB as a prototyping+environment

Excellent features for matrix programming + thus good for prototyping algorithms.

Excellent external interface to other systems : DLL(with extensive examples for C and FORTRAN), COM,DDE and SOAP.

The MATLAB compiler will translate a set of .m files to C or C++, compile and link them. This allows easy creation of a royalty free EXE or a DLL

A GUI RAD programming framework (combos, slider, buttons, radio boxes, check boxes, textboxes, menus, list box, button group, panel with all the obvious event hooks defined. If that is not enough a container for any activeX control.

Render tex strings into equations.

Relevant MATLAB features

Development process

Develop

algorithms,

model set-up interface,

model output display and model diagnostic devices

in MATLAB

C DLL

interfaced to MLwiN

Matlab compiler

Call DLL + convert MLwiN data matrix to MATLAB matrix

MLwiN – model appears on MlwiN menu.

Pass results back to MLwiN

structures

For each new model to be implemented :

Are we using MATLAB as a development engine or a prototyping environment?

Code is not as fast as handcrafted C/C++. By about an order of magnitude.

Architecture is a little piecemeal, treating each new model type as as a separate entity. Lacks extensibility. What happens if you want to combine model types?

However 2 project programmers and two statisticians have a very immediate need to learn MCMC and this provides a good platform for that.

As team members develop a better understanding of MCMC we can then think about a more general, extensible architecture.

MCMC learning group

We are seeking funding to set up an MCMC learning group which will be for a group of about 10 people associated with team: mathematical statisticians, programmers/software engineers and applied social statisticians.

Group will use an online learning environment and work through simple to more complex models using MCMC estimation. Covering

MCMC estimation theory for each model

Implementation in MATLAB, handcrafted C/C++ routines, BUGS and openBUGS

Applications of the model to substantive problems.

Outputs of the learning group

A better understanding across the team of the potential of MCMC estimation.

Better understanding of computing issues for the specification, estimation and interpretation of statistical models using MCMC.

This increased understanding will guide decisions on a future more general architecture. Which could be for MLwiN to become a front end for OpenBUGS. Provides general model specification structures and access to samplers for nodes.

Leaving a learning ladder for others to follow.

LEMMA

Recently received funding to do some statistical methodology development but mostly capacity building for social scientists.

From our workshops we know that many quantitative social science researchers in government and academic departments dont understand the mechanics of a multiple regression equation with interactions between continuous and categorical variables.

We are thinking hard about who we can target for progression, where conceptually and socially(e.g. work environment) they are getting stuck and what software tools, training materials and formats they need.

The architecture of the learning environment we are developing could be the subject of another whole presentation.

A cross-disciplinary model for development?

Software and training materials

Social Scientists Statisticians

Informed Hackers

ICT, Learning technology, E-

Learning

Usability

Software Engineers

Standards for statistical model representation

Many tools exist for transferring primary data between proprietary formats and existing standards. However no standards exist for the secondary data of statistical model structure and no tools exist for transferring between proprietary standards for representing model structure. (Some exceptions in data mining).

Development of a cross-platform language independent component for storing model specification is highly desirable....

Standard Model component

Generic statistical model representation

ModelGUIinterface

ModelEE interface

Estimation Engine

Other GUI components, eg equation window, graphical model

Data Source

ModelData source interface

Usual advantages of component based design

New EE algorithms can be plugged into the model making comparison of EE much easier – good science

Different data sources e.g. EXCEL, SAS etc worksheets can easily be bound to the model.

Alternative GUI devices can be plugged into the model for developing model specification and exploration tools.

Facilitates collaborative working.

Is graphical modelling a good framework to use to build the model representation component?

Reflections

Its papers not programs, stupid.

Software engineering not credited. ESRC for many years explicitly did not fund it. They had a policy of prototyping only and leaving commercial outfits to exploit and further develop into widely usable systems. Misguided – interaction between software engineers, statisticians and applied researchers crucial. Commercial outfits take too long to respond. We “sneaked in under the radar”.

Software engineering can be very valuable but software modelling techniques can be complex and easy to get lost in. Again good cross-disciplinary communication required.

Now changing with rising profile of GRID and E-learning/ICT.

Academic environment produces organic rather than structured development.

Date post:	19-Jan-2016
Category:	Documents
Upload:	holly-brougham
View:	216 times
Download:	1 times

Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a...

Documents