HealthWatch: A Management Tool Combining Clinical and Population Data Sets

The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.

(Signature of student) _______________________________

HealthWatch: A Management Tool Combining

Clinical and Population Data Sets

Mark David Hawker

Informatics

2007/2008

i

Summary

A leading heath expert, Muir Gray (2006), was famously quoted as saying that the analysis of data in

healthcare will have a “bigger impact than any single drug or technology likely to be introduced in the

next decade”. As society becomes more reliant on technology in their every day lives, and information

is more readily available on demand, it is clear that this could also be used in the healthcare industry.

There are several benefits to being better informed including reducing costs from having more

accurate resource provisioning, improving quality of care through better awareness of factors

affecting disease and increasing staff professionalism. We hope to explore the concept of healthcare

data analysis and provide a base for others to investigate in future years.

This project demonstrates how software tools and techniques from a computing discipline can be

applied to a problem in a healthcare domain. We use a novel requirements analysis technique known

as personas as well as applying an innovative combined data mining and project management

methodology inspired by the Agile Unified Process (AUP), participatory design (PD) and the Cross

Industry Standard Process for Data Modelling (CRISP-DM). Through the exploration of clinical and

population data sets we create a data model which is exploited through the implementation of a

management tool. We reinforce the term “management” as it is important to note that we are not

attempting to replace human-computer interaction (HCI), rather use technology to complement human

knowledge acquisition. Evaluation is demonstrated throughout in both the methodology and post-

implementation using the DECIDE framework Preece et al. (2002). Evaluation is carefully

constructed addressing technical and socio-technical issues such as usability and finally

recommendations are made for further extensions.

ii

Acknowledgements

Firstly I would like to thank my project supervisor, Dr. Natasha Shakhlevich, for her continued

support and guidance throughout the project. The feedback received from my project assessor, Dr.

Roy Ruddle, from the mid-project report and during the progress meeting was extremely helpful in

deciding the scope of the project and is greatly appreciated. Finally, I wish to thank my friends and

family who took part in the evaluation and Dr. Rick Jones for providing additional comments.

iii

List of Abbreviations

Third Normal Form (3NF) Model-View-Controller (MVC)

Aggregate Local Region (ALR) National Administrative Codes Service (NACS)

Aggregate National Region (ANR) National Centre for Health Outcomes Development (NCHOD)

Active Server Pages (ASP) National Health Service (NHS)

Agile Unified Process (AUP) Object-Relational Mapping (ORM)

Boyce-Codd Normal Form (BCNF) Primary Care Trust (PCT)

Census Area Statistics (CAS) Participatory Design (PD)

County Local Authority (CLA) Practical Extraction and Reporting Language (PERL)

Cross Industry Standard Process for Data Modelling (CRISP-DM) Hypertext Pre-Processor (PHP)

Comma-Separated Variables (CSV) Quality Outcomes Framework (QOF)

Database Management System (DBMS) Questionnaire for User Interaction Satisfaction (QUIS)

Department of Health (DH) Rapid Application Development (RAD)

Don’t Repeat Yourself (DRY) Rich Internet Application (RIA)

Enterprise Unified Process (EUP) Rational Unified Process (RUP)

File Transfer Protocol (FTP) Strategic Health Authority (SHA)

Geographic Information System (GIS) Structured Query Language (SQL)

Government Office Region (GOR) Scalable Vector Graphics (SVG)

General Practice (GP) Terminology Reference Data Update Distribution (TRUD)

Graphical User Interface (GUI) eXtensible Markup Language (XML)

Human-Computer Interaction (HCI) eXtreme Programming (XP)

Integrated Data Modelling, Analysis and Presentation (IDMAP) Yorkshire Centre for Health Informatics (YCHI)

Information Technology (IT)

iv

Table of Contents

Summary..................................................................................................................................................i

Acknowledgements.................................................................................................................................ii

List of Abbreviations .............................................................................................................................iii

Table of Contents................................................................................................................................... iv

1. Requirements Analysis ...................................................................................................................1

1.1 Problem Statement.................................................................................................................1

1.2 Requirements Capture ...........................................................................................................1

1.3 Project Scope .........................................................................................................................2

1.4 Project Aim and Objectives...................................................................................................3

1.5 Project Approach ...................................................................................................................3

1.5.1 Data Models ......................................................................................................................4

1.5.2 Existing Solutions .............................................................................................................5

1.6 Project Evaluation .................................................................................................................7

2. Project Management .......................................................................................................................8

2.1 Agile Unified Process (AUP) ................................................................................................8

2.2 Participatory Design (PD) .....................................................................................................8

2.3 CRoss Industry Standard Process for Data Modelling (CRISP-DM)....................................9

2.4 Conclusion.............................................................................................................................9

3. Project Schedule............................................................................................................................12

3.1 Milestones............................................................................................................................12

3.2 Deadlines .............................................................................................................................12

3.3 Tasks....................................................................................................................................13

4. Data Phase.....................................................................................................................................14

4.1 Data Understanding .............................................................................................................14

4.2 Data Preparation ..................................................................................................................17

4.2.1 Logical Database Design.................................................................................................18

4.2.2 Physical Database Design ...............................................................................................19

5. Presentation Phase ........................................................................................................................25

v

5.1 Modelling ............................................................................................................................25

5.2 Construction ........................................................................................................................29

6. Evaluation .....................................................................................................................................34

6.1 Evaluation Paradigms and Techniques................................................................................34

6.2 DECIDE Evaluation Framework.........................................................................................35

6.2.1 Determine........................................................................................................................36

6.2.2 Explore ............................................................................................................................36

6.2.3 Choose.............................................................................................................................37

6.2.4 Identify ............................................................................................................................38

6.2.5 Decide .............................................................................................................................38

6.2.6 Evaluate...........................................................................................................................39

6.3 Further Work .......................................................................................................................44

6.4 Evaluation Summary ...........................................................................................................45

7. Conclusion ....................................................................................................................................46

References.............................................................................................................................................47

Appendix A: Personal Reflection .........................................................................................................50

Appendix B: Project Plan......................................................................................................................52

Appendix C: Yorkshire and The Humber Aggregate Local Region Pairings.......................................53

Appendix D: Informed Consent Form ..................................................................................................54

Appendix E: User Survey Template .....................................................................................................55

Appendix F: Dr. Rick Jones’ Evaluation Cover Letter .........................................................................58

1

1. Requirements Analysis

In this section we discuss the background to the project, including aim and objectives, approach and

scope, and requirements capture through a technique known as personas. This user-centred design

approach focuses on systems development from the perspective of the user rather than on technical

requirements (Junior & Filgueiras, 2005). The ethos is to demonstrate the application of information

technology (IT) as a management tool to support decision-making rather than replacing human-

computer interaction (HCI) with intelligent agents. Finally, we discuss how the project will be

evaluated through technical and socio-technical perspectives.

1.1 Problem Statement

Data providers such as The Information Centre - a National Health Service (NHS) authority that

collect, analyse and distribute national statistics on health and social care across the UK - provide

reports that are freely available to view and download online. These reports are well-produced, but are

not dynamic as they cannot be queried or analysed electronically. They also only present data from a

single universe of discourse which cannot be cross-tabulated with other areas of interest such as

population data. This presents a challenge for health professionals attempting to compare clinical data

with other types of data such as demographics to distinguish factors affecting disease prevalence.

1.2 Requirements Capture

As we do not have any explicit end-users, a persona has been used for the understanding of the

requirements of this project and as a source for evaluation. Personas can be used to simulate the needs

of real users and the features they require. They are one of many user modelling techniques including

user roles, user profiles, extreme characters, stereotypes and archetypes (Junior & Filgueiras, 2005).

All of these techniques are similar in that they create a picture of potential users through research or

data collection, whereas personas differ in that they can be based on imaginary or perceived

information to enable more accurate characterisation (Junior & Filgueiras, 2005).

This technique was first documented by Cooper (1999) in his book entitled The Inmates Are Running

the Asylum which is a testament to keeping users happy by ensuring technology works in the same

way they think. We have only detailed one persona but multiple personas can enhance system design

by providing multiple perspectives which can be illustrated using feature-weighted priority matrices

(Grudin & Pruitt, 2003). The decision was made to only detail one persona as the scope of data

collection, preparation and visualisation was already large.

Adopting the recommendations of Cooper & Reimann (2007) the persona has three important goals,

classified as life, experience and purpose goals. Life goals are holistic and describe the overall aim of

the individual’s existence; experience goals describe how the individual wants to feel when using the

product or service; and purpose goals describe what features the individual would like to use. Grudin

& Pruitt (2003) advocate the use of this technique in conjunction with participatory design (PD) for

2

the development of systems where user engagement is paramount. The following persona

specification will be used throughout the project:

Tom Clayton

“The system shall enable me to compare clinical and population data.”

“The system shall reduce the time needed to manually find and extract clinical and population data.”

“The system shall enable me to filter results by geographic region and demographics.”

Life Goals: He is a health professional who researches diabetes prevalence. He creates reports that

describe the quality of care received across the UK and looks for reasons why there may be variance.

In the past these reports have been used for service commissioning where particularly prevalent

populations required more treatment options than less prevalent populations.

Experience Goals: He feels he is skilled in his work, and has good knowledge of statistical data

analysis. He already uses a site provided by The Information Centre (2007) to compare general

practice (GP) results but this does not include any population factors. He wants to remain in control of

the final results and be able to investigate them further.

Purpose Goals: He wants to be able to extend his knowledge by looking at population data which

provide him with demographic details such as age, sex or ethnicity. The main question he needs to

answer is: Is diabetes prevalence affected by the ethnicity of a population at a local or national level?

1.3 Project Scope

Considering the scope of the project helps identify which areas will, and will not, be covered in the

design of the system and influences the project aim, objectives and approach. This is important as it

ensures there are boundaries to prevent time being wasted. It is recognised that scope may change

over time but this may not be applicable in the case of this project as the requirements and

environment are not changing. In large-scale projects, scope is generally defined in terms of

deliverables, functionality and data, and technical structure (Turbit, 2005). For a project of this scale it

is unnecessary to go into such detail (and the fact that these issues are discussed in other areas of this

report) but we will look at describing the intentions.

As there is no “out-of-the-box” solution combining clinical and population data the majority of time

will be spent collecting and preparing the data structures. Presentation elements will be condensed and

it is hoped that this project will act as a springboard for others to continue and explore more advanced

areas of visualisation. For example, we wish to integrate an export feature into the software which

will enable users to download data. The export will be in both a generic comma-separated variables

(CSV) format, which can be imported into popular software packages such as Microsoft Excel or

OpenOffice Calc, and a specialised data mining tool WEKA ARFF format.

3

Data mining is the process of discovering useful patterns in data from large databases and using it to

make decisions (Kantardzic, 2002). We envisage that the data could be used to present correlations

between attributes, such as age and diabetes prevalence, but also for identifying geographical areas

which have similar characteristics. This may be useful as health professionals can begin to “learn”

what programmes and services are most effective in areas of high (or low) prevalence. WEKA is an

open-source collection of machine learning algorithms and tools for data classification, regression,

clustering, association and visualisation used in data mining tasks (Witten & Frank, 2005). WEKA

can only process files that are in a proprietary file format which will need to be considered when

creating export functions. Many of the details are contained on the WEKA web site

www.cs.waikato.ac.nz/ml/weka/ and discussed by Witten & Frank (2005).

1.4 Project Aim and Objectives

The project will incorporate the development of an independent software system, including the

production of a data model, from clinical and population data sets. Focus will be on the development

of the data model and software tools and techniques that can be used. The aim is to:

Develop a data model that incorporates clinical and population data sets and develop a prototype

system that queries and presents outputs from the data model that can be used by a health

professional to make inferences about disease prevalence in populations.

The project will involve an investigation into several data providers and the challenge is to create a

data model which links data sets with differing levels of granularity (local or national). To satisfy the

aim the following objectives need to be achieved:

• Identify the scope and approach to the problem including evaluating existing solutions.

• Identify an appropriate methodology for data processing and project management.

• Research and develop a data model for describing the multiple data sets.

• Design and develop a prototype system that queries the data model and presents the results.

• Evaluate the system to see if it influences working practice.

These objectives will be useful in helping plan the project and in constructing the project schedule

(see Section 3) to support its delivery.

1.5 Project Approach

We recognise the project could be implemented in a number of ways and using a number of

technologies, from presenting the data as a printable report to developing a fat client desktop

application which enables the user to analyse and store data locally and extract updates from a server.

For this project, where the primary focus is not on software development but on data modelling, the

implementation needs to be as “rough and ready” as possible. A web-based thin client approach which

only requires access to a web browser rather than a client application will ease the distribution of the

4

solution and recognises that installing applications may be restricted in the user’s work environment.

The selection of programming language and database management system (DBMS) is transparent to

the user and so has not been discussed in much detail in this report. What has been included is the

discussion of what is meant by a data model as there are a number of options available. The most

important factor in development is the choice of design pattern - the model-view-controller (MVC)

architecture - and dynamic charting tool which are discussed later in this report.

There are many programming languages that could be used including Active Server Pages (ASP),

Practical Extraction and Reporting Language (PERL), Python or Ruby. However, Hypertext Pre-

Processor (PHP) has been chosen as an appropriate programming language for two reasons:

• The developer has more experience of this language as opposed to alternatives through

external web development projects. This will reduce the time needed to learn a new language

which is beneficial as the presentation component is secondary to producing the data model.

• PHP5 introduced support for object-orientation and increased MySQL and eXtensible

Markup Language (XML) handling. This includes the ability to create constructors, methods,

and cleaner exception handling via try and catch blocks (Trachtenberg, 2004).

As well as a programming language we also require a suitable DBMS that will be used to store the

data. Again, there are many choices available including Apache Derby, MySQL, PostgreSQL or

SQLite. For such a relatively small project, where we will (hopefully) not be processing millions of

rows of data, any platform would be suitable as variance is minimal. We chose MySQL as it was

available alongside the PHP installation and there is an interactive add-on called PHPMyAdmin

which can be used to install and view databases via a web interface. This is useful for debugging as

most of the other DBMS only provide access via a command-line rather than a graphical interface.

We intend to use a web application framework to support development as it helps to reduce the

overheads associated with development activities such as database access, page templates and code

structure as functions are pre-coded. The use of these frameworks promotes code re-use according to

the Don’t Repeat Yourself (DRY) process philosophy (Hunt & Thomas, 1999) and they often utilise

MVC architectures. Having referred to the comparison of web applications frameworks presented on

Wikipedia (2007) and browsing the documentation and examples for each, CodeIgniter

http://www.codeigniter.com/ was identified as being the easiest to use and was well documented.

CodeIgniter supports the MVC architecture as well as being PHP. There is no need for a more

sophisticated framework as we do not require any of the advanced forum, e-commerce or gallery

features offered by other frameworks.

1.5.1 Data Models

A data model is a way of describing the schema of some data store which enables us to concentrate on

design principles without an immediate concern with technical implementation (Beynon-Davies,

5

2004). There are a number of data models available: relational, object-oriented, deductive and post-

relational. We will investigate the relational data model as it is considered to be the most appropriate

to the project. This is based on the lack of object-oriented database management systems, increases in

object-relational mapping (ORM) tools for translating relations into object-like structures and the

prevalence of the MVC architecture. A database is a set of structures for organising data therefore we

must have a set of principles for describing and implementing them. Adhering to three data model

rules - data definition, manipulation and integrity – ensures data will “just work” on any database

management platform. All of these rules will be important when designing and implementing the

database and so we will briefly describe how each is enforced in the relational data model.

Data definition is the process of exploiting data structures to suit an application. In the relational data

model we have a single data structure - the relation. Being a mathematical construct (Codd, 1970) a

relation must obey a number of rules which are presented by Beynon-Davies (2004). In particular,

every relation must have a distinct name, column names must be distinct and entries must be of the

same domain. For example, you cannot mix text and integer values.

Data manipulation has four aspects: how we input data; how we remove data; how we amend data;

and how we retrieve data (Beynon-Davies, 2004). Insertion of records is only performed once when

populating the database and so has not been included. The language for manipulating data in the

relational model is known as structured query language (SQL) which contains structured methods for

performing each of the tasks. We shall focus on the ones appropriate to this project, in particular, how

we retrieve data from the relation:

Restrict: Extract rows that match a condition and is represented by the WHERE operator in SQL.

Project: Extract specified columns and is represented by the SELECT operator in SQL.

Natural Join: Takes two (or more) relations and joins them into a single relation. This can be

represented in SQL using a combination of SELECT, FROM and WHERE r1.id = r2.id.

Data integrity is split into two key aspects: entity integrity and referential integrity. Entity integrity

relates to primary keys and states that every table must have one which must be unique and cannot be

null. If this isn’t adhered to it is possible that a relation will have duplicated rows, which is against the

rules of the relational data model. Referential integrity relates to foreign keys and ensures that if a

reference from a relation is updated or deleted then all corresponding references are also updated or

deleted. As we are not updating or deleting any records, referential integrity will simply ensure that

any foreign keys reference a valid column of another relation.

1.5.2 Existing Solutions

We have identified two existing solutions which show different ways of presenting data. Both are

provided on behalf of the NHS and are aimed primarily at health professionals. The Information

Centre provides a web site for comparing Quality Outcomes Framework (QOF) results for all GPs

6

across the UK. The site enables users to search for a GP (or list of GPs) and then select which QOF

indicators they want to compare, e.g., diabetes or asthma statistics. The results are presented in a

tabular format and are scored from 0 to 100% which can then be compared to the Primary Care Trust

(PCT) average (a regional score) or the national average (see Figure 1, below). An export to Microsoft

Excel feature is also available to download the results. The site is a good example of how simple

techniques can be effective in conveying information, such as in the form of bar charts and

combinations of colour and text.

Figure 1 Screenshot Showing the Number of Diabetes Instances from 7 General Practices (GP) in

Keighley (The Information Centre, 2007)

The National Centre for Health Outcomes Development (NCHOD, 2007) provides comparative data

for 700 local and health government organisations across England. Data is collected annually through

comparative analysis of a number of health-related factors, known as compendium indicators, which

are at national and local levels. Each indicator has a specification giving precise details about how the

statistic was calculated, which parts of the population it includes as well as concise definitions.

Figure 2 Screenshot Showing Male Mortality Rates from Diabetes (NCHOD, 2007)

7

As well as being able to download the data there is also an “interactive atlas” which has features

enabling the user to select, filter, sort and explore the data (see Figure 2, above). The atlas was created

using a commercial product, InstantAtlas™, which utilises scalable vector graphics (SVG)

technologies. SVG is a language used to create interactive two-dimensional graphics and applications

using XML (W3C, 2008) and may require a web browser plug-in to be installed to enable graphics to

be viewed. Whilst this may be seen as a barrier to entry, it enables increased functionality as

processing can be performed client-side rather than server-side. There are three types of atlas, each

presenting the data in a different way (NCHOD, 2007):

• Nested Rate Plot: Presents organisations’ data and confidence intervals nested within groups

of similar organisations to enable comparisons to be made between them.

• Funnel Plot: Presents the distribution of organisations’ data compared to average figures.

• Correlation Plot: Allows users to map and compare the relationship between two indicators.

This system is the most advanced implementation we have found as it presents data using

geographical locations to show a “heat map” of results which gives a quick summary of the entire

region. It also enables the user to toggle the display of boundaries and to zoom in on specific regions.

The implementation is particularly clever as it holds data over multiple years enabling comparisons to

be made not just at a geographic level but also over different periods of time. A trial workbench is

also provided which allows users to download customised data created using different statistical

methods. For example, the indicator “deaths at home from all cancers” can be divided into two

products, indirectly age-standardised rate or number and percent. This is useful as it enables analysts

to customise outputs requiring minimal time and effort to re-format the results.

1.6 Project Evaluation

An evaluation framework known as DECIDE Preece et al. (2002) has been identified which ensures

appropriate goals and questions are defined. Evaluation will be split into two types defined by Scriven

(1967) as formative and summative. Formative evaluation addresses issues throughout the design

process – an element which is implicitly built into the project methodology through elements of PD

and expert-developer interaction – and summative evaluation is completed at the end of the project by

“observers” who did not take part in development. The balance of formative and summative

evaluation provides an adequate review from multiple user perspectives in an attempt to increase

response validity and reduce bias. As the project is targeted towards health professionals it will be

important to gain domain expert feedback to ensure the system results are accurate. Three experts

were contacted and provided with a brief of the project and all three replied saying that they would be

willing to provide feedback when required.

8

2. Project Management

In this section we present an overview and evaluation of three popular methodologies and then

conclude with the chosen methodology which we will follow throughout the project. A methodology

is a framework of procedures, techniques, tools and documentation aids (Bennett et al. 2006) intended

to aid the acquisition of an information system (Bocij et al. 2006). Following a methodology helps in

managing and prioritising tasks, and provides a mechanism for assessing and monitoring progress

against a project schedule (see Section 3). A combination of methodologies may be required, or a

single methodology may need to be modified to suit the specific project tasks.

2.1 Agile Unified Process (AUP)

The Agile Unified Process (AUP) is an evolution of the Rational Unified Process (RUP) and the

Enterprise Unified Process (EUP) by Ambler (1999; 2006). The underlying principle of the RUP was

to invest time into planning activities and artefact production which should result in lower costs,

timely delivery and better software quality (Germain & Robillard, 2005). Development in the AUP is

spread over four iterative phases: Inception, Elaboration, Construction and Transition. Each phase

emphasises a different discipline such as modelling, implementation and testing which ensures full

coverage over the project life-cycle. The combination of phases and iterations enables developers to

spend less time modelling requirements and more time on implementation and refining models.

Bennett et al. (2006) summarise that although complex it is possible to adhere to its principles without

having to complete each phase slavishly. The AUP suits projects with multiple developers and is

oriented towards a business environment. For example, estimating costs and risks are not part of this

project. Whilst iterative working can be beneficial when requirements are constantly changing, for a

project such as this where time is constrained and requirements are fixed it would be inappropriate to

iterate through the entire cycle. However, iterating over each phase individually may be more

appropriate which links into the participatory design (PD) approach.

2.2 Participatory Design (PD)

PD is an approach to information systems development where stakeholders share a guiding vision

which can lead to hybrid experiences (Bennett et al. 2006). Practices take place in an “in-between”

region that is neither in the software professionals’ domain nor in the workers’ domain (Muller,

2002). The origins of PD are from Scandinavian work with trade unions and research projects in user-

participation in systems development date back to the 1970’s (Bødker, 1996). PD is an exciting

concept which is similar to newer methods such as eXtreme Programming (XP) which can be seen in

today’s development world. These concepts promote rapid development through knowledge sharing.

There are several benefits to using PD such as being able to challenge assumptions of others, learning

reciprocally and generating new ideas through shared experiences (Muller, 2002). Whilst its

application may seem like common sense, it can be a very powerful tool as described in the

9

integration of personas by Grudin & Pruitt (2003) and in providing synergy with Rapid Application

Development (RAD) methods as researched by Beynon-Davies & Holmes (1998). Whilst not strictly

a methodology, PD is a way of working that can be incorporated into a project management

framework. This enables a single developer to gain opinions and criticisms from others on issues such

as implementation, development and testing. This sociological approach to development is one

favoured by the author as it treats software engineering as more than just developing solutions, but

solutions that are used by the “inmates” described by Cooper (1999).

2.3 CRoss Industry Standard Process for Data Modelling (CRISP-DM)

The Cross Industry Standard Process for Data Modelling (CRISP-DM) is a tool-neutral data mining

process model that was conceived in 1996 (Chapman et al. 2000) and builds on attempts to define

knowledge discovery methodologies (Wirth & Hipp, 2000). The aim of the process model is to act as

a safety net for data mining practitioners to ensure they can demonstrate to prospective customers that

data mining is sufficiently mature to be adopted as a key part of business processes. The CRISP-DM

process model consists of six phases that are broken into generic and specialised tasks:

• Business Understanding: State the business and data mining objectives and success criteria

for which evaluation will be based on. This will help in understanding which data mining

concepts will be most appropriate. For example, classification, or clustering.

• Data Understanding: Collect, describe and explore the data to verify its quality.

• Data Preparation: Clean the data by removing any “noisy” data and consolidate it by

performing transformations to create new variables or formats.

• Modelling: Select and apply appropriate techniques to satisfy the data mining objectives.

• Evaluation: Identify how well the model performed and whether it meets the needs of the

business objectives. Interpret the model to determine its usefulness.

• Deployment: Determine how the results will be used and how often they need to be updated.

A major challenge highlighted by Wirth & Hipp (2000) was in implementing such a methodology

inside a project management framework. (They had to employ external service providers to complete

some tasks.) It is clear from the user guide (Chapman et al. 2000) and the experiences of Wirth &

Hipp (2000) that the process is complex and time-consuming and needs to be adapted to create a

sufficient balance between following the methodology and achieving the requirements of the project.

2.4 Conclusion

The final methodology was created using a combination of principles from the AUP, PD and CRISP-

DM. As we have suggested: the AUP emphasises iterating over a number of phases and refining

models and code over time and is suited to projects with multiple developers and changing

requirements; PD enables shared working and promotes collaboration between developers and

10

workers; the stages in CRISP-DM are useful, although care needs to be taken not to get too involved

in the detail of deliverables from each stage.

The Integrated Data Modelling, Analysis and Presentation (IDMAP) Process (see Figure 3, below)

has two phases. The phases incorporate a synergy between a developer and expert who cross-

collaborate and share domain knowledge. The expert is not necessarily the problem originator

although they can be included at the Evaluation stage.

Figure 3 Integrated Data Modelling, Analysis and Presentation (IDMAP) Process

The first phase is influenced by CRISP-DM and focuses on data capture, understanding and modelling

and will be referred to throughout the report as the Data Phase. The second phase is influenced by the

AUP and focuses on the development of the prototype system and will be referred to as the

Presentation Phase. The processes in each phase are iterative and can be considered complete when

11

both the developer and expert are in acceptance. Although the time-scale of the project will only

enable a single iteration of the life cycle it is possible to iterate over the entire process if required. If

data was more volatile then the project stages may be more complex. The six stages of the IDMAP

life cycle are:

• Requirements Analysis: This stage provides an understanding of requirements which will

form the basis for design, implementation and evaluation of the project. This includes

identifying the project aim and objectives, scope and formalising requirements. Background

reading is also included in this stage as it is important to investigate what solutions already

exist as it may be possible to either use them or learn from the lessons of others.

• Data Understanding: This stage includes initial data collection and data familiarisation. This

may identify data inconsistencies or quality issues that would require replacement data to be

used. Understanding the data may also lead to the detection of interesting subsets of data that

can be exploited to add value to the requirements.

• Data Preparation: This stage involves re-constructing the data into a format that can be

queried and translated into information. Included in this stage is removing any “noisy” data

and performing transformations to create new variables or formats.

• Modelling: This stage includes manipulating the data model into information, e.g.,

constructing queries or functions which present the results as a table or chart. This stage

assumes the data has been translated into an appropriate data model that can be manipulated.

• Construction: This stage involves the development of a graphical user interface (GUI) or

report which presents the results from the modelling stage. The results will be influenced by

the aim of the project and the objectives of the study.

• Evaluation: Although each process has an implicit evaluation created through the PD

collaboration of developer and expert, the project should be evaluated to present conclusions

for further study and evaluating the outputs against specified criteria.

Application of this methodology is presented in the project schedule (see Section 3) and is reflected in

the design of this report which includes the four phases of the IDMAP process: Requirements

Analysis, Data Phase, Presentation Phase and Evaluation. Reflections on the success or otherwise of

the methodology is presented in Section 7 and Appendix A and as it is hoped that it will provide an

improved framework for composite data and systems projects.

12

3. Project Schedule

In this section we detail the tasks required to complete this project using the chosen Integrated Data

Modelling, Analysis and Presentation (IDMAP) methodology including the tasks, milestones and

deadlines which will be used to judge the success of the planning of the project. Online project

management and collaboration tool Basecamp http://www.basecamphq.com/ will be used to manage

and track project milestones, to-do lists and messages as it sends reminder e-mails when deadlines are

close. Tick http://www.tickspot.com/ was also used, at least initially, as an online time tracking tool as

it could be integrated into Basecamp but was later replaced by Microsoft Excel as it was often

difficult to monitor how many hours were spent on tasks. The tool is probably more useful in a

situation with multiple developers who are being paid hourly or are working to a budget.

Figure 4 Final Project Plan

The final project plan (see Figure 4, above) contains the sections defined in this report on the left-

hand side and follows the methodology which separates data collection and preparation from the

creation of the graphical system. (Background Reading was separated from Requirements Analysis for

clarity.) The tasks specific to each section are detailed below including milestones and deadlines to

ensure timely project completion. Grey areas denote exam periods where no work was expected to be

completed and green areas are where work was completed.

3.1 Milestones

26/11/2007 Complete Data Phase

25/02/2008 Complete Presentation Phase

3.2 Deadlines

07/12/2007 Mid-Project Report

07/03/2008 Table of Contents and Draft Chapter

14/03/2008 Project Meeting

23/04/2008 Project Report (Hard Copy)

25/04/2008 Project Report (Electronic Copy)

13

3.3 Tasks

Requirements Analysis: w/c 1st October 2007 – w/c 15th October 2007

Define and formalise the persona, aim and objectives, scope and how the software will be evaluated.

Initial project aim and minimum requirements must be submitted by the 19th October 2007.

Background Research: w/c 15th October 2007 – w/c 17th December 2007

Investigate methodologies, existing solutions and software tools and techniques which can be

included in the mid-project report. This will be an on-going task but will be targeted to be completed

early in the project lifecycle to enable any further reading to be completed (if required).

Data Understanding: w/c 22nd October 2007 – w/c 19th October 2007

Identify and review data sets. Cleanse the data of anomalous values and ensure all relevant data has

been collected. Make contact with appropriate experts who have knowledge of the data sets in case

there are any questions or gaps in developer knowledge.

Data Preparation: w/c 12th November 2007 – w/c 19th November 2007

Re-construct the data into a data model suitable for manipulating. Documentation will include logical

and physical database design including an entity-relationship diagram. The notional milestone set for

completion of this stage is 26th November 2007 which ensures the presentation phase will be started

on time to leave enough time for evaluation.

Construction: w/c 17th December – w/c 25th February 2008

Create the graphical user interface for the data model using the software tools and techniques

identified in the background reading including the model-view-controller (MVC) architecture. The

notional milestone set for completion of this stage is 25th February 2008.

Evaluation: w/c 3rd March 2008 – w/c 31st March 2008

Evaluate the project in terms of technical and socio-technical perspectives including design heuristics,

usability and user satisfaction. Provide possible extensions and general successes/failures.

Project: w/c 1st October 2007 – w/c 21st April 2008

Although this is an on-going process, there are deliverables such as the mid-project report, table of

contents and draft chapter, project meeting and final report which will need to include some written

elements. The milestones set for the data and presentation phases should allow adequate time for the

report to be written up and checked prior to each submission.

14

4. Data Phase

In this section we discuss the data sources and steps required to translate them into a relational

database format. Data granularity is discussed, as some sources are available at a finer grain of detail,

and an ideal method for matching is proposed along with the approach taken in this project. A logical

and physical design is included to ensure the integrity of the final data model. Consultation took place

between the developer and experts who had knowledge of the availability of data and included an

employee at The Information Centre and the National Administrative Codes Service (NACS).

4.1 Data Understanding

Figure 5 Data Sources and Proposed Data Architecture

The initial understanding is that we have two distinct data sources (see Figure 5, above) – clinical and

population – with the clinical data being separate from the population data as there are no

relationships between the two. The aim is to aggregate these data marts so that we can link the clinical

and population data sources together using common attributes. There are already links between the

Quality Outcomes Framework (QOF) and Terminology Reference Data Update Distribution (TRUD)

data marts and the NACS and TRUD data marts which are discussed below.

Clinical Data

The QOF is an annual reward and incentive programme detailing general practice (GP) results, with

the latest being from 2006/2007. It details how well a practice is organised, customer experiences and

also includes how diseases are managed, e.g., diabetes. Other sources of data could have been the

National Diabetes Audit but unfortunately could not be obtained for confidentiality reasons. The QOF

data set contains records from 8,372 GPs in England each with the following attributes:

• GP Code which uniquely identifies a GP in the UK along with its Practice List Size.

• Number of Disease Cases. (We can work out prevalence, or density, by taking the number of

cases and dividing by the Practice List Size.)

15

• Primary Care Trust (PCT) Code for linking a GP to a PCT. This enables data from a

collection of GPs to be aggregated at a local level.

The NACS contains the geographical details of all GPs, PCTs and Strategic Health Authorities (SHA)

and so will be useful for linking a GP to a PCT and a PCT to a SHA. (A SHA is a collection of PCTs

which are aggregated at a national level.) This will enable results to be obtained for a “cluster” of GPs

belonging to a PCT, or for all GPs in a SHA. The NACS data set contains records from 157 PCTs,

8,631 GPs and 10 SHAs. However, the codes in the QOF data do not match the coding structures of

NACS and so a third service TRUD is required to match values for the two data marts. This was an

unforeseen problem as it was assumed that the QOF and NACS data would be connected. This

highlights one of the problems of applying theory to a practical task and how additional knowledge

was needed to find the appropriate service that would link the two sets.

The TRUD is a service hosted by the UK Terminology Centre that provides NHS reference data to

NHS and non-NHS third parties and contains national reference code data on NHS individuals and

organisations. Restricted access is given via File Transfer Protocol (FTP) which contains several data

files including a Microsoft Access database file with lookups for all 12,217 GP records in the UK.

(Note that the QOF data mart contains 8,372 GP records, NACS 8,631 and TRUD 12,217 which

means that already there are discrepancies in numbers and will lead to losses of data.)

Population Data

The Census is one of the most comprehensive sources of information about the UK population. A

form was sent to every household and establishment to be completed and sent back via post before

Census day which included details on demographics and employment (Directgov, 2007a). The most

recent Census was conducted on Sunday 29th April 2001 and the next is scheduled for 2011.

The results are broken into manageable “chunks” that are available at varying geographical levels.

Levels that we are interested in for this project are Government Office Regions (GOR) which roughly

translates to a SHA and County Local Authorities (CLA) which roughly translate to a PCT. Along

with these levels the results are split into three “products”. A description of each product has been

adapted from Directgov (2007b):

• Key Statistics consist of a series of 33 tables which provide a summary of the complete

results of the Census. The summaries are designed to enable easy comparisons between areas

across the full range of Census variables and cover the most significant or requested counts.

• Census Area Statistics (CAS) consists of a series of tables which provide detailed information

down to the most local geographic level. “Themed” tables are available which are specific to

a certain population type, e.g., dependent children, and univariate tables are available which

provide detailed information on a particular topic, e.g., age or country of birth.

16

• The Standard Tables consist of a series of detailed cross-tabulations that can be used to

compare a multitude of statistics. As with the CAS, “themed” tables are available.

Standard Table S108, which contains sex, age and economic activity by ethnic group, is ideal for this

project as it provides summaries of all of the relevant variables that are required. The data set contains

1,479 records for all 9 GOR and 148 CLA. A software tool, Nomis (2007), has been used as it

provided access to customise downloads of Census data. The software had a simple web interface

where selections can be made based on which attributes are required and at which geographic level.

Granularity

The major challenge in combining the two data sets is ensuring accuracy of data. We describe an ideal

technique for achieving this. However, due to time constraints a simpler technique was used which

was less accurate but ensured maximum coverage and speed. From the data we can see that the

clinical data sets include this hierarchy: GP > PCT > SHA and the population data set includes this

hierarchy: CLA > GOR. Note the Census does include other data sets such as counties, Unitary

Authorities and electoral wards but were less appropriate for the simpler technique.

Level of Detail Clinical Population

Top SHA (10) GOR (9)

Middle PCT (157) CLA (148)

Bottom GP (8,631) N/A

Table 1 Granularity of the Clinical and Population Data Sets

Table 1 (above) shows that as we drill down to a finer level of detail the difference between the

numbers of instances increases. At the top level, there is one more SHA instance than GOR. This is

because there are three regions: South East, South Central and South West in the clinical data and only

South East and South West in the population data. A simple technique is to combine these regions into

an “aggregate” Southern region. However, this technique does not work as easily (if at all) for the

middle level as some regions have multiple PCTs which do not map onto any CLA and vice versa.

Because of this we deemed it appropriate to take a single GOR – Yorkshire and The Humber – and

linked the middle regions together (see Appendix C) which enabled parts of the solution such as

drilling down into a national region to be demonstrated.

A more “scientific” method would be to investigate the land borders of each region and statistically

work out their boundaries. Once we knew these it would be possible to pinpoint whether a location

lay within a particular region. This would be similar to tracing the outlines of regions from one data

set onto a map and then adding points from the second data set on top. This was the reason why post

codes were collected from the data marts as it was thought this technique would be feasible in the

17

time-frame. With every post code it is possible to use a geo-coding service to gain the latitude and

longitude of an area which can then be used to plot it onto a map.

Geographic Information Systems (GIS) provide a means for visualising data, such as that which we

described above, and are discussed by Leonard & Samy (1998) using software to forecast demand for

services in healthcare regions (see Figure 6, below).

Figure 6 Screenshot Created from the SAS Software (Leonard & Samy, 1998)

This example is similar to that of the NCHOD (2007) application with the use of a map and graphical

representation of results using adjacency matrices and complex time-series models. However, a

manual technique was adopted in this project by taking a sample of instances from a GOR and using

an online mapping tool Google Maps http://local.google.co.uk/ to identify which was the closest PCT

to the CLA. Evidently the GIS approach would’ve been more exciting!

4.2 Data Preparation

Having identified the sources of data we are able to highlight the appropriate entities and relations

needed for this project. It may be appropriate to clean the data so that only relevant attributes are used.

For example, Economic Group from the original population data contained several categories for Full

and Part Time workers, Retired or Permanently Sick or Disabled which could be aggregated into four

main groups: Active, Inactive, Unemployed and Student.

Clinical Data

The QOF data was prepared by removing all attributes that were not required, such as other disease

statistics, leaving a GP Code, Practice List Size and Diabetes Cases. A new attribute, Disease

Prevalence, was also created by dividing Diabetes Cases by Practice List Size. As the NACS and

TRUD data sets were already structured, it was a case of removing attributes that were not required

and combining the PCT and SHA into a single relation. The final TRUD relation now has two

attributes, a GP Code and a PCT Code which will be used as a lookup for the NACS and QOF data.

18

Population Data

The Census data was the most time consuming data set to prepare. The steps were:

1. Download all nine GOR data sets contained in Microsoft Excel files using the Nomis (2007)

online tool. Each CLA was represented as a column and each of the Census statistics as a row.

2. The un-normalised statistic column was normalised to occupy a single column for each value,

e.g., All People, Aged 16 to 24 Years, Economically Active, Employee, Part Time, Mixed, White

and Black Caribbean now occupied seven columns instead of one. A new column Aggregate

Economic Group was created which removed some of the unnecessary detail from the economic

statistics. Four Aggregate Economic Groups were created: Active, Inactive, Unemployed and

Student. This was repeated for all GOR and they were all pasted into a single data file. Microsoft

Excel was used as formulae could be created to split and concatenate values quickly.

3. All un-used attributes were removed from this file, e.g., specific Ethnic Group types (Irish) so

that they would be aggregated into their aggregate group (White).

4. All redundant rows were removed. For example, the data set initially included a summary of all

records which aggregated all of the Male and Female statistics.

5. A Microsoft Access database was created to aggregate all of the statistics. It used GROUP BY

Sex, Age, Ethnic Group and Economic Group and then a SUM function on each of the values.

The original Census data contains 1,479 records for all 148 CLA. This means that we had a potential

1,479 x 148 = 218,892 values which was reduced to 11,920 (5%) through the cleansing phase.

4.2.1 Logical Database Design

Logical database design shows how each entity can be arranged without considering the physical

limitations of the DBMS (Beynon-Davies, 2004). Logical design will not be the same as physical

design as it does not take into account volume and transaction analysis used to optimise relations.

Bracketing notation has been used to describe the relations with primary keys underlined and foreign

keys tagged with an asterisk*.

CENSUS (Statistic Code, CLA Name, GOR Name, Sex, Age, Ethnic Group, Economic Group, Value)

NACS (PCT Code, PCT Name, PCT Post Code, SHA Code, SHA Name, SHA Post Code)

QOF (GP Code, GP List Size, Diabetes Cases, Diabetes Prevalence)

TRUD (GP Code*, PCT Code*)

Normalisation is performed to remove data redundancy and anomalies that reduce the integrity of data

(Beynon-Davies, 2004). Normalisation involves transforming data from an un-normalised form into

third normal form (3NF) or Boyce-Codd normal form (BCNF) by identifying the determinacy

between attributes. Functional determinacy means that for every value of A there is a single value for

19

B. For example, PCT Code functionally determines PCT Name; however, a PCT Name may not be

unique and so does not functionally determine PCT Code. The CENSUS relation contains a composite

key of Statistic Code and CLA Name as both attributes are required to functionally determine the other

attributes. Anomalies will arise if we update the relation and so conversion is required to BCNF.

CENSUS_CLA (Statistic Code, CLA Name*, Sex, Age, Ethnic Group, Economic Group, Value)

CENSUS_GOR (CLA Name, GOR Name)

The NACS relation is un-normalised as SHA Code, Name and Post Code are not functionally

dependent on PCT Code. To transform this into 3NF we separate the PCT and SHA attributes.

NACS_PCT (PCT Code, PCT Name, PCT Post Code, SHA Code*)

NACS_SHA (SHA Code, SHA Name, SHA Post Code)

Through the process of normalisation we have created additional relations not initially described in

the data sets. To join the data we require a new AGGREGATE_NATIONAL entity which lists the

high level names that are common to both data sets, and an AGGREGATE_LOCAL entity which lists

the medium level names. The final logical database design is:

CENSUS_CLA (Statistic Code, CLA Name*, Sex, Age, Ethnic Group, Economic Group, Value,

Aggregate Local Region*)

CENSUS_GOR (CLA Name, GOR Name, Aggregate National Region*)

NACS_PCT (PCT Code, PCT Name, PCT Post Code, SHA Code*, Aggregate Local Region*)

NACS_SHA (SHA Code, SHA Name, SHA Post Code, Aggregate National Region*)

QOF (GP Code, GP List Size, Diabetes Cases, Diabetes Prevalence)

TRUD (GP Code*, PCT Code*)

AGGREGATE_NATIONAL (Aggregate National Region*)

AGGREGATE_LOCAL (Aggregate Local Region*)

Note that to link the sets to the AGGREGATE relations we have added foreign keys to each relation

which references either a local or national region.

4.2.2 Physical Database Design

The physical database design involves taking the results of the logical design and fine-tuning them

against transaction, performance and storage requirements (Beynon-Davies, 2004). This requires us to

change the design to suit the needs of the system, e.g., de-normalising relations to improve

performance. The physical design process consists of volume and transaction analysis and defining a

physical schema. This may seem like a step backwards, but it is useful in ensuring the database is

configured as optimally as possible which will aid the speed of querying and retrieving results.

20

Volume Analysis

Volume analysis is used to establish the average and maximum number of instances per entity and is

useful for deciding storage requirements and for transaction analysis. There were 12,217 GPs listed in

the TRUD data set of which 6 listed in the QOF data set do not appear in the TRUD data set and so

have been omitted. The final number of records is 8,366. The AGGREGATE_LOCAL relation has

been excluded from the analysis as it has been deemed out of the scope of this project.

Using the figures from the logical database design, we are able to determine the relation sizes (see

Table 2, below). Column size was calculated by estimating the field sizes of each record based on

documentation found at the understanding phase. For example, CENSUS_CLA has eight fields with

indicative character sizes shown in parentheses: Statistic Code (15), CLA Name (100), Sex (1), Age

(7), Ethnic Group (1), Economic Group (1), Value (8) and Aggregate Local Region (100).

Relation Rows (Exact) Column Size Relation Size

(Chars.)

CENSUS_CLA 11,920 233 2,777,360

CENSUS_GOR 149 300 44,700

NACS_PCT 157 219 34,383

NACS_SHA 10 211 2,110

QOF 8,372 21 175,812

TRUD 8,366 16 133,856

AGGREGATE_NATIONAL 8 100 800

Total 3,169,021

Table 2 Volume Analysis Summary

Assuming each character is a byte we can estimate the size of the database as being around 3MB.

Transaction Analysis

Transaction analysis involves determining the expected frequency and number of instances returned

by a search query, e.g., viewing all population and diabetes prevalence data for an Aggregate National

Region (ANR) in order to assess memory and storage requirements:

1. Select an Aggregate National Region from the AGGREGATE_NATIONAL relation.

2. Select the CLA Names from CENSUS_GOR that match the Aggregate National Region.

3. Extract the statistics from CENSUS_CLA that match the CLA Names.

4. Select the SHA Codes from NACS_SHA that match the Aggregate National Region.

21

5. Select the PCT Codes from NACS_PCT that match the SHA Codes.

6. Select the GP Codes from TRUD that match the PCT Codes.

7. Extract the diabetes data from QOF that match the GP Codes.

For each step (see Table 3, below) we identified an average number of instances returned and

estimated that this transaction would be run 10 times per hour (period).

Step Type of Access No. of Instances Returned Per Period

1 Read 1 10

2 Read 15 10

3 Read 750 10

4 Read 2 10

5 Read 15 10

6 Read 500 10

7 Read 500 10

Table 3 Transaction Analysis for viewing all Data for an Aggregate National Region (ANR)

By multiplying the number of instances returned against the column size of each relation we can see

how much data will be stored when performing the query (see Table 4, below): (1 x 100) + (15 x 300)

+ (750 x 233) + (2 x 211) + (15 x 219) + (500 x 16) + (500 x 21) = 201,557 bytes.

Step Relation No. of Instances

Returned Column Size

1 AGGREGATE_NATIONAL 1 100

2 CENSUS_GOR 15 300

3 CENSUS_CLA 750 233

4 NACS_SHA 2 211

5 NACS_PCT 15 219

6 TRUD 500 16

7 QOF 500 21

Table 4 Transaction Analysis Results

If this was an inefficient process this would mean that to perform the restrictions the DBMS would

need to create natural joins between CENSUS_GOR and CENSUS_CLA and between NACS_SHA

and NACS_PCT and TRUD and QOF. This would mean a staggering (149 x 11,920) + (157 x 10 x

22

8,372 x 8,366) = 109,964,814,720 rows would be placed in memory. Although we recognise this is

not how a modern DBMS works, it will help in generating a baseline figure for comparison. To

reduce the number of transactions required de-normalisation can be used to hard code relationships

into the entities. For example, rather than a foreign key from NACS_PCT to NACS_SHA we add the

SHA Name and SHA Post Code attributes to NACS_PCT and drop NACS_SHA. In general, de-

normalisation involves identifying a foreign key entry in a primary relation and then transferring the

attributes from the secondary relation into the primary relation and removing the duplicated foreign

key. By completing this we now have a single entity for clinical data and a single entity for population

data which are linked together by two aggregate entities (see Figure 7, below).

Figure 7 Final Physical Database Design

To test the hypothesis that de-normalisation will reduce the number of calculations we completed the

volume and transaction analysis again. The following changes are apparent in the entities:

• DATA_CENSUS contains 11,920 rows of size 433 (233 + 300 - 100 [duplication of CLA

Name]) making the relation size 5,161,360 bytes.

• DATA_CLINICAL contains 8,366 rows of size 448 (219 + 211 + 21 + 16 – 19 [duplication

of GP Code, SHA Code and PCT Code]) making the relation size 3,747,968 bytes.

Comparator Normalised Un-Normalised

Volume Analysis 3,169,021 8,909,428

Transaction Analysis 201,557 548,850

Natural Joins 109,964,814,720 20,286

Table 5 Comparisons between Normalised and De-Normalised Entities

This has changed the transaction steps to:

1. Select an Aggregate National Region from the AGGREGATE_NATIONAL relation.

23

2. Extract the statistics from DATA_CENSUS that match the Aggregate National Region.

3. Extract the diabetes data from DATA_CLINICAL that match the Aggregate National Region.

The results are presented in Table 5 (above) which shows that the physical size of the database has

increased by approximately 280% from 3,169,021 bytes to 8,909,428 bytes. For transaction analysis,

we will be returning an average of 750 rows from DATA_CENSUS and 500 rows from

DATA_CLINICAL (based on the averages for Steps 3 and 6 from Table 3).

By multiplying the number of instances returned against the column size from the new volume

analysis we can see how much data will be stored when performing the query: (1 x 100) + (750 x 433)

+ (500 x 448) = 548,850 bytes, an increase of over 270%. However, using the calculations for

performing restrictions, as used above, the DBMS would not need to create any natural joins and so

only 20,286 rows would be placed in memory. This shows that an increase in physical size can

improve performance by reducing the number of calculations required to process a query.

Attribute Original Source Final Relation

Statistic Code, CLA Name, GOR Name, Sex, Age,

Ethnic Group, Economic Group, Value

Census DATA_CENSUS

GP Code QOF, TRUD DATA_CLINICAL

GP List Size, Diabetes Cases, Diabetes Prevalence

QOF DATA_CLINICAL

PCT Code NACS, TRUD DATA_CLINICAL

PCT Name, PCT Post Code, SHA Code, SHA Name,

SHA Post Code NACS DATA_CLINICAL

Aggregate Local Region N/A DATA_CENSUS,

DATA_CLINICAL, AGGREGATE_LOCAL

Aggregate National Region N/A DATA_CENSUS,

DATA_CLINICAL, AGGREGATE_NATIONAL

Table 6 Attribute Mappings from Original Source to Final Relation

Clarification is presented in Table 6 (above) which shows the attribute mappings from original source

to final relation. The final database schema is:

DATA_CENSUS (Statistic Code, CLA Name, GOR Name, Aggregate National Region*, Sex,

Age, Ethnic Group, Economic Group, Value, Aggregate Local Region*)

24

DATA_CLINICAL (GP Code, GP List Size, Diabetes Cases, Diabetes Prevalence, PCT Code,

PCT Name, PCT Post Code, SHA Code, SHA Name, SHA Post Code, Aggregate National

Region*, Aggregate Local Region*)

AGGREGATE_NATIONAL (Aggregate National Region)

AGGREGATE_LOCAL (Aggregate Local Region)

To conclude, this section presented the steps taken from understanding the data sets to preparing them

for the project and described two techniques - logical and physical database design – used to create

the final entities. Volume and transaction analysis was performed and results for a normalised and de-

normalised schema compared to ensure the most optimal database structure. Granulation of data was

discussed and a technique for linking the data sets was described as a suitable extension. The final

database design is included in Figure 7 which is a culmination of the logical and physical design.

25

5. Presentation Phase

In this section we discuss the role of the model-view-controller (MVC) architecture in separating the

application into three components: models that comprise functionality and access to data; views that

present the user interface; and controllers that manage updates to views (Bennett et al. 2006). This

architecture aids code maintainability by enabling multiple views to be created using the same

models. We also describe the dynamic charting tool, amCharts, and how it was integrated into the

project. When problems arose, consultation took place between the developer and experts who had

knowledge of amCharts, in particular from the application’s user forum.

Figure 8 Proposed System Architecture Using the MVC Architecture

An overview of the system is presented in Figure 8 (above) which illustrates the application of the

MVC architecture. We haven’t included server components as it is possible to store the database,

models, controller and views on separate servers (if required) demonstrating the scalability of the

architecture. To access data from the database requests are made via the view (user interface) which

are processed by the controller which in turn interacts with the models containing the logic to extract

data from the database. The results are then passed back to the controller which updates the view in

response to the request. For example, if a user requests a list of GPs the system will print back a list to

the console of GPs extracted from the database.

5.1 Modelling

Modelling describes the low-level details of the application related to models and controllers. Models

are the building blocks of the application and provide an interface between the data and application

layers. The way models have been used in this application promotes code re-use as the four main

functions - creating a pie and bar chart, data table and exporting data – all have similar components

26

which are described in this section. The application uses two models which represent Aggregate

National Region (ANR) and Aggregate Local Region (ALR) concepts. These concepts reference the

database, and link together the clinical and population data sets. Both models have similar methods

but at different levels of abstraction, with the difference being that the ALR concept accepts a

parameter so that results can be obtained for a single ANR.

AmCharts is a free, customisable Adobe Flash chart and graph creator available from

http://www.amcharts.com/ and supports importing data and settings from comma-separated variables

(CSV) or eXtensible Markup Language (XML) files. It has been chosen as it can dynamically render a

chart based on files generated from live data. Code can be re-used as the same chart component can be

used with a number of different data files.

To display a chart the following JavaScript is placed inside the <body> element of a page:

1. <script type="text/javascript" src="swfobject.js"></script>

2. <div id="barchart" class="chart"><strong>You need to upgrade your

Flash Player.</strong></div>

3. <script type="text/javascript">

// <![CDATA[

var so = new SWFObject("amcolumn.swf", "amcolumn", "520", "580",

"8", "#ffffff");

so.addVariable("settings_file", escape("settings.xml"));

so.addVariable("data_file", escape("data.xml"));

so.write("barchart");

// ]]>

</script>

From the code there are a number of components that are used to create the chart:

1. The amCharts JavaScript library is imported which contains details on how to create the chart

along with any other information required by the tool such as directory paths.

2. A container <div> is placed where the chart is to appear, note that the <div> required an id

parameter which is used later in the process. If the user does not have JavaScript enabled or does

not have the Adobe Flash Player, the default text will be shown. This enables the tool to “degrade

gracefully” in older generation web browsers or those with JavaScript disabled.

3. The remaining JavaScript creates the SWFObject which is a column chart in this instance with a

width of 520 pixels, a height of 580 pixels, the minimum version of the Adobe Flash Player is 8

and the background colour is white (hexadecimal value #ffffff). Variables are then added to

27

the object which describes the locations of the settings and data files. Finally, the object is written

to the page by replacing the <div id=”barchart”> with the rendered chart.

These components are required for the simplest implementation, but there are additional settings

available for advanced use such as what preload text should be displayed and additional settings for a

more specialised chart. As most of the settings are set explicitly in the code, this enables multiple

charts to be displayed on a single page with different variables. A sample data file is provided which

can be generated using a function to extract data from the database and translate it into XML:

1. <?xml version="1.0" encoding="UTF-8"?>

2. <chart>

3. <series>

<value xid="0">East Midlands</value>

<value xid="1">East</value>

</series>

4. <graphs>

5. <graph gid="Males" title="Males">

<value xid="0" start="1499115">49.6</value>

<value xid="1" start="1925336">49.6</value>

</graph>

</graphs>

</chart>

From the code above there are a number of components that are used to create the chart:

1. Standard XML header information describing the XML version and character encoding.

2. Root node which can contain multiple <graph> nodes but only a single <series> node.

3. Node representing the x-axis labels and identifiers that will be referenced by each graph.

4. Parent node of all of the graph nodes, as multiple graphs can be shown on the same axis.

5. Each graph is represented by a title and unique identifier, xid, which corresponds to the series

identifier. In the example, start represents a total and the value inside represents a percentage.

This was a “hack” to get the graph to display values and percentages concurrently.

The settings file has a similar structure to the data file, but holds parameters for controlling every part

of the look and feel of the chart, from borders to bar colours, labels to legends. A single settings file

can contain all of the information to control multiple charts by referencing each of them by their

28

unique identifier. The flexibility of amCharts in terms of additional settings and ease of use make it

one of the most successful charting tools and is used by the likes of Microsoft, Sony and Motorola.

Whether creating the data file for a pie or bar chart, or a data table or file export similar functions can

be used in all cases. In pseudo-code the functions contain the following components:

1. Extract data from database which may include a parameter if we require results for a specific

ANR which would be represented as a WHERE command in SQL.

2. Process each row depending on what output we want. For example this code creates a data value

for a bar chart: <value xid="'.$value->aggregateNationalRegion.'"

start="'.number_format(($value->diabetesCases/$value-

>gpListSize)*100,2).'">'.number_format(($value-

>diabetesCases/$value->gpListSize)*1000,2).'</value> where $value is

an array of the row’s values and number_format formats a number to 2 decimal places.

If we want to allow the user to select which charts they want to see, e.g., data related to sex or age

then switch and case methods have been used. This works by passing a parameter into the

switch statement and branching based on the parameter value.

switch($animal) {

case “Dog”:

print “You have selected a dog.”;

break;

case “Cat”:

print “You have selected a cat.”;

break;

default:

print “You have not selected anything.”;

break;

}

Through the use of switch and case branching statements we can start to build a framework for a

dynamic web application that can be customised to the user. The use of amCharts enables interactive

graphs and charts to be created which can be generated from live data and with a number of settings.

Controller

The controller is responsible for loading models, describing the workflow of the application and is

used to pass data to the view. The application controller has three components for displaying the home

29

page, a drilled down page of a specific ANR and for creating the export view. The main body of each

controller loads the models that are required and contains function for referencing those models and

for displaying views. A typical call for a controller looks like: $data['allAnrNames'] =

$this->anr->getAllAnrNames(); This example shows saving all of the ANR names from

the database into the $data array. The $this->anr variable denotes that the application uses the

loaded ANR model and calls the method getAllAnrNames() for that model. Each controller has a

similar final method: $this->load->view('application/index', $data); This shows

that we are loading the index view from the application directory and are passing the contents

of the $data array to the view. Another useful function is that it can be used to handle post and

get variables from forms or web links. The following code could be used to extract which day of the

week was selected from a menu whose name is weekDay and store it in the $data array:

$data['weekDay'] = $this->input->post('weekDay');

The advantage of a controller is that there is full control over the variables and data flows between

models and views and can vary from displaying a static page, to one with lots of dynamic content.

5.2 Construction

Construction describes the graphical outputs of the application related to views. The reason for

choosing the MVC architecture is that multiple views can be created using the same models, e.g.,

presenting data as a table or chart. This section details each view with the aid of a screenshot.

Views

Typically, a view presents data from models via a graphical user interface (GUI) which makes it

suitable for interaction. We have created three views in this project, two present a GUI and one is

used to create a dynamic export file:

• The main view is the home page of the application which presents data from each ANR. This

is the highest level of abstraction which can be drilled into by selecting an ANR.

• The detail view presents data for a single ALR which has been selected by the user and been

passed from the main view. The results differ from the main view as they are at a lower level

of abstraction but have similar models and controller.

• The export view does not display any results graphically, but translates the data into a

downloadable form. This is achieved by modifying the content-type of the output from

text/html to text/plain which prompts the web browser to display the file as simple

text. This then allows the user to save the results in either a CSV or WEKA ARFF file format.

At the top of each of the views is a paragraph of text which explains where the data has been extracted

from and the potential limitations of the methods used to combine the data sets. This is so users can

trace back to where the original data was extracted from (if required).

30

The screenshot in Figure 9 (below) shows the Clinical Data section of the main view which includes a

pie chart displaying the total number of diabetes cases across England and a drop-down box enabling

the user to drill-down into an ANR. This section was included as the first iteration of implementation

as it does not access any population data and was used to test how easy amCharts was to use. The data

for the pie chart has been extracted from the DATA_CLINICAL relation and has been aggregated by

ANR and is accessed from the ANR model using the following SQL.

SELECT aggregateNationalRegion, SUM(diabetesCases) AS diabetesCases

FROM DATA_CLINICAL

GROUP BY aggregateNationalRegion

Figure 9 Screenshot of Clinical Data Section of the Main View

The results from the clinical data only presents a one-dimensional view which shows that the South

has the highest proportion (24%) of diabetes cases and the North East has the lowest proportion (5%).

However, this does not take into account population density or any other demographic factor. Using

this view could lead to incorrect inference as we are not able to comment on whether the South has a

higher population which has led to the number of cases being higher. For the second iteration we have

combined the clinical and population data to highlight and address this issue.

The screenshot in Figure 10 (below) shows the Clinical and Population Data section of the main view

which includes a combination bar and line graph and data table. The chart can be automatically

updated to show data for Sex, Age, Ethnicity or Economic Group by selecting the category from the

drop-down menu and is currently showing Ethnicity data.

The bar chart is 100% stacked which means each attribute represents a percentage against the total

value. For example, in London the highest proportion of people is White (72.63%), followed by Asian

(11.85%), Black or Black British (10.36%), Chinese or Other Ethnic Group (2.91%) and Mixed

31

(2.24%). The red line represents the number of diabetes cases compared to the total population size

(known as prevalence) and so we can now see that the West Midlands has the highest proportion of

cases (6%) and the South has one of the smallest (5.13%).

Figure 10 Screenshot of Clinical and Population Data Section of the Main View

The initial pie chart showed that the South had the highest proportion of diabetes cases, but now with

the addition of the population data we can see that it is in fact the lowest when population size is

considered. The screenshot (see Figure 10, above) demonstrates how data can be presented as a chart

and data table using the same model.

Finally, the Export Data section shows a convenient form (see Figure 11, below) that allows the user

to select which output format they require. Clicking the “Export” button presents the user with a

dialog box where they are able to download the data to their computer.

Figure 11 Screenshot of Export Data Section of the Main View

This example shows the output presented by selecting the WEKA ARFF option:

@relation anr

@attribute name string

@attribute females numeric

32

@attribute males numeric

@attribute aged16To24 numeric

@attribute aged25AndOver numeric

@attribute asian numeric

@attribute blackOrBlackBritish numeric

@attribute chineseOrOther numeric

@attribute mixed numeric

@attribute white numeric

@attribute active numeric

@attribute inactive numeric

@attribute unemployed numeric

@attribute student numeric

@attribute diabetes numeric

@data

'East Midlands', 1521595, 1499115, 451904, 2568806, 119949, 30819,

16296, 18720, 2834926, 1852548, 858737, 210758, 98667, 174117…

The results so far have only been from the main view which is the highest level of abstraction. We

have also shown an example of the export view in the form of WEKA ARFF output which shows how

data can be re-formatted into a desired format. To make the management tool even more powerful, the

detail view enables the user to view the same graphs and tables but at a lower level of abstraction. We

had previously noted in this report that matching a County Local Authority (CLA) to a Primary Care

Trust (PCT) would be out of scope of this project as the variation in geographic regions would make

the task of matching areas very inaccurate.

To test the detail view we chose a local region, Yorkshire and The Humber, and attempted to match

the two entities to create ALR. The pairings are presented in Appendix C and show an almost one-to-

one correspondence between CLAs and PCTs in this region with the only exception being that North

Yorkshire and York were separate in the population statistics. Yorkshire and The Humber was chosen

primarily because of local knowledge but also because of the small variance in regions between the

clinical and population data. Figure 12 (below) shows the total number of diabetes cases in Yorkshire

and The Humber with Leeds (13%) and York (13%) accounting for most of the cases. The lowest

numbers of cases were in North East Lincolnshire, North Lincolnshire and Calderdale which are not

labelled as they each account for less than 4%. The amCharts settings file enabled this feature to

prevent over-crowding of labels.

Figure 12 Diabetes Cases in Yorkshire and The Humber

33

As we have seen at the higher level, these results do not take into account any population data and so

an example is presented in Figure 13 (below) which shows the Economic Groups, e.g., Students,

Unemployed etc. of each region along with diabetes prevalence. The results almost show the inverse

of just the number of diabetes cases on their own, with Leeds (4.88%) and York (4.71%) being among

the lowest for diabetes prevalence. North East Lincolnshire (6.34%) and North Lincolnshire (6.38%)

are now among the highest in proportion to their size.

Figure 13 Economic Groupings and Diabetes Prevalence in Yorkshire and The Humber

To conclude, this section explained the elements of the MVC architecture in supporting the

application, described how amCharts was used to create graphs and charts and shown some

screenshots of the final application. In the evaluation we will investigate the results further with a

domain expert to see if conclusions can be made about their validity. Even from the results we have

presented it is easy to see the management tool can be powerful in providing quantitative evidence.

34

6. Evaluation

In this section we present an evaluation of the project and, to complement the choice of adopting a

user-centred approach to requirements analysis and implementation, we aim to provide a balance

between technical and socio-technical perspectives. Evidently, there are many paradigms and

techniques that could be used to conduct the summative evaluation and so we have identified the

DECIDE framework (Preece et al. 2002) to provide a useful checklist to aid the evaluation design. To

conclude we look at the overall strengths and weaknesses of the evaluation and provide

recommendations for further work based on the responses from the evaluation and research.

6.1 Evaluation Paradigms and Techniques

The results of an evaluation are explicitly, or implicitly, influenced by the questions that are asked and

the perspectives of the evaluators. It is important to consider these perspectives, or paradigms, to

ensure appropriate techniques are selected to provide a balanced evaluation. This is described by

Phillips et al. (2000) as being an “eclectic-mixed methods-pragmatic” approach whose strength is in

acknowledging there are no “right” approaches to evaluation and that maintaining an open mind is

essential. This is reflected by the many different approaches to evaluation in the literature and we aim

to present two views and techniques that can be used for each. Preece et al. (2002) identify four

evaluation paradigms: “quick and dirty”, usability testing, field studies, and predictive evaluation:

• “Quick and dirty” evaluations are a common way of gaining informal feedback from users

which focus primarily on design such as layout and aesthetics and can be done at any stage of

the project lifecycle. Responses are highly qualitative and can be collected through verbal or

written notes or comments. The advantages of this approach are that results can be identified

quickly and enables interaction between users and developers.

• Usability testing involves observing users’ performance on a set of prepared tasks and can be

measured in terms of number of errors made and time to complete the tasks. It is a highly

quantitative approach and is conducted in a laboratory environment where every key stroke

and action is recorded. Although this approach has benefits, such as identifying mismatches

between designs and user perceptions, field studies and predictive evaluations have grown in

prominence since the introduction of usability testing in the 1980’s (Preece et al. 2002).

• Field studies differ from usability tests as evaluation is performed in the users’ natural

environment rather than in a laboratory. This can help prevent what is known as the

Hawthorne Effect (Landsberger, 1958) where a short-term improvement caused by observing

users’ performance may be witnessed. The most common approach is where the developer

acts as an observational outsider recording how users interact with the system and noting any

comments or suggestions they have for improvements.

35

• Predictive evaluation is where expert evaluators apply their own knowledge, often with the

aid of heuristics, to predict usability problems. Evaluation of products using tried and tested

heuristics has become popular with guidelines such as those created by Nielsen (2005). In

general, users are not involved and data is presented as a list of problems from expert reviews.

A different view is presented by Dix et al. (2004) who identify two approaches to evaluation through

expert analysis or user participation. They then classify evaluation techniques into three categories:

analytical, experimental and query-based, and observational:

• Expert analysis is similar to predictive evaluation where an expert assesses the impact of

design on typical users using their own knowledge or heuristics. The main benefit of this is

that they are relatively cheap as they do not require user involvement but are weak in

assessing use of the system in a natural environment. Analytical techniques include cognitive

walkthrough (similar to usability testing) and heuristic evaluation.

• User participation is used to evaluate the system from the perspective of the user and is

similar to field studies presented by Preece et al. (2002). Experimental techniques include

testing hypotheses and other statistical measures such as error rates, processing times and

response times which is a highly quantitative and scientific approach. Query-based techniques

include interviews and questionnaires which provide softer data but enable the evaluator to

question the users’ actions. Observational techniques include “think aloud” and co-operative

evaluation where users are asked to talk through a process and justify decisions made.

We have only selected a handful of techniques but there are evidently a lot more that could be used.

Qualitative approaches have been deliberately favoured as we believe systems should be treated as

organically as possible and the aim of the project was always to focus on users rather than processes.

6.2 DECIDE Evaluation Framework

Having identified paradigms and techniques we can begin to structure the evaluation. Preece et al.

(2002) describe an evaluation framework to help ensure that clear goals and appropriate questions are

defined. This is split into six elements and is known as DECIDE:

• Determine the high-level goal(s) that the evaluation is set to address. Different goals include

helping clarify user needs, fine-tuning an interface against user needs or to investigate the

degree to which technology influences working practices. These are important as they help

influence which paradigms and techniques are most suitable for the evaluation.

• Explore the questions that need to be answered to achieve the goal(s).

• Choose the evaluation paradigm and the techniques to answer the questions. Practical and

ethical issues must be considered and trade-offs made as the most appropriate techniques may

36

prove too costly or particular laboratory equipment may be unavailable. A combination of

techniques can be used to obtain responses from different user perspectives.

• Identify practical issues such as involving appropriate users, facilities and equipment,

schedules and evaluators’ expertise. Evaluators must be representative of the user population

of the system (or as close as possible) and may have particular domain expertise. Particularly

in field studies the length of time taken should be considered as ten minutes may be too short

to gain adequate answers. Users must be re-assured so that they feel at ease, enabling them to

perform normally, and it may also be good practice to provide an introduction and reasoning

behind the system so that they fully understand why they have been chosen.

• Decide how to deal with ethical issues such as respecting user’s privacy and ensuring

confidential information is kept secure. Preece et al. (2002) describe the ethical issues

concerned with evaluation and these can be summarised as: tell participants the goals of the

study and what is expected of them such as the time it should take and how their responses

will be analysed; make sure they know that they are free to stop the evaluation at any time;

and avoid including quotations that would personally identify any individual.

• Evaluate the results which include deciding how data will be analysed and presented. Issues

such as reliability, validity, biases, scope and ecological validity should be considered and

any assumptions should be noted.

In the remaining sections we apply the DECIDE framework to the project and present the final

conclusions. In response to the results of the evaluation, suggestions for further work are described

which extend the comments made in the evaluate section of the framework.

6.2.1 Determine

The goal is to assess how the solution could influence the working practice of the persona and

whether it satisfies the objectives set during the requirements analysis:

“The system shall enable me to compare population and clinical data.”

“The system shall reduce the time needed to manually find and extract population and clinical data.”

“The system shall enable me to filter results by geographic region and demographics.”

6.2.2 Explore

From the objectives we can see that there are a number of questions that can be asked about the final

solution which will help assess the impact on the working practice of the persona. We have split these

objectives into two main questions which focus on socio-technical and technical elements:

1. Does the solution enable clinical and population data to be compared?

a. Does it enhance the speed of making inferences about the population?

37

i. Is the system easy to navigate?

ii. Is the terminology confusing because it is poorly explained or inconsistent?

iii. Are response times for most operations adequate?

iv. Are characters on screen legible?

v. Is the amount of help given for performing tasks adequate?

b. Does it enhance the efficiency of making inferences about the population?

i. Are the results valid?

ii. Are the results easy to interpret?

iii. Can valid inferences be made from the results?

2. Does the solution enable users to filter results by geographic region and demographics?

a. Are online instructions visible and understandable?

b. Are colours used appropriate?

c. Are charts easy to interpret?

We consider these to be the main questions that we hope to address in this evaluation in order to

achieve the goal. They focus on the speed of the system, whether the results are valid and achieving

the functional requirement of being able to filter by geographic region and demographics.

6.2.3 Choose

From the questions that have been identified the approach suggested by Dix et al. (2004) of splitting

the evaluation into expert analysis and user participation is the most appropriate as it provides the

greatest range of perspectives. Expert analysis will focus on answering 1b and user participation can

be used to provide general feedback related to responsiveness and functionality. Table 7 (below)

highlights these approaches in relation to the evaluation paradigms defined by Preece et al. (2002).

Paradigm Expert Analysis User Participation

“Quick and Dirty” N/A Survey

Predictive Evaluation Semi-Structured Interview N/A

Table 7 Evaluation Paradigms and Techniques

A survey will be created, focussing on 1a and 2, including a set of bipolar semantically anchored

items (Coleman & Williges, 1985) which the evaluators will rate (e.g. simple versus complicated,

hostile versus friendly, concise versus redundant). Some questions were extracted from the

Questionnaire for User Interaction Satisfaction (QUIS) which was developed by Shneiderman and

refined by Chin et al. (1988) and user interface design heuristics suggested by Nielsen (2005). Users

38

will complete a set of tasks using the system such as “Which Aggregate National Region has the

highest number of diabetes cases?” and will be encouraged to comment on their answers and on

potential areas of improvement.

We deliberately focussed the survey on issues of design and usability as we anticipated not receiving

many responses. Following Nielsen’s (Dix et al. 2004) experience it is possible that between three and

five evaluators will be sufficient in identifying around 75% of usability problems. Subsequent

evaluators generally repeat problems that have already been identified which means that no value will

be added by increasing the quantity of responses. To complement the survey, expert analysis will

include inviting a healthcare domain expert to evaluate the software through a semi-structured

interview to gauge its validity, and to gain feedback on improvements that may not have been

identified by users who are not domain experts.

6.2.4 Identify

The practical issues that have been considered and answered for this evaluation are:

1. How is the survey going to be distributed and how are responses going to be received?

The survey will be distributed to a group of friends through online social networking site

Facebook http://www.facebook.com/. A group will be created explaining the aim of the

evaluation and a downloadable survey which can be filled out and then sent back via e-mail. The

survey will contain a consent form which will be used to ensure the evaluation conforms to the

identified ethical issues.

2. How is an adequate cross-section of evaluators going to be achieved?

This area is always going to be problematic and it is safe to assume that this will never be fully

achieved. Users will be invited to supply a few details about themselves such as age, sex and

technical experience which may be used in the results. As the user evaluation is quite generic and

focused on design and usability it may be that these differences do not affect responses.

3. Will the expert have time and where will the interview take place?

We have personal contact with the expert and so it is hoped they will have time to complete the

interview. If not, we have identified two further experts if this one is not available. The location of

the interview will be subject to the expert and we will need to remain as flexible as possible.

Any other issues will be dealt with at the point of occurrence.

6.2.5 Decide

As we intend to involve multiple users it is essential they are briefed on the goal of the evaluation.

Each evaluator will only be required to spend a maximum of half an hour on the tasks set, with the

exception of the expert whose evaluation may take longer, and they should be assured that they are

under no obligation to take part in the study. To comply with these issues we will require each

39

participant to complete an informed consent form (see Appendix D) before completing any elements

of the evaluation which will stress that none of their details will be made available to third parties and

all quotations will remain anonymous. This will be achieved by allocating each user an identification

number rather than using their name in all correspondence.

6.2.6 Evaluate

The results of the evaluation are presented in two sections: from the user survey and then from the

expert analysis. The user survey (see Appendix E) was designed to focus on issues of design and

usability and was based around a number of questions that could be answered using the software.

Responses were received from four evaluators and out of the five questions four mistakes were made.

1. Which Aggregate National Region (ANR) has the highest percentage of diabetes cases?

2. Which ANR has the highest disease prevalence?

3. Which ANR has the largest non-White population?

4. Which Aggregate Local Region (ALR) in Yorkshire and The Humber has the highest number of

diabetes cases?

5. How many students are there in Leeds?

Question 1 was answered correctly by all evaluators as it only required them to use the Clinical Data

chart and identify that the South had the highest percentage (24%). This question was used to ease the

user into the software and to enable them to experience how the statistics were being presented.

Question 2 was answered incorrectly by one evaluator. To get the right answer it required evaluators

to use the diabetes prevalence line from the Clinical and Population Data chart and to identify that

the West Midlands had the highest prevalence (6%). The evaluator who gave the incorrect answer

appeared to have quoted the figure representing the number of Males in the region rather than

prevalence. This potentially means that the combination of bar and line graph could be confusing.

Question 3 was answered correctly by all evaluators and required them to change the category of the

Clinical and Population Data chart to Ethnicity and then read off the highest value which was London

(27.37%). This could have been completed in a number of ways including adding up the non-White

percentages from the data table or by identifying the answer from the chart.

Question 4 was answered incorrectly by three evaluators. They all identified to access ALR data they

must drill into the Yorkshire and The Humber region and read from the Clinical Data chart but were

confused by the fact York and Leeds both had 13% of cases. The correct answer was York as this had

the highest number (25,682) against Leeds (25,403) but this could only be found by rolling the mouse

over the chart to discover the result.

Question 5 was answered correctly by all evaluators and, without a prompt, required them to drill into

the Yorkshire and The Humber region, change the category of the Clinical and Population Data chart

40

to Economic Group and then read off the Student value which was 17,280. As this information wasn’t

presented to them directly in the question it implicitly demonstrates the retention of knowledge of the

system over time and also suggests the time to learn is small (Shneiderman & Plaisant, 2005).

The survey was split into six parts and used a 1 to 7 scoring system for all responses where 1 was poor

and 7 excellent. There was also a section below each question to enable evaluators to leave further

comments. (Most took up this opportunity which enabled richer responses.) Rather than taking the

mean average of the responses, the minimum value will be used. This is known as the “least misery

strategy” (Masthoff, 2004) as it shows that the group of evaluators is only as satisfied as the least

satisfied evaluator. There is also a possibility that some evaluators rated more positively than others

which would bias a mean average.

The responses from the survey are presented in Figure 14 (below) with the red line showing the

minimum value for each part. The y-axis scale has been modified as a space saver as all ratings were

above 3. (Starting from 1 would have been inappropriate based on these results.)

User Survey Responses

3

4

5

6

7

1.1 2.1 2.2 2.3 2.4 3.1 3.1.1 3.2 3.2.1 3.2.2 3.4 3.5 4.1 4.2 4.3 4.4 5.1 5.2 5.3 6.1 6.2

Part

Ra

tin

g

3

4

5

6

7

Evaluator 1 Evaluator 2 Evaluator 3 Evaluator 4 Minimum

Figure 14 Chart Showing Responses from User Survey

Even with the relatively small number of responses the ratings have a relatively large variance which

will be discussed and then recommendations for further work presented. The sub-parts of each

section, which are shown in the chart, will be discussed collectively under the headings supplied in

the user survey (see Appendix E).

Part 1: General Computer Experience

The survey forms were distributed via Facebook which meant that the target audience was primarily

students. As the system is intended for health professionals, who are believed to have similar technical

41

experience to students due to the fact they will have been in higher education, this was thought to be

the most appropriate method. Responses showed that the evaluators felt they were closer to experts

rather than beginners however nobody rated themselves higher than 6. This is important for two

reasons: if evaluators consider themselves to be near-experts and cannot answer all questions correctly

there may be serious usability issues (such as their responses to Question 4); and, they may make

better assumptions about functions than less experienced users which means that some issues may be

missed. The last reason has knowingly been ignored as it would require contacting people outside of

Facebook and it is considered unrealistic to assume that health professionals would have less

experience than students (and not all evaluators were from a computing background).

Part 2: Overall User Reactions

Reactions to the system were generally positive, but concerns were raised related to the amount of

information presented. One evaluator commented that “you need to search a lot of the website to find

what you’re looking for” which relates to the low ratings for stimulation and ease of use and it is

assumed that a lot of trial and error was required to find the answers to the questions. The same

evaluator also stated the site was “well laid out and compact so navigation isn’t time consuming”

which suggests poor documentation rather than poor layout.

Part 3: Screen

Questions regarding screen elements were focused towards character legibility, colour and layout. An

issue was highlighted that the site itself only occupied a relatively small area in the centre of the

screen which could be expanded to occupy more screen “real estate”. The characters on screen were

identified as being quite hard to read but still legible. This may mean that an alternate typeface should

be used, such as a serif font rather than sans-serif, or that there should be increased whitespace

separating textual elements. In one case the screen layouts and arrangement of information was scored

quite low which supports the comments made in Part 2. It was suggested there should be a “tip at the

beginning to inform the user that they can hover over sections of the graph to find out more

information” which complements the fact evaluators answered some questions incorrectly. Question 4

in particular required evaluators to roll their mouse over the graph which was obviously missed.

Although the evaluator gushed “it took me a while to realise [that], but then again it could just be

me!” in fact it was experienced (unknowingly) by all but one evaluator. The pie charts were described

as “great” however the line graphs required a “longer look to understand them”. An interesting

comment was that the use of colour made the management tool look “friendly and accessible and puts

the user at ease” which was important when faced with lots of medical data. The same evaluator also

suggested the modern and up-to-date look and feel of the site almost gave “extra validity to the data”.

Part 4: Terminology and System Information

As the project incorporated new concepts such as ANR and ALR along with other medical data it was

important to gauge how easy they were to understand. A low rating was received for the clarity of

42

instructions for commands or functions which has been discussed above regarding the hovering over

of charts. Another evaluator commented that “paragraphs are short and concise which allows the

understanding of the tools and statistics”. Unfortunately the evaluator was one of those who answered

Question 4 incorrectly and so although paragraphs are concise they must also be relevant.

Part 5: Learning

For one evaluator the prospect of exploration of features by trial and error was more discouraging

than encouraging as it was “sometimes unclear where to obtain specific information”. They stated that

“the whole page needs to be analysed before being able to find the information required”. This is

worrying as if the correct information cannot be found quickly then users may not use the system in

the future. However, it also suggests users have to actively engage with the system which means they

will not just take answers for granted but will question and learn from errors. Nielsen (2005) identifies

this as “recognition rather than recall” as users should not have to remember information between

components. One of the most famous findings in psychology is Miller’s (1956) theory that only 7±2

pieces of information can be stored in short-term memory at one time. Short-term memory is a “store

in which information was assumed to be processed when first perceived” (Preece et al. 2002). The

theory suggests that if users are required to process too much information at one time they may forget

important details or become confused very easily. Obviously this must be reduced through separating

“chunks” of information and providing accurate help and documentation.

Part 6: System Capabilities

The final part related to issues of system speed and reliability and gave an opportunity for evaluators

to leave any further comments. All ratings were high and it was noted that “no problems [were]

experienced”. One evaluator mentioned that the “time to completion” preloading bar used in each of

the graphs re-assured them that the buttons they clicked were working. This relates to a heuristic

known as “visibility of system status” described by Nielsen (2005) in which users are kept informed

about what is going on through system feedback.

Semi-Structured Interview

Along with the user participation survey, expert analysis was completed with the aid of a healthcare

expert, Dr. Rick Jones, through a semi-structured interview. Dr. Jones has a keen interest in clinical

decision support, statistical analysis and educational uses of computers. He is also the Deputy

Director of the Yorkshire Centre for Health Informatics (YCHI) and Senior Lecturer in Chemical

Pathology both at the University of Leeds.

The focus of the interview was to discuss the project as a whole, the validity of results and to identify

areas of improvement and further work. Several positive comments arose from the meeting, most

notably that YCHI are hosting an exploratory pilot pathology benchmarking project

http://www.ychi.leeds.ac.uk/benchmarking/ which is attempting to provide timely insights into the

43

delivery of diagnostic services to National Health Service (NHS) organisations through routinely

collected pathology test data. The benchmarking project is similar to this project in terms of data

sources but also adds a new dimension by integrating pathology lab test data. This project was

forwarded to some of the benchmarking team who were impressed by the “professionalism of the

outputs, the quality of the academic analysis and the potential for further exploitation” and will be

looking to integrate some of the tools demonstrated in this project, which were said to be “very

desirable”, in the next stages of their development.

An example of a current output is shown in Figure 15 (below) which shows the number of Hba1c

(diabetes) tests taken per 1,000 patients (the blue bars) against the respective diabetes prevalence (the

black line). The results are filtered to show each Primary Care Trust (PCT) from four Strategic Health

Authorities (SHAs) and can be customised by the user. The population statistics that have been

prepared in this project could be integrated with such outputs as both uses the standard National

Administrative Codes Service (NACS) nomenclature.

Figure 15 Screenshot Showing the Number of Hba1c Tests Compared to Diabetes Prevalence

Dr. Jones commented that the results presented in this project seemed valid but would need to include

other risk factor statistics if it were to be used in a clinical environment. For example, if a family

member has diabetes or if you are overweight or have high blood pressure. From these comments it

may have been more appropriate to focus the project on gaining quantifiable results, e.g., the costs

associated with the number of diabetes tests being performed unnecessarily rather than providing

‘best-guesses’ at what may be causing the diabetes cases in the first place. A covering letter with

further comments is available in Appendix F.

44

6.3 Further Work

From the results of the evaluation, and from other sources, we have identified a number of extensions

that would improve the project. As the computing discipline is constantly evolving, newer

technologies make additions more feasible and it is important to be aware of change. For example,

Google released its Charts API http://code.google.com/apis/chart/ which could be used as a

replacement for amCharts and Rich Internet Application (RIA) development kits enabling interactive

content to be displayed in ways not possible a few years ago are increasing in functionality and size.

• Documentation could be improved by either creating an interactive screen cast which

demonstrates the use of the software or by explaining functions more clearly. This was

identified by the evaluators who noted it was difficult to find what they were looking for and

who were unaware of some of the features, such as hovering, provided by amCharts.

• There should be a clear distinction made between what information each graph represented

and how results should be interpreted. The combination bar and line graph was potentially

difficult to understand, which lead to errors, and so statistics should either be separated or a

clearer legend and description of what results are being presented should be displayed.

• More functionality could be added to enable users to zoom in and out of the charts. This

would improve clarity of the data and also enable the user to focus on areas relevant to them

rather than just seeing the “big picture”. This feature is provided by amCharts in the form of a

scroller which enables charts to be customised by selecting areas of interest from a web form

or from the chart directly using the mouse.

• Other disease indicators could be used such as heart disease or cancer. These statistics could

be presented in the same way as diabetes, or new methods could be created to enable cross-

comparisons between indicators. This would be similar to the NCHOD (2007) interactive

atlas enabling live cross-comparisons. Replicating current features could be implemented as

the Quality Outcomes Framework (QOF) data had other indicators which were removed in

the data preparation phase of this project. By modifying the data model to include new

indicators and updating the models, views and controller this change would be possible.

• Exploring data mining concepts further in WEKA such as clustering may provide further

insights into the data. This could include identifying “similar” regions in terms of

demographics and seeing whether they also had similar disease prevalence. WEKA also has a

Java-based interface which could be programmed alongside this project to extend its

functionality. For example, providing a desktop application which accepted live data feeds

from this software and outputting them in a different format.

45

The list of extensions is in no way exhaustive, but presents the views specific to the goal of the

evaluation. It is possible that other issues would have been identified if a different approach was taken

which is discussed in the evaluation summary (see Section 6.4).

6.4 Evaluation Summary

The goal of the evaluation was to assess how the solution could influence the working practice of the

persona through the questions identified in the explore stage of the framework. Two techniques were

used, user surveys and expert analysis, which were elements of “quick and dirty” and predictive

evaluation paradigms. User surveys were used to address issues of design and usability and expert

analysis addressed the validity of results and further analysis into where this project fit into the

context of the National Health Service (NHS). Although only a few responses were received this was

justified by Nielsen’s (Dix et al. 2004) experience in usability evaluations and was the reason why

expert analysis was also conducted. Responses from the evaluation were generally positive and helped

identify a few problems in the design of the software such as lack of clear documentation. The use of

the DECIDE framework helped structure the evaluation enabling greater focus to be on

implementation and generating user responses. Finally we presented areas of further work addressing

the results of the evaluation but also areas that were considered out of the scope of this project.

46

7. Conclusion

In this section we review the project as a whole to identify its achievements and include suggestions

for future improvements. The project can be considered a success for the following reasons:

• A problem was defined which not only provided a sufficient challenge but also could be

extended in a number of ways in the future. The problem was relevant to the Informatics

degree programme as it not only included information creation, processing and systems

engineering but also bridged the gap between computing and healthcare domains.

• A methodology was designed built on the experiences of others (Wirth & Hipp, 2000) and

effectively integrated data mining and project management processes. Sufficient maturity was

demonstrated to follow the methodology and to adhere to the project schedule which led to

the project being completed a week before the deadline. There was a slight change in the final

project plan which reflected a change in scope. This was because initially it was thought

aspects of data mining could be explored when in fact the time taken to create the data model

would have meant that discussion would have been of limited originality.

• Knowledge of clinical and population data sets was demonstrated and improved throughout

the project. This included being disciplined in how data was collected, as some resources

needed to be applied for such as the Quality Outcomes Framework (QOF) data, and

persevering with often monotonous data transformation procedures.

• Although creating the graphical user interface (GUI) wasn’t the primary concern of the

project it was developed with expert care and attention. Attention was paid to adhere to

current web standards defined by the W3C and implementation required good knowledge of

CodeIgniter, eXtensible Markup Language (XML), JavaScript, MySQL, phpMyAdmin and

Hypertext Pre-Processor (PHP) to name but a few.

• Evaluation was completed systematically following a documented framework (DECIDE).

Although only a few responses were received these were considered enough to formulate

accurate conclusions about the system. An area of improvement would have been to conduct

the evaluation earlier in the project cycle or to provide prototypes before the final

implementation. However, by making a small number of changes, such as improving help and

documentation, the system would be able to go live.

Further personal reflections are presented in Appendix A which can be used as a reference to

prospective students undertaking similar projects. In summary, we hope this report accurately reflects

the skills and time taken in the development of this project as each stage was meticulously crafted to

produce a solution which was not only technically creative but also usable and extensible.

47

References

AMBLER, S. 1999. Enhancing the Unified Process [online]. [Accessed 25 October 2007]. Available

from: http://www.ddj.com/architect/184415741

AMBLER, S. 2006. The Agile Edge: Unified and Agile [online]. [Accessed 25 October 2007].

Available from: http://www.ddj.com/architect/184415460

BENNETT, S., MCROBB, S. & FARMER, R. 2006. Object-Oriented Systems Analysis and Design

(3rd. Ed.) Using UML. Berkshire: McGraw-Hill Education.

BEYNON-DAVIES, P. 2004. Database Systems (3rd Ed.). Hampshire: Palgrave Macmillan.

BEYNON-DAVIES, P. & HOLMES, S. 1998. Integrating rapid application development and

participatory design. IEEE Proceedings - -Software. 145 (4), pp. 105-112.

BOCIJ, P., CHAFFEY, D., GREASLEY, A. (Ed.). & HICKIE, S. 2006. Business Information Systems

(3rd Ed.) Technology, Development and Management for the E-Business. Essex: Prentice Hall.

BØDKER, S. 1996. Creating conditions for participation: conflicts and resources in systems

development. Human-Computer Interaction. 11 (3), pp. 215-236.

CHAPMAN, P., CLINTON, J., KERBER, R., KHABAZA, T., REINARTZ, T., SHEARER, C. &

WIRTH, R. 2000. CRoss Industry Standard Process for Data Mining [online]. [Accessed 27 October

2007]. Available from: http://www.crisp-dm.org/CRISPWP-0800.pdf

CHIN, J.P., DIEHL, V.A. & NORMAN, K.L. 1988. Development of an instrument measuring user

satisfaction of the human-computer interface in: Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems, pp.213-218.

CODD, E.F. 1970. A relational model of data for large shared data banks. Communications of the

ACM. 13 (6), pp. 377-387.

COLEMAN, W.D. & WILLINGES, R.C. 1985. Collecting detailed user evaluations of software

interfaces in: Proceedings of the Human Factors Society, pp.240-244.

COOPER, A. 1999. The Inmates are Running the Asylum (2nd Ed.) Why High-tech Products Drive Us

Crazy and How to Restore the Sanity. Indiana: Sams.

COOPER, A. & REIMANN, R.M. 2007. About Face 3 (3rd Ed.) The Essentials of Interaction

Design. Indiana: John Wiley & Sons.

DIRECTGOV, 2007a. What is a Census? [online]. [Accessed 12 November 2007]. Available from:

http://www.statistics.gov.uk/Census/WhatisaCensus/

DIRECTGOV, 2007b. Census 2001: the most Comprehensive Survey of the UK Population [online].

[Accessed 12 November 2007]. Available from: http://www.statistics.gov.uk/Census2001/

48

DIX, A., FINLAY, J., ABOWD, G.D. & BEALE, R. 2004. Human-Computer Interaction (3rd Ed.).

Essex: Pearson Education Limited.

GERMAIN, E & ROBILLARD, P.N. 2005. Engineering-based processes and agile methodologies for

software development: a comparative case study. Journal of Systems and Software. 75 (1), pp. 17-27.

GRUDIN, J. & PRUITT, J. 2003. Personas, Participatory Design and Product Development: an

Infrastructure for Engagement [online]. [Accessed 27 October 2007]. Available from:

http://research.microsoft.com/research/coet/Grudin/Personas/Grudin-Pruitt.pdf

HUNT, A. & THOMAS, D. 1999. The Pragmatic Programmer: From Journeyman to Master.

London: Addison-Wesley.

JUNIOR, P.T.A. & FILGUEIRAS, L.V.L. 2005. User Modelling with Personas in: Proceedings of the

2005 Latin American Conference on Human-Computer interaction, pp. 277-282.

KANTARDZIC, M. 2002. Data Mining: Concepts, Models, Methods and Algorithms. Chichester:

John Wiley & Sons.

LANDSBERGER, H.A. 1958. Hawthorne Revisited: Management and the Worker, Its Critics, and

Developments in Human Relations in Industry. New York: Cornell University Press.

LEONARD, M & SAMY, R. Forecasting Geographic Data [online]. [Accessed 23 January 2008].

Available from: http://support.sas.com/rnd/app/papers/papers_ets.html

MASTHOFF, J. 2004. Group modelling: Selecting a sequence of television items to suit a group of

viewers. User Modelling and User-Adapted Interaction. 14 (1), pp.37-85.

MILLER, G. 1956. The magical number seven, plus or minus two: some limits on our capacity for

processing information. The Psychological Review. 63, pp.81-97.

MUIR GRAY, J.A. 2006. The National Knowledge Service Plan 2007-2010 [online]. [Accessed 22

January 2008]. Available from: http://www.nks.nhs.uk/nksplan2007.pdf

MULLER, M.J. 2002. Participatory Design: The Third Space in HCI [online]. [Accessed 27 October

2007]. Available from:

http://domino.research.ibm.com/cambridge/research.nsf/2b4f81291401771785256976004a8d13/5684

4f3de38f806285256aaf005a45ab?OpenDocument

NCHOD, 2007. Clinical and Health Outcomes Knowledge Base [online]. [Accessed 12 December

2007]. Available from: http://www.nchod.nhs.uk/

NIELSEN, J. 2005. Heuristics for User Interface Design [online]. [Accessed 02 March 2008].

Available from: http://www.useit.com/papers/heuristic/heuristic_list.html

NOMIS, 2007. Official Labour Market Statistics [online]. [Accessed 13 November 2007]. Available

from: https://www.nomisweb.co.uk/

49

PHILLIPS, R., BAIN, J., MCNAUGHT, C., RICE, M. & TRIPP, D. 2000. Handbook for Learning-

centred Evaluation of Computer-facilitated Learning Projects in Higher Education [online].

[Accessed 05 March 2008]. Available from:

http://www.tlc.murdoch.edu.au/archive/cutsd99/handbook/handbook.html

PREECE, J., ROGERS, Y. & SHARP, H. 2002. Interaction Design: Beyond Human-Computer

Interaction. New York: John Wiley & Sons.

SCRIVEN, M. 1967. The Methodology of Evaluation in: TYLER, R.W., GAGNÉ, M. & SCRIVEN,

M. (Eds.). Perspectives of Curriculum Evaluation. Chicago: Rand McNally, pp.39-83.

SHNEIDERMAN, B. & PLAISANT, C. 2005. Designing the User Interface (4th Ed.) Strategies for

Effective Human-Computer Interaction. New York: Addison-Wesley.

THE INFORMATION CENTRE, 2007. Online GP Practice Results Database [online]. [Accessed 12

December 2007]. Available from: http://www.qof.ic.nhs.uk/

TRACHTENBERG, A. 2004. Why PHP5 Rocks! [online]. [Accessed 30 September 2007]. Available

from: http://www.onlamp.com/pub/a/php/2004/07/15/UpgradePHP5.html

TURBIT, N. 2005. Defining the Scope of a Project [online] [Accessed 06 April 2008]. Available

from: http://www.projectperfect.com.au/downloads/Info/info_define_the_scope.pdf

W3C, 2008. Scalable Vector Graphics (SVG) [online]. [Accessed 08 April 2008]. Available from:

http://www.w3.org/Graphics/SVG/

WIKIPEDIA. 2007. Comparison of web application frameworks [online]. [Accessed 12 December

2007]. Available from: http://en.wikipedia.org/wiki/Comparison_of_web_application_frameworks

WIRTH, R & HIPP, J. 2000. CRISP-DM: Towards a Standard Process Model for Data Mining

[online]. [Accessed 27 October 2007]. Available from: http://sunwww.informatik.uni-

tuebingen.de/forschung/papers/padd00.pdf

WITTEN, I.H. & FRANK, E. 2005. Data Mining: (2nd Ed.) Practical Machine Learning Tools and

Techniques. San Francisco: Morgan Kaufmann.

50

Appendix A: Personal Reflection

Phew. With the development of the system finally complete and all components of the report finished

I can now look back over the past year and reflect on what an amazing experience it has been. I am

extremely pleased with how the project has turned out and many lessons have been learnt that I can

take into a future career. For me, the most important factor in the project’s success was incorporating

things that I loved throughout the requirements, methodology, implementation and evaluation. From

the research into personas and user-centred design to investigating evaluation paradigms and

techniques which truly reflected human experiences with technology. If anything, I would have loved

to have explored elements of human cognition and perceptions a lot more! For any students

attempting a project with a similar topic or using similar methods I would recommend the following:

• Plan your requirements analysis carefully. Defining a suitable scope to the project is essential

and is a lesson that I learnt quite late into development. From the beginning I had in mind the

kind of approach that I wanted to adopt, including how I was going to evaluate my solution,

which helped as I could always refer back to my requirements to ensure I was on track.

• Manage your time appropriately. Even with a 60/60 split between modules I found that in the

first semester I was extremely busy with coursework and in the second semester I had a lot

more time for the project. Ironically, I seemed to complete more work in the first semester as

I was committed to completing the data processing elements before the Christmas break.

• Choose a suitable methodology. Having reviewed many of the common methodologies I

realised that none of them addressed both data processing and project issues. If anything, I

would use the Integrated Data Modelling, Analysis and Presentation (IDMAP) process but

ensure that there is adequate collaboration between yourself and an expert. I found that this

was often difficult to achieve and I did not utilise it as much as I would have liked. An area

that I would emphasise is the design of your system and gathering responses early in

development. In this project I relied heavily on my evaluation to highlight errors which meant

that they could not be completed for a final version.

• Refine, refine, refine. Don’t be afraid to add or remove elements from the write-up that are no

longer appropriate. I found this when I re-defined my project scope through suggestions made

in project meetings. During the write-up phase I must have completed in excess of 100

refinements throughout the project. As I started planning and writing from the very first day

this helped immensely when deadlines approached as I had completed around 90% of the

write-up well before hand-in dates. I can imagine there will be a lot of other students who will

be cramming a lot of work into the two weeks prior to the final submission, but as I write I am

around 99% complete. Writing up early also helped during meetings with my supervisor as

feedback was received based on real work rather than hypothetical work.

51

• Plan your report carefully including the length of sections. This was advice given to me by

my assessor after the mid-project report and prompted me to modify the layout of my work

quite dramatically. Always look at your report through the eyes of a reader and identify the

most logical layout and presentation. For example, this report follows my chosen

methodology which also proves that I followed my project plan.

• Read about how you are being assessed from the project web site. Maintaining a good balance

between implementation and write-up was something that hopefully is shown in this project.

From the beginning I knew how I was going to be assessed and was able to reflect on this in

my methodology and planning.

• Be adventurous. Through background research I investigated techniques which extended

concepts taught throughout my degree programme which show the application of technical

methods to a real-world problem. For example, I looked at evaluation in detail including

using techniques that were relevant to both technical and social perspectives.

• Be professional when communicating with others. It is essential to show appreciation for the

time spent by others in helping with your project. In my experience this ensured future

responses were quicker and it created a rapport which often lead to people divulging a lot

more information than originally intended. Not only this, but remember that you are

representing the University of Leeds as well as yourself and should be looking to promote

your skills in case you are looking to apply for jobs at similar organisations in the future.

Hopefully following some of this advice will enable you to implement a project that you can be proud

of. I found that keeping a level head and aiming to complete work in advance of deadlines helped

immeasurably, although I know that this approach does not work for everyone! Over the year I feel I

have grown and matured, not only as an Informatics student, but as an individual which is something

that will stay with me forever. If I could offer one final piece of advice it would be to enjoy yourself

and do something that you love. Oh, and drink lots of coffee. Good luck.

52

Appendix B: Project Plan

53

Appendix C: Yorkshire and The Humber Aggregate Local Region Pairings

CLA Name PCT Name Aggregate Local Region

Barnsley Barnsley Barnsley

Bradford Bradford and Airedale Bradford

Calderdale Calderdale Calderdale

Doncaster Doncaster Doncaster

East Riding of Yorkshire East Riding of Yorkshire East Riding of Yorkshire

Kingston upon Hull Hull Teaching Hull

Kirklees Kirklees Kirklees

Leeds Leeds Leeds

North East Lincolnshire North East Lincolnshire North East Lincolnshire

North Lincolnshire North Lincolnshire North Lincolnshire

Rotherham Rotherham Rotherham

Sheffield Sheffield Sheffield

Wakefield Wakefield District Wakefield

North Yorkshire, York North Yorkshire and York York

54

Appendix D: Informed Consent Form

I state that I am over 18 years of age and wish to participate in a program of research being conducted

by Mr. Mark Hawker at the School of Computing, University of Leeds.

The purpose of the research is to assess the usability of HealthWatch, a website developed to aid

health professionals in comparing clinical data against population data.

The procedures will involve the monitored use of HealthWatch. I will be asked to complete a survey

and answer open-ended questions about HealthWatch and my experience using it. All information

collected in the study is confidential, and my name will not be identified at any time. I understand that

I am free to ask questions or withdraw from participation at any time without penalty.

55

Appendix E: User Survey Template

To complete the evaluation you will be required to answer a few simple questions (see below) and

then reflect on your experience by filling out the user evaluation survey from Page 3 onwards.

Sections that require your input are highlighted in yellow.

Please visit http://leeds.thebubblejungle.com/project/ and try to answer the following questions:

1. Which Aggregate National Region (ANR) has the highest percentage of diabetes cases? XXX

2. Which ANR has the highest disease prevalence? XXX

3. Which ANR has the largest non-White population? XXX

4. Which Aggregate Local Region (ALR) in Yorkshire and The Humber has the highest number of

diabetes cases? XXX

5. How many students are there in Leeds? XXX

Next, please fill out the user evaluation survey with your comments.

User Evaluation

To fill out the form replace the number that you consider to be the most appropriate answer with an

‘X’. Please provide additional comments for justification in the spaces provided.

Finally, please be as honest as possible when filling out the survey. If something doesn’t make sense

or is hard to use, then please say so!

Part 1: General Computer Experience

Beginner Expert 1.1 General computer experience

1 2 3 4 5 6 7 NA

Part 2: Overall User Reactions

Terrible Wonderful 2.1 Overall reactions to the system

1 2 3 4 5 6 7 NA

Frustrating Satisfying 2.2 Overall reactions to the system

1 2 3 4 5 6 7 NA

Dull Stimulating 2.3 Overall reactions to the system

1 2 3 4 5 6 7 NA

56

Difficult Easy 2.4 Overall reactions to the system

1 2 3 4 5 6 7 NA

Part 3: Screen

Hard to Read Easy to Read 3.1 Characters on the computer screen

1 2 3 4 5 6 7 NA

Barely Legible Very Legible 3.1.1 Character shapes (fonts)

1 2 3 4 5 6 7 NA

Never Always 3.2 Screen layouts were helpful

1 2 3 4 5 6 7 NA

Illogical Logical 3.2.1 Arrangement of information

1 2 3 4 5 6 7 NA

Confusing Clear 3.2.2 Sequence of screens

1 2 3 4 5 6 7 NA

Inappropriate Appropriate 3.4 Colours used are

1 2 3 4 5 6 7 NA

Hard to Interpret Easy to Interpret 3.5 Charts results are

1 2 3 4 5 6 7 NA

Part 4: Terminology and System Information

Inconsistent Consistent 4.1 Use of terminology throughout

1 2 3 4 5 6 7 NA

Too Frequently Appropriately 4.2 Computer terminology is used

1 2 3 4 5 6 7 NA

57

Ambiguous Precise 4.3 Terminology on the screen

1 2 3 4 5 6 7 NA

Confusing Clear 4.4 Instructions for commands or functions

1 2 3 4 5 6 7 NA

Part 5: Learning

Difficult Easy 5.1 Learning to operate the system

1 2 3 4 5 6 7 NA

Discouraging Encouraging 5.2 Exploration of features by trial and error

1 2 3 4 5 6 7 NA

Inadequate Adequate 5.3 Amount of help given

1 2 3 4 5 6 7 NA

Part 6: System Capabilities

Too Slow Fast Enough 6.1 System speed

1 2 3 4 5 6 7 NA

Never Always 6.2 The system is reliable (no crashes)

1 2 3 4 5 6 7 NA

Thank you for completing this evaluation.

Please e-mail this document to [email protected].

58

Appendix F: Dr. Rick Jones’ Evaluation Cover Letter

Mark’s project partly arose from interest in the Department of Health (DH) in understanding

healthcare activity in relation to clinical demand. In particular how to move away from cost and

volume metrics towards metrics based on clinical need and disease burden. To that end the DH had

sponsored a University-led research project (Primary Care Benchmarking: Leeds/Keele, 2007-2009)

to look at laboratory test utilisation by general practices (GP) in relation to disease prevalence.

A major challenge in achieving that goal is to understand how to obtain, manage and integrate the

data needed to underpin the different dimensions of the analysis (clinical, population, organisational,

ontological). I feel that Mark has engaged with this problem and has demonstrated a robust and

scalable solution which will serve to underpin further work in the main DH project. He has during the

year attended a number of meetings with the team from Keele Benchmarking and The Information

Centre and quickly grasped the issues involved.

He has then independently researched the problem and brought his considerable skill to formalising

the solution and building the prototype. He has demonstrated a clear insight into the scalability issues

and the need to build a robust platform into which other datasets could be imported to allow extended

analysis. His work has been seen by the team working on the DH project and they have been

impressed by the professionalism of the outputs, the quality of the academic analysis and the potential

for further exploitation.

It is our intention to use Mark’s work as a core component in the next stage of the benchmarking

system development. This will involve adding additional datasets but will use the same generic tools

Mark has developed and also some possible extensions into geographic information system (GIS)

displays. There is a significant opportunity to extend this work further to research collaboration

between health and computer science.

Date post:	08-Jun-2015
Category:	Documents
Upload:	mark-hawker
View:	650 times
Download:	3 times

HealthWatch: A Management Tool Combining Clinical and Population Data Sets

Documents