+ All Categories
Home > Documents > Supercomputing 2003, UK e-Science Booth 1 First Data Investigation on the Grid: FirstDIG Terry...

Supercomputing 2003, UK e-Science Booth 1 First Data Investigation on the Grid: FirstDIG Terry...

Date post: 13-Dec-2015
Category:
Upload: eustace-dickerson
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
Supercomputing 2003, UK e-Science Booth First Data Investigation on the Grid: FirstDIG Terry Sloan, Paul Graham, Adam Carter Edinburgh Parallel Computing Centre (EPCC) Telephone: +44 131 650 5155 Email: [email protected] [email protected] [email protected]
Transcript

Supercomputing 2003, UK e-Science Booth

First Data Investigation on the Grid: FirstDIG

Terry Sloan, Paul Graham, Adam CarterEdinburgh Parallel Computing Centre (EPCC)

Telephone: +44 131 650 5155

Email: [email protected] [email protected]

[email protected]

2 Supercomputing 2003, UK e-Science Booth

Overview

The Project Motivation Methodology Data Sources, Cleaning, Analysis OGSA-DAI Future Work

Supercomputing 2003, UK e-Science Booth

The Project

4 Supercomputing 2003, UK e-Science Booth

The Project

Two aims:

1. Demonstrate deployment of OGSA-DAI within the First South Yorkshire bus operational environment and learn from it

2. Short data analysis using OGSA-DAI service enabled data sources to answer business questions posed by First South Yorkshire

5 Supercomputing 2003, UK e-Science Booth

The Project (cont)

Partners– First plc represented by First South Yorkshire– National e-Science Centre represented by EPCC

Timescales– 9 months – Start May 2003– End JanDec 2004– Nov 2003 = Project Month 7 (PM7)

Supercomputing 2003, UK e-Science Booth

Motivation

7 Supercomputing 2003, UK e-Science Booth

Motivation

First plc– Few UK e-Science projects involve service companies such as First

plc– Operate worldwide in variety of transport sectors– Over 10000 vehicles in the UK, 23% of the market– UK’s largest operator– Challenge is meeting the needs of the travelling public whilst making

money– Data Mining may assist but huge range of fragmented data sources

OGSA-DAI : Data Access and Integration– Potentially provides a solution– Need business users to make transition from science to commerce

Supercomputing 2003, UK e-Science Booth

Methodology

9 Supercomputing 2003, UK e-Science Booth

Methodology

Business questions Data sources Data cleaning/analysis OGSA-DAI service-enabled data sources Replicate data cleaning/data analysis Feedback on OGSA-DAI suitability and areas for

improvement.

Supercomputing 2003, UK e-Science Booth

Data Sources, Cleaning & Analysis

11 Supercomputing 2003, UK e-Science Booth

Data Sources in the Bus Industry

Many different kinds of data involved with running a bus company– Mileage, revenue, customer contact, schedule, fuel consumption,

vehicle maintenance, routes…

Many means to collect data– Manually entered data at depot– Data collected on buses from ticket machines– Data collected on buses from GPS systems– GPS system notes when bus passes through a predefined

“footprint” and records the time at which this happens

12 Supercomputing 2003, UK e-Science Booth

Disparate Databases

Data is typically stored in disparate databases– Various reasons for this: Incremental construction of systems.– Not a problem for day-to-day running and querying but…

Introduces challenges for Data Analysis– Systems introduced at different times– Different database engines– Different front-ends– Different operating systems– Different physical locations– Different ways of representing data

13 Supercomputing 2003, UK e-Science Booth

An Example Process

CLEAN

CLEAN

AGGREGATE

AGGREGATE

JOIN

RE-FORMAT

RE-FORMAT

ANALYSE

14 Supercomputing 2003, UK e-Science Booth

Cleaning and Reformatting

One Bus, Many Names– e.g. Service 25A might be “025A”, “25A”, “25a”– Sometimes referring to individual depots, and sometimes to

operating regions which may include various depots.– Furthermore, if data is stored separately for each depot, data

might not explicitly include a reference to a depot – this has to be added when the data is aggregated

Pre-processing can often be done with SQL after some initial analysis– e.g. Create tables with entries corresponding to the depot and

columns containing data on how this depot is labelled in the different databases.

15 Supercomputing 2003, UK e-Science Booth

Cleaning and Reformatting 2

Pre-processing with SQL (continued)– Harder for example of service names: Need larger table. Requires

effort, but need only be created once.– Alternative:

• Read data from database

• Process data with other tools (Perl, SPSS, …)

• Load results to new table in database

16 Supercomputing 2003, UK e-Science Booth

Aggregation

Data can be aggregated in various ways– e.g. By Service, By Day

SQL can do much of the simple aggregation:

SELECT Service, Region, SUM(Revenue) AS TotalRevenueFROM RevenueTableGROUP BY Service, Region

In practice SQL can be somewhat more complicated

17 Supercomputing 2003, UK e-Science Booth

Aggregation 2

As before, an alternative is:– Read data from database– Aggregate with external program (SPSS, Perl, even Excel)– Load data back into database

Whether or not this is worth doing depends on– Availability of Aggregation Functions in database engine– Extent of processing required: If a database is stored on a small

or heavily-used machine, it may be quicker to export, process, and import.

18 Supercomputing 2003, UK e-Science Booth

Joins

Can combine data from more than one database:– Complaints versus Lateness– Revenue versus Lost Miles– Complaints versus Lost Miles

Often Joins are on data aggregated in some way:– By Service– By Day

Subsets of the data can also be considered– e.g. no weekends

19 Supercomputing 2003, UK e-Science Booth

Hurdles: Non-Standard SQL

Non-Standard SQL introduces some hurdles for transparent integration of data

Date Formats:– No standard data format: DD/MM/YYYY or MM/DD/YY– No standard date handling functions– Compare MS Access and mySQL:

SELECT * FROM AccessTable WHEREIncidentDate BETWEEN #11/30/2000# AND#11/30/2002#

SELECT * FROM MySQLTable WHEREIncidentDate BETWEEN '2000-11-30' AND'2002 11 30'

20 Supercomputing 2003, UK e-Science Booth

Hurdles: How representative is Data?

Data available for mining can influence results Representative data required for meaningful

results Since data is not collected for the purposes of

data mining, it may be incomplete For example, data might only be collected to

analyse a perceived problem with a particular route

21 Supercomputing 2003, UK e-Science Booth

Required Datamining Tools

SQL can be used for basic data analysis but OGSA-DAI doesn’t replace data mining tools

More complicated data analysis requires external tools: e.g. C5, Perl, SPSS, Excel

OGSA-DAI’s use here is to extract data required for analysis and deliver it to the system on which analysis is to be performed in a useful format Machine

performing analysis

GRID

OGSA-DAI

OGSA-DAI

Supercomputing 2003, UK e-Science Booth

OGSA-DAI

23 Supercomputing 2003, UK e-Science Booth

The problem

Access to databases at First The databases:

– Are located at different sites– Are hosted on different operating systems– Are not all available via the internal network– Have different DBMS

Require ability to analyse their contents in a uniform manner and include cross-database analysis

24 Supercomputing 2003, UK e-Science Booth

The solution

OGSA-DAI– Open Grid Services Architecture Data Access and Integration– DAIS-WG at GGF

Grid middleware:– Assists with the access and integration of data from separate data sources

via the Grid– Represents databases as Grid Services– Enables access from other machines in a secure manner

OGSA-DAI Partners– Funded under UK e-Science Core program– Universities of Edinburgh, Manchester and Newcastle– IBM and Oracle– http://www.ogsadai.org.uk

25 Supercomputing 2003, UK e-Science Booth

OGSA Data Access and Integration

Based on Grid Services concept– Stateful web services with an associated lifetime– Has a set of behaviours, and conforms to a set of interfaces

through which a client may interact

Three main Grid Services:– DAI Service Group Registry (DAISGR)

• Holds a list of …

– Grid Data Service Factory (GDSF)• Associated with a single database

– Grid Data Service (GDS)• A “session” with a database

26 Supercomputing 2003, UK e-Science Booth

OGSA-DAI typical interaction 1/3

Client

DAISGRGDSF

1.

Web ServicesContainer

27 Supercomputing 2003, UK e-Science Booth

OGSA-DAI typical interaction 2/3

Database

GDSClient

GDSF

Web ServicesContainer

2.

28 Supercomputing 2003, UK e-Science Booth

OGSA-DAI typical interaction 3/3

DatabaseGDS

Client

Web ServicesContainer

3.

29 Supercomputing 2003, UK e-Science Booth

First and OGSA-DAI

Our remit:– To evaluate the suitability of the use of OGSA-DAI in a

commercial environment

Need to find out if OGSA-DAI:– Is appropriate– Is secure– Is straightforward to deploy and use– Does what we need!

Feedback from project goes straight to OGSA-DAI team

30 Supercomputing 2003, UK e-Science Booth

Progress

Have a test deployment running at EPCC

Using two of the databases identified in the data analysis WP– The Customer Contact System

• Microsoft Access

• Information on customer complaints e.g. time, service, nature

– The Mileage database• dBASE IV

• Information on bus mileage e.g. lost miles

31 Supercomputing 2003, UK e-Science Booth

Issue

OGSA-DAI currently does not officially support Access or dBASE IV !

However, does support JDBC-accessible databases

Solution– Use the Microsoft provided ODBC driver– Use the Sun provided JDBC-ODBC bridge

32 Supercomputing 2003, UK e-Science Booth

Set up

Using three machines within our firewall– One to host the CCS database– One to host the Mileage database– One to act as the client

ClientDAISGR

GDSFGDSF

DAISGR

33 Supercomputing 2003, UK e-Science Booth

Limitations

Data type support– The BIT data type (Yes/No fields)– The Date format

“Out of range” character codes– Limitation of XML

Firewalls– General Grid computing discussion

Usability– Use of XML can be confusing

Supercomputing 2003, UK e-Science Booth

Future Work

35 Supercomputing 2003, UK e-Science Booth

Future work

Deploy at First– And test within their network

A client tool– To improve usability

Additional databases and DBMS– First have other databases under different DBMS they want to

integrate

Single DAI Service Group Registry– These databases should be registered centrally

More complex interactions– Joins across databases …


Recommended