DELIVERABLE - Transforming Transport - Proactive rail...DELIVERABLE D6.1 – Proactive rail...

DELIVERABLE

D6.1 – Proactive rail infrastructures pilots design

Project Acronym TT

Project Title Transforming Transport

Grant Agreement number 731932

Call and topic identifier ICT-15-2016-2017

Funding Scheme Innovation Action (IA)

Project duration 30 Months [1 January 2017 – 30 June 2019]

Coordinator Mr Chris Byron (Thales)

Website www.transformingtransport.eu

Project Acronym TT

http://www.transformingtransport.eu/

Document fiche

Authors: Ángel Pérez Bartolomé [INDRA], Jesús Miguel Alonso Pérez

[FERROVIAL], Miguel Rodríguez Plaza [ADIF], Philip Flynn

[Thales]

Internal reviewers: Jarno Kanninen [MATT]

Klaus Julian Eggemann [FHG]

Work Package: WP6

Task: T6.2 / T6.3

Nature: R

Dissemination: PU

Document History

Version Date Contributor(s) Description

1.0 14/03/2017

INDRA: Ángel Pérez Bartolomé

ADIF: Miguel Rodriguez Plaza, Álvaro Perez

Mansilla

FERROVIAL: Jesús Miguel Alonso Pérez, Pablo

Eugenio Fernández Vivanco, Jose Ignacio Jardi

Cuerda

Final document for

internal review

1.02 14/03/2017 Thales: Philip Flynn Internal review

Keywords: Rail, rail domain, pilot design

Abstract (few lines): This deliverable reports on the work performed in

WP6/T6.2 “Predictive Rail Asset Management

Pilot” and T6.3 “Predictive High Speed Network

Maintenance Pilot” with respect to the pilots

design for the rail domain of the

TransformingTransport project.

DISCLAIMER

This document does not represent the opinion of the European Community, and the

European Community is not responsible for any use that might be made of its content. This

document may contain material, which is the copyright of certain TT consortium parties, and

may not be reproduced or copied without permission. All TT consortium parties have agreed

to full publication of this document. The commercial use of any information contained in this

document may require a license from the proprietor of that information.

Neither the TT consortium as a whole, nor a certain party of the TT consortium warrant that

the information contained in this document is capable of use, nor that use of the

information is free from risk, and does not accept any liability for loss or damage suffered by

any person using this information.

ACKNOWLEDGEMENT

This document is a deliverable of TT project. This project has received funding from the

European Union’s Horizon 2020 research and innovation programme under grant agreement

Nº 731932

Table of Contents DEFINITIONS, ACRONYMS AND ABBREVIATIONS ............................................................................................. 6

EXECUTIVE SUMMARY ..................................................................................................................................... 7

1 OVERALL MOTIVATION AND AMBITIONS FOR PILOT DOMAIN ................................................................ 8

1.1 PROACTIVE RAIL INFRASTRUCTURES (WP6) OVERVIEW......................................................................................... 8

2 INITIAL PILOT: D6.1: PROACTIVE RAILWAY INFRASTRUCTURE ................................................................. 9

2.1 REQUIREMENTS ............................................................................................................................................. 9 2.2 OBJECTIVES ................................................................................................................................................ 10 2.3 USE CASES / SCENARIOS ............................................................................................................................... 10

2.3.1 Use case 1: ..................................................................................................................................... 10 2.3.2 Use case 2 ...................................................................................................................................... 12 2.3.3 Use case 3 ...................................................................................................................................... 15

2.4 DATA ASSETS .............................................................................................................................................. 17 2.5 BIG DATA TECHNOLOGY, TECHNIQUES AND ALGORITHMS .................................................................................... 21

2.5.1 Overview and Positioning in BDV Reference Model ...................................................................... 21 2.5.2 Detailed Explanation of Technology, Techniques and Algorithms ................................................ 23

2.5.2.1 Technologies ........................................................................................................................................ 23 2.5.2.2 Techniques & Algorithms .................................................................................................................... 24

2.6 BIG DATA INFRASTRUCTURE ........................................................................................................................... 25 2.6.1 Platform architecture .................................................................................................................... 25 2.6.2 Value view architecture ................................................................................................................. 26 2.6.3 Conceptual Architecture overview diagram .................................................................................. 27 2.6.4 Logical Architecture diagram ........................................................................................................ 28 2.6.5 Logical Architecture Detail ............................................................................................................ 29

2.7 ROADMAP .................................................................................................................................................. 30

3 REPLICATION PILOT: PREDICTIVE HIGH SPEED NETWORK MAINTENANCE PILOT ................................... 31

3.1 REQUIREMENTS ........................................................................................................................................... 34 3.2 OBJECTIVES ................................................................................................................................................ 36 3.3 USE CASES / SCENARIOS ............................................................................................................................... 39

3.3.1 UC1: Prediction of the degradation of switch and crossing elements ........................................... 40 3.3.1.1 Objective ............................................................................................................................................. 40 3.3.1.2 Input data ............................................................................................................................................ 40 3.3.1.3 Work environment .............................................................................................................................. 40 3.3.1.4 Relation with objectives and requirements......................................................................................... 41

3.3.2 UC2: Prediction of the degradation of track profiles ..................................................................... 41 3.3.2.1 Objective ............................................................................................................................................. 41 3.3.2.2 Input data ............................................................................................................................................ 41 3.3.2.3 Work environment .............................................................................................................................. 41 3.3.2.4 Relation with objectives and requirements......................................................................................... 42

3.3.3 UC3: Prediction of the evolution of the slopes that are part of the railway infrastructure ........... 42 3.3.3.1 Objective ............................................................................................................................................. 42 3.3.3.2 Input data ............................................................................................................................................ 42 3.3.3.3 Work environment .............................................................................................................................. 42 3.3.3.4 Relation with objectives and requirements......................................................................................... 43

3.4 DATA ASSETS .............................................................................................................................................. 43 3.4.1 Techniques to monitor the infrastructure ..................................................................................... 47

3.5 BIG DATA TECHNOLOGY, TECHNIQUES AND ALGORITHMS .................................................................................... 49 3.5.1 Big Data Technology...................................................................................................................... 49

3.5.1.1 Conceptual model Indra Big Data platform ......................................................................................... 50 3.5.1.2 Sofia2 Typical workflow ....................................................................................................................... 51 3.5.1.3 Sofia2 Storage...................................................................................................................................... 52

3.5.1.4 DataFlow (ETL module)........................................................................................................................ 53 3.5.1.5 Notebooks (Collaborative analytical) .................................................................................................. 55 3.5.1.6 Machine Learning ................................................................................................................................ 57 3.5.1.7 Dashboards .......................................................................................................................................... 58

3.5.2 Big Data Techniques and Algorithms ............................................................................................ 58 3.5.2.1 Goal definition based on business knowledge .................................................................................... 60 3.5.2.2 Data preparation and management .................................................................................................... 60 3.5.2.3 Creating analytic models ..................................................................................................................... 60 3.5.2.4 Model Evaluation ................................................................................................................................ 63 3.5.2.5 Deployment: Solution integration with the pilot................................................................................. 63

3.6 POSITIONING OF PILOT SOLUTIONS IN BDVA REFERENCE MODEL ......................................................................... 64 3.7 BIG DATA INFRASTRUCTURE ........................................................................................................................... 65

3.7.1 INDRA Cloud platform ................................................................................................................... 65 3.7.2 ADIF Railway Technology Centre Platform .................................................................................... 66

3.8 ROADMAP .................................................................................................................................................. 67 3.8.1 Objectives of each stage ................................................................................................................ 68 3.8.2 Deployment of each stage ............................................................................................................. 69 3.8.3 Data evolution of each stage ........................................................................................................ 69

4 COMMONALITIES AND REPLICATION .................................................................................................... 70

4.1 COMMON REQUIREMENTS AND ASPECTS .......................................................................................................... 70 4.2 ASPECTS OF REPLICATION .............................................................................................................................. 70

5 CONCLUSIONS ....................................................................................................................................... 71

Table of Figures Figure 1 Pilot Approach Diagram ............................................................................................. 34

Figure 2 Pilot objectives identification process ....................................................................... 36

Figure 3: Sofia2 Big Data Platform used modules .................................................................... 51

Figure 4: Sofia2 Typical workflow ............................................................................................ 51

Figure 5: Sofia2 DataFlow module ........................................................................................... 53

Figure 6: Sofia2 DataFlow example diagram ............................................................................ 54

Figure 7: Sofia2 DataFlow monitoring information ................................................................. 55

Figure 8: Sofia2 Sofia2 Notebook ............................................................................................. 56

Figure 9: Sofia2 Notebook Spark graphics ............................................................................... 56

Figure 10: Sofia2 Notebook HIVE graphics ............................................................................... 57

Figure 11: Sofia2 Dashboard types ........................................................................................... 58

Figure 12: Methodology based on CRISP-DM .......................................................................... 59

Figure 13: Different kind of models ......................................................................................... 61

Figure 14: Algorithm type for each use case ............................................................................ 62

Figure 15: Model creation process ........................................................................................... 63

Figure 16 Big Data Value Reference Model ............................................................................. 64

Figure 17: Sofia2 Big Data Cloud Platform Infrastructure ........................................................ 66

Figure 18 Pilot infrastructure deployment and data evolution ............................................... 69

Definitions, Acronyms and Abbreviations Acronym Title

CO Confidential, only for members of the consortium (including Commission

Services)

CR Change Request

D Demonstrator

DL Deliverable Leader

DM Dissemination Manager

DMS Document Management System

DoA Description of Action

Dx Deliverable (where x defines the deliverable identification number e.g. D1.1.1)

EIM Exploitation Innovation Manager

EU European Union

FM Financial Manager

MSx project Milestone (where x defines a project milestone e.g. MS3)

Mx Month (where x defines a project month e.g. M10)

O Other

P Prototype

PC Project Coordinator

PM partner Project Manager

PO Project Officer

PP Restricted to other programme participants (including the Commission Services)

PU Public

QA Quality Assurance

QAP Quality Assurance Plan

QFD Quality Function Deployment

QM Quality Manager

R Report

RE Restricted to a group specified by the consortium (including Commission

Services)

RUP Rational Unified Process

STEP Standard Technology Evaluation Process

STM Scientific and Technical Manager

TL Task Leader

WP Work Package

WPL Work Package Leader

WPS Work Package Structure

Executive Summary Big Data will have a profound economic and societal impact in the mobility and logistics

sector, which is one of the most-used industries in the world contributing to approximately

15% of GDP. Big Data is expected to lead to 500 billion USD in value worldwide in the form of

time and fuel savings, and savings of 380 megatons CO2 in mobility and logistics. With

freight transport activities projected to increase by 40% in 2030, transforming the current

mobility and logistics processes to become significantly more efficient, will have a profound

impact. A 10% efficiency improvement may lead to EU cost savings of 100 BEUR. Despite

these promises, interestingly only 19 % of EU mobility and logistics companies employ Big

Data solutions as part of value creation and business processes.

The Transforming Transport project will demonstrate, in a realistic, measurable, and

replicable way the transformations that Big Data will bring to the mobility and logistics

market. To this end, Transforming Transport validates the technical and economic viability of

Big Data to reshape transport processes and services to significantly increase operational

efficiency, deliver improved customer experience, and foster new business models.

Transforming Transport will address seven pilot domains of major importance for the

mobility and logistics sector in Europe: (1) Smart Highways, (2) Sustainable Vehicle Fleets, (3)

Proactive Rail Infrastructures, (4) Ports as Intelligent Logistics Hubs, (5) Efficient Air

Transport, (6) Multi-modal Urban Mobility, (7) Dynamic Supply Chains. The Transforming

Transport consortium combines knowledge and solutions of major European ICT and Big

Data technology providers together with the competence and experience of key European

industry players in the mobility and logistics domain.

1 Overall Motivation and Ambitions for Pilot

Domain 1.1 Proactive Rail Infrastructures (WP6) Overview The pilot will be delivered to overcome the following major barriers to moving to a predict-

and-prevent rail maintenance approach. Rail infrastructures involve a complex supply chain

between equipment manufacturers, maintainers and operators. Acquiring long term,

performance (maintenance, fault) data, understanding the quality, accuracy, and

provenance of the largely unstructured data, processing it to identify emerging faults

(diagnostics) and disseminating useful, timely prognosis information with known confidence

levels for preventive and predictive maintenance has proved historically problematic.

This pilot will apply Big Data predictive and prescriptive analytics to a UK national rail route,

to reduce the long term cost of maintenance and increase network availability through the

facilitation of focused short and medium term proactive interventions. The Pilot will be run

on historical and real-time data sources.

The objective of this pilot is to improve the reliability of high speed rail networks by

optimising operator´s performance and maintenance of the rail infrastructure. The pilot will

consist on the application of Big Data technologies to process heterogeneous data collected

in order to understand the variables that have impact on the operators performance and

model the nature of the maintenance incidents occurred in the infrastructure (tracks,

tunnels, bridges, etc.) based on rail traffic, rolling-stock flows, maintenance data, planning &

control Data and other information sources.

This analysis will also include external variables such as weather forecasts or specific events

(summer holidays) to anticipate the maintenance activities on the rail network and therefore

to improve the operations of the maintenance of the whole rail infrastructure. Moreover,

this solution might also allow rail operators to predict in real-time the impact of certain

events on traffic management.

2 Initial Pilot: D6.1: Proactive Railway

Infrastructure 2.1 Requirements The high level requirement for this pilot is to provide insights and information to key

decision makers such that maintenance of rail assets can be optimised to:

Improve the safety of rail workers by minimising the time spent trackside.

Improve the safety of rail passengers by identifying trends toward asset failure.

Improve the reliability of services by minimising downtime and service interruptions.

Improve cost efficiency by prioritising preventative maintenance on assets based on a

number of factors.

Improve service capacity by minimising disruption.

This pilot is intended to prove whether or not the application of ‘Big Data’ techniques can

help to achieve one or more of the above optimisations. The diagram below shows how the

project’s high level requirements cascade into the pilot’s requirements and subsequently

into the challenges being addressed.

High level requirements

Improve Safety Improve Reliability Improve Cost

Efficiency Improve Capacity

Pilot requirements

Reduction in frequency of

attended maintenance

Reduction in downtime

for assets

Reduction in maintenance

based cost

Reduced number of

cancellations and delays

Pilot challenges

TBD% reduction in

frequency of attended

maintenance

TBD% reduction in

downtime for assets

TBD% reduction in

maintenance based cost

TBD% reduced number of

cancellations and delays

To achieve these challenges the pilot deliverable must answer the following questions:

When will an asset fail?

What trends can be identified based on historical data sources?

What accuracy of data is required in order to produce a likely prediction?

What relationships exist between:

o Data sets? E.g.

Weather, forecast timetable, vehicle category etc.

o Asset failures? E.g.

Model/manufacturer related

Dependency related

Given a failure, can another cascading failure be predicted soon after?

What is the cost incurred/saved by predicting (and subsequently preventing)

failures?

2.2 Objectives The application of new Big Data technology and processes to facilitate the introduction of a step change in maintenance across European rail infrastructures to: • Verify the quality, accuracy and provenance of asset data, leading to confidence to • Provide timely focused prioritised maintenance activities (predict and prevent), leading to • Improved reliability and availability of track-side assets, with • Higher availability of rail infrastructure for passenger and freight services; and • Enhancing worker safety through minimising track-side activities

2.3 Use Cases / Scenarios Three use cases have been identified by industry experts working in conjunction with

Network Rail which have been attributed to the majority of delays caused to mainline rail

services in the UK and Europe. By considering the impact that can be made to these three

significant contributors, it is hoped that data analytics can be used to reduce the delays and

disruption incurred by users, increase the safety of both users and trackside staff, reduce

costs incurred by both penalties and unnecessary maintenance, and improve capacity.

2.3.1 Use case 1:

Name Asset Maintenance and Renewal Analysis Overhead Line Equipment (OLE)

Objective Provide evidence to allow an Infraco to inform the OLE maintenance regime and renewal programme based upon measurable, quantifiable analysis of OLE condition and expected life.

Overview Description

Provide information to improve the maintenance and renewal schedules for Overhead Lines.

Intended Benefits

Improved quantifiable models for maintenance/renewal activities in order to:

- identify conditions known to impact upon the reliability and availability of the system

- minimise risk of disruption due to unexpected incidents (maintain uptime)

- demonstrate the effect of intervention (appropriate renewal) - improve availability due to reduced OLE failure

Approach

Collect, prepare, analyse and visualise using big data techniques the factors influencing OLE maintenance and renewal. Analysis of OLE by:

- geo-spatial location, tunnels/cuttings/embankments/open - weather effects (corrosion due to humidity, temperature,

precipitation, ice, wind, lightning etc.) - usage (km, no. of passes; train type/frequency; passenger/freight;

speed) - power feed/voltage - component/manufacturer e.g. contract strips - age; wear/remaining life - stray current - fault types (schedule 8) - temporary speed restrictions (TSR) - maintenance history and schedule - [Pancheck data - Video data ] - track geometry

Key Contacts / Regions

Steve Hooker - East Anglia Paul Barnes - Great Western Simon Taylor – systems integration; LNW; Virgin on-train data Osman Maruf (via C. Lowe) - centre

Data Sources / Volumetrics

Requirement that the data sources are consistent and co-incident i.e. cover the same area and time.

- Yellow (New Measurement) Train OLE and track geometry measurement data

- Pantograph instrumentation (passenger trains) o data and video analytics

- Overhead line maintenance and renewal history - Fault (FMS) and Maintenance (Ellipse) records - OLE asset data (Ellipse/OLE-EX) - Environmental (historical precipitation, temperature) - Usage (passenger/freight – type/loading; performance

characteristics; electric/diesel) [timetable & movements] - Pancheck data - Schedule 8 payments (delays minutes/incident) including

Temporary Speed Restrictions -

[Other relevant data sources to be confirmed in visit to HQ data team v. C. Lowe]

Standards

Relevant standards to be agreed.

Risks

1. Data alignment problems (different data sets collected from separate areas) – unable to confirm correlations

2. Analysis is not sufficiently complete to feed into Anglia OLE renewals process during Mid-2017 (opportunity cost to Network Rail)

3. Failure to confirm causality rather than correlations from big data techniques

Corroboration/ Confirmation of Progress and Results

Key stakeholders: S. Hooker; Forum: OLE-EX forum (M. Dobbs (LNW E&P); P. Doughty (Head Contact Systems), P. Barnes (Western E&P)

Repeatability

Demonstration that the results are applicable to at least one UK route, can be extended to several UK routes and are repeatable in European rail infrastructures.

Related Work Relevant work and publications by: - Network Rail – internal programme results – Project Insight - Network Rail / University of Oxford (video analytics of pantograph) - JR-East 2013 using train based laser inspection - NeTIRail 2016 (H2020) University of Sheffield. ADS-Electronic Research (geo-spatial analysis) - Railway Reliability Data Handbook (Network Rail)

Quick Wins Improve Temporary Speed Restriction process and reduce TSR Explore affect around Neutral Sections

2.3.2 Use case 2

Name Asset Failure Diagnosis and Prognosis For Points and Track Circuits

Objective Assist Route based Flight Engineer and asset maintenance teams to quantifiably assess the risk of an imminent or future asset failure, based upon the asset’s as-is health state, known fault symptom diagnosis and

expected further factors influencing the rail system’s operation (system-level prioritisation of maintenance interventions).


Determine from the real-time prognosis future risk curve whether it is necessary for maintenance to intervene during in-traffic, wait for engineering hours or inspect the asset at the next scheduled maintenance period. Provide clear, quantified, diagnostic information for the on-site maintenance team. Provide evidence to support exploitation of existing data sources, enabling improved decision making, improvement of maintenance regimes and prevention of asset failure. Focus on mechanical assets, primarily points operating equipment and electrical operation of track circuits. Analyse potential root causes of service affecting failures.

Intended Benefits

Improved quantifiable confidence (for maintenance action) in the earlier diagnosis of emerging and fully developed asset fault symptoms. Support improved decision making, improvement of maintenance regimes and prevention of asset failure. Provides evidence for the safe targeted maintenance of individual assets or groups of assets.

Approach

Undertake large scale, high volume, multi-sourced relevant (track and train) data set analysis to create validated algorithms for diagnostics and prognostics that could be deployed onto the operational railway. Leverage existing operational expertise to provide confidence in approach and outputs.


P. Barnes – Western Route S. Hooker - Anglia Route Forum (Signalling function) to present, discuss, feedback and suggest further research. Organised through C. Lowe.


Datasets are available for the approach: Diagnostics – Intelligent Infrastructure, FMS (fault) and Historical weather data; minimum of one year of data across the Route assets. Prognosis – as a minimum asset usage (movements, train type/freight vehicle), location, historical fault/maintenance records. Associated Interlocking data sets

Standards

Relevant standards to be agreed.

Risks

1. Extraction and correlation of II (symptom) data with related FMS (fault) may be difficult to achieve. Limited previous analysis in this area has produced positive correlation for fully emerged fault symptoms (not emerging). Identified that FMS entries may be of poor quality for this purpose.

2. The success of II in reducing on-the-day faults will have skewed some of the data i.e. corrective maintenance has occurred before a fault occurred. Requiring a higher level of granular analysis to be needed against pre-II (control) conditions.

3. Measurement of delay savings is difficult to quantify accurately (prevention of event) and hence determination of ultimate value of this approach could be seen as unreliable.

4. The Network Rail business and safety case for changes to the maintenance regime require clear, rigorous, evidence, hence the data collection and analysis process needs to be proven and reliable to a high standard throughout (including cross-correlation and capable of dealing with incomplete or inaccurate data).


Anglia Route monitors point and track circuit assets using Intelligent Infrastructure. Target first time fix rate stated as 85%. Initial corroboration required is that similar levels of diagnostic accuracy can be achieved using automated techniques within reasonable timeframes. The next diagnostic checkpoint is to identify potential faults earlier using emerging symptoms (trends). Metrics for prognosis accuracy to be developed.

Repeatability

Each UK route uses Intelligent Infrastructure. Results on a single Route to be compared with different Routes with alternative operational challenges and mix of assets. Model to be extrapolated to European mainline railways.

Related Work PCIPP prognosis outputs from University of Nottingham to be evaluated for use. Diagnostic algorithmic approaches from Thales TRT and PCIPP (University of Birmingham - UoB) to be considered. Other background research commissioned by Network Rail with the UoB.

Quick Wins Effect of temperature on operation Interaction of water with track circuits Correlation of earth leakage and track circuit operation

2.3.3 Use case 3

Name Assessing the value of passenger-train monitoring for track maintenance

Objective Improving asset management decisions based upon passenger-based train / track measurement. Assess whether advanced analytics can increase the value of UGMS data in support of better track maintenance.


Provide information to improve the maintenance schedules for track-bed.

Intended Benefits

Prioritisation of maintenance focus, based upon clearer picture of current and future track conditions. More readily identify poor track quality and deteriorating conditions.

Approach

Collect, prepare, analyse and visualise using big data techniques the factors indicating changes in state of the track. Availability of more frequent (less accurate) track measurements to inform track maintenance decisions. . The Yellow (New Measurement) Train provides data that is less frequent (but more accurate) than that available from passenger trains running over the same infrastructure. This could allow more rapid identification and response to: - changing track bed conditions - rail wear - geometry deterioration


Paul Barnes - Great Western (Future, 2018, UGMS operation) Simon Taylor – LNE (Historical UGMS data) Brian Whitley and Patric Mak via C. Lowe


- Yellow (New Measurement) Train track geometry data - UGMS (Unmanned Geometry Measurement Systems) from

timetabled passenger trains - Environmental (historical precipitation, temperature) - Track geometry, material, age - Tamping history - Usage (passenger/freight – type/loading; performance

characteristics) [timetable & movements] - Eddy current data

Standards

Alignment with emerging T1010 recommendations and future RSSB open data standards.

ISO 13374.

Risks

1. Data alignment problems (different data sets collected from separate areas) – unable to confirm correlations

2. Access to data in a timely fashion 3. Unable to distinguish between root cause and symptoms 4. Insufficient asset specialist input into requirements and review of

output


Correlation between NTM and UGSM / big data results Forum contacts: Brian Whitney & Patric Mak via C. Lowe

Repeatability

Applicable to any UK route. No expectation that the approaches generated would be restricted to the UK.

Related Work Note work done by: 1. University of Huddersfield, Institute of Railway Research (IRR),

Omnicom and University of York in RAMP demonstrator (RSSB/Innovate UK)

2. University of Huddersfield, IRR, Siemens – Tracksure for under-track voids (RSSB funded)

3. Perpetuum track condition monitoring South East Performance Improvement Project (Real-time track condition monitoring using passenger train mounted accelerometers)

4. International track recording services e.g. Mermec (Italy), Fugro (NL)

5. RSSB/T1010 RCM 6. TfL (Acton) big data analysis for trains

Quick Wins Early analysis results to support Western’s strategy in the use of UGMS.

2.4 Data Assets ID USE

CASE Data Asset Name

Short Description Expected Use Initial Availability Date

Data Type Link to Data Card (in basecamp)

1 1 & 2 ELLIPSE Information about dates and locations of maintenance and renewal of OLE and trackside equipment

To assess how points of maintenance and renewal affect asset condition and other measurements.

TBD Suspected SQL Server

TBD

2 1 & 2 FMS Fault Management System. Information about equipment failures, their time, and their location

To assess relationship between failure and other data.


TBD

3 1 & 2 ITPS/ACTRAFF Integrated Train Planning System (Timetable). Information about the type and extent of asset usage.

To assess relationship between usage and condition of the asset.


TBD

4 1 & 2 TRUST Train movement reports. Information about asset failures and associated delay.

To assess relationship between failure and performance risk relative to other parts of the network.


TBD

5 1 OLE_EX OLE Heights, staggers, force, acceleration, GPS information.

To assess whether there is any correlation between measured data and failure events. To assess whether there is any correlation between expected wire wear and actual wire wear.


TBD

To compare LADS and OLE-EX processed data.

6 1 OLE_EX OLE asset data such as structures/midspans/neutral section and location of OLE

To support identification of the location of problem


TBD

7 1 LADS OLE Heights, staggers and GPS information.

To assess whether there is any correlation between measured data and failure events. To assess whether there is any correlation between expected wire wear and actual wire wear. To compare LADS and OLE-EX processed data.


TBD

8 1 379 In-service OLE

TBC To identify places where contact is lost and compare with other OLE monitored data and track geometry information.


TBD

9 1 SSI Train Movements, speeds, and other related information

To allow comparison between actual train movements, asset health, and faults raised


TBD

10 1 NRWS Information about air temperature.

To compare the air temperature to the actual temperature close to the OLE; if the OLE temperature is lower than the measured air temperature we can infer a threshold of "safe" performance.


TBD

11 1 OLE temperature monitoring

Information about the temperature within the vicinity of the OLE.

To compare the air temperature to the actual temperature close to the OLE; if the OLE temperature is lower than the measured air temperature we can infer a threshold of "safe" performance.


TBD

12 1 OLE ALP Information about how the OLE degrades over time.

To compare the "theoretical" life of the OLE with the "actual" life of the OLE based on usage.


TBD

13 1 Network Model SRS for each 5 chain length To enable the performance risk to be managed.


TBD

14 1 & 3 Yellow Train Track Geometry Data

High quality OLE, Track Gauge and ride quality measurements

To allow trends to be identified between OLE and other Geometrical outliers and faults raised.


TBD

15 2 Intelligent Infrastructure (II)

The II system contains asset monitoring information such as detailed point swing readings

To allow trends to be identified between asset types, makes, models etc. in conjunction with FMS,

TBD Wonderware InData database, exposed

TBD

weather and other data sources

through SQL Server connector

116 1, 2, & 3 MetDesk Historical UK weather readings local to track assets going back over the last 10 years

To allow trends between the weather and asset failures to be identified


TBD

117 1, 2, & 3 Earthworks Failure

10 years of approx. 1500 events To allow trends between weather, earthworks, and asset failures/health changes to be identified


TBD

Funded by the European Union’s H2020 GA - 731932

2.5 Big Data Technology, Techniques and Algorithms 2.5.1 Overview and Positioning in BDV Reference Model

Data Visualization and User interaction

The output is to be determined depending on the value and output of the prognostic/predictive

algorithms/models. These may be represented by for example 2D or 3D representations

depending on what is necessary.

Page | 22


Version 1.0, 15/03/2017

Data Analytics

Data analytics will be based around a process described as follows:

Data Processing Architectures

Data will likely need to be processed multiple ways in order to answer the questions presented,

this includes batch and streaming or real-time. The programme will utilise big data processing

tools in the cloud platform to accomplish these goals e.g. USQL, PIG, KAFKA, MapReduce etc.

Data Management

Data will be collected, handled, and processed in accordance with the work package data

management plan.

Existing

The programme will be consuming existing data provided by the data providers and will be

hosted on a managed cloud environment.

Page | 23


Version 1.0, 15/03/2017

2.5.2 Detailed Explanation of Technology, Techniques and Algorithms

2.5.2.1 Technologies

The technologies chosen for this work package will be mostly from the hadoop ecosystem, as

the de facto standard for big data analytical processing. Thales will be making use of the

Microsoft Azure service as a platform which is based upon hadoop. The Azure services are fully

cloud based with a minimum supported SLA of 99.9% for all components and provide a scalable

fully hosted solution with security features built in.

Key technologies to be used:

Storage & Analytics

Azure Data Lake – HDFS backed distributed storage with potentially limitless storage and

scalable performance.

HBase or Cassandra– Distributed multidimensional column orientated databases,

capable of linear scaling, and the storage and retrieval of petabytes of data extremely

quickly.

Data Lake Analytics – MapReduce based jobs capable of performing analytical

calculations upon huge quantities of schema-less data

SQL Server – RDBMS developed by Microsoft, designed to store relational data

Azure Machine Learning – Apache Mahout based Machine Learning service. Used to

automatically categorise and suggest relations based upon a training set of data

Languages

R – Statistical and graphical software library which is ideal for manipulation of data and

creating machine learning models.

C# .Net – Mature and widely used C based language under the Microsoft .Net family of

technologies.

Testing, Development & Management

SpecFlow – Automated test tool that binds requirements to functionality written in .Net

languages

Visual Studio Team Services – Tools for providing Agile working practices, code source

control, and continuous integration functionality.

PivotalTracker – Task management tool

Messaging, Orchestration & Service Bus Technologies

AMQP compatible message queues – open standard for message queue services

Azure Event Hub – Telemetry ingestion from website, apps, or streams of data

Azure Notification Hub – Allows push notifications to be sent to any platform

https://azure.microsoft.com/en-gb/solutions/data-lake/

http://hbase.apache.org/

https://azure.microsoft.com/en-gb/services/data-lake-analytics/

https://azure.microsoft.com/en-gb/services/sql-database/

https://azure.microsoft.com/en-gb/services/sql-database/

https://www.r-project.org/

https://msdn.microsoft.com/en-us/library/kx37x362.aspx

http://specflow.org/

https://www.visualstudio.com/team-services/

https://www.pivotaltracker.com/

https://www.amqp.org/

https://azure.microsoft.com/en-gb/services/event-hubs/

https://azure.microsoft.com/en-gb/services/notification-hubs/

Page | 24


Version 1.0, 15/03/2017

Azure Data Factory – Composition and orchestration of data services in Azure

UI

Microsoft PowerBI – Interactive Data Visualisation Tools

Universal Windows Platform – Allows single user interface over all devices running

Windows 10

2.5.2.2 Techniques & Algorithms

Supervised machine learning algorithms have the potential to process huge quantities of data

to identify trends and categorise data on a previously impossible scale, and in near real-time.

Models are created using between 70% and 80% of a known dataset and then validated using

the remaining 20% to 30%.

MapReduce was developed by Google for processing incredibly large amounts of schema-less

data in a very parallelised architecture. MapReduce was open sourced in the early 2000’s and

is now a key part of the ‘Big Data’ landscape. A major component of the Hadoop ecosystem is

the MapReduce functionality. The concept itself is very simple but it’s capable of very

complicated aggregation and what-if style processing. MapReduce jobs are likely to be a very

prominent part of this work package.

Traditional star schema data warehouses were designed for allowing the aggregation of large

amounts of data very quickly. They are flexible, widely understood and easily ingested by a

huge range of reporting tools. In contrast multi-dimensional columnar store NoSQL databases

are designed to hold greater than hundreds of millions of rows of data but are quite specific in

how data can be accessed. Their performance is linearly scalable and while they don’t offer full

ACID guarantees they generally offer tuneable aspects to meet or closer approach the

requirements. It is highly likely that both a star schema warehouse and a columnar database

will be required. It is also likely that a traditional RDBMS will be required for comparatively

small amounts (sub 10 million rows) of relational data.

https://powerbi.microsoft.com/en-us/

https://developer.microsoft.com/en-us/windows/apps/getstarted

Page | 25


Version 1.0, 15/03/2017

2.6 Big Data Infrastructure Note: The architectural diagrams in this section may change and evolve as more information

becomes apparent through the research and progress of this work package.

2.6.1 Platform architecture

The image below presents a conceptual platform architecture. Data ingested on the left

becomes far more useful and actionable as more intelligence is applied, resulting in information

to back business decisions

Page | 26


Version 1.0, 15/03/2017

2.6.2 Value view architecture

The diagram below illustrates how value grows as data passes through each phase of the data

manipulation process.

Thales Analytics Platform : Value ViewTT Secure Storage

Data Lake Data Factory

Asset Data

ML: Failure Classifier

ML: RuL Prediction

Store Processed Data

Data fetch

1

Output

Schema/Data

Processor

Raw Data Processing

Asset:Health

Assessment

Asset:Prediction RuL & confidence

Asset:State

ML: Anomaly

DetectionReports

Input

Output

Binary Data Text Data

15

2 3

4

5 7 9 11

6 8 10 12Thales HSM Key

Store

0

Insights

0. Thales HSM key store – The thales HSM key store is a military grade Hardware Security

Module which is used to store encryption keys.

1. Customer / Data owner – Customers/data owners hold the key to their data and store it

in a Thales Hardware Security Module. This enables the customer/data owner to take

their key at any point and render all encrypted data useless. This way the customer/data

owner is always in control of their own data.

2. Data store – The customer / data owner uploads their data either by one off bulk,

regular batch, or streaming. The data at this point is raw and will never be modified

from its original state. By ensuring its write-once nature accuracy of predictions can be

traced back to the original data.

3. Data factory – Data factory is an orchestration tool used to transport and manipulate

data.

4. Schema/data processor – Data will be uploaded in its raw format. This will include many

different forms such as proprietary binary formats, XML, JSON, XLS, CSV, etc. The data

processor manipulates the raw format into a generic canonical format which can be

ingested by subsequent steps.

5. Asset Data – This is data purely about an asset. This is unmodified, but canonicalised

data.

6. Insights - Using the asset data humans can draw insights from viewing the ‘raw’ data in

various graphs and tables.

7. Asset State – Asset States are used to definititely say whether an asset is in a particular

state or not. states might be “Working” and “Broken”, or could be more specific such as

“Brush wear fault”, “Obstruction fault”, “Over voltage fault”.

Page | 27


Version 1.0, 15/03/2017

8. Machine learning anomaly detection – Using known data models can be created, similar

to a set of complex rules, via a process known as Machine Learning. Machines can then

be used categorise new data in one of the learnt categories. This can be used to

categorise data into known states, as well as a degree of certainty to back the machines

confidence in its decision.

9. Asset Health Assessment – States are current tense, they don’t help to predict or

influence the future. Health assessments however typically provide a percentage

indicating how healthy an asset is. This can be used by a user to influence maintenance

patterns. It’s not used however to predict the future i.e. no knowledge of future events

is taken into account, only past events.

10. Machine learning failure classifier – Using past knowledge of failures machine learning

can produce health assessments.

11. Asset prediction Remaining Useful Life – By taking into account planned and forecasted

conditions and events, and by knowing how these events have historically effected the

health of an asset, a figure of remaining useful life can be determined. This figure can

then be used to better plan and prioritise maintenance.

12. Machine learning Remaining Useful Life – Machine learning can again be used to

produce these valuable figures. By comparing the upcoming planned and forecasted

events the models are able to produce figures that estimate a remaining useful life

duration.

13. Reports – This is the critical part of the system. If the information isn’t presented in a

useful fashion then the user won’t be able to take effective action. This may be

presented as graphs, tables, etc.

2.6.3 Conceptual Architecture overview diagram

WP6 HMI Users

Data Streams

Data Files

External In

pu

t Interfaces

External O

utp

ut In

terfaces

Page | 28


Version 1.0, 15/03/2017

2.6.4 Logical Architecture diagram

WP6

Data Streams

Data Files

External In

pu

t Interfaces

External O

utp

ut In

terfaces

Data Acquisition

Data Streams/

Data Files

Data Manipulation

State Detection

Canonical data

Health Assessment

Prognosis Assessment

Alerts

% Health Used / Remaining

X days remaining

UsersHMI

Page | 29


Version 1.0, 15/03/2017

2.6.5 Logical Architecture Detail Key:

WP6

Machine Learnt Intelligence

HMI & Reporting

Data repository

Ingestion

Input Files

Batch Binary

Batch Text

Streaming JSON/XML

Parser

Grey Data

Canonical JSON For entire document/message.

Append/split data where necessary into few, large hourly

files.Stored in folder structures as

/Source/YYYY/MM/DD/YYYYMMDD_HH.JSON

Raw Data Save /move

Raw Data

Persist Data Persist Data

Batched Golden Data Generation

Golden Data

Star Schema DW

Time Series Multi Dimensional

Documents

Key Value

Etc. As appropriate

Queue

AggregatorGroups individual

messages into files of multiple messages

Save bundle files

Pass to Parser to be ‘canonicalised’

Consumed byStreaming analyticsConsumed by

Persist Data

Zipped Files containing raw data on cheap

storage. Likely never to be read again

Stored in folder structures as /Source/YYYY/MM/DD

Near real-time reporting

Big Data Analytics

Map Reduce Job

KPI, historical, batch reporting

Statistical Analysis Job

Consumes Canonical Data


Consumes Golden Set Data

Consumes Golden Set DataOne Off Reports

Recurring Reports

Machine Learning -

Model Creation

Machine Categorisation &

Trending -

Model Usage


Batched data to categorise & extrapolate

Categorised / trend prediction information

Unsupervised Machine Learning

- Model Update

Near real-time data to categorise & extrapolate

Near real-time data to update Machine Learnt Model

Batched data to updated Machine Learnt Model

Updated model

New Model

External In

pu

t Interfaces

External O

utp

ut In

terfaces

Data Acquisition Data Manipulation State Detection Health AssessmentPrognosis

Assessment

Page | 30


Version 1.0, 15/03/2017

2.7 Roadmap Stage Delivery

Date (Project Month)

Features / Objectives Addressed

Embedding in Productive Environment

Big Data Infrastructure Used

Scale of Data

S1: Technology Validation

9 Validate the quality of the input data obtained

Confirm the feasibility of the initial objectives

Microsoft Azure Cloud environment – development resource group

Microsoft Azure Cloud environment

2013 – 2015, Milton Keynes to Waterloo, partial number of data sets

S2: Large-scale

experimentation and

demonstration

18 Validate the quality of the developed algorithms and determine their reliability

Microsoft Azure Cloud environment – ‘Production’, high power, resource group


2013 – 2015, Milton Keynes to Waterloo, all data sets

S3: In-situ trials

27 Validate the quality of the developed algorithms and determine their reliability

Microsoft Azure Cloud environment – ‘Production’, high power, resource group


2013 – 2015, Milton Keynes to Waterloo, all data sets

Page | 31


Version 1.0, 15/03/2017

3 Replication Pilot: Predictive High Speed

Network Maintenance Pilot Nowadays there are different maintenance strategies that can be applied in a railway

infrastructure. Traditionally, it was considered two main tendencies:

a) Preventive maintenance, is defined as the maintenance that is carried out according to prescribed

criteria, and which objective is to reduce the failure probability, or the degradation of elements.

There are different types:

Systematic and periodic.

Conditional, depending on the overcoming of an established limit, or the wear out of the

systems.

Preventive periods will be defined by three factors:

Reliability of the materials.

Characteristics of its functionality.

Level of degradation. It will depend on the time, number and speed of circulations.

We can define three different types of preventive maintenance:

Scheduled preventive maintenance, where reviews are made for time.

Predictive maintenance based in equipment reliability. It tries to determine the moment in

which the repairs must be made by means of a follow up that determines the maximum

period of use before being repaired.

Opportunity maintenance is one that is carried out taking advantage of the periods of non-

use, thus avoiding to stop the railway operation when they are in use.

b) Corrective maintenance, which is based on the actions of identification and rectification of

breakdowns that have already happened.

This type of maintenance is carried out after an incidence in order to return it to its original status.

There are two types of corrective maintenance:

Programmed: when the works are carried out during the maintenance time

Not programmed.

We can consider two types of interventions:

Palliative: when the repair is an emergency repair, not definitive. Normally because of the

need of a quick restore of the service.

Solving: when the repair is carried out definitively.

Page | 32


Version 1.0, 15/03/2017

However, the sector needs an effective optimization of all the process according to the

requirements and the rising competitiveness of the infrastructure market. Therefore, it is

required to develop new strategies based in real data and the specific conditions of the

infrastructure location.

Taking into account the existing needs, it has been established that Predictive Maintenance

Strategies can provide the framework necessary to reach the optimization objectives required

by the sector. In this context, Big Data technologies are basic to provide the platform to

develop accurate models as well as to support the new strategy to be implemented.

Our pilot is focus on the predictive maintenance as part of the preventive maintenance types.

The predictive maintenance at railway domain is the set of programmable activities designed to

maintain the track geometry quality and to ensure the proper operation of the elements of the

superstructure.

Some of the objectives of the predictive maintenance are:

To detect causes that can induce problems.

To decide the suitable moment to review or replace an element.

To reduce the unavailability periods of an element.

Predictive maintenance is based in:

The election of specific parameters to be measured.

To evaluate the range of admissible values.

The use of devices and the method of control used to measure the parameter.

To define the frequency for the measures of the parameters.

Additionally, a predictive approach comprises an additional challenge for the Big Data

technologies due to the difficulties to foresee future situations and/or extreme events. For

example, Climate Change represents a clear challenge of this problem because its effects during

the whole lifecycle of the infrastructure are currently quite difficult to determine (qualitatively

and/or quantitatively).

The benefits that a predictive maintenance strategy can provide are summarized as follows:

1. Extend assets life. 2. Reduce the costs of maintenance (scheduled and unscheduled). 3. Predictive models to forecast (and plan accordingly) maintenance activities. 4. Increase warranty recoveries. 5. Better knowledge of the real causes of early failure. 6. Downtimes reduction (people and machinery). 7. Integrated platform to manage a vast amount of information.

Page | 33


Version 1.0, 15/03/2017

8. Provide decision-making support. 9. GHG reduction. 10. Development of Effective Climate Change Adaptation Strategies.

In a parallel effort, it has been collected a series of needs that diverse professionals have

identified. It is particularly interesting the results showed in a technical paper titled “Railway

infrastructure maintenance - a survey of planning problems and conducted research” carried

out by Linköping University, Department of Science and Technology, Norrköping SE-601 74,

Sweden.

The main outcomes of this paper consist of an identification of problems and improvement

opportunities related to railway maintenance. In summary, these opportunities are list as

follows:

Strategic opportunities: o Maintenance dimensioning o Maintenance contract design o Maintenance resource dimensioning and localization o Service life and maintenance frequency determination o Network design considering maintenance o Renewal scheduling and project planning

Operational opportunities: o Possession scheduling o Maintenance vehicle and team routing o Rescheduling o Deterioration-based maintenance scheduling o Maintenance vehicle routing and team scheduling o Maintenance project planning o Work timing and resource scheduling o Track usage planning

In spite of the opportunities previously commented are more specific and operationally

focused, it can be determined they are perfectly aligned with the objectives that a predictive

strategy implies. This represents a relevant issue in order to ensure the replicability of the

results of the pilot not only in a national context but also in a European framework.

Then, taking into account all this requirements and characteristics, the following figure shows a

general diagram with the approach proposed for the Replication Pilot of the WP6.

Page | 34


Version 1.0, 15/03/2017

Figure 1 Pilot Approach Diagram

In the following sections, it will be described in detail the initial design of the pilot indicating the

parameters to take into consideration.

3.1 Requirements The main high level requirement for this pilot is to optimize the maintenance task of the high

speed railway infrastructure. At this chapter, given the breadth of this initial requirement, the

focus will be on the description of the optimization to be achieved in this pilot, reflecting which

aspects of infrastructure maintenance will be addressed and why these aspects are chosen.

Within the maintenance of the railway infrastructure, the main concerns and therefore

requirements of the railway infrastructure managers and of the companies responsible for

carrying out the maintenance tasks of the railway infrastructure are:

1. Infrastructure security: Have a railway infrastructure that is at all times in conditions that

ensure the level of security for which it was designed.

2. Safety of the maintenance personnel: All interventions and work performed as part of the

maintenance of the railway infrastructure must be designed and controlled in a manner

that ensures the safety of the maintenance staff.

3. High availability: Increase infrastructure availability.

Page | 35


Version 1.0, 15/03/2017

4. Cost optimization: Reduction of the cost of maintaining the infrastructure given that it is

one of the aspects that allows the train to be competitive against other transport modes.

In order to meet these high-level requirements, the following specific requirements can be

established for the pilot:

ID Requirement description

REQ1 Minimizing the number of maintenance interventions

REQ2 Minimization of times required for each maintenance work

REQ3 Minimization of elements or assets requiring maintenance

REQ4 Optimization of the maintenance activities whose elements involved are close and has to be performed close to the time

REQ5 Migrate the maintenance mode philosophy, from the periodic maintenance to the intelligent maintenance based on predictive information going through the current maintenance model of “Condition based maintenance” or “by state”

The basis of this current maintenance model is to maintain continuous intensive monitoring of

the elements and track geometry. The next step for the maintenance activities is to be based on

predictions of the evolution of infrastructure wear, which is one of the pilot requirements.

With this pilot is expected to achieve the following improvements related to the management

of maintenance tasks of the high speed rail network:

Reduction of maintenance cost by up to 12%

Increase network availability by up to 20%

Cost optimization

Safety for maintenance

staff

Infrastructure security

High availability

Page | 36


Version 1.0, 15/03/2017

Improve maintenance efficiency by reducing the number of maintenance work due to

failure in elements 10%

The following figure shows the process of identifying the pilot's objectives.

Figure 2 Pilot objectives identification process

To achieve these challenges, the pilot must provide useful information in order to be able to

answer appropriately to the following questions:

When some element of the infrastructure will fail?

What level of accuracy have the predictions made by the pilot?

What elements will be affected by the expected failure?

What is the relationship between failures?

In the event of a fault, is it expected that another fault will occur nearby?

What is the cost (time and money) for each of the actions necessary to anticipate the

existence of a possible failure?

What is the cost (time and money) for each of the actions necessary to solve a failure?

What is the cost (time and money) for the necessary work to solve a failure when it

occurs?

3.2 Objectives The main objective is to develop a pilot that obtains, analyses and links all the information

currently available regarding the operation and maintenance of the high-speed rail network,

Page | 37


Version 1.0, 15/03/2017

both historical and real-time information, applying methodologies of analysis of massive data,

with the aim to extract the necessary knowledge that allows to make an accurate prediction

about the evolution of the wear of the infrastructure elements, that serves to optimize the

maintenance works of the high speed railway infrastructure.

The objective of this pilot is to provide a tool that provides useful information for the decision-

making process that allows anticipating the possible problems or failures that can arise in the

railway infrastructure and in this way achieve to reach greater levels of availability of the

infrastructure than the current ones at a lower cost and maintaining the high level of security

required.

Therefore, the specific objectives of the pilot are:

ID Objective description

OBJ1 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the degradation of switches and crossings of the high speed lines throughout its useful life

OBJ2 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the degradation of track profiles of the high speed lines throughout its useful life

OBJ3 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the topographic characteristics of the slopes that are part of the railway infrastructure

According to the standard EN 13306:2011, maintenance is the combination of all technical,

administrative and management actions during the life cycle of an item intended to retain it in,

or restore it to a state in which it can perform the required function.

All the elements of the track, such as the materials that make it up and the geometric

parameters that relate to each other, wear out due to the effects of atmospherics agents and

the vehicles driving on them. In order to continue with their functions, they have to be

performed a set of actions to ensure the quality of the route in relation to the needs of the

traffic.

The maintenance tasks are aimed at ensuring the safety of the circulation, reaching the

maximum possible degree of comfort for travellers and maintaining regularity indices that

characterize the trains on each track.

During the first phases of the pilot's execution, three aspects of the railway infrastructure will

be studied in order to optimize its maintenance.

Page | 38


Version 1.0, 15/03/2017

The choice of these three aspects of the railway infrastructure as pilot study elements has been

due to the following points:

They are elements that concentrate the majority of the maintenance works of the high

speed lines.

The maintenance cost associated with these items is high.

The maintenance of these elements has a great margin of improvement and therefore

great margin of reduction of cost.

Each maintenance work to solve a degradation of these elements requires heavy

machinery that must go to the fault point. To move this kind of machinery requires a lot

of time and has impact at regular train traffic.

All the maintenance works requires long execution times to solve the incidences. It

provokes a reduction of the availability of the railway infrastructure even at valley

hours.

There are a lot of information available related to these aspects, both regarding the

evolution of their useful life and of incidents that have been occurred in the past.

There are historical data on these elements and periodically updated information on

them is available.

There are economic cost data for each maintenance task associated with these

elements.

The maintenance of these elements is currently periodic or when a failure is detected

and therefore fits in with the objective of performing a predictive maintenance of these

elements.

The quality of the information related to these elements is high.

The sources of information available are very diverse.

Switch and crossing elements

Track profiles

Slopes close to tracks

Page | 39


Version 1.0, 15/03/2017

A priori there are common sources of information that have information that can be

useful to detect the degradation of these elements.

The study of the maintenance of these elements and their degradation will allow to

estimate the cost savings and the increase of the availability of the railway network.

During the next steps of the pilot, these aspects of the infrastructure will be analysed in order

to detect if it will be possible to predict the wear of the infrastructure elements that will be

analysed at the pilot. This decision will be make taking into account several aspects like:

Data quality

Data availability

Data volume

Relationship between all available data

Margin of improvement

Possibility to get new measures in the future

3.3 Use Cases / Scenarios The following use cases will be realized during the execution of the pilot are:

Page | 40


Version 1.0, 15/03/2017

3.3.1 UC1: Prediction of the degradation of switch and crossing elements

3.3.1.1 Objective

The result of this use case will be a prediction about the wear of switch and crossing elements

and a comparative about cost between current maintenance activities and the cost of the

maintenance activities designed using the pilot results.

3.3.1.2 Input data

For this scenario will be necessary to use at least the following information:

Historical data of switch and crossing state

Historical reports about maintenance of switch and crossing elements

Relation between switch and crossing elements faults and maintenance activities

Measures about the current state of the switch and crossing elements

Train traffic information

Historical weather conditions

Maintenance activities to recover the optimal state of the switch and crossing elements

Resources to each type of maintenance work

Cost of each type of maintenance work

3.3.1.3 Work environment

The main activities to achieve this use case will be analyse, use and link all data available at the

railway infrastructure, operational, environmental and maintenance information, in order to

make a fit prediction of the degradation of the switch and crossing elements installed at High

Speed Lines.

The main steps to fulfil the objectives are:

1. Select specific switch and crossing along the pilot line

2. Study of the failures produced in the past with this selected items

3. Collect all data available related to each selected item

4. Execution of the big data algorithm to predict the wear of these selected items

5. Compare the results to detect the degree of reliability of the predictions made by the

pilot

6. Carry out a cost analysis to detect the improvements achieved with the pilot

Page | 41


Version 1.0, 15/03/2017

3.3.1.4 Relation with objectives and requirements

At the following table is represented the relation of this use case with the pilot objectives and

the requirement.

Use case Objective Requirement

UC1 OBJ1 REQ1 REQ2 REQ3 REQ4 REQ5

3.3.2 UC2: Prediction of the degradation of track profiles

3.3.2.1 Objective

The result of this use case will be a prediction about the wear of track profiles and a

comparative about cost between current maintenance activities and the cost of the

maintenance activities designed using the pilot results.

3.3.2.2 Input data


Historical data of track profile state

Historical reports about maintenance of track profile

Relation between track profiles faults and maintenance activities

Measures about the current state of the track profile

Train traffic information


Maintenance activities to recover the optimal state of the track profile






make a fit prediction of the degradation of the track profile along the High Speed Lines.


1. Select specific tracks along the pilot line


Page | 42


Version 1.0, 15/03/2017


4. Execution of the bid data algorithm to predict the wear of these selected items


pilot




the requirement.


UC2 OBJ2 REQ1 REQ2 REQ5

3.3.3 UC3: Prediction of the evolution of the slopes that are part of the railway infrastructure

3.3.3.1 Objective

The result of this use case will be a prediction about the evolution of slopes and a comparative

about cost between current maintenance activities and the cost of the maintenance activities

designed using the pilot results.

3.3.3.2 Input data


Historical data of slopes state

Historical reports about maintenance at slopes

Measures about the current state of the slopes


Maintenance activities to recover the optimal state of the slopes






make a fit prediction of the evolution of the degradation of slopes at High Speed Lines.

Page | 43


Version 1.0, 15/03/2017


1. Select specific slopes along the pilot line



4. Execution of the bid data algorithm to predict the wear of these selected items


pilot




the requirement.


UC3 OBJ3 REQ1 REQ3 REQ5

3.4 Data Assets Railway Infrastructures are really complex and comprises many different elements. All these

elements have associated diverse parameters that define their characteristics at least form the

maintenance point of view.

Then, and in order to identify and describe these parameters, the following table shows the

data sets and their possible sources to characterize these parameters and the element of the

infrastructure (or elements) that is related.

Name of Data Asset

Short Description Initial Availability

Date

Data Type

Link to Data ID Card (in

basecamp) Ferrovial Drone flights

The drone will provide topographic data from the flights to be carried out in the Pilot area. These data set comprises cloud point of the terrain with their coordinates, data from the drone (location, altitude, etc.) as well as aerial photos.

Depending on the flights planning

and their number that must be

approved by the competent authority

LAS format and aerial photos

https://3.basecamp.com/3320520/buckets/1429164/uploads/423252586

Machinery Technical Specifications

Technical specification data (type, manufacturer, plate (national and international), max speed, year of adquisition, etc)

01/04/2017 pdf, xls, jpg, mpg4, doc


Machinery Fuel Data regarding machinery fuel consumption 01/01/2015 xls, csv https://3.basecamp.com/3320520/bu











Page | 44


Version 1.0, 15/03/2017

Consumption during the maintenenace operations. ckets/1429164/uploads/423253184

Machinery Work Mode

Data related the machinery work mode (standstill periods , working stages, transit)

01/01/2015 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423251931

Machinery Engine Work

Data related the machinery engine when is working (standstill periods , working stages, transit)


Machinery GPS location

Data related to the location of the machinery according to their GPS coordinates and the theoretical distances to reach the maintenance worksite.


Machinery Automatic Fuel Consumption

Data regarding machinery fuel consumption during the maintenance operations [automatically obtained]


Machinery Automatic Work Mode

Data related to the machinery work mode (standstill periods , working stages, transit) [automatically obtained]


Machinery Automatic Engine Work

Data related to the machinery engine when is working (standstill periods , working stages, transit) (automatically obtained) [automatically obtained]


Machinery Number of Tamping Insertions

Data related to the number of tamping insertions carried out by the tamping machine [automatically obtained]


Machinery General Engine Data

Data related to engine data of the machinery focused on its operation [automatically obtained]


Machinery Tamping Device Temperature

Data related to the monitoring of the temperature of the tampling devices while the machine is working [automatically obtained]


Track Geometry Data related to the track geometry before and after of the maintenance activities used to control the track parameters are within the required tolerances.

01/01/2008 xls, csv and paper


Railway Design Projects [paper support]

Full construction designs including all the elements of the infrastructure and the as-built info collected during the construction stage [paper support]

01/01/2002 paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252531

Railway Design Projects [digital support]

Full construction designs including all the elements of the infrastructure and the as-built info collected during the construction stage [digital support]

01/01/2002 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.


Maintenance Operation Drainage Clearing [paper support]

Maintenance reports regarding the operations carried out for drainage clearing [paper support]

01/01/2002- 01/01/2008

paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252563

Maintenance Operation Drainage Clearing [digital support]

Maintenance reports regarding the operations carried out for drainage clearing [digital support]





























































Page | 45


Version 1.0, 15/03/2017

Maintenance Embankment Slope Clearing [paper support]

Maintenance reports regarding the operations carried out for clearing of embankment slopes [paper support]

01/01/2002- 01/01/2008


Maintenance Operation Embankment Slopes Clearing [digital support]

Maintenance reports regarding the operations carried out for clearing of embankment slopes [digital support]



Maintenance Operation Track Bed Profiling [paper support]

Maintenance reports regarding the operations carried out for track bed profiling [paper support]

01/01/2002- 01/01/2008


Maintenance Operation Track Bed Profiling [digital support]

Maintenance reports regarding the operations carried out for track bed profiling [digital support]



Maintenance Operation Line Fencing Preservation [paper support]

Maintenance reports regarding the operations carried out for track bed profiling [paper support]

01/01/2002- 01/01/2008


Maintenance Operation Line Fencing Preservation [digital support]

Maintenance reports regarding the operations carried out for track bed profiling [digital support]



PNOA MÁXIMA ACTUALIDAD

Recent ortophotographies mosaic of the national territory in scale 1:50.000

01/01/1990 (in spite of there are

older pictures that can be

considered not enough updated)


MTN25 RÁSTER Recent raster file of the National territory map in scale 1:25.001




WEB Service


BTN25 Recent vectorial file of the National territory map in scale 1:25.000




WEB Service


BTN100 Topography base of National territory map in scale 1:100.000

NA WEB Service


MDT05/MDT05-LIDAR

Digital model of the terrain related to the National territory with a mesh of 5x5 m

01/01/2017 WEB Service


Recent Seismicity Recent seismic activity and earthquakes within National territory and boundaries.



Seismogenic Zones Seismic related zones and elements within 01/10/2010 WEB https://3.basecam


















































Page | 46


Version 1.0, 15/03/2017

National territory and boundaries. Service p.com/3320520/buckets/1429164/uploads/423252328

Hydrogeologic map Hydrogeological GIS information of the National territory.

01/01/1991 - 01/01/1999

WEB Service


Vegetation Index Vegetation index GIS information within the territory of Andalusia time dependant.



Noise map of Malaga

GIS information of the noise distribution in the city of Malaga related to train traffic

01/01/1991 - 01/01/1999

WEB Service


Weather Information

Weather information gathered by the ADIF's weatherstations deploed through its facilities and infrastructures.



Weather Information

Weather information gathered by the AEMET network.

01/01/1920 API REST https://3.basecamp.com/3320520/buckets/1429164/uploads/423252425

European Railway Data

The Agency is responsible for developing and maintaining several registers and databases in order to ensure transparency and equal access to documents for all railway market actors. These data sets include aspects such as incidents, safety, iteroperability or rolling stock.

01/01/1992 WEB Service / API REST


Tramification Tracks and units in service JANUARY 2010 XLS, PDF https://3.basecamp.com/3320520/buckets/1429164/uploads/423439141

Tramification (IDEADIF)

Tracks and units in service (geometry) JANUARY 2010 SHP https://3.basecamp.com/3320520/buckets/1429164/uploads/423439174

PIDAME Aplying for maintenance tasks JANUARY 2010 XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439054

SIOS Works, projects and maintenace tasks XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439101

ICECOF Monitoring and control system of railway operation

XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439032

CGRH24 24h network management centre https://3.basecamp.com/3320520/buckets/1429164/uploads/423438890

DAVINCI High technology in railway operations https://3.basecamp.com/3320520/buckets/1429164/uploads/423438950
























































Page | 47


Version 1.0, 15/03/2017

Dynamic inspection Dynamics inspections database https://3.basecamp.com/3320520/buckets/1429164/uploads/423438992

Geometrical inspection

Geometrical inspections database https://3.basecamp.com/3320520/buckets/1429164/uploads/423439016

S&c inspection S&c inspections database XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439085

TAMPING Tamping database XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439118

AEMET Observing networks used for meteorological and climatological studies

1981 CSV, PDF, XML


CLIMA Environmental climatological information subsystem

XLS https://3.basecamp.com/3320520/buckets/1429164/uploads/423438915

3.4.1 Techniques to monitor the infrastructure

The data available regarding to the infrastructure state are collected by several types of

inspections. The pilot will use the information provided by these inspections as data input in

order to try to find relationships between them that currently are not recognised and to

analyse the evolution of each measure to try to predict the infrastructure wear.

Following is describing some different kind of inspections that will be used at the pilot:

Geometric track inspection: in Spain, the Vehicle Track Geometric Control registers the

parameters of the track geometry. Accelerations are measured in the axle boxes to

detect wearing both short-wave and long-wave and levelling defects in welds or joints.

These defects are associated with dynamic overload, which causes rolling contact

fatigue, as well as causing vibrations and discomfort. Longitudinal and transversal

levelling are also controlled; they are related to short-wave, medium-wave and long-

wave and can produce dynamic overload, uncertainties in the circulation and

discomfort. Warping defects are also measured with this device because they can result

in derailments. Other parameters measured are alignment, track gauge, the head rail

transverse section and the profile of the track.

The frequency of the geometric inspections is about every two weeks.

Dynamic inspection: One of the most important sources of information related to

maintenance of superstructure in the high-speed lines is the periodic dynamic

inspections of track. This is an indirect method of detecting defects in the

























Page | 48


Version 1.0, 15/03/2017

superstructure through the study of the different accelerations suffered by a

commercial vehicle in motion at certain points of their structure. The characteristics and

condition of the vehicle, and its speed have influence in the "reading" that it makes of

the track.

It is made with the dynamic control vehicle whose speed is the same than the one in a

commercial train. The next accelerations are measured and registered: the vertical in

axle boxes, and lateral and vertical in car body. According to the experience, we set the

thresholds for each acceleration. The system provides a list of the kilometres where

limits have been exceeded, attached to the values measured at these points.

o Lateral Bogie acceleration. For values above 6 m/s2, the defect is checked as

soon as possible and immediately corrected. When the values are between 4 and

6 m/s2 they are considered as programmed correction, and when they are

between 2.5 and 4 m/s2, they are considered as surveillance.

o Vertical axle box acceleration. They are considered only values above 30 m/s2. It

is necessary to make specific topographic studies in the track, to analyse the

sleeper´s settlement, among other causes.

o Lateral and vertical car body accelerations. Besides these accelerations,

occasionally longitudinal acceleration is also measured. They show the comfort

of the traveller. We consider as normal values those below 1 m/s2. It is also

interesting from the point of view of detection of long wave defects.

Twelve inspections dynamics are performed annually. The frequency of dynamic

inspections allows us having updated information (monthly or bimonthly) of the state of

the superstructure and its evolution and effectiveness of the work done since last

auscultation. That is why it is considered that the maintenance of the superstructure has

its main source of information on the dynamics inspections, and that should be the basis

for scheduling the maintenance of superstructure measures.

Rails ultrasonic inspection: Rail head defects with longitudinal orientation are detected

by a vertical acting ultrasonic radiator.

A vehicle equipped with ultrasonic instruments for rail inspections performs this activity

twice a year.

Track visual inspection: Specialized personnel make a visual inspection of superstructure

of the track by foot. All the aspects that could have an impact on the normal

development of the exploitation are checked.

In particular, the next elements are evaluated:

o Ballast: State, pollution, presence of weeds, dimensions of the ballast layer.

o Sleepers: Presence of fissures or cracks, damage from track machinery, squaring.

o Fasteners: Correct placement and operation of the fastener.

o Rail: appearance of surface defects, cracks or fissures.

Page | 49


Version 1.0, 15/03/2017

o S&C: inspections

The frequency of these visual inspections is twice a year.

Route on train cab: Every week, it is made a visual inspection of the track in the train cab

of a commercial circulation recording all the singularities of the road. As soon as

possible, potential defects are confirmed, taking field data and geometric correction is

programmed if it is necessary.

The frequency of these visual inspections is weekly.

With all the data from dynamic auscultation, geometric auscultation, train cab inspection and

on foot inspections the maintenance staff have the information necessary to schedule

maintenance work. The analysis of the acceleration graphs is very useful and it reaches its

maximum operational and effectiveness if they the more important elements of the

superstructure are located on it. From these studies, the works to be done in the track are

scheduled, taking into account those that need treatment with heavy machinery or specific

studies of topography or dynamic inspection confirmation to solve the problem. In addition,

those areas or points whose treatment requires more investment and specific planning are

identified.

3.5 Big Data Technology, Techniques and Algorithms

Note: In the present chapter it is presented that the Big Data Technology, Techniques and

Algorithm used for the presented project will be based on the SOFIA2 Platform (by Indra

Sistemas, S.A.). It is worth noting that SOFIA2 is also the Big Data platform used in “WP4 –

Smart Highways” and “WP8 – Smart Airport Turnaround” so because of that, the following

chapter is shared also with those work packages.

Each of this pilot that share the Big Data platform will use both, common and specific modules

of SOFIA2 in order to be able to fulfil the specific objectives of each pilot.

This chapter is divided in two parts; the first one shows the Big Data platform characteristics

with the specific modules that will be used at the Railway pilot, and the second one shows the

specific technics used develop the descriptive and predictive algorithms of this railway pilot.

3.5.1 Big Data Technology

The aim of the SOFIA project (an Artemis projects, 2009-2012) was to make information in the physical world available for smart services by connecting the physical world with the information world. The idea was to enable cross-industry interoperability and to foster

Page | 50


Version 1.0, 15/03/2017

innovation while maintaining value of legacy solutions. Sofia implements the Smart M3 semantic model. The semantic information broker (SIB) of the Smart M3 model is able to dispatch extremely high volume of data, but it is not able to process them, this is delegated to the nodes.

The results of Sofia were taken as input by Indra to create Sofia2, an IoT and Big Data solution oriented to real-life conditions. The performance was optimized and the semantic requirements were simplified. Indra also enhanced Sofia2 through the Analytics Labs, bringing an open source toolset for the Big Data exploit, based on the main standards and tools like Hadoop.

Sofia 2 community version provides a set of open APIs based on the main standards so that any developer can expand the functionality of the platform Sofia for its needs. Using this approach it is not needed any specific skill or training for using Sofia2 beyond the existing open solutions. As communication protocols Sofia 2 uses MQTT, RESTful, Ajax Push, Websocket, AMQP and JMS.

The interchange of information is based in the definition of ontologies, a semantic solution to face the wide spectrum of IoT. This modeling reference only requires to define the information semantically, and to develop the Knowledge Processor from the data source (sensor or system) that will structure the data obtained from the environment with its semantic meaning. Sosfia2 uses SSAP-Json and SSAP-XML as standards for the exchanging of information.

Sofia2 supports a wide range of use cases, for example for Smart Cities, Mobility, Energy, Building, Home or Health among others. And using the same approach, the only thing that changes is the semantic definition of the information and the sensors or devices connected.

3.5.1.1 Conceptual model Indra Big Data platform

Sofia2 Big Data Platform's main objective is simplifying the use of all its technologies and

expediting the use and exploitation of data structured and not structured, even in real time.

Sofia2 is a modular platform that allows to deploy its modules independently according to the

needs. All the concepts of the platform are managed from a unified web console allows scaling

per the needs based on proven technologies.

Built on widely-tested Open Source software (like Hadoop, Spark, Hive, and others), Sofia2

supports real-time scenarios, batch, ML, visualization. Additionaly, Sofia2 is extensible and

adaptable, integrating security at the data modelling level, offering validations in the data

exploitation and semantics.

Page | 51


Version 1.0, 15/03/2017

Figure 3: Sofia2 Big Data Platform used modules

3.5.1.2 Sofia2 Typical workflow

Typical flows in Sofia2 do not use all modules. Below it is presented the workflow and the

modules that will be used for the current pilot.

The workflow of SOFIA2 has the following 4 steps.

Figure 4: Sofia2 Typical workflow

Page | 52


Version 1.0, 15/03/2017

1- Upload information

Ingestion of information (this information can be treated in this phase to homogenize it

and ensure the quality of the data) through DataFlow Module for the definition of

treatment and storage in Stagging Area of Module Storage.

2- Analysis of the information

Processing of information and execution of machine learning algorithms through the

Notebook and Machine Learning modules to obtain business value data.

3- Storage of models and results

The values resulting from the executions of the machine learning algorithms are stored

on the platform

4- Display information

Through the Dashboard module, all the information stored in the platform, both

inserted in point 1 and inserted in point 3 can be accessed and consulted.

The modules of the platform Sofia 2 used in the construction of the pilot are explained in the

following subsections.

3.5.1.3 Sofia2 Storage

The information modeled in the Platform is stored in the Big Data Repository included in the platform.

The reference implementation of this repository that is supported on Hadoop is used. Apache Hadoop is an open-source framework that allows the distributed processing of large amounts of data (peta bytes) and working with machine clusters in a distributed way.

Currently Hadoop is synonymous with Big Data for being:

Economical: runs on low-cost equipment forming clusters.

Scalable: If you need more processing power or single storage capacity there is to add

more nodes to the cluster very easily.

Efficient : Hadoop distributed data and processes it in parallel on the nodes

Reliable: Hadoop moves processing (Tasks) to data.

The main parts of Hadoop that uses the solution are:

HDFS, is the Hadoop distributed file system

System of files distributed that abstracts of the storage physical and offers a

vision only of all the resources of storage from the cluster.

To the store a file, it part in blocks and stores each block in node different from

the cluster. It also replicates each block in at least three nodes.

Page | 53


Version 1.0, 15/03/2017

It is possible to store files larger than the maximum size of any of the machines

in the cluster disk.

If a node of the cluster is fault, the system continues running while is repaired

using the information replicated in other nodes.

Hive, is the infrastructure data warehouse on Hadoop, which allows SQL queries on data

stored in Hadoop.

Impala, that allow the access via SQL online to the data stored in HDFS.

3.5.1.4 DataFlow (ETL module)

DataFlow module is the main entry of data and information to the platform. This module can be

used as an ETL (Extract, Transform and Load), for both purposes: to intake data as for complex

transformations within the platform and/or to export data involving intermediate

transformations.

Figure 5: Sofia2 DataFlow module

The main capabilities of the DataFlow module can be summarized as follows:

Extraction: Where data is extracted from homogeneous data sources. Up to 18 different data origins are integrated in the module: Excel, AmazonS3, HadoopFS, Sofia2 (which lets you select the ontology, fields or query), Kafka, etc.

Page | 54


Version 1.0, 15/03/2017

Transformation: Where the data is transformed for storing in the proper format or structure for the purposes of queriying and analysis.. It is composed of many different tasks:

Evaluation of expressions: performs checks and calculations that can write fields new or existing.

Actions on fields: different actions available on the fields as: Converter, Merger, Masker, Hasher, remove, rename...

Parser of JSON, XML and logs: parses information valid per the different types of format of logs, and schema XML and JSON.

Flow selector: to select the next activity to execute on the dataset, depending on conditions of execution.

Evaluators in different languages: different specific actions on the data available for the coding languages (Python, JavaScript, Jython...)

Other components such as the Replicator registry or the replacement of values

Figure 6: Sofia2 DataFlow example diagram

Load: Where the data is loaded into the final target database.

There are more than twenty possible destinations, to incorporate into the process via

Drag & drop from the taskbar. We are highlighting the following components from

Sofia2 which lets you to select the ontology, fields, and other additional parameters:

AmazonS3

Cassandra

Hadoop

Kafka

Flume

Page | 55


Version 1.0, 15/03/2017

Figure 7: Sofia2 DataFlow monitoring information

3.5.1.5 Notebooks (Collaborative analytical)

The Notebook is an interactive and intuitive module from Sofia2 that allows to show the data and to facilitate its analysis.

As a summary, the Notebook is basically a collaborative tool that is capable for:

1. Performing complex analysis of information managed by the platform (both, real-time and historicals),

2. Combining different languages (Spark, R, Python, Hive, SparkSQL and Shell) 3. Generating intuitive figures (such as table, graphs, maps, and others) 4. Planning the execution of procedures (by new notebooks, one per procedure). It mean,

you can create Notebooks with different targets, for example: a real-time execution, a periodical execution, a batch execution.

As an example, Notebook is able to make a data load from Hadoop Distributed File System (HDFS) to Spark, launching queries and perform complex processes of machine learning through the libraries of MLib.

Also it is possible to use of R code as well as the numerous libraries of the language, allowing by examples to display maps of leaflet.

Page | 56


Version 1.0, 15/03/2017

Figure 8: Sofia2 Sofia2 Notebook

SOFIA2 Notebooks can combine Scala code, Spark, SparkSQL, Hive, R, Shell, or many others with

html content or reactive policy angle, allowing interactions in real time with a powerful

interface, and all in a shared environment, multi-user.

Each supported language is managed by an interpreter, so it always that you want to write code

for a certain language should be write an own marker in the paragraph.

In addition, it allows instant visualization of data, being able to easily configure graphics and

quickly change the display of the same type. Also is possible the creation of graphics advanced

thanks to libraries own of each language.

Figure 9: Sofia2 Notebook Spark graphics

.

https://sofia2about.files.wordpress.com/2016/03/image0051.jpg

Page | 57


Version 1.0, 15/03/2017

Figure 10: Sofia2 Notebook HIVE graphics

Each Notebook consists of paragraphs, which may have different languages, and can run

individually the paragraphs and viewing the output of the same, as well as the State of

execution.

Both paragraphs and the full notebook can outsource via URL, seeing in real time in all cases,

the executions of notebooks or paragraph.

Another feature important is the possibility of plan the execution of them notebooks through

an expression CRON, and can run notebook repeatedly and without loss of context, and can

select an interval of execution of them predesigned or write one custom.

With all these features have a tool web collaborative, that is capable of perform analysis

complex of the information managed by the platform (both in time real as historical),

combining different languages and generating views graphic (u others actions), that is can plan

for their execution periodic, cooling automatically the result of the analytical that is exposed in

a URL.

3.5.1.6 Machine Learning

The Machine Learning platform allows you to apply different learning techniques, among which we have to highlight the following:

Regression: Techniques to estimate relationships between variables.

Clustering: Techniques for grouping data by similarities.

Classification: Techniques to identify the membership of an element to a specific group.

Recommendation / Prediction: Techniques for forecasting the value from a new entity

based on historical preferences and/or behaviors.

Through the interpreter, SOFIA2 allows:

https://sofia2about.files.wordpress.com/2016/03/image0061.jpg

Page | 58


Version 1.0, 15/03/2017

Store models created on the platform. From this, it will be possible to manage them

from the web console, from which we can also invoke them based on parameters and

give them permissions.

Publish Scripts SOFIA2Models that provides methods to retrieve the model, save it,

invoke it, assess its quality.

Generate REST APIs allowing evaluating input data sets through the generated

models. This facilitates its invocation through standard mechanisms that also have

integrated security platform.

This module allows you to define workflows visually, so that it is only necessary to introduce

the configuration parameters and input data to define analytic processes.

3.5.1.7 Dashboards

This module allows you to create a simple and visual dashboard with the information managed by the platform.

This module offers various types of gadgets (data outputs) which can help to generate a full and

personal Dashboard

Figure 11: Sofia2 Dashboard types

3.5.2 Big Data Techniques and Algorithms

Applying a methodology for data mining processes is an important point to plan and execute

such kinds of projects. Some organizations implements KDD (knowledge, discover, datamining)

process while others use more specific standards like CRISP-DM (IBM SPSS) or SEMMA (if they

Page | 59


Version 1.0, 15/03/2017

are using SAS tools). However, in this project, we will use open software and mainly we will use

R and Python language and R Studio tool.

Data mining or exploitation of information is a process to extract useful, comprehensive and

new knowledge with large data volumes being its main goal to find hidden or implicit

information, which cannot be obtained through conventional statistics methods. The inputs for

data mining processes are records coming from operational data bases or data warehouses.

We are using a methodology based on CRISP-DM with some shortcuts. The major steps are

represented in the next diagram. From a defined goal where it is implicit, the business

knowledge it is necessary to prepare data. That data preparation usually includes the data

enrichment with Open Data. Afterwards the creation of an advanced model will produce results

and require validation. These last three stages (data preparation, creation of advanced models

and results validation) constitute a cycle, which is iterated until valid results for the business are

achieved. You can appreciate the model in the following diagram.

Figure 12: Methodology based on CRISP-DM

Each stage will be analysed separately so we can provide additional details.

1. Goal definition based on business knowledge

2. Data preparation and management

Page | 60


Version 1.0, 15/03/2017

3. Creating analytic models

4. Validation and conclusions

5. Deployment: Solution integration with the pilot

3.5.2.1 Goal definition based on business knowledge

The first goal for a data analyst is to understand what the customer really needs to achieve. It is

important to discover which is the primary objective, and the relations with the rest of

objectives.

The analyst should describe the criteria, which are useful from the business perspective so they

can easily understand the situation. Afterwards, it is necessary a more detailed research about

all the resources, restrictions, presumptions and other factors, which should be considered to

determine the objective of data analysis and project plan.

Afterwards, the business goals are converted into data mining goals, so the goals are translated

into technical issues. However, it is important to determine criteria for business success. The

tool to be used is also selected in this stage.

3.5.2.2 Data preparation and management

First, from an initial data collection, it is possible to identify the data quality, discover the first

knowledge and identify interesting data subsets to make hypothesis regarding to hidden

information.

Secondly, the final data set to be used in the analysis is built and it includes tasks such as table

selection, records and attributes, as well as transformation, new specific variables and data

cleaning.

The data cleaning can include the substitution of data with defects to the data estimation

through modelling. Other operations include production of derived variables or creation of new

variables.

Other common operation consists of combining data with open sources, especially when there

are relationships between the initial data and the Open Data, for instance, combining data with

socio-economic variables in EUROSTAT.

The combined data also cover aggregations, as new values calculated as summary information

from multiple records. For instance, a table with customer shopping new fields could be

number of shopping, average in the shopping quantity, percentage of articles in promotion,

etc..

3.5.2.3 Creating analytic models

With our methodology, we are able to respond to any kind of models: descriptive, diagnostic,

predictive and prescriptive. The reader can appreciate the difference in this graphic:

Page | 61


Version 1.0, 15/03/2017

Figure 13: Different kind of models

As it is shown, the more complex the technique you choose, the more value you can add to

your client. In this pilot, it is expected to achieve the predictive level.

One of the main classifications divides machine-learning algorithms into two groups:

Unsupervised algorithms;

Supervised algorithms.

Unsupervised algorithms are applied when you only have input data and no corresponding

output variables. The goal for this technique is to determine the underlying structure or

distribution of the data, to organize data by similarity.

Examples of application of these techniques may be customer segmentation, finding hidden

patterns, etc..

One of the most extended unsupervised algorithms is the K-means algorithm.

On the other hand, supervised algorithms try to map a function from the input data to de

output variable. In these cases, you know in advance the variable you want to predict.

Supervised algorithms are divided into two groups:

Classification algorithms: the output variable is a categorical one: Fraud-not fraud,

green-red-blue, failure-not failure;

Regression algorithms: the output variable is a real number: A value of a temperature, a

pressure…

The next table summarizes the most common algorithms in supervised learning in both

categories:

Page | 62


Version 1.0, 15/03/2017

Classification problem Regression problem

Logistic regression Simple regression

Decision tree Ridge regression

Random forest Lasso regression

Gradient boosted trees Elastic net (Ridge+Lasso)

Neural network, Deep learning Regression tree

Adaboost K-neighbors regression

Naive Bayes SVR

K-neighbors Random forest regression

SVM Gradient boosted tree regression

AFT

In both types of problems, many different algorithms from the listed above are tested and the

most accurate is chosen.

The next diagram shows the selected type of algorithm for each one of the uses cases and

scenarios, which have been described earlier in the document:

Figure 14: Algorithm type for each use case

Once the more appropriate type of algorithms has been chosen, a procedure to test the model

quality and the validity is needed. So data are divided into sampling data for training the model

(the algorithm learns from the past) and the other for testing (the accuracy of the algorithm is

tested) as the next figure depicts.

Page | 63


Version 1.0, 15/03/2017

Figure 15: Model creation process

3.5.2.4 Model Evaluation

The data scientist is able to interpret the models according to his domain knowledge, the

success criteria in data mining and the desired test design. Later, he discusses with the business

analysts the results in the business context.

Depending on the model evaluation, the adjust parameters are reviewed and adjusted for a

new model evaluation until the best model has been achieved until the model can answer the

business goals in a better way. It is even possible to encounter business decisions, which make

the model deficient. So according to the evaluation results and the process review, the project

team decides how to proceed. The equipment decides if the project has to end, if it should

continue by modifying the development so more iterations are necessary either a new data

mining process should start.

A good way to define the total outputs of data mining is OUTPUTS=MODELS+CONCLUSIONS

3.5.2.5 Deployment: Solution integration with the pilot

Supervision and maintenance are important issues if data mining results are part of the daily

business.

Generally, data mining processes are not running independently in an IT environment but they

have to interrelate with other applications or be incorporated into the business processes.

Therefore, we think this stage is crucial to assure the success of the data mining algorithms.

Page | 64


Version 1.0, 15/03/2017

3.6 Positioning of Pilot Solutions in BDVA Reference Model

Figure 16 Big Data Value Reference Model

Data Visualization and User interaction: The pilot will provide a set of specific reports

that will allow the visualization of the information in a readable and useful format on

each one of the predictions made so that they can be of help for the decision-making in

the optimization of the maintenance works.

See chapter 3.5.1.7 Dashboards.

Data Analytics: Algorithms developed for the pilot:

o Descriptive: A descriptive analysis will be carried out in the first place to get a full

understanding of the data.

o Predictive: The main aim of this pilot is develop specific algorithms based on

predictive data analysis to predict the evolution and degradation of each of the

elements of the pilot.

See chapters 3.5.1.4 DataFlow (ETL module), 3.5.1.5 Notebooks (Collaborative

analytical) and 3.5.2.3 Creating analytic models.

Page | 65


Version 1.0, 15/03/2017

Data Processing Architectures:

o Batch process: Processes that allow the feeding of the algorithms with new data

collected during the execution of the pilot will be specified for each data source.

The inclusion of new data will be a periodic task given the nature of the data

sources.

o Interactive: There are identified some data sources that has an unstructured

format, for this reason will be necessary to process this information by

interactive methods in order to be able to use and include them at the pilot.

See chapters 3.5.1.4 DataFlow (ETL module) and 3.5.1.5 Notebooks (Collaborative

analytical).

Data Management: The identified initial data sources provide information to the pilot in

standard formats based on Excel, Pdf and XML files. All these sources will be treated to

allow their initial inclusion and the insertion of data progressively throughout the

execution of the pilot. For more detailed description see chapter 3.4 Data Assets.

The techniques used to manage the data will be:

o Collection: techniques and tools for gathering and storing data in its original form

(i.e., raw data.).

o Preparation/Curation: techniques and tools for converting raw data into

cleansed, organized information.

o Linking/Integration: techniques and tools for matching, aligning and integrating

information.

o Access: techniques, tools and interfaces for accessing information (incl. access

rights management).

See chapter 3.5.2.2 Data preparation and management.

3.7 Big Data Infrastructure At pilot stage 1 and 2 we will use an INDRA Cloud infrastructure in order to execute the first

steps of our methodology (data preparation and management, creating analytic models and

model evaluation). This first steps and the first iterations of our methodology cycle will be

executed at the Cloud infrastructure but at the end of stage 2 and the stage 3, the pilot will be

deployed at the Railway Technology Centre (CTF) of ADIF at Málaga in an IT environment

described at this chapter.

3.7.1 INDRA Cloud platform

The INDRA Cloud platform will be shared by three Transforming Transport Domains:

WP4 – Smart Highways

WP6 – Proactive Rail Infrastructures

Page | 66


Version 1.0, 15/03/2017

WP8 – Smart Airport Turnaround

The cloud platform is dimensioned taking into account the estimated volume of data of the

pilots that will share it.

The following figure shows an overview of the platform that will be used at the first pilot stages.

Figure 17: Sofia2 Big Data Cloud Platform Infrastructure

3.7.2 ADIF Railway Technology Centre Platform

SOFIA2 will be deployed at the ADIF platform with the same modules and characteristics

deployed at the cloud platform.

To deploy the replication railway pilot at the CTF of ADIF we have two hardware alternatives

with the following characteristics.

IBM xServer X3850

Processor 32 cores

RAM 64GB

Network 4 network cards 10/100/1000 4FC DUAL cards

Memory 2x 146GB 15k internal disks

Page | 67


Version 1.0, 15/03/2017

IBM p770

Processor 6 x Power7 with 8 cores

RAM 192GB

Network 3 Quad ethernet cards 4FC DUAL cards

Memory 2 internal disks

The scalability of this system allows growing up to 64 cores and 1TB of RAM to 1066Mhz.

For the data storage system, we will use the hardware solution with the following

characteristics:

Dual IBM System Storage DS5300 2x IBM System Storage EXP5000

Memory 2 x 16 disks with 300GB and 15000RPM

3.8 Roadmap The pilot will be executed following the next three stages. Each stage will evolve with different

objectives, deployment environment, Big Data infrastructure used and data used.

At the following table is represented the main characteristics of these stages.

Stage Delivery

Date

(Project

Month)

Features / Objectives

Addressed

Embedding

in Productive

Environment

Big Data

Infrastructur

e Used

Scale of Data

S1:

Technology

Validation

9

Validate the quality and

volume of the input data

obtained

Provide KPIs for input

data

Confirm the feasibility

of the initial objectives,

At test lab

Indra´s

hardware

and Cloud

Historical data

Page | 68


Version 1.0, 15/03/2017

(OBJ1, OBJ2, OBJ3)

S2:

Large-scale

experimentatio

n and

demonstration

18

Generate the descriptive

algorithm for all the

feasible objectives

(OBJ1, OBJ2, OBJ3)

Generate the predictive

algorithm for all the

objectives, (OBJ1,

OBJ2, OBJ3)

Validate the quality of

the developed algorithm

and determine their

reliability

At test lab

+

Initial deploy

at ADIF CTF

of Málaga

Indra´s

hardware

and Cloud

+

Dedicated

hardware

Historical data

+

Simulated data

S3:

In-situ trials 27

Generate all process of

feed the algorithms with

new and updated

information

Provide conclusions on

each objective and

assess the improvement

found, (OBJ1, OBJ2,

OBJ3)

Deployed at

ADIF CTF

of Málaga

Dedicated

hardware

Historical data

+

Information

updated

3.8.1 Objectives of each stage

The objectives of each stage are:

1. In stage 1, the main objective is check if the data is usefulness for pilot objectives by

checking if the data sources provided and the relationships between them are

appropriate to meet the objectives proposed in the pilot.

During this stage the quality of the available information will be analysed in depth with

the objective of determining the feasibility of each one of the objectives raised as part

of the pilot, as well as the possible reliability of the results and predictions that can be

obtained throughout the execution of the pilot.

2. In stage 2, the main objective is to develop all the descriptive and predictive algorithms.

Other important aim of the stage is to test the algorithms and to be able to establish

adjustments that allow to increase the reliability of the results.

During this stage will be tested too the pilot performance taking into account high data

volume to ensure that the pilot will execute at the final stage properly.

3. In stage 3, the main objective is to provide a complete tool that allow to extract useful

knowledge to be able to optimize the maintenance activities at the high speed rail lines.

Page | 69


Version 1.0, 15/03/2017

As part of this stage will be provide an analysis of reliability of the results provided by

the pilot.

3.8.2 Deployment of each stage

The evolution of the deployment of the Big Data platform will follow the following steps:

1. In stage 1, the Big Data platform will be deployed by INDRA at the test lab environment

based on Cloud technology and will be used for the validation of the pilot objectives and

the data analysis.

2. In stage 2, the pilot will be executed at the Cloud technology environment in order to

elaborate the first algorithms version to process the initial data and provide the first

results of the algorithm.

In parallel, the Big Data platform will be deployed in the Malaga CTF of ADIF in order to

carry out the first load tests and thus check if the available hardware equipment will be

capable of performing a correct execution of the big data platform with a higher level of

data load according to the following phases of the pilot.

3. In stage 3, we will use the Malaga CTF of ADIF as a centre for information processing

and execution of the Big Data algorithms deployed in their different evolutions and with

data corresponding to each of the phases described below.

Figure 18 Pilot infrastructure deployment and data evolution

3.8.3 Data evolution of each stage

The data evolution aspect at each stage follows these points:

Page | 70


Version 1.0, 15/03/2017

1. In stage 1, the data that the pilot will use will be all the historical data that is available

from the ADIF and FERROVIAL systems and data sources.

2. In stage 2, the same data that has been used in the first phase plus simulated data will

be used following the same characteristics of historical data.

3. In stage 3, the historical data used in the first phase will be used with the historical

information stored during the pilot's execution plus information receive by provider

systems in real and quasi real time will be included. Given the nature of each data

provider systems used as data sources, the information will be received periodically and

will be included in the pilot so that the volume of data used increases progressively.

4 Commonalities and Replication 4.1 Common requirements and aspects Initially the two railway pilots, although having the common objective of providing a tool to

predict the degradation of the railway infrastructure and thus optimize the maintenance work,

each one is focused on different parts of the railway infrastructure.

The initial starting data also have great similarities between the two pilots of the railway

domain, since both their nature and the mechanisms of obtaining the data are very similar.

Both pilots receive information collected by laboratory trains and systems of measurement of

the state of the different elements of the infrastructure.

Therefore, both pilots can be considered complementary from the point of view that both will

allow to analyse the state of the infrastructure, will predict its degradation, and will allow to

provide tools to both, Infrastructure Administrators and infrastructure maintenance companies,

to establish a more optimized planning of the maintenance activities.

Both pilots have the aim to optimize maintenance from two points of view, increase the

availability of the infrastructure and reduce the cost of maintenance always ensuring the safety

of both the infrastructure and the workers involved in maintenance activities.

4.2 Aspects of Replication For future projects or pilots where the objective is to perform a predictive analysis of the wear

of an element that is part of the railway infrastructure, the following points that are part of this

pilot can be used:

Analysis of the spectrum of existing data related to the operation and maintenance of

the railway infrastructure.

Page | 71


Version 1.0, 15/03/2017

Techniques for extracting information based on heterogeneous data obtained from

different data sources.

Prediction-based data analysis techniques.

Algorithms for the prediction of the evolution of the wear of elements belonging to the

railway infrastructure.

The railway primary pilot is using data provided by the UK’s Network Rail. Initially the data

provided will be for a small subsection of the overall rail network, however once benefit is

proven a wider area of rail network might be provided. The results found however are not

limited to either that region or the UK as a whole, but have applications for all rail side assets

worldwide.

The railway replication pilot is planning to obtain information from the Cordoba-Málaga high-

speed line but some of the data sources that will be used in this pilot are also able to provide

information on other high-speed lines of the Spanish rail network. For this reason, the

replication of the pilot will be possible being able to use much of the effort made in this pilot

and will allow to extract in a future information of other lines in order to optimize their

maintenance activities.

5 Conclusions The conclusions related to the primary pilot are:

The primary pilot parties have engaged with the data provider and a data usage license

is currently being reviewed by the legal teams of all involved parties.

We have a small sample of the data set proposed to be made available to us. While it is

not enough or complete enough for us to begin work with, it’s sufficient to give us a

general idea of the kind of data we’ll be receiving.

Several system architectures have been designed and are being evaluated in parallel

whilst we are awaiting data.

Next steps:

Decide amongst involved parties the work breakdown

Obtain data

Ingest data into Big Data repository

Perform canonicalization of data before making it available to all parties

Data scientists to perform analysis of data and propose algorithms to work towards end

goals

The conclusions related to the replication pilot are:

Page | 72


Version 1.0, 15/03/2017

At this stage of the project, the replication pilot has defined their objectives in order to

try to optimize the maintenance activities of the Rail High Speed Lines.

We have an initial analysis of the data available to use at the pilot with some

information about their characteristics. This information available is very promising to

be able to be used in the pilot and thus be able to fulfil its objectives.

We have several options related to the Big Data infrastructure in order to deploy the

pilot at each stage of the project.

The next steps to fulfil the next milestones will focus on three different guidelines:

1. Data: Make a deep analysis of the data available, checking the data quality, the available

volume of data of each data source, the specific format of the information and the

access characteristic to each data source.

2. Algorithms: Generate the descriptive and predictive algorithms.

3. BigData infrastructure: Deploy the BigData platform at cloud environment to support

the algorithm generation process.

Date post:	20-Mar-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times