DELIVERABLE
D6.1 – Proactive rail infrastructures pilots design
Project Acronym TT
Project Title Transforming Transport
Grant Agreement number 731932
Call and topic identifier ICT-15-2016-2017
Funding Scheme Innovation Action (IA)
Project duration 30 Months [1 January 2017 – 30 June 2019]
Coordinator Mr Chris Byron (Thales)
Website www.transformingtransport.eu
Project Acronym TT
Document fiche
Authors: Ángel Pérez Bartolomé [INDRA], Jesús Miguel Alonso Pérez
[FERROVIAL], Miguel Rodríguez Plaza [ADIF], Philip Flynn
[Thales]
Internal reviewers: Jarno Kanninen [MATT]
Klaus Julian Eggemann [FHG]
Work Package: WP6
Task: T6.2 / T6.3
Nature: R
Dissemination: PU
Document History
Version Date Contributor(s) Description
1.0 14/03/2017
INDRA: Ángel Pérez Bartolomé
ADIF: Miguel Rodriguez Plaza, Álvaro Perez
Mansilla
FERROVIAL: Jesús Miguel Alonso Pérez, Pablo
Eugenio Fernández Vivanco, Jose Ignacio Jardi
Cuerda
Final document for
internal review
1.02 14/03/2017 Thales: Philip Flynn Internal review
Keywords: Rail, rail domain, pilot design
Abstract (few lines): This deliverable reports on the work performed in
WP6/T6.2 “Predictive Rail Asset Management
Pilot” and T6.3 “Predictive High Speed Network
Maintenance Pilot” with respect to the pilots
design for the rail domain of the
TransformingTransport project.
DISCLAIMER
This document does not represent the opinion of the European Community, and the
European Community is not responsible for any use that might be made of its content. This
document may contain material, which is the copyright of certain TT consortium parties, and
may not be reproduced or copied without permission. All TT consortium parties have agreed
to full publication of this document. The commercial use of any information contained in this
document may require a license from the proprietor of that information.
Neither the TT consortium as a whole, nor a certain party of the TT consortium warrant that
the information contained in this document is capable of use, nor that use of the
information is free from risk, and does not accept any liability for loss or damage suffered by
any person using this information.
ACKNOWLEDGEMENT
This document is a deliverable of TT project. This project has received funding from the
European Union’s Horizon 2020 research and innovation programme under grant agreement
Nº 731932
Table of Contents DEFINITIONS, ACRONYMS AND ABBREVIATIONS ............................................................................................. 6
EXECUTIVE SUMMARY ..................................................................................................................................... 7
1 OVERALL MOTIVATION AND AMBITIONS FOR PILOT DOMAIN ................................................................ 8
1.1 PROACTIVE RAIL INFRASTRUCTURES (WP6) OVERVIEW......................................................................................... 8
2 INITIAL PILOT: D6.1: PROACTIVE RAILWAY INFRASTRUCTURE ................................................................. 9
2.1 REQUIREMENTS ............................................................................................................................................. 9 2.2 OBJECTIVES ................................................................................................................................................ 10 2.3 USE CASES / SCENARIOS ............................................................................................................................... 10
2.3.1 Use case 1: ..................................................................................................................................... 10 2.3.2 Use case 2 ...................................................................................................................................... 12 2.3.3 Use case 3 ...................................................................................................................................... 15
2.4 DATA ASSETS .............................................................................................................................................. 17 2.5 BIG DATA TECHNOLOGY, TECHNIQUES AND ALGORITHMS .................................................................................... 21
2.5.1 Overview and Positioning in BDV Reference Model ...................................................................... 21 2.5.2 Detailed Explanation of Technology, Techniques and Algorithms ................................................ 23
2.5.2.1 Technologies ........................................................................................................................................ 23 2.5.2.2 Techniques & Algorithms .................................................................................................................... 24
2.6 BIG DATA INFRASTRUCTURE ........................................................................................................................... 25 2.6.1 Platform architecture .................................................................................................................... 25 2.6.2 Value view architecture ................................................................................................................. 26 2.6.3 Conceptual Architecture overview diagram .................................................................................. 27 2.6.4 Logical Architecture diagram ........................................................................................................ 28 2.6.5 Logical Architecture Detail ............................................................................................................ 29
2.7 ROADMAP .................................................................................................................................................. 30
3 REPLICATION PILOT: PREDICTIVE HIGH SPEED NETWORK MAINTENANCE PILOT ................................... 31
3.1 REQUIREMENTS ........................................................................................................................................... 34 3.2 OBJECTIVES ................................................................................................................................................ 36 3.3 USE CASES / SCENARIOS ............................................................................................................................... 39
3.3.1 UC1: Prediction of the degradation of switch and crossing elements ........................................... 40 3.3.1.1 Objective ............................................................................................................................................. 40 3.3.1.2 Input data ............................................................................................................................................ 40 3.3.1.3 Work environment .............................................................................................................................. 40 3.3.1.4 Relation with objectives and requirements......................................................................................... 41
3.3.2 UC2: Prediction of the degradation of track profiles ..................................................................... 41 3.3.2.1 Objective ............................................................................................................................................. 41 3.3.2.2 Input data ............................................................................................................................................ 41 3.3.2.3 Work environment .............................................................................................................................. 41 3.3.2.4 Relation with objectives and requirements......................................................................................... 42
3.3.3 UC3: Prediction of the evolution of the slopes that are part of the railway infrastructure ........... 42 3.3.3.1 Objective ............................................................................................................................................. 42 3.3.3.2 Input data ............................................................................................................................................ 42 3.3.3.3 Work environment .............................................................................................................................. 42 3.3.3.4 Relation with objectives and requirements......................................................................................... 43
3.4 DATA ASSETS .............................................................................................................................................. 43 3.4.1 Techniques to monitor the infrastructure ..................................................................................... 47
3.5 BIG DATA TECHNOLOGY, TECHNIQUES AND ALGORITHMS .................................................................................... 49 3.5.1 Big Data Technology...................................................................................................................... 49
3.5.1.1 Conceptual model Indra Big Data platform ......................................................................................... 50 3.5.1.2 Sofia2 Typical workflow ....................................................................................................................... 51 3.5.1.3 Sofia2 Storage...................................................................................................................................... 52
3.5.1.4 DataFlow (ETL module)........................................................................................................................ 53 3.5.1.5 Notebooks (Collaborative analytical) .................................................................................................. 55 3.5.1.6 Machine Learning ................................................................................................................................ 57 3.5.1.7 Dashboards .......................................................................................................................................... 58
3.5.2 Big Data Techniques and Algorithms ............................................................................................ 58 3.5.2.1 Goal definition based on business knowledge .................................................................................... 60 3.5.2.2 Data preparation and management .................................................................................................... 60 3.5.2.3 Creating analytic models ..................................................................................................................... 60 3.5.2.4 Model Evaluation ................................................................................................................................ 63 3.5.2.5 Deployment: Solution integration with the pilot................................................................................. 63
3.6 POSITIONING OF PILOT SOLUTIONS IN BDVA REFERENCE MODEL ......................................................................... 64 3.7 BIG DATA INFRASTRUCTURE ........................................................................................................................... 65
3.7.1 INDRA Cloud platform ................................................................................................................... 65 3.7.2 ADIF Railway Technology Centre Platform .................................................................................... 66
3.8 ROADMAP .................................................................................................................................................. 67 3.8.1 Objectives of each stage ................................................................................................................ 68 3.8.2 Deployment of each stage ............................................................................................................. 69 3.8.3 Data evolution of each stage ........................................................................................................ 69
4 COMMONALITIES AND REPLICATION .................................................................................................... 70
4.1 COMMON REQUIREMENTS AND ASPECTS .......................................................................................................... 70 4.2 ASPECTS OF REPLICATION .............................................................................................................................. 70
5 CONCLUSIONS ....................................................................................................................................... 71
Table of Figures Figure 1 Pilot Approach Diagram ............................................................................................. 34
Figure 2 Pilot objectives identification process ....................................................................... 36
Figure 3: Sofia2 Big Data Platform used modules .................................................................... 51
Figure 4: Sofia2 Typical workflow ............................................................................................ 51
Figure 5: Sofia2 DataFlow module ........................................................................................... 53
Figure 6: Sofia2 DataFlow example diagram ............................................................................ 54
Figure 7: Sofia2 DataFlow monitoring information ................................................................. 55
Figure 8: Sofia2 Sofia2 Notebook ............................................................................................. 56
Figure 9: Sofia2 Notebook Spark graphics ............................................................................... 56
Figure 10: Sofia2 Notebook HIVE graphics ............................................................................... 57
Figure 11: Sofia2 Dashboard types ........................................................................................... 58
Figure 12: Methodology based on CRISP-DM .......................................................................... 59
Figure 13: Different kind of models ......................................................................................... 61
Figure 14: Algorithm type for each use case ............................................................................ 62
Figure 15: Model creation process ........................................................................................... 63
Figure 16 Big Data Value Reference Model ............................................................................. 64
Figure 17: Sofia2 Big Data Cloud Platform Infrastructure ........................................................ 66
Figure 18 Pilot infrastructure deployment and data evolution ............................................... 69
Definitions, Acronyms and Abbreviations Acronym Title
CO Confidential, only for members of the consortium (including Commission
Services)
CR Change Request
D Demonstrator
DL Deliverable Leader
DM Dissemination Manager
DMS Document Management System
DoA Description of Action
Dx Deliverable (where x defines the deliverable identification number e.g. D1.1.1)
EIM Exploitation Innovation Manager
EU European Union
FM Financial Manager
MSx project Milestone (where x defines a project milestone e.g. MS3)
Mx Month (where x defines a project month e.g. M10)
O Other
P Prototype
PC Project Coordinator
PM partner Project Manager
PO Project Officer
PP Restricted to other programme participants (including the Commission Services)
PU Public
QA Quality Assurance
QAP Quality Assurance Plan
QFD Quality Function Deployment
QM Quality Manager
R Report
RE Restricted to a group specified by the consortium (including Commission
Services)
RUP Rational Unified Process
STEP Standard Technology Evaluation Process
STM Scientific and Technical Manager
TL Task Leader
WP Work Package
WPL Work Package Leader
WPS Work Package Structure
Executive Summary Big Data will have a profound economic and societal impact in the mobility and logistics
sector, which is one of the most-used industries in the world contributing to approximately
15% of GDP. Big Data is expected to lead to 500 billion USD in value worldwide in the form of
time and fuel savings, and savings of 380 megatons CO2 in mobility and logistics. With
freight transport activities projected to increase by 40% in 2030, transforming the current
mobility and logistics processes to become significantly more efficient, will have a profound
impact. A 10% efficiency improvement may lead to EU cost savings of 100 BEUR. Despite
these promises, interestingly only 19 % of EU mobility and logistics companies employ Big
Data solutions as part of value creation and business processes.
The Transforming Transport project will demonstrate, in a realistic, measurable, and
replicable way the transformations that Big Data will bring to the mobility and logistics
market. To this end, Transforming Transport validates the technical and economic viability of
Big Data to reshape transport processes and services to significantly increase operational
efficiency, deliver improved customer experience, and foster new business models.
Transforming Transport will address seven pilot domains of major importance for the
mobility and logistics sector in Europe: (1) Smart Highways, (2) Sustainable Vehicle Fleets, (3)
Proactive Rail Infrastructures, (4) Ports as Intelligent Logistics Hubs, (5) Efficient Air
Transport, (6) Multi-modal Urban Mobility, (7) Dynamic Supply Chains. The Transforming
Transport consortium combines knowledge and solutions of major European ICT and Big
Data technology providers together with the competence and experience of key European
industry players in the mobility and logistics domain.
1 Overall Motivation and Ambitions for Pilot
Domain 1.1 Proactive Rail Infrastructures (WP6) Overview The pilot will be delivered to overcome the following major barriers to moving to a predict-
and-prevent rail maintenance approach. Rail infrastructures involve a complex supply chain
between equipment manufacturers, maintainers and operators. Acquiring long term,
performance (maintenance, fault) data, understanding the quality, accuracy, and
provenance of the largely unstructured data, processing it to identify emerging faults
(diagnostics) and disseminating useful, timely prognosis information with known confidence
levels for preventive and predictive maintenance has proved historically problematic.
This pilot will apply Big Data predictive and prescriptive analytics to a UK national rail route,
to reduce the long term cost of maintenance and increase network availability through the
facilitation of focused short and medium term proactive interventions. The Pilot will be run
on historical and real-time data sources.
The objective of this pilot is to improve the reliability of high speed rail networks by
optimising operator´s performance and maintenance of the rail infrastructure. The pilot will
consist on the application of Big Data technologies to process heterogeneous data collected
in order to understand the variables that have impact on the operators performance and
model the nature of the maintenance incidents occurred in the infrastructure (tracks,
tunnels, bridges, etc.) based on rail traffic, rolling-stock flows, maintenance data, planning &
control Data and other information sources.
This analysis will also include external variables such as weather forecasts or specific events
(summer holidays) to anticipate the maintenance activities on the rail network and therefore
to improve the operations of the maintenance of the whole rail infrastructure. Moreover,
this solution might also allow rail operators to predict in real-time the impact of certain
events on traffic management.
2 Initial Pilot: D6.1: Proactive Railway
Infrastructure 2.1 Requirements The high level requirement for this pilot is to provide insights and information to key
decision makers such that maintenance of rail assets can be optimised to:
Improve the safety of rail workers by minimising the time spent trackside.
Improve the safety of rail passengers by identifying trends toward asset failure.
Improve the reliability of services by minimising downtime and service interruptions.
Improve cost efficiency by prioritising preventative maintenance on assets based on a
number of factors.
Improve service capacity by minimising disruption.
This pilot is intended to prove whether or not the application of ‘Big Data’ techniques can
help to achieve one or more of the above optimisations. The diagram below shows how the
project’s high level requirements cascade into the pilot’s requirements and subsequently
into the challenges being addressed.
High level requirements
Improve Safety Improve Reliability Improve Cost
Efficiency Improve Capacity
Pilot requirements
Reduction in frequency of
attended maintenance
Reduction in downtime
for assets
Reduction in maintenance
based cost
Reduced number of
cancellations and delays
Pilot challenges
TBD% reduction in
frequency of attended
maintenance
TBD% reduction in
downtime for assets
TBD% reduction in
maintenance based cost
TBD% reduced number of
cancellations and delays
To achieve these challenges the pilot deliverable must answer the following questions:
When will an asset fail?
What trends can be identified based on historical data sources?
What accuracy of data is required in order to produce a likely prediction?
What relationships exist between:
o Data sets? E.g.
Weather, forecast timetable, vehicle category etc.
o Asset failures? E.g.
Model/manufacturer related
Dependency related
Given a failure, can another cascading failure be predicted soon after?
What is the cost incurred/saved by predicting (and subsequently preventing)
failures?
2.2 Objectives The application of new Big Data technology and processes to facilitate the introduction of a step change in maintenance across European rail infrastructures to: • Verify the quality, accuracy and provenance of asset data, leading to confidence to • Provide timely focused prioritised maintenance activities (predict and prevent), leading to • Improved reliability and availability of track-side assets, with • Higher availability of rail infrastructure for passenger and freight services; and • Enhancing worker safety through minimising track-side activities
2.3 Use Cases / Scenarios Three use cases have been identified by industry experts working in conjunction with
Network Rail which have been attributed to the majority of delays caused to mainline rail
services in the UK and Europe. By considering the impact that can be made to these three
significant contributors, it is hoped that data analytics can be used to reduce the delays and
disruption incurred by users, increase the safety of both users and trackside staff, reduce
costs incurred by both penalties and unnecessary maintenance, and improve capacity.
2.3.1 Use case 1:
Name Asset Maintenance and Renewal Analysis Overhead Line Equipment (OLE)
Objective Provide evidence to allow an Infraco to inform the OLE maintenance regime and renewal programme based upon measurable, quantifiable analysis of OLE condition and expected life.
Overview Description
Provide information to improve the maintenance and renewal schedules for Overhead Lines.
Intended Benefits
Improved quantifiable models for maintenance/renewal activities in order to:
- identify conditions known to impact upon the reliability and availability of the system
- minimise risk of disruption due to unexpected incidents (maintain uptime)
- demonstrate the effect of intervention (appropriate renewal) - improve availability due to reduced OLE failure
Approach
Collect, prepare, analyse and visualise using big data techniques the factors influencing OLE maintenance and renewal. Analysis of OLE by:
- geo-spatial location, tunnels/cuttings/embankments/open - weather effects (corrosion due to humidity, temperature,
precipitation, ice, wind, lightning etc.) - usage (km, no. of passes; train type/frequency; passenger/freight;
speed) - power feed/voltage - component/manufacturer e.g. contract strips - age; wear/remaining life - stray current - fault types (schedule 8) - temporary speed restrictions (TSR) - maintenance history and schedule - [Pancheck data - Video data ] - track geometry
Key Contacts / Regions
Steve Hooker - East Anglia Paul Barnes - Great Western Simon Taylor – systems integration; LNW; Virgin on-train data Osman Maruf (via C. Lowe) - centre
Data Sources / Volumetrics
Requirement that the data sources are consistent and co-incident i.e. cover the same area and time.
- Yellow (New Measurement) Train OLE and track geometry measurement data
- Pantograph instrumentation (passenger trains) o data and video analytics
- Overhead line maintenance and renewal history - Fault (FMS) and Maintenance (Ellipse) records - OLE asset data (Ellipse/OLE-EX) - Environmental (historical precipitation, temperature) - Usage (passenger/freight – type/loading; performance
characteristics; electric/diesel) [timetable & movements] - Pancheck data - Schedule 8 payments (delays minutes/incident) including
Temporary Speed Restrictions -
[Other relevant data sources to be confirmed in visit to HQ data team v. C. Lowe]
Standards
Relevant standards to be agreed.
Risks
1. Data alignment problems (different data sets collected from separate areas) – unable to confirm correlations
2. Analysis is not sufficiently complete to feed into Anglia OLE renewals process during Mid-2017 (opportunity cost to Network Rail)
3. Failure to confirm causality rather than correlations from big data techniques
Corroboration/ Confirmation of Progress and Results
Key stakeholders: S. Hooker; Forum: OLE-EX forum (M. Dobbs (LNW E&P); P. Doughty (Head Contact Systems), P. Barnes (Western E&P)
Repeatability
Demonstration that the results are applicable to at least one UK route, can be extended to several UK routes and are repeatable in European rail infrastructures.
Related Work Relevant work and publications by: - Network Rail – internal programme results – Project Insight - Network Rail / University of Oxford (video analytics of pantograph) - JR-East 2013 using train based laser inspection - NeTIRail 2016 (H2020) University of Sheffield. ADS-Electronic Research (geo-spatial analysis) - Railway Reliability Data Handbook (Network Rail)
Quick Wins Improve Temporary Speed Restriction process and reduce TSR Explore affect around Neutral Sections
2.3.2 Use case 2
Name Asset Failure Diagnosis and Prognosis For Points and Track Circuits
Objective Assist Route based Flight Engineer and asset maintenance teams to quantifiably assess the risk of an imminent or future asset failure, based upon the asset’s as-is health state, known fault symptom diagnosis and
expected further factors influencing the rail system’s operation (system-level prioritisation of maintenance interventions).
Overview Description
Determine from the real-time prognosis future risk curve whether it is necessary for maintenance to intervene during in-traffic, wait for engineering hours or inspect the asset at the next scheduled maintenance period. Provide clear, quantified, diagnostic information for the on-site maintenance team. Provide evidence to support exploitation of existing data sources, enabling improved decision making, improvement of maintenance regimes and prevention of asset failure. Focus on mechanical assets, primarily points operating equipment and electrical operation of track circuits. Analyse potential root causes of service affecting failures.
Intended Benefits
Improved quantifiable confidence (for maintenance action) in the earlier diagnosis of emerging and fully developed asset fault symptoms. Support improved decision making, improvement of maintenance regimes and prevention of asset failure. Provides evidence for the safe targeted maintenance of individual assets or groups of assets.
Approach
Undertake large scale, high volume, multi-sourced relevant (track and train) data set analysis to create validated algorithms for diagnostics and prognostics that could be deployed onto the operational railway. Leverage existing operational expertise to provide confidence in approach and outputs.
Key Contacts / Regions
P. Barnes – Western Route S. Hooker - Anglia Route Forum (Signalling function) to present, discuss, feedback and suggest further research. Organised through C. Lowe.
Data Sources / Volumetrics
Datasets are available for the approach: Diagnostics – Intelligent Infrastructure, FMS (fault) and Historical weather data; minimum of one year of data across the Route assets. Prognosis – as a minimum asset usage (movements, train type/freight vehicle), location, historical fault/maintenance records. Associated Interlocking data sets
Standards
Relevant standards to be agreed.
Risks
1. Extraction and correlation of II (symptom) data with related FMS (fault) may be difficult to achieve. Limited previous analysis in this area has produced positive correlation for fully emerged fault symptoms (not emerging). Identified that FMS entries may be of poor quality for this purpose.
2. The success of II in reducing on-the-day faults will have skewed some of the data i.e. corrective maintenance has occurred before a fault occurred. Requiring a higher level of granular analysis to be needed against pre-II (control) conditions.
3. Measurement of delay savings is difficult to quantify accurately (prevention of event) and hence determination of ultimate value of this approach could be seen as unreliable.
4. The Network Rail business and safety case for changes to the maintenance regime require clear, rigorous, evidence, hence the data collection and analysis process needs to be proven and reliable to a high standard throughout (including cross-correlation and capable of dealing with incomplete or inaccurate data).
Corroboration/ Confirmation of Progress and Results
Anglia Route monitors point and track circuit assets using Intelligent Infrastructure. Target first time fix rate stated as 85%. Initial corroboration required is that similar levels of diagnostic accuracy can be achieved using automated techniques within reasonable timeframes. The next diagnostic checkpoint is to identify potential faults earlier using emerging symptoms (trends). Metrics for prognosis accuracy to be developed.
Repeatability
Each UK route uses Intelligent Infrastructure. Results on a single Route to be compared with different Routes with alternative operational challenges and mix of assets. Model to be extrapolated to European mainline railways.
Related Work PCIPP prognosis outputs from University of Nottingham to be evaluated for use. Diagnostic algorithmic approaches from Thales TRT and PCIPP (University of Birmingham - UoB) to be considered. Other background research commissioned by Network Rail with the UoB.
Quick Wins Effect of temperature on operation Interaction of water with track circuits Correlation of earth leakage and track circuit operation
2.3.3 Use case 3
Name Assessing the value of passenger-train monitoring for track maintenance
Objective Improving asset management decisions based upon passenger-based train / track measurement. Assess whether advanced analytics can increase the value of UGMS data in support of better track maintenance.
Overview Description
Provide information to improve the maintenance schedules for track-bed.
Intended Benefits
Prioritisation of maintenance focus, based upon clearer picture of current and future track conditions. More readily identify poor track quality and deteriorating conditions.
Approach
Collect, prepare, analyse and visualise using big data techniques the factors indicating changes in state of the track. Availability of more frequent (less accurate) track measurements to inform track maintenance decisions. . The Yellow (New Measurement) Train provides data that is less frequent (but more accurate) than that available from passenger trains running over the same infrastructure. This could allow more rapid identification and response to: - changing track bed conditions - rail wear - geometry deterioration
Key Contacts / Regions
Paul Barnes - Great Western (Future, 2018, UGMS operation) Simon Taylor – LNE (Historical UGMS data) Brian Whitley and Patric Mak via C. Lowe
Data Sources / Volumetrics
- Yellow (New Measurement) Train track geometry data - UGMS (Unmanned Geometry Measurement Systems) from
timetabled passenger trains - Environmental (historical precipitation, temperature) - Track geometry, material, age - Tamping history - Usage (passenger/freight – type/loading; performance
characteristics) [timetable & movements] - Eddy current data
Standards
Alignment with emerging T1010 recommendations and future RSSB open data standards.
ISO 13374.
Risks
1. Data alignment problems (different data sets collected from separate areas) – unable to confirm correlations
2. Access to data in a timely fashion 3. Unable to distinguish between root cause and symptoms 4. Insufficient asset specialist input into requirements and review of
output
Corroboration/ Confirmation of Progress and Results
Correlation between NTM and UGSM / big data results Forum contacts: Brian Whitney & Patric Mak via C. Lowe
Repeatability
Applicable to any UK route. No expectation that the approaches generated would be restricted to the UK.
Related Work Note work done by: 1. University of Huddersfield, Institute of Railway Research (IRR),
Omnicom and University of York in RAMP demonstrator (RSSB/Innovate UK)
2. University of Huddersfield, IRR, Siemens – Tracksure for under-track voids (RSSB funded)
3. Perpetuum track condition monitoring South East Performance Improvement Project (Real-time track condition monitoring using passenger train mounted accelerometers)
4. International track recording services e.g. Mermec (Italy), Fugro (NL)
5. RSSB/T1010 RCM 6. TfL (Acton) big data analysis for trains
Quick Wins Early analysis results to support Western’s strategy in the use of UGMS.
2.4 Data Assets ID USE
CASE Data Asset Name
Short Description Expected Use Initial Availability Date
Data Type Link to Data Card (in basecamp)
1 1 & 2 ELLIPSE Information about dates and locations of maintenance and renewal of OLE and trackside equipment
To assess how points of maintenance and renewal affect asset condition and other measurements.
TBD Suspected SQL Server
TBD
2 1 & 2 FMS Fault Management System. Information about equipment failures, their time, and their location
To assess relationship between failure and other data.
TBD Suspected SQL Server
TBD
3 1 & 2 ITPS/ACTRAFF Integrated Train Planning System (Timetable). Information about the type and extent of asset usage.
To assess relationship between usage and condition of the asset.
TBD Suspected SQL Server
TBD
4 1 & 2 TRUST Train movement reports. Information about asset failures and associated delay.
To assess relationship between failure and performance risk relative to other parts of the network.
TBD Suspected SQL Server
TBD
5 1 OLE_EX OLE Heights, staggers, force, acceleration, GPS information.
To assess whether there is any correlation between measured data and failure events. To assess whether there is any correlation between expected wire wear and actual wire wear.
TBD Suspected SQL Server
TBD
To compare LADS and OLE-EX processed data.
6 1 OLE_EX OLE asset data such as structures/midspans/neutral section and location of OLE
To support identification of the location of problem
TBD Suspected SQL Server
TBD
7 1 LADS OLE Heights, staggers and GPS information.
To assess whether there is any correlation between measured data and failure events. To assess whether there is any correlation between expected wire wear and actual wire wear. To compare LADS and OLE-EX processed data.
TBD Suspected SQL Server
TBD
8 1 379 In-service OLE
TBC To identify places where contact is lost and compare with other OLE monitored data and track geometry information.
TBD Suspected SQL Server
TBD
9 1 SSI Train Movements, speeds, and other related information
To allow comparison between actual train movements, asset health, and faults raised
TBD Suspected SQL Server
TBD
10 1 NRWS Information about air temperature.
To compare the air temperature to the actual temperature close to the OLE; if the OLE temperature is lower than the measured air temperature we can infer a threshold of "safe" performance.
TBD Suspected SQL Server
TBD
11 1 OLE temperature monitoring
Information about the temperature within the vicinity of the OLE.
To compare the air temperature to the actual temperature close to the OLE; if the OLE temperature is lower than the measured air temperature we can infer a threshold of "safe" performance.
TBD Suspected SQL Server
TBD
12 1 OLE ALP Information about how the OLE degrades over time.
To compare the "theoretical" life of the OLE with the "actual" life of the OLE based on usage.
TBD Suspected SQL Server
TBD
13 1 Network Model SRS for each 5 chain length To enable the performance risk to be managed.
TBD Suspected SQL Server
TBD
14 1 & 3 Yellow Train Track Geometry Data
High quality OLE, Track Gauge and ride quality measurements
To allow trends to be identified between OLE and other Geometrical outliers and faults raised.
TBD Suspected SQL Server
TBD
15 2 Intelligent Infrastructure (II)
The II system contains asset monitoring information such as detailed point swing readings
To allow trends to be identified between asset types, makes, models etc. in conjunction with FMS,
TBD Wonderware InData database, exposed
TBD
weather and other data sources
through SQL Server connector
116 1, 2, & 3 MetDesk Historical UK weather readings local to track assets going back over the last 10 years
To allow trends between the weather and asset failures to be identified
TBD Suspected SQL Server
TBD
117 1, 2, & 3 Earthworks Failure
10 years of approx. 1500 events To allow trends between weather, earthworks, and asset failures/health changes to be identified
TBD Suspected SQL Server
TBD
Funded by the European Union’s H2020 GA - 731932
2.5 Big Data Technology, Techniques and Algorithms 2.5.1 Overview and Positioning in BDV Reference Model
Data Visualization and User interaction
The output is to be determined depending on the value and output of the prognostic/predictive
algorithms/models. These may be represented by for example 2D or 3D representations
depending on what is necessary.
Page | 22
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Data Analytics
Data analytics will be based around a process described as follows:
Data Processing Architectures
Data will likely need to be processed multiple ways in order to answer the questions presented,
this includes batch and streaming or real-time. The programme will utilise big data processing
tools in the cloud platform to accomplish these goals e.g. USQL, PIG, KAFKA, MapReduce etc.
Data Management
Data will be collected, handled, and processed in accordance with the work package data
management plan.
Existing
The programme will be consuming existing data provided by the data providers and will be
hosted on a managed cloud environment.
Page | 23
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.5.2 Detailed Explanation of Technology, Techniques and Algorithms
2.5.2.1 Technologies
The technologies chosen for this work package will be mostly from the hadoop ecosystem, as
the de facto standard for big data analytical processing. Thales will be making use of the
Microsoft Azure service as a platform which is based upon hadoop. The Azure services are fully
cloud based with a minimum supported SLA of 99.9% for all components and provide a scalable
fully hosted solution with security features built in.
Key technologies to be used:
Storage & Analytics
Azure Data Lake – HDFS backed distributed storage with potentially limitless storage and
scalable performance.
HBase or Cassandra– Distributed multidimensional column orientated databases,
capable of linear scaling, and the storage and retrieval of petabytes of data extremely
quickly.
Data Lake Analytics – MapReduce based jobs capable of performing analytical
calculations upon huge quantities of schema-less data
SQL Server – RDBMS developed by Microsoft, designed to store relational data
Azure Machine Learning – Apache Mahout based Machine Learning service. Used to
automatically categorise and suggest relations based upon a training set of data
Languages
R – Statistical and graphical software library which is ideal for manipulation of data and
creating machine learning models.
C# .Net – Mature and widely used C based language under the Microsoft .Net family of
technologies.
Testing, Development & Management
SpecFlow – Automated test tool that binds requirements to functionality written in .Net
languages
Visual Studio Team Services – Tools for providing Agile working practices, code source
control, and continuous integration functionality.
PivotalTracker – Task management tool
Messaging, Orchestration & Service Bus Technologies
AMQP compatible message queues – open standard for message queue services
Azure Event Hub – Telemetry ingestion from website, apps, or streams of data
Azure Notification Hub – Allows push notifications to be sent to any platform
Page | 24
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Azure Data Factory – Composition and orchestration of data services in Azure
UI
Microsoft PowerBI – Interactive Data Visualisation Tools
Universal Windows Platform – Allows single user interface over all devices running
Windows 10
2.5.2.2 Techniques & Algorithms
Supervised machine learning algorithms have the potential to process huge quantities of data
to identify trends and categorise data on a previously impossible scale, and in near real-time.
Models are created using between 70% and 80% of a known dataset and then validated using
the remaining 20% to 30%.
MapReduce was developed by Google for processing incredibly large amounts of schema-less
data in a very parallelised architecture. MapReduce was open sourced in the early 2000’s and
is now a key part of the ‘Big Data’ landscape. A major component of the Hadoop ecosystem is
the MapReduce functionality. The concept itself is very simple but it’s capable of very
complicated aggregation and what-if style processing. MapReduce jobs are likely to be a very
prominent part of this work package.
Traditional star schema data warehouses were designed for allowing the aggregation of large
amounts of data very quickly. They are flexible, widely understood and easily ingested by a
huge range of reporting tools. In contrast multi-dimensional columnar store NoSQL databases
are designed to hold greater than hundreds of millions of rows of data but are quite specific in
how data can be accessed. Their performance is linearly scalable and while they don’t offer full
ACID guarantees they generally offer tuneable aspects to meet or closer approach the
requirements. It is highly likely that both a star schema warehouse and a columnar database
will be required. It is also likely that a traditional RDBMS will be required for comparatively
small amounts (sub 10 million rows) of relational data.
Page | 25
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.6 Big Data Infrastructure Note: The architectural diagrams in this section may change and evolve as more information
becomes apparent through the research and progress of this work package.
2.6.1 Platform architecture
The image below presents a conceptual platform architecture. Data ingested on the left
becomes far more useful and actionable as more intelligence is applied, resulting in information
to back business decisions
Page | 26
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.6.2 Value view architecture
The diagram below illustrates how value grows as data passes through each phase of the data
manipulation process.
Thales Analytics Platform : Value ViewTT Secure Storage
Data Lake Data Factory
Asset Data
ML: Failure Classifier
ML: RuL Prediction
Store Processed Data
Data fetch
1
Output
Schema/Data
Processor
Raw Data Processing
Asset:Health
Assessment
Asset:Prediction RuL & confidence
Asset:State
ML: Anomaly
DetectionReports
Input
Output
Binary Data Text Data
15
2 3
4
5 7 9 11
6 8 10 12Thales HSM Key
Store
0
Insights
0. Thales HSM key store – The thales HSM key store is a military grade Hardware Security
Module which is used to store encryption keys.
1. Customer / Data owner – Customers/data owners hold the key to their data and store it
in a Thales Hardware Security Module. This enables the customer/data owner to take
their key at any point and render all encrypted data useless. This way the customer/data
owner is always in control of their own data.
2. Data store – The customer / data owner uploads their data either by one off bulk,
regular batch, or streaming. The data at this point is raw and will never be modified
from its original state. By ensuring its write-once nature accuracy of predictions can be
traced back to the original data.
3. Data factory – Data factory is an orchestration tool used to transport and manipulate
data.
4. Schema/data processor – Data will be uploaded in its raw format. This will include many
different forms such as proprietary binary formats, XML, JSON, XLS, CSV, etc. The data
processor manipulates the raw format into a generic canonical format which can be
ingested by subsequent steps.
5. Asset Data – This is data purely about an asset. This is unmodified, but canonicalised
data.
6. Insights - Using the asset data humans can draw insights from viewing the ‘raw’ data in
various graphs and tables.
7. Asset State – Asset States are used to definititely say whether an asset is in a particular
state or not. states might be “Working” and “Broken”, or could be more specific such as
“Brush wear fault”, “Obstruction fault”, “Over voltage fault”.
Page | 27
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
8. Machine learning anomaly detection – Using known data models can be created, similar
to a set of complex rules, via a process known as Machine Learning. Machines can then
be used categorise new data in one of the learnt categories. This can be used to
categorise data into known states, as well as a degree of certainty to back the machines
confidence in its decision.
9. Asset Health Assessment – States are current tense, they don’t help to predict or
influence the future. Health assessments however typically provide a percentage
indicating how healthy an asset is. This can be used by a user to influence maintenance
patterns. It’s not used however to predict the future i.e. no knowledge of future events
is taken into account, only past events.
10. Machine learning failure classifier – Using past knowledge of failures machine learning
can produce health assessments.
11. Asset prediction Remaining Useful Life – By taking into account planned and forecasted
conditions and events, and by knowing how these events have historically effected the
health of an asset, a figure of remaining useful life can be determined. This figure can
then be used to better plan and prioritise maintenance.
12. Machine learning Remaining Useful Life – Machine learning can again be used to
produce these valuable figures. By comparing the upcoming planned and forecasted
events the models are able to produce figures that estimate a remaining useful life
duration.
13. Reports – This is the critical part of the system. If the information isn’t presented in a
useful fashion then the user won’t be able to take effective action. This may be
presented as graphs, tables, etc.
2.6.3 Conceptual Architecture overview diagram
WP6 HMI Users
Data Streams
Data Files
External In
pu
t Interfaces
External O
utp
ut In
terfaces
Page | 28
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.6.4 Logical Architecture diagram
WP6
Data Streams
Data Files
External In
pu
t Interfaces
External O
utp
ut In
terfaces
Data Acquisition
Data Streams/
Data Files
Data Manipulation
State Detection
Canonical data
Health Assessment
Prognosis Assessment
Alerts
% Health Used / Remaining
X days remaining
UsersHMI
Page | 29
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.6.5 Logical Architecture Detail Key:
WP6
Machine Learnt Intelligence
HMI & Reporting
Data repository
Ingestion
Input Files
Batch Binary
Batch Text
Streaming JSON/XML
Parser
Grey Data
Canonical JSON For entire document/message.
Append/split data where necessary into few, large hourly
files.Stored in folder structures as
/Source/YYYY/MM/DD/YYYYMMDD_HH.JSON
Raw Data Save /move
Raw Data
Persist Data Persist Data
Batched Golden Data Generation
Golden Data
Star Schema DW
Time Series Multi Dimensional
Documents
Key Value
Etc. As appropriate
Queue
AggregatorGroups individual
messages into files of multiple messages
Save bundle files
Pass to Parser to be ‘canonicalised’
Consumed byStreaming analyticsConsumed by
Persist Data
Zipped Files containing raw data on cheap
storage. Likely never to be read again
Stored in folder structures as /Source/YYYY/MM/DD
Near real-time reporting
Big Data Analytics
Map Reduce Job
KPI, historical, batch reporting
Statistical Analysis Job
Consumes Canonical Data
Consumes Canonical Data
Consumes Golden Set Data
Consumes Golden Set DataOne Off Reports
Recurring Reports
Machine Learning -
Model Creation
Machine Categorisation &
Trending -
Model Usage
Consumes Canonical Data
Batched data to categorise & extrapolate
Categorised / trend prediction information
Unsupervised Machine Learning
- Model Update
Near real-time data to categorise & extrapolate
Near real-time data to update Machine Learnt Model
Batched data to updated Machine Learnt Model
Updated model
New Model
External In
pu
t Interfaces
External O
utp
ut In
terfaces
Data Acquisition Data Manipulation State Detection Health AssessmentPrognosis
Assessment
Page | 30
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
2.7 Roadmap Stage Delivery
Date (Project Month)
Features / Objectives Addressed
Embedding in Productive Environment
Big Data Infrastructure Used
Scale of Data
S1: Technology Validation
9 Validate the quality of the input data obtained
Confirm the feasibility of the initial objectives
Microsoft Azure Cloud environment – development resource group
Microsoft Azure Cloud environment
2013 – 2015, Milton Keynes to Waterloo, partial number of data sets
S2: Large-scale
experimentation and
demonstration
18 Validate the quality of the developed algorithms and determine their reliability
Microsoft Azure Cloud environment – ‘Production’, high power, resource group
Microsoft Azure Cloud environment
2013 – 2015, Milton Keynes to Waterloo, all data sets
S3: In-situ trials
27 Validate the quality of the developed algorithms and determine their reliability
Microsoft Azure Cloud environment – ‘Production’, high power, resource group
Microsoft Azure Cloud environment
2013 – 2015, Milton Keynes to Waterloo, all data sets
Page | 31
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3 Replication Pilot: Predictive High Speed
Network Maintenance Pilot Nowadays there are different maintenance strategies that can be applied in a railway
infrastructure. Traditionally, it was considered two main tendencies:
a) Preventive maintenance, is defined as the maintenance that is carried out according to prescribed
criteria, and which objective is to reduce the failure probability, or the degradation of elements.
There are different types:
Systematic and periodic.
Conditional, depending on the overcoming of an established limit, or the wear out of the
systems.
Preventive periods will be defined by three factors:
Reliability of the materials.
Characteristics of its functionality.
Level of degradation. It will depend on the time, number and speed of circulations.
We can define three different types of preventive maintenance:
Scheduled preventive maintenance, where reviews are made for time.
Predictive maintenance based in equipment reliability. It tries to determine the moment in
which the repairs must be made by means of a follow up that determines the maximum
period of use before being repaired.
Opportunity maintenance is one that is carried out taking advantage of the periods of non-
use, thus avoiding to stop the railway operation when they are in use.
b) Corrective maintenance, which is based on the actions of identification and rectification of
breakdowns that have already happened.
This type of maintenance is carried out after an incidence in order to return it to its original status.
There are two types of corrective maintenance:
Programmed: when the works are carried out during the maintenance time
Not programmed.
We can consider two types of interventions:
Palliative: when the repair is an emergency repair, not definitive. Normally because of the
need of a quick restore of the service.
Solving: when the repair is carried out definitively.
Page | 32
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
However, the sector needs an effective optimization of all the process according to the
requirements and the rising competitiveness of the infrastructure market. Therefore, it is
required to develop new strategies based in real data and the specific conditions of the
infrastructure location.
Taking into account the existing needs, it has been established that Predictive Maintenance
Strategies can provide the framework necessary to reach the optimization objectives required
by the sector. In this context, Big Data technologies are basic to provide the platform to
develop accurate models as well as to support the new strategy to be implemented.
Our pilot is focus on the predictive maintenance as part of the preventive maintenance types.
The predictive maintenance at railway domain is the set of programmable activities designed to
maintain the track geometry quality and to ensure the proper operation of the elements of the
superstructure.
Some of the objectives of the predictive maintenance are:
To detect causes that can induce problems.
To decide the suitable moment to review or replace an element.
To reduce the unavailability periods of an element.
Predictive maintenance is based in:
The election of specific parameters to be measured.
To evaluate the range of admissible values.
The use of devices and the method of control used to measure the parameter.
To define the frequency for the measures of the parameters.
Additionally, a predictive approach comprises an additional challenge for the Big Data
technologies due to the difficulties to foresee future situations and/or extreme events. For
example, Climate Change represents a clear challenge of this problem because its effects during
the whole lifecycle of the infrastructure are currently quite difficult to determine (qualitatively
and/or quantitatively).
The benefits that a predictive maintenance strategy can provide are summarized as follows:
1. Extend assets life. 2. Reduce the costs of maintenance (scheduled and unscheduled). 3. Predictive models to forecast (and plan accordingly) maintenance activities. 4. Increase warranty recoveries. 5. Better knowledge of the real causes of early failure. 6. Downtimes reduction (people and machinery). 7. Integrated platform to manage a vast amount of information.
Page | 33
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
8. Provide decision-making support. 9. GHG reduction. 10. Development of Effective Climate Change Adaptation Strategies.
In a parallel effort, it has been collected a series of needs that diverse professionals have
identified. It is particularly interesting the results showed in a technical paper titled “Railway
infrastructure maintenance - a survey of planning problems and conducted research” carried
out by Linköping University, Department of Science and Technology, Norrköping SE-601 74,
Sweden.
The main outcomes of this paper consist of an identification of problems and improvement
opportunities related to railway maintenance. In summary, these opportunities are list as
follows:
Strategic opportunities: o Maintenance dimensioning o Maintenance contract design o Maintenance resource dimensioning and localization o Service life and maintenance frequency determination o Network design considering maintenance o Renewal scheduling and project planning
Operational opportunities: o Possession scheduling o Maintenance vehicle and team routing o Rescheduling o Deterioration-based maintenance scheduling o Maintenance vehicle routing and team scheduling o Maintenance project planning o Work timing and resource scheduling o Track usage planning
In spite of the opportunities previously commented are more specific and operationally
focused, it can be determined they are perfectly aligned with the objectives that a predictive
strategy implies. This represents a relevant issue in order to ensure the replicability of the
results of the pilot not only in a national context but also in a European framework.
Then, taking into account all this requirements and characteristics, the following figure shows a
general diagram with the approach proposed for the Replication Pilot of the WP6.
Page | 34
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 1 Pilot Approach Diagram
In the following sections, it will be described in detail the initial design of the pilot indicating the
parameters to take into consideration.
3.1 Requirements The main high level requirement for this pilot is to optimize the maintenance task of the high
speed railway infrastructure. At this chapter, given the breadth of this initial requirement, the
focus will be on the description of the optimization to be achieved in this pilot, reflecting which
aspects of infrastructure maintenance will be addressed and why these aspects are chosen.
Within the maintenance of the railway infrastructure, the main concerns and therefore
requirements of the railway infrastructure managers and of the companies responsible for
carrying out the maintenance tasks of the railway infrastructure are:
1. Infrastructure security: Have a railway infrastructure that is at all times in conditions that
ensure the level of security for which it was designed.
2. Safety of the maintenance personnel: All interventions and work performed as part of the
maintenance of the railway infrastructure must be designed and controlled in a manner
that ensures the safety of the maintenance staff.
3. High availability: Increase infrastructure availability.
Page | 35
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
4. Cost optimization: Reduction of the cost of maintaining the infrastructure given that it is
one of the aspects that allows the train to be competitive against other transport modes.
In order to meet these high-level requirements, the following specific requirements can be
established for the pilot:
ID Requirement description
REQ1 Minimizing the number of maintenance interventions
REQ2 Minimization of times required for each maintenance work
REQ3 Minimization of elements or assets requiring maintenance
REQ4 Optimization of the maintenance activities whose elements involved are close and has to be performed close to the time
REQ5 Migrate the maintenance mode philosophy, from the periodic maintenance to the intelligent maintenance based on predictive information going through the current maintenance model of “Condition based maintenance” or “by state”
The basis of this current maintenance model is to maintain continuous intensive monitoring of
the elements and track geometry. The next step for the maintenance activities is to be based on
predictions of the evolution of infrastructure wear, which is one of the pilot requirements.
With this pilot is expected to achieve the following improvements related to the management
of maintenance tasks of the high speed rail network:
Reduction of maintenance cost by up to 12%
Increase network availability by up to 20%
Cost optimization
Safety for maintenance
staff
Infrastructure security
High availability
Page | 36
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Improve maintenance efficiency by reducing the number of maintenance work due to
failure in elements 10%
The following figure shows the process of identifying the pilot's objectives.
Figure 2 Pilot objectives identification process
To achieve these challenges, the pilot must provide useful information in order to be able to
answer appropriately to the following questions:
When some element of the infrastructure will fail?
What level of accuracy have the predictions made by the pilot?
What elements will be affected by the expected failure?
What is the relationship between failures?
In the event of a fault, is it expected that another fault will occur nearby?
What is the cost (time and money) for each of the actions necessary to anticipate the
existence of a possible failure?
What is the cost (time and money) for each of the actions necessary to solve a failure?
What is the cost (time and money) for the necessary work to solve a failure when it
occurs?
3.2 Objectives The main objective is to develop a pilot that obtains, analyses and links all the information
currently available regarding the operation and maintenance of the high-speed rail network,
Page | 37
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
both historical and real-time information, applying methodologies of analysis of massive data,
with the aim to extract the necessary knowledge that allows to make an accurate prediction
about the evolution of the wear of the infrastructure elements, that serves to optimize the
maintenance works of the high speed railway infrastructure.
The objective of this pilot is to provide a tool that provides useful information for the decision-
making process that allows anticipating the possible problems or failures that can arise in the
railway infrastructure and in this way achieve to reach greater levels of availability of the
infrastructure than the current ones at a lower cost and maintaining the high level of security
required.
Therefore, the specific objectives of the pilot are:
ID Objective description
OBJ1 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the degradation of switches and crossings of the high speed lines throughout its useful life
OBJ2 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the degradation of track profiles of the high speed lines throughout its useful life
OBJ3 Collect, process and analyse information based on Big Data techniques for the prediction of the evolution of the topographic characteristics of the slopes that are part of the railway infrastructure
According to the standard EN 13306:2011, maintenance is the combination of all technical,
administrative and management actions during the life cycle of an item intended to retain it in,
or restore it to a state in which it can perform the required function.
All the elements of the track, such as the materials that make it up and the geometric
parameters that relate to each other, wear out due to the effects of atmospherics agents and
the vehicles driving on them. In order to continue with their functions, they have to be
performed a set of actions to ensure the quality of the route in relation to the needs of the
traffic.
The maintenance tasks are aimed at ensuring the safety of the circulation, reaching the
maximum possible degree of comfort for travellers and maintaining regularity indices that
characterize the trains on each track.
During the first phases of the pilot's execution, three aspects of the railway infrastructure will
be studied in order to optimize its maintenance.
Page | 38
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
The choice of these three aspects of the railway infrastructure as pilot study elements has been
due to the following points:
They are elements that concentrate the majority of the maintenance works of the high
speed lines.
The maintenance cost associated with these items is high.
The maintenance of these elements has a great margin of improvement and therefore
great margin of reduction of cost.
Each maintenance work to solve a degradation of these elements requires heavy
machinery that must go to the fault point. To move this kind of machinery requires a lot
of time and has impact at regular train traffic.
All the maintenance works requires long execution times to solve the incidences. It
provokes a reduction of the availability of the railway infrastructure even at valley
hours.
There are a lot of information available related to these aspects, both regarding the
evolution of their useful life and of incidents that have been occurred in the past.
There are historical data on these elements and periodically updated information on
them is available.
There are economic cost data for each maintenance task associated with these
elements.
The maintenance of these elements is currently periodic or when a failure is detected
and therefore fits in with the objective of performing a predictive maintenance of these
elements.
The quality of the information related to these elements is high.
The sources of information available are very diverse.
Switch and crossing elements
Track profiles
Slopes close to tracks
Page | 39
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
A priori there are common sources of information that have information that can be
useful to detect the degradation of these elements.
The study of the maintenance of these elements and their degradation will allow to
estimate the cost savings and the increase of the availability of the railway network.
During the next steps of the pilot, these aspects of the infrastructure will be analysed in order
to detect if it will be possible to predict the wear of the infrastructure elements that will be
analysed at the pilot. This decision will be make taking into account several aspects like:
Data quality
Data availability
Data volume
Relationship between all available data
Margin of improvement
Possibility to get new measures in the future
3.3 Use Cases / Scenarios The following use cases will be realized during the execution of the pilot are:
Page | 40
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3.3.1 UC1: Prediction of the degradation of switch and crossing elements
3.3.1.1 Objective
The result of this use case will be a prediction about the wear of switch and crossing elements
and a comparative about cost between current maintenance activities and the cost of the
maintenance activities designed using the pilot results.
3.3.1.2 Input data
For this scenario will be necessary to use at least the following information:
Historical data of switch and crossing state
Historical reports about maintenance of switch and crossing elements
Relation between switch and crossing elements faults and maintenance activities
Measures about the current state of the switch and crossing elements
Train traffic information
Historical weather conditions
Maintenance activities to recover the optimal state of the switch and crossing elements
Resources to each type of maintenance work
Cost of each type of maintenance work
3.3.1.3 Work environment
The main activities to achieve this use case will be analyse, use and link all data available at the
railway infrastructure, operational, environmental and maintenance information, in order to
make a fit prediction of the degradation of the switch and crossing elements installed at High
Speed Lines.
The main steps to fulfil the objectives are:
1. Select specific switch and crossing along the pilot line
2. Study of the failures produced in the past with this selected items
3. Collect all data available related to each selected item
4. Execution of the big data algorithm to predict the wear of these selected items
5. Compare the results to detect the degree of reliability of the predictions made by the
pilot
6. Carry out a cost analysis to detect the improvements achieved with the pilot
Page | 41
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3.3.1.4 Relation with objectives and requirements
At the following table is represented the relation of this use case with the pilot objectives and
the requirement.
Use case Objective Requirement
UC1 OBJ1 REQ1 REQ2 REQ3 REQ4 REQ5
3.3.2 UC2: Prediction of the degradation of track profiles
3.3.2.1 Objective
The result of this use case will be a prediction about the wear of track profiles and a
comparative about cost between current maintenance activities and the cost of the
maintenance activities designed using the pilot results.
3.3.2.2 Input data
For this scenario will be necessary to use at least the following information:
Historical data of track profile state
Historical reports about maintenance of track profile
Relation between track profiles faults and maintenance activities
Measures about the current state of the track profile
Train traffic information
Historical weather conditions
Maintenance activities to recover the optimal state of the track profile
Resources to each type of maintenance work
Cost of each type of maintenance work
3.3.2.3 Work environment
The main activities to achieve this use case will be analyse, use and link all data available at the
railway infrastructure, operational, environmental and maintenance information, in order to
make a fit prediction of the degradation of the track profile along the High Speed Lines.
The main steps to fulfil the objectives are:
1. Select specific tracks along the pilot line
2. Study of the failures produced in the past with this selected items
Page | 42
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3. Collect all data available related to each selected item
4. Execution of the bid data algorithm to predict the wear of these selected items
5. Compare the results to detect the degree of reliability of the predictions made by the
pilot
6. Carry out a cost analysis to detect the improvements achieved with the pilot
3.3.2.4 Relation with objectives and requirements
At the following table is represented the relation of this use case with the pilot objectives and
the requirement.
Use case Objective Requirement
UC2 OBJ2 REQ1 REQ2 REQ5
3.3.3 UC3: Prediction of the evolution of the slopes that are part of the railway infrastructure
3.3.3.1 Objective
The result of this use case will be a prediction about the evolution of slopes and a comparative
about cost between current maintenance activities and the cost of the maintenance activities
designed using the pilot results.
3.3.3.2 Input data
For this scenario will be necessary to use at least the following information:
Historical data of slopes state
Historical reports about maintenance at slopes
Measures about the current state of the slopes
Historical weather conditions
Maintenance activities to recover the optimal state of the slopes
Resources to each type of maintenance work
Cost of each type of maintenance work
3.3.3.3 Work environment
The main activities to achieve this use case will be analyse, use and link all data available at the
railway infrastructure, operational, environmental and maintenance information, in order to
make a fit prediction of the evolution of the degradation of slopes at High Speed Lines.
Page | 43
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
The main steps to fulfil the objectives are:
1. Select specific slopes along the pilot line
2. Study of the failures produced in the past with this selected items
3. Collect all data available related to each selected item
4. Execution of the bid data algorithm to predict the wear of these selected items
5. Compare the results to detect the degree of reliability of the predictions made by the
pilot
6. Carry out a cost analysis to detect the improvements achieved with the pilot
3.3.3.4 Relation with objectives and requirements
At the following table is represented the relation of this use case with the pilot objectives and
the requirement.
Use case Objective Requirement
UC3 OBJ3 REQ1 REQ3 REQ5
3.4 Data Assets Railway Infrastructures are really complex and comprises many different elements. All these
elements have associated diverse parameters that define their characteristics at least form the
maintenance point of view.
Then, and in order to identify and describe these parameters, the following table shows the
data sets and their possible sources to characterize these parameters and the element of the
infrastructure (or elements) that is related.
Name of Data Asset
Short Description Initial Availability
Date
Data Type
Link to Data ID Card (in
basecamp) Ferrovial Drone flights
The drone will provide topographic data from the flights to be carried out in the Pilot area. These data set comprises cloud point of the terrain with their coordinates, data from the drone (location, altitude, etc.) as well as aerial photos.
Depending on the flights planning
and their number that must be
approved by the competent authority
LAS format and aerial photos
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252586
Machinery Technical Specifications
Technical specification data (type, manufacturer, plate (national and international), max speed, year of adquisition, etc)
01/04/2017 pdf, xls, jpg, mpg4, doc
https://3.basecamp.com/3320520/buckets/1429164/uploads/423251967
Machinery Fuel Data regarding machinery fuel consumption 01/01/2015 xls, csv https://3.basecamp.com/3320520/bu
Page | 44
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Consumption during the maintenenace operations. ckets/1429164/uploads/423253184
Machinery Work Mode
Data related the machinery work mode (standstill periods , working stages, transit)
01/01/2015 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423251931
Machinery Engine Work
Data related the machinery engine when is working (standstill periods , working stages, transit)
01/01/2015 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423252967
Machinery GPS location
Data related to the location of the machinery according to their GPS coordinates and the theoretical distances to reach the maintenance worksite.
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423251904
Machinery Automatic Fuel Consumption
Data regarding machinery fuel consumption during the maintenance operations [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423253222
Machinery Automatic Work Mode
Data related to the machinery work mode (standstill periods , working stages, transit) [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423251943
Machinery Automatic Engine Work
Data related to the machinery engine when is working (standstill periods , working stages, transit) (automatically obtained) [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423253148
Machinery Number of Tamping Insertions
Data related to the number of tamping insertions carried out by the tamping machine [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423251990
Machinery General Engine Data
Data related to engine data of the machinery focused on its operation [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423253173
Machinery Tamping Device Temperature
Data related to the monitoring of the temperature of the tampling devices while the machine is working [automatically obtained]
01/06/2017 xls, csv https://3.basecamp.com/3320520/buckets/1429164/uploads/423252008
Track Geometry Data related to the track geometry before and after of the maintenance activities used to control the track parameters are within the required tolerances.
01/01/2008 xls, csv and paper
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252072
Railway Design Projects [paper support]
Full construction designs including all the elements of the infrastructure and the as-built info collected during the construction stage [paper support]
01/01/2002 paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252531
Railway Design Projects [digital support]
Full construction designs including all the elements of the infrastructure and the as-built info collected during the construction stage [digital support]
01/01/2002 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252504
Maintenance Operation Drainage Clearing [paper support]
Maintenance reports regarding the operations carried out for drainage clearing [paper support]
01/01/2002- 01/01/2008
paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252563
Maintenance Operation Drainage Clearing [digital support]
Maintenance reports regarding the operations carried out for drainage clearing [digital support]
01/01/2008 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252542
Page | 45
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Maintenance Embankment Slope Clearing [paper support]
Maintenance reports regarding the operations carried out for clearing of embankment slopes [paper support]
01/01/2002- 01/01/2008
paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252054
Maintenance Operation Embankment Slopes Clearing [digital support]
Maintenance reports regarding the operations carried out for clearing of embankment slopes [digital support]
01/01/2008 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252030
Maintenance Operation Track Bed Profiling [paper support]
Maintenance reports regarding the operations carried out for track bed profiling [paper support]
01/01/2002- 01/01/2008
paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252145
Maintenance Operation Track Bed Profiling [digital support]
Maintenance reports regarding the operations carried out for track bed profiling [digital support]
01/01/2008 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252112
Maintenance Operation Line Fencing Preservation [paper support]
Maintenance reports regarding the operations carried out for track bed profiling [paper support]
01/01/2002- 01/01/2008
paper https://3.basecamp.com/3320520/buckets/1429164/uploads/423252663
Maintenance Operation Line Fencing Preservation [digital support]
Maintenance reports regarding the operations carried out for track bed profiling [digital support]
01/01/2008 pdf,cad,jpg,xls,word,txt,mov,mpeg4,dwg, etc.
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252604
PNOA MÁXIMA ACTUALIDAD
Recent ortophotographies mosaic of the national territory in scale 1:50.000
01/01/1990 (in spite of there are
older pictures that can be
considered not enough updated)
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252250
MTN25 RÁSTER Recent raster file of the National territory map in scale 1:25.001
01/01/1990 (in spite of there are
older pictures that can be
considered not enough updated)
WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252340
BTN25 Recent vectorial file of the National territory map in scale 1:25.000
01/01/1990 (in spite of there are
older pictures that can be
considered not enough updated)
WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252376
BTN100 Topography base of National territory map in scale 1:100.000
NA WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252232
MDT05/MDT05-LIDAR
Digital model of the terrain related to the National territory with a mesh of 5x5 m
01/01/2017 WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252445
Recent Seismicity Recent seismic activity and earthquakes within National territory and boundaries.
01/01/2017 WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252278
Seismogenic Zones Seismic related zones and elements within 01/10/2010 WEB https://3.basecam
Page | 46
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
National territory and boundaries. Service p.com/3320520/buckets/1429164/uploads/423252328
Hydrogeologic map Hydrogeological GIS information of the National territory.
01/01/1991 - 01/01/1999
WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252188
Vegetation Index Vegetation index GIS information within the territory of Andalusia time dependant.
01/01/1993 WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252393
Noise map of Malaga
GIS information of the noise distribution in the city of Malaga related to train traffic
01/01/1991 - 01/01/1999
WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252355
Weather Information
Weather information gathered by the ADIF's weatherstations deploed through its facilities and infrastructures.
01/01/2015 WEB Service
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252411
Weather Information
Weather information gathered by the AEMET network.
01/01/1920 API REST https://3.basecamp.com/3320520/buckets/1429164/uploads/423252425
European Railway Data
The Agency is responsible for developing and maintaining several registers and databases in order to ensure transparency and equal access to documents for all railway market actors. These data sets include aspects such as incidents, safety, iteroperability or rolling stock.
01/01/1992 WEB Service / API REST
https://3.basecamp.com/3320520/buckets/1429164/uploads/423252465
Tramification Tracks and units in service JANUARY 2010 XLS, PDF https://3.basecamp.com/3320520/buckets/1429164/uploads/423439141
Tramification (IDEADIF)
Tracks and units in service (geometry) JANUARY 2010 SHP https://3.basecamp.com/3320520/buckets/1429164/uploads/423439174
PIDAME Aplying for maintenance tasks JANUARY 2010 XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439054
SIOS Works, projects and maintenace tasks XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439101
ICECOF Monitoring and control system of railway operation
XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439032
CGRH24 24h network management centre https://3.basecamp.com/3320520/buckets/1429164/uploads/423438890
DAVINCI High technology in railway operations https://3.basecamp.com/3320520/buckets/1429164/uploads/423438950
Page | 47
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Dynamic inspection Dynamics inspections database https://3.basecamp.com/3320520/buckets/1429164/uploads/423438992
Geometrical inspection
Geometrical inspections database https://3.basecamp.com/3320520/buckets/1429164/uploads/423439016
S&c inspection S&c inspections database XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439085
TAMPING Tamping database XML https://3.basecamp.com/3320520/buckets/1429164/uploads/423439118
AEMET Observing networks used for meteorological and climatological studies
1981 CSV, PDF, XML
https://3.basecamp.com/3320520/buckets/1429164/uploads/423438859
CLIMA Environmental climatological information subsystem
XLS https://3.basecamp.com/3320520/buckets/1429164/uploads/423438915
3.4.1 Techniques to monitor the infrastructure
The data available regarding to the infrastructure state are collected by several types of
inspections. The pilot will use the information provided by these inspections as data input in
order to try to find relationships between them that currently are not recognised and to
analyse the evolution of each measure to try to predict the infrastructure wear.
Following is describing some different kind of inspections that will be used at the pilot:
Geometric track inspection: in Spain, the Vehicle Track Geometric Control registers the
parameters of the track geometry. Accelerations are measured in the axle boxes to
detect wearing both short-wave and long-wave and levelling defects in welds or joints.
These defects are associated with dynamic overload, which causes rolling contact
fatigue, as well as causing vibrations and discomfort. Longitudinal and transversal
levelling are also controlled; they are related to short-wave, medium-wave and long-
wave and can produce dynamic overload, uncertainties in the circulation and
discomfort. Warping defects are also measured with this device because they can result
in derailments. Other parameters measured are alignment, track gauge, the head rail
transverse section and the profile of the track.
The frequency of the geometric inspections is about every two weeks.
Dynamic inspection: One of the most important sources of information related to
maintenance of superstructure in the high-speed lines is the periodic dynamic
inspections of track. This is an indirect method of detecting defects in the
Page | 48
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
superstructure through the study of the different accelerations suffered by a
commercial vehicle in motion at certain points of their structure. The characteristics and
condition of the vehicle, and its speed have influence in the "reading" that it makes of
the track.
It is made with the dynamic control vehicle whose speed is the same than the one in a
commercial train. The next accelerations are measured and registered: the vertical in
axle boxes, and lateral and vertical in car body. According to the experience, we set the
thresholds for each acceleration. The system provides a list of the kilometres where
limits have been exceeded, attached to the values measured at these points.
o Lateral Bogie acceleration. For values above 6 m/s2, the defect is checked as
soon as possible and immediately corrected. When the values are between 4 and
6 m/s2 they are considered as programmed correction, and when they are
between 2.5 and 4 m/s2, they are considered as surveillance.
o Vertical axle box acceleration. They are considered only values above 30 m/s2. It
is necessary to make specific topographic studies in the track, to analyse the
sleeper´s settlement, among other causes.
o Lateral and vertical car body accelerations. Besides these accelerations,
occasionally longitudinal acceleration is also measured. They show the comfort
of the traveller. We consider as normal values those below 1 m/s2. It is also
interesting from the point of view of detection of long wave defects.
Twelve inspections dynamics are performed annually. The frequency of dynamic
inspections allows us having updated information (monthly or bimonthly) of the state of
the superstructure and its evolution and effectiveness of the work done since last
auscultation. That is why it is considered that the maintenance of the superstructure has
its main source of information on the dynamics inspections, and that should be the basis
for scheduling the maintenance of superstructure measures.
Rails ultrasonic inspection: Rail head defects with longitudinal orientation are detected
by a vertical acting ultrasonic radiator.
A vehicle equipped with ultrasonic instruments for rail inspections performs this activity
twice a year.
Track visual inspection: Specialized personnel make a visual inspection of superstructure
of the track by foot. All the aspects that could have an impact on the normal
development of the exploitation are checked.
In particular, the next elements are evaluated:
o Ballast: State, pollution, presence of weeds, dimensions of the ballast layer.
o Sleepers: Presence of fissures or cracks, damage from track machinery, squaring.
o Fasteners: Correct placement and operation of the fastener.
o Rail: appearance of surface defects, cracks or fissures.
Page | 49
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
o S&C: inspections
The frequency of these visual inspections is twice a year.
Route on train cab: Every week, it is made a visual inspection of the track in the train cab
of a commercial circulation recording all the singularities of the road. As soon as
possible, potential defects are confirmed, taking field data and geometric correction is
programmed if it is necessary.
The frequency of these visual inspections is weekly.
With all the data from dynamic auscultation, geometric auscultation, train cab inspection and
on foot inspections the maintenance staff have the information necessary to schedule
maintenance work. The analysis of the acceleration graphs is very useful and it reaches its
maximum operational and effectiveness if they the more important elements of the
superstructure are located on it. From these studies, the works to be done in the track are
scheduled, taking into account those that need treatment with heavy machinery or specific
studies of topography or dynamic inspection confirmation to solve the problem. In addition,
those areas or points whose treatment requires more investment and specific planning are
identified.
3.5 Big Data Technology, Techniques and Algorithms
Note: In the present chapter it is presented that the Big Data Technology, Techniques and
Algorithm used for the presented project will be based on the SOFIA2 Platform (by Indra
Sistemas, S.A.). It is worth noting that SOFIA2 is also the Big Data platform used in “WP4 –
Smart Highways” and “WP8 – Smart Airport Turnaround” so because of that, the following
chapter is shared also with those work packages.
Each of this pilot that share the Big Data platform will use both, common and specific modules
of SOFIA2 in order to be able to fulfil the specific objectives of each pilot.
This chapter is divided in two parts; the first one shows the Big Data platform characteristics
with the specific modules that will be used at the Railway pilot, and the second one shows the
specific technics used develop the descriptive and predictive algorithms of this railway pilot.
3.5.1 Big Data Technology
The aim of the SOFIA project (an Artemis projects, 2009-2012) was to make information in the physical world available for smart services by connecting the physical world with the information world. The idea was to enable cross-industry interoperability and to foster
Page | 50
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
innovation while maintaining value of legacy solutions. Sofia implements the Smart M3 semantic model. The semantic information broker (SIB) of the Smart M3 model is able to dispatch extremely high volume of data, but it is not able to process them, this is delegated to the nodes.
The results of Sofia were taken as input by Indra to create Sofia2, an IoT and Big Data solution oriented to real-life conditions. The performance was optimized and the semantic requirements were simplified. Indra also enhanced Sofia2 through the Analytics Labs, bringing an open source toolset for the Big Data exploit, based on the main standards and tools like Hadoop.
Sofia 2 community version provides a set of open APIs based on the main standards so that any developer can expand the functionality of the platform Sofia for its needs. Using this approach it is not needed any specific skill or training for using Sofia2 beyond the existing open solutions. As communication protocols Sofia 2 uses MQTT, RESTful, Ajax Push, Websocket, AMQP and JMS.
The interchange of information is based in the definition of ontologies, a semantic solution to face the wide spectrum of IoT. This modeling reference only requires to define the information semantically, and to develop the Knowledge Processor from the data source (sensor or system) that will structure the data obtained from the environment with its semantic meaning. Sosfia2 uses SSAP-Json and SSAP-XML as standards for the exchanging of information.
Sofia2 supports a wide range of use cases, for example for Smart Cities, Mobility, Energy, Building, Home or Health among others. And using the same approach, the only thing that changes is the semantic definition of the information and the sensors or devices connected.
3.5.1.1 Conceptual model Indra Big Data platform
Sofia2 Big Data Platform's main objective is simplifying the use of all its technologies and
expediting the use and exploitation of data structured and not structured, even in real time.
Sofia2 is a modular platform that allows to deploy its modules independently according to the
needs. All the concepts of the platform are managed from a unified web console allows scaling
per the needs based on proven technologies.
Built on widely-tested Open Source software (like Hadoop, Spark, Hive, and others), Sofia2
supports real-time scenarios, batch, ML, visualization. Additionaly, Sofia2 is extensible and
adaptable, integrating security at the data modelling level, offering validations in the data
exploitation and semantics.
Page | 51
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 3: Sofia2 Big Data Platform used modules
3.5.1.2 Sofia2 Typical workflow
Typical flows in Sofia2 do not use all modules. Below it is presented the workflow and the
modules that will be used for the current pilot.
The workflow of SOFIA2 has the following 4 steps.
Figure 4: Sofia2 Typical workflow
Page | 52
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
1- Upload information
Ingestion of information (this information can be treated in this phase to homogenize it
and ensure the quality of the data) through DataFlow Module for the definition of
treatment and storage in Stagging Area of Module Storage.
2- Analysis of the information
Processing of information and execution of machine learning algorithms through the
Notebook and Machine Learning modules to obtain business value data.
3- Storage of models and results
The values resulting from the executions of the machine learning algorithms are stored
on the platform
4- Display information
Through the Dashboard module, all the information stored in the platform, both
inserted in point 1 and inserted in point 3 can be accessed and consulted.
The modules of the platform Sofia 2 used in the construction of the pilot are explained in the
following subsections.
3.5.1.3 Sofia2 Storage
The information modeled in the Platform is stored in the Big Data Repository included in the platform.
The reference implementation of this repository that is supported on Hadoop is used. Apache Hadoop is an open-source framework that allows the distributed processing of large amounts of data (peta bytes) and working with machine clusters in a distributed way.
Currently Hadoop is synonymous with Big Data for being:
Economical: runs on low-cost equipment forming clusters.
Scalable: If you need more processing power or single storage capacity there is to add
more nodes to the cluster very easily.
Efficient : Hadoop distributed data and processes it in parallel on the nodes
Reliable: Hadoop moves processing (Tasks) to data.
The main parts of Hadoop that uses the solution are:
HDFS, is the Hadoop distributed file system
System of files distributed that abstracts of the storage physical and offers a
vision only of all the resources of storage from the cluster.
To the store a file, it part in blocks and stores each block in node different from
the cluster. It also replicates each block in at least three nodes.
Page | 53
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
It is possible to store files larger than the maximum size of any of the machines
in the cluster disk.
If a node of the cluster is fault, the system continues running while is repaired
using the information replicated in other nodes.
Hive, is the infrastructure data warehouse on Hadoop, which allows SQL queries on data
stored in Hadoop.
Impala, that allow the access via SQL online to the data stored in HDFS.
3.5.1.4 DataFlow (ETL module)
DataFlow module is the main entry of data and information to the platform. This module can be
used as an ETL (Extract, Transform and Load), for both purposes: to intake data as for complex
transformations within the platform and/or to export data involving intermediate
transformations.
Figure 5: Sofia2 DataFlow module
The main capabilities of the DataFlow module can be summarized as follows:
Extraction: Where data is extracted from homogeneous data sources. Up to 18 different data origins are integrated in the module: Excel, AmazonS3, HadoopFS, Sofia2 (which lets you select the ontology, fields or query), Kafka, etc.
Page | 54
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Transformation: Where the data is transformed for storing in the proper format or structure for the purposes of queriying and analysis.. It is composed of many different tasks:
Evaluation of expressions: performs checks and calculations that can write fields new or existing.
Actions on fields: different actions available on the fields as: Converter, Merger, Masker, Hasher, remove, rename...
Parser of JSON, XML and logs: parses information valid per the different types of format of logs, and schema XML and JSON.
Flow selector: to select the next activity to execute on the dataset, depending on conditions of execution.
Evaluators in different languages: different specific actions on the data available for the coding languages (Python, JavaScript, Jython...)
Other components such as the Replicator registry or the replacement of values
Figure 6: Sofia2 DataFlow example diagram
Load: Where the data is loaded into the final target database.
There are more than twenty possible destinations, to incorporate into the process via
Drag & drop from the taskbar. We are highlighting the following components from
Sofia2 which lets you to select the ontology, fields, and other additional parameters:
AmazonS3
Cassandra
Hadoop
Kafka
Flume
Page | 55
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 7: Sofia2 DataFlow monitoring information
3.5.1.5 Notebooks (Collaborative analytical)
The Notebook is an interactive and intuitive module from Sofia2 that allows to show the data and to facilitate its analysis.
As a summary, the Notebook is basically a collaborative tool that is capable for:
1. Performing complex analysis of information managed by the platform (both, real-time and historicals),
2. Combining different languages (Spark, R, Python, Hive, SparkSQL and Shell) 3. Generating intuitive figures (such as table, graphs, maps, and others) 4. Planning the execution of procedures (by new notebooks, one per procedure). It mean,
you can create Notebooks with different targets, for example: a real-time execution, a periodical execution, a batch execution.
As an example, Notebook is able to make a data load from Hadoop Distributed File System (HDFS) to Spark, launching queries and perform complex processes of machine learning through the libraries of MLib.
Also it is possible to use of R code as well as the numerous libraries of the language, allowing by examples to display maps of leaflet.
Page | 56
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 8: Sofia2 Sofia2 Notebook
SOFIA2 Notebooks can combine Scala code, Spark, SparkSQL, Hive, R, Shell, or many others with
html content or reactive policy angle, allowing interactions in real time with a powerful
interface, and all in a shared environment, multi-user.
Each supported language is managed by an interpreter, so it always that you want to write code
for a certain language should be write an own marker in the paragraph.
In addition, it allows instant visualization of data, being able to easily configure graphics and
quickly change the display of the same type. Also is possible the creation of graphics advanced
thanks to libraries own of each language.
Figure 9: Sofia2 Notebook Spark graphics
.
Page | 57
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 10: Sofia2 Notebook HIVE graphics
Each Notebook consists of paragraphs, which may have different languages, and can run
individually the paragraphs and viewing the output of the same, as well as the State of
execution.
Both paragraphs and the full notebook can outsource via URL, seeing in real time in all cases,
the executions of notebooks or paragraph.
Another feature important is the possibility of plan the execution of them notebooks through
an expression CRON, and can run notebook repeatedly and without loss of context, and can
select an interval of execution of them predesigned or write one custom.
With all these features have a tool web collaborative, that is capable of perform analysis
complex of the information managed by the platform (both in time real as historical),
combining different languages and generating views graphic (u others actions), that is can plan
for their execution periodic, cooling automatically the result of the analytical that is exposed in
a URL.
3.5.1.6 Machine Learning
The Machine Learning platform allows you to apply different learning techniques, among which we have to highlight the following:
Regression: Techniques to estimate relationships between variables.
Clustering: Techniques for grouping data by similarities.
Classification: Techniques to identify the membership of an element to a specific group.
Recommendation / Prediction: Techniques for forecasting the value from a new entity
based on historical preferences and/or behaviors.
Through the interpreter, SOFIA2 allows:
Page | 58
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Store models created on the platform. From this, it will be possible to manage them
from the web console, from which we can also invoke them based on parameters and
give them permissions.
Publish Scripts SOFIA2Models that provides methods to retrieve the model, save it,
invoke it, assess its quality.
Generate REST APIs allowing evaluating input data sets through the generated
models. This facilitates its invocation through standard mechanisms that also have
integrated security platform.
This module allows you to define workflows visually, so that it is only necessary to introduce
the configuration parameters and input data to define analytic processes.
3.5.1.7 Dashboards
This module allows you to create a simple and visual dashboard with the information managed by the platform.
This module offers various types of gadgets (data outputs) which can help to generate a full and
personal Dashboard
Figure 11: Sofia2 Dashboard types
3.5.2 Big Data Techniques and Algorithms
Applying a methodology for data mining processes is an important point to plan and execute
such kinds of projects. Some organizations implements KDD (knowledge, discover, datamining)
process while others use more specific standards like CRISP-DM (IBM SPSS) or SEMMA (if they
Page | 59
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
are using SAS tools). However, in this project, we will use open software and mainly we will use
R and Python language and R Studio tool.
Data mining or exploitation of information is a process to extract useful, comprehensive and
new knowledge with large data volumes being its main goal to find hidden or implicit
information, which cannot be obtained through conventional statistics methods. The inputs for
data mining processes are records coming from operational data bases or data warehouses.
We are using a methodology based on CRISP-DM with some shortcuts. The major steps are
represented in the next diagram. From a defined goal where it is implicit, the business
knowledge it is necessary to prepare data. That data preparation usually includes the data
enrichment with Open Data. Afterwards the creation of an advanced model will produce results
and require validation. These last three stages (data preparation, creation of advanced models
and results validation) constitute a cycle, which is iterated until valid results for the business are
achieved. You can appreciate the model in the following diagram.
Figure 12: Methodology based on CRISP-DM
Each stage will be analysed separately so we can provide additional details.
1. Goal definition based on business knowledge
2. Data preparation and management
Page | 60
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3. Creating analytic models
4. Validation and conclusions
5. Deployment: Solution integration with the pilot
3.5.2.1 Goal definition based on business knowledge
The first goal for a data analyst is to understand what the customer really needs to achieve. It is
important to discover which is the primary objective, and the relations with the rest of
objectives.
The analyst should describe the criteria, which are useful from the business perspective so they
can easily understand the situation. Afterwards, it is necessary a more detailed research about
all the resources, restrictions, presumptions and other factors, which should be considered to
determine the objective of data analysis and project plan.
Afterwards, the business goals are converted into data mining goals, so the goals are translated
into technical issues. However, it is important to determine criteria for business success. The
tool to be used is also selected in this stage.
3.5.2.2 Data preparation and management
First, from an initial data collection, it is possible to identify the data quality, discover the first
knowledge and identify interesting data subsets to make hypothesis regarding to hidden
information.
Secondly, the final data set to be used in the analysis is built and it includes tasks such as table
selection, records and attributes, as well as transformation, new specific variables and data
cleaning.
The data cleaning can include the substitution of data with defects to the data estimation
through modelling. Other operations include production of derived variables or creation of new
variables.
Other common operation consists of combining data with open sources, especially when there
are relationships between the initial data and the Open Data, for instance, combining data with
socio-economic variables in EUROSTAT.
The combined data also cover aggregations, as new values calculated as summary information
from multiple records. For instance, a table with customer shopping new fields could be
number of shopping, average in the shopping quantity, percentage of articles in promotion,
etc..
3.5.2.3 Creating analytic models
With our methodology, we are able to respond to any kind of models: descriptive, diagnostic,
predictive and prescriptive. The reader can appreciate the difference in this graphic:
Page | 61
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 13: Different kind of models
As it is shown, the more complex the technique you choose, the more value you can add to
your client. In this pilot, it is expected to achieve the predictive level.
One of the main classifications divides machine-learning algorithms into two groups:
Unsupervised algorithms;
Supervised algorithms.
Unsupervised algorithms are applied when you only have input data and no corresponding
output variables. The goal for this technique is to determine the underlying structure or
distribution of the data, to organize data by similarity.
Examples of application of these techniques may be customer segmentation, finding hidden
patterns, etc..
One of the most extended unsupervised algorithms is the K-means algorithm.
On the other hand, supervised algorithms try to map a function from the input data to de
output variable. In these cases, you know in advance the variable you want to predict.
Supervised algorithms are divided into two groups:
Classification algorithms: the output variable is a categorical one: Fraud-not fraud,
green-red-blue, failure-not failure;
Regression algorithms: the output variable is a real number: A value of a temperature, a
pressure…
The next table summarizes the most common algorithms in supervised learning in both
categories:
Page | 62
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Classification problem Regression problem
Logistic regression Simple regression
Decision tree Ridge regression
Random forest Lasso regression
Gradient boosted trees Elastic net (Ridge+Lasso)
Neural network, Deep learning Regression tree
Adaboost K-neighbors regression
Naive Bayes SVR
K-neighbors Random forest regression
SVM Gradient boosted tree regression
AFT
In both types of problems, many different algorithms from the listed above are tested and the
most accurate is chosen.
The next diagram shows the selected type of algorithm for each one of the uses cases and
scenarios, which have been described earlier in the document:
Figure 14: Algorithm type for each use case
Once the more appropriate type of algorithms has been chosen, a procedure to test the model
quality and the validity is needed. So data are divided into sampling data for training the model
(the algorithm learns from the past) and the other for testing (the accuracy of the algorithm is
tested) as the next figure depicts.
Page | 63
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Figure 15: Model creation process
3.5.2.4 Model Evaluation
The data scientist is able to interpret the models according to his domain knowledge, the
success criteria in data mining and the desired test design. Later, he discusses with the business
analysts the results in the business context.
Depending on the model evaluation, the adjust parameters are reviewed and adjusted for a
new model evaluation until the best model has been achieved until the model can answer the
business goals in a better way. It is even possible to encounter business decisions, which make
the model deficient. So according to the evaluation results and the process review, the project
team decides how to proceed. The equipment decides if the project has to end, if it should
continue by modifying the development so more iterations are necessary either a new data
mining process should start.
A good way to define the total outputs of data mining is OUTPUTS=MODELS+CONCLUSIONS
3.5.2.5 Deployment: Solution integration with the pilot
Supervision and maintenance are important issues if data mining results are part of the daily
business.
Generally, data mining processes are not running independently in an IT environment but they
have to interrelate with other applications or be incorporated into the business processes.
Therefore, we think this stage is crucial to assure the success of the data mining algorithms.
Page | 64
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
3.6 Positioning of Pilot Solutions in BDVA Reference Model
Figure 16 Big Data Value Reference Model
Data Visualization and User interaction: The pilot will provide a set of specific reports
that will allow the visualization of the information in a readable and useful format on
each one of the predictions made so that they can be of help for the decision-making in
the optimization of the maintenance works.
See chapter 3.5.1.7 Dashboards.
Data Analytics: Algorithms developed for the pilot:
o Descriptive: A descriptive analysis will be carried out in the first place to get a full
understanding of the data.
o Predictive: The main aim of this pilot is develop specific algorithms based on
predictive data analysis to predict the evolution and degradation of each of the
elements of the pilot.
See chapters 3.5.1.4 DataFlow (ETL module), 3.5.1.5 Notebooks (Collaborative
analytical) and 3.5.2.3 Creating analytic models.
Page | 65
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Data Processing Architectures:
o Batch process: Processes that allow the feeding of the algorithms with new data
collected during the execution of the pilot will be specified for each data source.
The inclusion of new data will be a periodic task given the nature of the data
sources.
o Interactive: There are identified some data sources that has an unstructured
format, for this reason will be necessary to process this information by
interactive methods in order to be able to use and include them at the pilot.
See chapters 3.5.1.4 DataFlow (ETL module) and 3.5.1.5 Notebooks (Collaborative
analytical).
Data Management: The identified initial data sources provide information to the pilot in
standard formats based on Excel, Pdf and XML files. All these sources will be treated to
allow their initial inclusion and the insertion of data progressively throughout the
execution of the pilot. For more detailed description see chapter 3.4 Data Assets.
The techniques used to manage the data will be:
o Collection: techniques and tools for gathering and storing data in its original form
(i.e., raw data.).
o Preparation/Curation: techniques and tools for converting raw data into
cleansed, organized information.
o Linking/Integration: techniques and tools for matching, aligning and integrating
information.
o Access: techniques, tools and interfaces for accessing information (incl. access
rights management).
See chapter 3.5.2.2 Data preparation and management.
3.7 Big Data Infrastructure At pilot stage 1 and 2 we will use an INDRA Cloud infrastructure in order to execute the first
steps of our methodology (data preparation and management, creating analytic models and
model evaluation). This first steps and the first iterations of our methodology cycle will be
executed at the Cloud infrastructure but at the end of stage 2 and the stage 3, the pilot will be
deployed at the Railway Technology Centre (CTF) of ADIF at Málaga in an IT environment
described at this chapter.
3.7.1 INDRA Cloud platform
The INDRA Cloud platform will be shared by three Transforming Transport Domains:
WP4 – Smart Highways
WP6 – Proactive Rail Infrastructures
Page | 66
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
WP8 – Smart Airport Turnaround
The cloud platform is dimensioned taking into account the estimated volume of data of the
pilots that will share it.
The following figure shows an overview of the platform that will be used at the first pilot stages.
Figure 17: Sofia2 Big Data Cloud Platform Infrastructure
3.7.2 ADIF Railway Technology Centre Platform
SOFIA2 will be deployed at the ADIF platform with the same modules and characteristics
deployed at the cloud platform.
To deploy the replication railway pilot at the CTF of ADIF we have two hardware alternatives
with the following characteristics.
IBM xServer X3850
Processor 32 cores
RAM 64GB
Network 4 network cards 10/100/1000 4FC DUAL cards
Memory 2x 146GB 15k internal disks
Page | 67
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
IBM p770
Processor 6 x Power7 with 8 cores
RAM 192GB
Network 3 Quad ethernet cards 4FC DUAL cards
Memory 2 internal disks
The scalability of this system allows growing up to 64 cores and 1TB of RAM to 1066Mhz.
For the data storage system, we will use the hardware solution with the following
characteristics:
Dual IBM System Storage DS5300 2x IBM System Storage EXP5000
Memory 2 x 16 disks with 300GB and 15000RPM
3.8 Roadmap The pilot will be executed following the next three stages. Each stage will evolve with different
objectives, deployment environment, Big Data infrastructure used and data used.
At the following table is represented the main characteristics of these stages.
Stage Delivery
Date
(Project
Month)
Features / Objectives
Addressed
Embedding
in Productive
Environment
Big Data
Infrastructur
e Used
Scale of Data
S1:
Technology
Validation
9
Validate the quality and
volume of the input data
obtained
Provide KPIs for input
data
Confirm the feasibility
of the initial objectives,
At test lab
Indra´s
hardware
and Cloud
Historical data
Page | 68
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
(OBJ1, OBJ2, OBJ3)
S2:
Large-scale
experimentatio
n and
demonstration
18
Generate the descriptive
algorithm for all the
feasible objectives
(OBJ1, OBJ2, OBJ3)
Generate the predictive
algorithm for all the
objectives, (OBJ1,
OBJ2, OBJ3)
Validate the quality of
the developed algorithm
and determine their
reliability
At test lab
+
Initial deploy
at ADIF CTF
of Málaga
Indra´s
hardware
and Cloud
+
Dedicated
hardware
Historical data
+
Simulated data
S3:
In-situ trials 27
Generate all process of
feed the algorithms with
new and updated
information
Provide conclusions on
each objective and
assess the improvement
found, (OBJ1, OBJ2,
OBJ3)
Deployed at
ADIF CTF
of Málaga
Dedicated
hardware
Historical data
+
Information
updated
3.8.1 Objectives of each stage
The objectives of each stage are:
1. In stage 1, the main objective is check if the data is usefulness for pilot objectives by
checking if the data sources provided and the relationships between them are
appropriate to meet the objectives proposed in the pilot.
During this stage the quality of the available information will be analysed in depth with
the objective of determining the feasibility of each one of the objectives raised as part
of the pilot, as well as the possible reliability of the results and predictions that can be
obtained throughout the execution of the pilot.
2. In stage 2, the main objective is to develop all the descriptive and predictive algorithms.
Other important aim of the stage is to test the algorithms and to be able to establish
adjustments that allow to increase the reliability of the results.
During this stage will be tested too the pilot performance taking into account high data
volume to ensure that the pilot will execute at the final stage properly.
3. In stage 3, the main objective is to provide a complete tool that allow to extract useful
knowledge to be able to optimize the maintenance activities at the high speed rail lines.
Page | 69
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
As part of this stage will be provide an analysis of reliability of the results provided by
the pilot.
3.8.2 Deployment of each stage
The evolution of the deployment of the Big Data platform will follow the following steps:
1. In stage 1, the Big Data platform will be deployed by INDRA at the test lab environment
based on Cloud technology and will be used for the validation of the pilot objectives and
the data analysis.
2. In stage 2, the pilot will be executed at the Cloud technology environment in order to
elaborate the first algorithms version to process the initial data and provide the first
results of the algorithm.
In parallel, the Big Data platform will be deployed in the Malaga CTF of ADIF in order to
carry out the first load tests and thus check if the available hardware equipment will be
capable of performing a correct execution of the big data platform with a higher level of
data load according to the following phases of the pilot.
3. In stage 3, we will use the Malaga CTF of ADIF as a centre for information processing
and execution of the Big Data algorithms deployed in their different evolutions and with
data corresponding to each of the phases described below.
Figure 18 Pilot infrastructure deployment and data evolution
3.8.3 Data evolution of each stage
The data evolution aspect at each stage follows these points:
Page | 70
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
1. In stage 1, the data that the pilot will use will be all the historical data that is available
from the ADIF and FERROVIAL systems and data sources.
2. In stage 2, the same data that has been used in the first phase plus simulated data will
be used following the same characteristics of historical data.
3. In stage 3, the historical data used in the first phase will be used with the historical
information stored during the pilot's execution plus information receive by provider
systems in real and quasi real time will be included. Given the nature of each data
provider systems used as data sources, the information will be received periodically and
will be included in the pilot so that the volume of data used increases progressively.
4 Commonalities and Replication 4.1 Common requirements and aspects Initially the two railway pilots, although having the common objective of providing a tool to
predict the degradation of the railway infrastructure and thus optimize the maintenance work,
each one is focused on different parts of the railway infrastructure.
The initial starting data also have great similarities between the two pilots of the railway
domain, since both their nature and the mechanisms of obtaining the data are very similar.
Both pilots receive information collected by laboratory trains and systems of measurement of
the state of the different elements of the infrastructure.
Therefore, both pilots can be considered complementary from the point of view that both will
allow to analyse the state of the infrastructure, will predict its degradation, and will allow to
provide tools to both, Infrastructure Administrators and infrastructure maintenance companies,
to establish a more optimized planning of the maintenance activities.
Both pilots have the aim to optimize maintenance from two points of view, increase the
availability of the infrastructure and reduce the cost of maintenance always ensuring the safety
of both the infrastructure and the workers involved in maintenance activities.
4.2 Aspects of Replication For future projects or pilots where the objective is to perform a predictive analysis of the wear
of an element that is part of the railway infrastructure, the following points that are part of this
pilot can be used:
Analysis of the spectrum of existing data related to the operation and maintenance of
the railway infrastructure.
Page | 71
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
Techniques for extracting information based on heterogeneous data obtained from
different data sources.
Prediction-based data analysis techniques.
Algorithms for the prediction of the evolution of the wear of elements belonging to the
railway infrastructure.
The railway primary pilot is using data provided by the UK’s Network Rail. Initially the data
provided will be for a small subsection of the overall rail network, however once benefit is
proven a wider area of rail network might be provided. The results found however are not
limited to either that region or the UK as a whole, but have applications for all rail side assets
worldwide.
The railway replication pilot is planning to obtain information from the Cordoba-Málaga high-
speed line but some of the data sources that will be used in this pilot are also able to provide
information on other high-speed lines of the Spanish rail network. For this reason, the
replication of the pilot will be possible being able to use much of the effort made in this pilot
and will allow to extract in a future information of other lines in order to optimize their
maintenance activities.
5 Conclusions The conclusions related to the primary pilot are:
The primary pilot parties have engaged with the data provider and a data usage license
is currently being reviewed by the legal teams of all involved parties.
We have a small sample of the data set proposed to be made available to us. While it is
not enough or complete enough for us to begin work with, it’s sufficient to give us a
general idea of the kind of data we’ll be receiving.
Several system architectures have been designed and are being evaluated in parallel
whilst we are awaiting data.
Next steps:
Decide amongst involved parties the work breakdown
Obtain data
Ingest data into Big Data repository
Perform canonicalization of data before making it available to all parties
Data scientists to perform analysis of data and propose algorithms to work towards end
goals
The conclusions related to the replication pilot are:
Page | 72
D6.1 – Proactive rail infrastructures pilots design
Version 1.0, 15/03/2017
At this stage of the project, the replication pilot has defined their objectives in order to
try to optimize the maintenance activities of the Rail High Speed Lines.
We have an initial analysis of the data available to use at the pilot with some
information about their characteristics. This information available is very promising to
be able to be used in the pilot and thus be able to fulfil its objectives.
We have several options related to the Big Data infrastructure in order to deploy the
pilot at each stage of the project.
The next steps to fulfil the next milestones will focus on three different guidelines:
1. Data: Make a deep analysis of the data available, checking the data quality, the available
volume of data of each data source, the specific format of the information and the
access characteristic to each data source.
2. Algorithms: Generate the descriptive and predictive algorithms.
3. BigData infrastructure: Deploy the BigData platform at cloud environment to support
the algorithm generation process.