+ All Categories
Home > Documents > & data ingestion

& data ingestion

Date post: 02-Oct-2021
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
13
DSS implementation and data ingestion DSS implementation & data ingestion
Transcript
Page 1: & data ingestion

DSS implementation and data ingestion

DSS implementation

& data ingestion

Page 2: & data ingestion

2

Summary

Background and motivations .................................................................................................. 3 AI Methodology ........................................................................................................................ 4

Data research, evaluation and selection ................................................................................. 4 KPIs identification ................................................................................................................... 4 Identification of the levers to improve KPIs ............................................................................. 5 Scenarios analysis .................................................................................................................. 5

DSS Data ................................................................................................................................... 6

Data providers ........................................................................................................................ 6

DSS Architecture ...................................................................................................................... 7

Architectural choices ............................................................................................................... 7

Apache Mesos .............................................................................................................. 7 Integration with Apache Marathon ................................................................................. 8 Datacenter Operating System ....................................................................................... 8 Kafka ............................................................................................................................. 9 Spark ............................................................................................................................. 9 Cassandra ................................................................................................................... 10 WSO2 API Management ............................................................................................. 10 Data flow ..................................................................................................................... 11

Focus on Lombardy Region .................................................................................................. 12

Innovation Hub for green business ........................................................................................ 12

Page 3: & data ingestion

DSS implementation and data ingestion

3

BACKGROUND AND MOTIVATIONS

The availability of models able to highlight the specific needs of a territory and

suggest efficient levers is a precious support tool during decision-making processes

oriented to establish targeted interventions to reach concrete results. These models can be

obtained with the systematic use of data-driven methodologies, in combination with Big

Data Analytics and AI tools. Artificial Intelligence can be used to identify the main levers on

which it is possible to act to overcome shortcomings or to enhance already virtuous

situations but with further room for improvement, in order to work on KPIs in the most

efficient and effective way by “learning” from the territory past experience or from the past

experience of similar territories. In this way, it is possible both to simulate the impact of

positive actions undertaken by competitor regions in the same areas and to prevent

negative trends already occurred in other regions.

The starting point to build such models is the availability of quality data. The data can be

collected from heterogeneous data providers and, successively, integrated. Once

integrated, they have to be properly normalized and pre-processed to be then provided as

input to AI algorithms that automatically analyse them to build models that quantify the

relationships between phenomena.

In this scenario, the use of automated processes for data collection, normalization, and pre-

processing is useful to minimize the cost in terms of time and to maximize the accuracy of

the process.

For the above mentioned reasons, the design of the platform has been completed. The

platform will be used to automatically collect data and the algorithms needed to process

those data.

In particular, once implemented, the platform will be used both during the collection and

during data analysis, in order to obtain models to support the decision-making process.

Page 4: & data ingestion

DSS implementation and data ingestion

4

AI METHODOLOGY

The underlying methodological approach involves comparison between the region or

set of regions of interest with a set of reference regions, in a "data-driven" approach.

The real innovation of this methodology consists in the use of automatic analysis

techniques, not only to identify the main problems or shortcomings of the region of interest,

but also to identify the main levers on which it is possible to act to overcome the

shortcomings or to enhance virtuous situations.

The flow of the methodological approach is schematised in the following figure, listing the

relevant steps which will be described in details below.

Figure 1: Data-driven analysis

Data research, evaluation and selection

The first step is to find reliable quantitative information, objective-oriented,

taking into account both the specific territorial context and the surrounding one. To this end,

the availability of datasets containing values for a set as large as possible of European

regions of NUTS2 or possibly NUTS1 level, is necessary.

KPIs identification

To identify performance indicators (KPIs) it is first necessary to analyse the relevant high-

level indicators which can be found in the different regional plans for the field of interest.

Then, the KPIs research is completed by analysing literature related to the topics of interest,

in order to identify further potential levers for which there is evidence of known impacts on

the relevant KPIs in the European territory.

For the chosen set of candidate KPIs, the analysis of the performance of the region of

interest with respect to a set of Competitor regions is performed to identify the areas of

intervention. Competitor regions are selected on the basis of context analysis where there is

a set of regions that represent known "competitors". Where the literature in the sector does

not suggest particular consolidated groups of competing regions, the selection is made from

the data, based on similarity with the region of interest, calculated automatically on a set of

context indicators. For example, if the theme of social inclusion is considered, the regions

most similar to the target region in terms of demographic and industrial structure can be

selected.

Data research, evaluation and

selection

KPIs identification

Identification of the levers to improve KPIs

Scenarios analysis

Page 5: & data ingestion

DSS implementation and data ingestion

5

The main purpose of this step is providing evidence of the positioning of the Region under

analysis with respect to the competitor ones, with reference to a particular KPI.

Identification of the levers to improve KPIs

Artificial Intelligence is used to identify the main levers on which it is possible to act

to overcome shortcomings or to enhance already virtuous situations but with further room

for improvement, in order to work on KPIs in the most efficient and effective way.

The data-driven levers selection is based the use of AI algorithms which process the data

both related to an output indicator (the KPI), the dependent variable, and to a large pool of

input indicators (the levers), the independent variables. This processing aims are to identify

the levers having the greatest impact on the KPI and to assess the type of impact (positive

or negative) on the KPI itself. The identification of the relevant levers and of their impact

takes place with the so-called "learning by examples" approach, where each example is

represented by data of a specific Region.

Specifically, the levers-KPIs relationship is modelled as a multivariate regression (i.e.

dependent on several variables) in which the KPI is a function of the leverage indicators.

Therefore, each training region is described by a set of values for the lever indicators

(inputs) and a single value for the KPI (output). Multiple KPIs are treated in parallel.

From the Machine Learning point of view, each region represents an "example", while the

whole set of regions constitutes the training set used to train the machine learning algorithm

in order to identify the multivariate function that best models the relationship between levers

and KPIs, based on the available examples. For this analysis, the decision tree approach is

used as it allows to identify which of the levers are actually relevant to the KPIs, by creating

an input-output function that depends on a subset of the inputs.

Scenarios analysis

The last stage consists in carrying out predictive scenarios for the chosen KPIs, in

order to define realistic targets. In fact, the quantitative modelling of the relationship

among levers and KPIs allows to create projective scenarios based on assumptions

made on the future values of the leverage indicators, hence computing automatically the

corresponding value of KPI by setting realistic targets.

In particular, starting from the analysis of the trends of Competitors Regions, the

methodology allows to estimate the KPI evolution in the following scenarios:

Neutral scenario: the region keeps on pursuing the same KPI policies as in the

past

Best case scenario: the region improves its effectiveness in pursuing policies

related to the KPI, until it reaches the same performance as the leading

Competitor region

Worst case scenario: the region worsens its effectiveness in pursuing policy with

respect to the KPI, following the trend of the least performing Competitor region

Page 6: & data ingestion

DSS implementation and data ingestion

6

DSS DATA

In order to perform the analysis, the platform requires the availability of "quality"

data. The term "quality" data refers to data that need to be available in digital format and

homogeneous, to be processed automatically, reliable, in order to guarantee the accuracy

of the results of the analysis, regularly updated, in order to analyse the effect of a specific

intervention and with a high granularity, in order to identify the most advantageous

conditions towards which interventions should be oriented.

These data need to be extracted from heterogeneous data providers and then integrated in

compliance with data processing regulations (i.e. GDPR).

Data providers

The following initial set of relevant data providers have been identified for the ingestion in

the DSS:

OECD – Regional Statistics and Indicators

OECD REGPAT – Regional Patent Database

ISTAT – Indicators of Fair and Sustainable Welfare

ISTAT – Regional Statistics

SALUTEGOV – Regional Economic and Financial Database archive

SALUTEGOV – ASL Structures and activities

AIFI – Venture Capital Investments

Terna – Electric energy consumes

INAPP – IeFP courses data

INDIRE – Dual system data

CRISP e Fondazione Agnelli – Technical and professional schools performance

data

COEWEB – Import-export data

EQI QOG - European Government Quality Index

ISPRA – Environmental data

2017 RIS

EUROSTAT

Page 7: & data ingestion

DSS implementation and data ingestion

7

DSS ARCHITECTURE

Architectural choices

In this section, the architectural choices for the realisation of a distributed and scalable

computing platform for the storage and the analysis of large amounts of data and real-time

information flows will be detailed below. All the technological components and the

interactions among them will be analysed.

Apache Mesos

Mesos ( http://mesos.apache.org ) is a kernel for distributed systems which allows cluster

management by abstracting resources such as memory, storage space and CPU, and

making them available to services that request them, while isolating them.

It can be installed on local servers or on the main cluster providers (aws, azure, gcp), and is

compatible with Linux, OSX and Windows servers.

Figure 2: Dynamic configuration of nodes with Apache Mesos

Page 8: & data ingestion

DSS implementation and data ingestion

8

Mesos uses a master/agent architecture. The master nodes manage the distribution of

resources on the agent nodes, i.e. the nodes where the services are actually running. The

possibility of replicating master nodes allows to obtain a fault tolerant configuration.

Mesos has the native possibility of using containers (with Docker, Appc and OCI images),

relying directly on Docker or on a proprietary implementation that allows greater control over

resources. An integrated REST API allows to automate cluster monitoring and management

procedures.

Programs that exploit Mesos are called frameworks, and are responsible for allocating

resources for the services they manage.

Figure 3: Allocation of resources to different frameworks via Mesos

Integration with Apache Marathon

Marathon ( http://mesosphere.github.io/marathon ) is a framework for Mesos that

orchestrates services and other frameworks on Mesos, similarly to what an init system like

systemd does, providing also service discovery and load balancing capabilities.

Marathon executes services on the cluster, according to some given constraints. If a

service, or an instance of it, stops working, it is restarted; if a node crashes, its allocated

services are automatically relocated to other machines; it also scales service instances

according to their load, and provides persistent volumes to services such as databases. As

for Mesos, a REST API is available to automate and customise the management of

services.

Datacenter Operating System

Datacenter Operating System or DC/OS (http://dcos.io) is a commercial product based

on Mesos and Marathon, which makes available a free open-source version. DC/OS

simplifies the installation on nodes, provides a repository of services that can be directly

Page 9: & data ingestion

DSS implementation and data ingestion

9

installed on the cluster and offers an interface which simplifies the deployment,

configuration and updating of services.

Kafka

Kafka ( http://kafka.apache.org ) is a distributed streaming platform. Input data are

immediately saved to disk and are retained for a given and configurable period of time.

The data streams are organised into topics, which are divided into partitions stored on

several servers.

A software program can subscribe to the Kafka topics it wants to receive messages from,

and can read all saved messages, regardless of sorting.

Spark

Spark ( http://spark.apache.org ) is a framework for distributed data processing, designed

specifically for machine learning applications. It is very powerful (up to two orders of

magnitude more than Hadoop Map-Reduce) thanks to the use of data in ram memory and

thanks to the optimisation of the computation pipeline.

There are several high-level libraries already integrated in Spark that provide functionalities

for Machine Learning (ML), graph management and analysis, data manipulation through

SQL-like queries and real-time data streaming. The use of these libraries and other non-

native tools which can be integrated with Spark accelerates the development and

implementation processes.

Figure 4: The Spark integrated library ecosystem

Furthermore, Spark offers an ML library (MLLib), and integrates partially or fully with other

libraries, creating a complete ecosystem for prototyping, developing and deploying machine

learning solutions.

MLib. Native library for ML in Spark, originally inspired by scikit-learn (from which it

inherits the pipeline concept), it provides a Python interface and includes a complete

Page 10: & data ingestion

DSS implementation and data ingestion

10

collection of machine learning algorithms and data transformation capabilities. Actively

supported, open-source, and with a growing community of developers and users that

guarantees support and improvements over time.

H2O. Machine learning platform that, through a complete integration ("Sparkling

Water"), calls itself "the killer application for Spark" (https://blog.h2o.ai/2014/06/h2o-killer-

application-spark/). It allows the construction of machine learning models and pipelines in

distributed computing environments and it also has a Python interface. It partially overlaps

with and complements the MLLib algorithm library (e.g. with dedicated Deep Learning

functionality).

Other ML libraries partially and fully integrate with Spark, further expanding and completing

the pool of algorithms provided by the two main libraries above.

Some examples of integration with other libraries are:

XGBoost. Library that implements learning algorithms belonging to the gradient

boosting framework, compatible with the Spark environment and successfully used

in other use-cases. Several benchmarks show that it is more efficient and performs

better than MLLib and H2O in the class of algorithms in which it specialises.

http://datascience.la/benchmarking-random-forest-implementations/

Spark-sklearn bridge. Parallelisation on distributed environments of sklearn tasks,

which is the main Python library used successfully in the past for prototyping and

developing other use cases.

Cassandra

Cassandra ( http://cassandra.apache.org ) is a distributed NoSQL database that uses a

column family model. Data is automatically replicated across multiple nodes in the cluster or

across multiple clusters to ensure fault-tolerance. The Cassandra cluster is based on a

peer-to-peer architecture that tolerates multiple nodes dropping out, and which distributes

the workload.

Cassandra is extremely scalable (clusters with more than 50k nodes are known).

WSO2 API Management

The API Management functionalities of the WSO2 framework (http://wso2.com/api-

management) are used to manage access to APIs on different backends. This module

simplifies API management by controlling access and supporting different types of

authentication and authorisation, as well as validating content and providing protection

against bots and forged tokens. It can limit the number of requests based on the service

and the specific user, and provides advanced statistics on API usage for monitoring

purposes.

Page 11: & data ingestion

DSS implementation and data ingestion

11

Data flow

The real-time data flow, based on the components described above, is summarised in

the figure.

Figure 5: real time data flow

Real-time data are taken from remote streams and collected by Kafka, which stores them

for a given and defined period of time, guaranteeing their integrity. From Kafka, those data

are picked up by the real-time flow, based on Apache Spark Streaming, which processes

them in micro-batches, filters, processes and saves them on a Cassandra database. The

batch flow, based on Apache Spark, exploiting the analysis capabilities provided by MLLib

and H2O, accesses the data from Cassandra and possibly directly to Kafka flows and

performs the analysis, saving their results in the database. Through APIs, a web service

can access the results of the analysis, recorded on the database, and give them to the user.

Page 12: & data ingestion

DSS implementation and data ingestion

12

FOCUS ON LOMBARDY REGION

Innovation Hub for green business

Within the AlpGov2 project, besides activities carried out by each Action Group, 5

transversal Strategic Policy Area are carried out through the collaboration of different Action

Groups: besides the EIF - EUSALP Innovation Facility, the others are thematic initiative with

focus on different topics: Innovation Hub for Green Business, Smart Villages, Spatial

Planning, Carbon Neutral Alpine Region.

For each of the abovementioned topics, several indicators can be identified and different

dataset can be compared, this depending on the data availability, consistency and

comparability. In fact, not all data providers regularly update the data available in an open

form, and often the available datasets are not standardized thus making it difficult or even

impossible to allow comparisons; moreover, even these datasets are not significant or not

sufficiently thorough with respect to the geographical deepness we are interested in, the

choice of the topics and the associated indicators is deliberately redundant, in order to have

a wider spectrum of possibilities to choose from. Further steps will thus deal with the

identification of the strategic areas of interest where to apply the predictive scenarios for the

chosen KPIs.

Page 13: & data ingestion

Recommended