FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS Tomaž Špeh UNECE...

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS

Tomaž Špeh

UNECE Workshop on the Modernisation of Statistical Production

Geneva, 15-17 April 2015

Modernization activities

• More efficient production, decreasing costs and administrative burden, increasing efficiency and flexibility

• Produce data faster and at lower cost• Maintain or preferably increase quality• Adopt international frameworks (CSPA, GSIM,

GSBPM) and develop reusable and sharable software components

Comprehensive data sources

• Large existing data store, with the widest and deepest data access (survey, admin data)

• Investigating ways to disclose all kinds of new data sources (mobile, scrapped, scanned)

• Use of modern technologies such as the internet, mobile phones, automated scanning techniques, etc.

DW based statistical production• Data from everywhere need to

be accessible and integrated in a timely and secure fashion

• Building solutions must be fast, iterative and repeatable (using common statistical data processing system)

• Sufficient infrastructure must be available for conducting statistical analyses (scalability, privacy)

• The statisticians need to “run the show”.

Great, but what is the problem ?

Agility Gap

Based on TDWI research

Point to point architecture• Integration complexity• Extracting and moving data

adds latency and cost• Every projects solves data

access and integration in a different way

• Solutions are tightly coupled to data sources

• Data is fragmented across the organization

• Poor flexibility and agility

„Nature only replicates things when necessary“

What is Data Virtualization?

• Wikipedia:

„Data virtualization is an application to retrive and manipulate data without requiring technical details about the data, such as how it is formated or where it is physically located.“

• Or more simply:

„A solution that sits in front of multiple data sources and allows them to be treated as a single SQL database“

Use cases

• Improved data integration architecture

• „BIG DATA“ Mobile data integration

Hub and spoke architectureIntegration H

ub• Data virtualization abstracts, federates,

and publishes a data sources in an array of different formats

• It simplifies and accelerates the process of accessing, combining, and utilizing disparate data sources

• It hides the complexity of the different data sources from the consumers.

• It focuses on creating a unified common data rather than highly efficient data movement creating logical views

• Logical view exposed as data services to consumers in many different formats; SQL, Web Services (SOAP/XML and RESTful)

Role based access Agile data mart design

Source analysis Data masking

Advanced features

Mobile data integration

• Pilot project Population Statistics Project using mobile positioning data.

• Enhance statistics about population (active, retired, etc.), distribution regarding time and location

• Anonymized raw (micro) data have been transferred to the statistical organization for processing

• According to our Regulatory Compliance, the data are considered as confidential

• New challenges in areas such as efficient processing, integration into the current environment, privacy and security issues

• data lake – a storage repository that holds raw data in its native format

Mobile data integration• Easy Access to mobile data• Confidentiality requirements,

row level security and masking of columns

• Enable agile integration with existing enterprise and provide consistent security policy across multiple data sources

• Reporting tool accesses the

data virtualization server via rich

SQL dialect (SAS Visual analytics)

• Data virtualization server translates rich SQL to HDFS

• Logical tables

Benefits

• Data Virtualization layer delivers the data firewall functionality. • Robust security infrastructure and reduction in physical copies of

data, providing security mechanism such as role based control and data masking, thus reducing risk

• Provides business-friendly representation of data, allowing the statisticians to interact with their data without having to know the complexities of their database or where the data are stored and allowing standard tools to acquire data (SAS Visual analytics)

• Business agility, action ability, information speed

SOA adoption

• Data services as a key part of a SOA. They provide the necessary interface to data for all business services.

• Expose all data through a single uniform interface

• Provide a single point of access to all business services in the system

• Expose legacy data sources as data services • Provide a uniform means of exposing/accessing

metadata • Provide a searchable interface to data and

metadata • Provide uniform access controls to information

But it’s not a Silver Bullet

• Can be slow, depending on how much data has to be fetched from remote systems to the DV platform – platforms try to be smart to reduce this

• Can impact performance on underlying systems lots of users making queries on resource sensitive resources is not a good idea

• Requires Resources another set of servers, technologies, etc. to manage, but this cost is often offset against the reduction in complexity elsewhere.

• Not a replacement – it is an additional tool, ETL still needed

Conclusions• Data virtualization could play a key role in modern statistical data integration stacks

to cover some strategic needs. • Extensive use of new data sources require a new infrastructure that is not covered

with traditional ETL solutions alone. • The answer to the original question of “when should I use data virtualization and

when should I use ETL tools?” really is “it depends on circumstances”. • When combining structured data with unstructured data or requiring real-time access

to up-to-date data, then data virtualization is a better option. • When copy massive amounts of data for complex analytics or historical data marts

with no concerns about data freshness, ETL and static data warehouses are still the best option.

• Data virtualization can often be used to increase the value of existing ETL. • When deployed in a SOA environment, Data Virtualization can provide data services

to any application.

Date post:	24-Dec-2015
Category:	Documents
Upload:	jodie-anderson
View:	214 times
Download:	0 times

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS Tomaž Špeh UNECE...

Documents