BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS
Sra. Gemma Batlle
Business Development Manager Eurecatwww.eurecat.org
@eurecat_events
BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS
Sr. Marc Planagumà & Sr. Jose Luis Sánchez
BigData Platform Managers
ServiZurich
www.zurich.es
INTERNAL USE ONLY
Build Enterprise Data Lake without Drowning
26/10/2017
Jose Luis Sanchez and Marc Planagumà
Big Data Congress Barcelona 2017
ServiZurich – Big Data Delivery Center - EDAA
INTERNAL USE ONLY
What is a Data Lake?
A data lake is a method of storing data within a system or repository, in its natural
format (Structured or Unstructured).
The idea of a data lake is to have a single store of all data in the enterprise ranging
from raw data to transformed data.
…and the main goals of an Enterprise Data lake?
© Z
uri
ch
INTERNAL USE ONLY 8
Break the Silos
© Z
uri
ch
INTERNAL USE ONLY 9
Bring togetherStructured and UnstructuredData
© Z
uri
ch
INTERNAL USE ONLY
Enable Big Data Analytics
10
INTERNAL USE ONLY
What does a Data Lake in an Enterprise mean?
The Main Management Challenges
© Z
uri
ch
INTERNAL USE ONLY
BI and Big Data Coexistence
12
Business
Intelligence
Data Warehouse Data Lake
Data Analytics
Big Data
Big Data Analytics
© Z
uri
ch
INTERNAL USE ONLY
Open up space for Open Source
13
Vs.
© Z
uri
ch
INTERNAL USE ONLY
Disruption on Infrastructure Strategies
14
© Z
uri
ch
INTERNAL USE ONLY
Fast Acceleration in Technology Life Cycle
15
INTERNAL USE ONLY
How to build a Data Lake in an Enterprise?
The Main Technology Challenges
© Z
uri
ch
INTERNAL USE ONLY
Corporate Tools Stack Set-up
17
© Z
uri
ch
INTERNAL USE ONLY
Big Data Platform into Production
18
Development Chain
SLAs
High Availability
Disaster Recovery
Support
Automation
© Z
uri
ch
INTERNAL USE ONLY
Enterprise Platform Enablement
19
Security
Processes
IntegrationReports
Monitoring
© Z
uri
ch
INTERNAL USE ONLY
Data Science framework in Production
20
Solution
Developers
Platform
Engineers
Data
Scientists
INTERNAL USE ONLY
Enjoying the Lake to Swim on Data
Thanks!
BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS
Sr. Víctor Dertiano
Senior Manager
BI Geekwww.bi-geek.com
2getherbank
Arquitectura de una plataforma financiera
BIGDATACONGRESS
1. Qué es 2getherbank
2. Arquitectura – Plataforma Informacional
Qué es 2getherbank
Arquitectura – Plataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
ArquitecturaPlataforma Informacional
Muchas gracias
BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS
Sr. Albert Climent
Senior Data Scientist
Pervasive Technologies www.pervasive-tech.com
Google Cloud Platform for Big Data
Albert Climent Bigas
Senior Data Scientist – Pervasive Technologies
26 Octubre 2017
www.pervasive-tech.com
TABLE OF CONTENTS
• Problemática
• Big Data lifecycle
• Infraestructura Big Data
• Soluciones
• Google Cloud Platform
www.pervasive-tech.com
PROBLEMÁTICA
VOLUMEN
VARIEDADVELOCIDAD
• Terabytes
• Transiciones
• Tablas, ficheros
• Registros
• Batch
• Near Time
• Real Time
• Streams
• No estructurado
• Semiestructurado
• Estructurado
• Combinado
3 V’s de
Big Data
www.pervasive-tech.com
BIG DATA LIFECYCLE
Data Acquisition
Data Preparation
Data Representation
Data Analysis
Data Interpretation
www.pervasive-tech.com
INFRAESTRUCTURA BIG DATA
Equipo de IT dedicado al mantenimiento No es necesario un equipo de IT dedicado
Acceso limitado a los dispositivos Acceso desde cualquier ubicación (internet)
Actualizaciones y mejoras de software limitadas Actualizaciones y mejoras continuas
Alto coste inicial y de renovación de los equipos Costes limitados al uso
Riesgo de pérdida de datos gestionado Bajo riesgo de pérdida de datos
www.pervasive-tech.com
SOLUCIONES On premise
www.pervasive-tech.com
SOLUCIONES Cloud
www.pervasive-tech.com
GOOGLE CLOUD PLATFORM
www.pervasive-tech.com
GOOGLE CLOUD PLATFORM
www.pervasive-tech.com
GOOGLE CLOUD PLATFORM
VARIEDAD
VELOCIDAD
VOLUMEN
VOLUMEN
VELOCIDAD
VELOCIDAD
BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS
Sr. Oscar RomeroProfesor del departamento de Ingeniería de Servicios y
Sistemas de Información de la UPC y miembro del grupo
de investigación en Database Technologies and
Information Management
Universitat Politècnica de Catalunyawww.essi.upc.edu/dtim
@romero_m_oscar
DTIM
Evolución de losEcosistemas de DatosDel Data Warehouse al Data LakeO S C A R R O M E R O
D T I M R E S E A R C H G R O U P ( H T T P : / / W W W. E S S I . U P C . E D U / D T I M / )
U N I V E R S I TAT P O L I T È C N I C A D E C ATA L U N YA - B A R C E L O N AT E C H
DTIM
What is Big Data?
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 62
VOLUME
vArIaBiLiTy VarietyVelocity Value
Veracity
DTIM
Today, the Focus is on Variety
OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 6326-10-2017
That Big Data is synonymous with large volumes of data is a myth
“Rather, it is the ability to integrate more sources of data than ever before — new data, old data, big data, small data, structured data, unstructured data, social media data, behavioral data, and legacy data”
The Variety Challenge
MIT Sloan Management Review (2016): http://sloanreview.mit.edu/article/variety-not-volume-is-driving-big-data-initiatives/
DTIM
The Long Tail of Big DataThe ultimate goal is…◦ Integrate new data sources on-demand,
◦ Legacy Systems
◦ External Data (typically, semi-structured or unstructured data)
◦ Social Media and Behavioural Data Sources
◦ Provide the required flexibility for conducting on-demand data analysis techniques◦ Data preparation
«Data is the new oil!» - Clive Humby, 2006«No! Data is the new soil» - David McCandless, 2010
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 64
Value
DTIM
Model-First (Load-Later)
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 65
Twitter API (JSON)
In-house DB(PostgreSQL)
Web Logs(Logs)
USER FEEDBACK PRODUCT INFOUSER WEB
BEHAVIOUR
- Product- Product features
- User- Tweet- Date- Location
- User - Product- Landing
time- Visits ts
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
Assesses
Interested In
Sentiment Analysis (e.g., Text Mining)
Log Analysis (e.g., Process Mining)
Product homogenization (e.g., duplicate detection)
- Avg (sentiment)- Keen: Avg(landing time)/#visits
DTIM
Drawbacks
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 66
Twitter API (JSON)
In-house DB(PostgreSQL)
Web Logs(Logs)
USER FEEDBACK PRODUCT INFOUSER WEB
BEHAVIOUR
- Product- Product features
- User- Tweet- Date- Location
- User - Product- Landing
time- Visits ts
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
Assesses
Interested In
Sentiment Analysis (e.g., Text Mining)
Log Analysis (e.g., Process Mining)
Product homogenization (e.g., duplicate detection)
- Avg (sentiment)- Keen: Avg(landing time)/#visits
Permanenttransformations
Fixed Target Schema
High EntryBarriers
DTIM
Load-First Model-Later
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 67
Twitter API (JSON)
Web Logs(Logs)
USER FEEDBACK PRODUCT INFO USER WEB BEHAVIOUR
In-house DB(PostgreSQL)
Data Lake
Analyst 1
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
AssessesAnalyst 2
Data Views
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Interested In
- Avg (sentiment)- Keen: Avg(landing
time)/#visits
DTIM
Drawbacks
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 68
Twitter API (JSON)
Web Logs(Logs)
USER FEEDBACK PRODUCT INFO USER WEB BEHAVIOUR
In-house DB(PostgreSQL)
Data Lake
Analyst 1
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
AssessesAnalyst 2
Data Views
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Interested In
- Avg (sentiment)- Keen: Avg(landing
time)/#visits
Data Swamp
(Automated)ComplexTransformations
DTIM
From Data Swarms to Semantic Data Lakes
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 69
Twitter API (JSON) Web Logs
(Logs)
USER FEEDBACK PRODUCT INFO USER WEBBEHAVIOUR
In-house DB(PostgreSQL)
Data Lake
Assesses- Product- Product features
- User- Tweet- Date- Location
- User - Product- Landing
time- Visits ts
Catalog
File 1 File 2 File 3
Analyst 1
Analyst 2
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
Data Views
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Interested In
- Avg (sentiment)- Keen: Avg(landing
time)/#visits
DTIM
From IT-Centered to User-Centered
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 70
Twitter API (JSON) Web Logs
(Logs)
USER FEEDBACK PRODUCT INFO USER WEBBEHAVIOUR
In-house DB(PostgreSQL)
Data Lake
Assesses- Product- Product features
- User- Tweet- Date- Location
- User - Product- Landing
time- Visits ts
Catalog
File 1 File 2 File 3
AUTOMATIC DATA GOVERNANCE
Analyst 1
Analyst 2
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Feature- Avg(sentiment)
Is part of
User- Avg rating- List of preferences
Product- Popularity- Top feature- Bottom feature
Interested In
- Avg (sentiment)- Keen: Avg(landing
time)/#visits
Data Views
26-10-2017 OSCAR ROMERO - EVOLUCIÓN DE LOS ECOSISTEMAS DE DATOS 71
Thanks! Any Question?OROMERO@ESSI .UPC.EDU
HOMEPAGE: HT TP://WWW.ESSI .UPC.EDU/DTIM/PEOPLE/OROMERO
TWIT TER: @ROMERO_M_OSCAR
DTIM
BIG DATA TECHNOLOGY:
INFRAESTRUCTURE
AND PLATFORMS