+ All Categories
Home > Technology > 10 Big Data Technologies you Didn't Know About

10 Big Data Technologies you Didn't Know About

Date post: 24-Jan-2017
Category:
Upload: jesus-rodriguez
View: 2,605 times
Download: 0 times
Share this document with a friend
63
Big Data Technologies You Didn’t Know About
Transcript
Page 1: 10 Big Data Technologies you Didn't Know About

Big Data Technologies You Didn’t Know About

Page 2: 10 Big Data Technologies you Didn't Know About

About Us

• Emerging technology firm focused on helping enterprises build breakthrough software solutions

• Building software solutions powered by disruptive enterprise software trends

-Machine learning and data science -Cyber-security -Enterprise IOT -Powered by Cloud and Mobile• Bringing innovation from startups and academic institutions to the enterprise

• Award winning agencies: Inc 500, American Business Awards, International Business Awards

Page 3: 10 Big Data Technologies you Didn't Know About

• Big data technologies you didn’t know about• Apache Flink• Apache Samza • Google Cloud Data Flow• StreamSets• Tensor Flow• Apache NiFi• Druid• LinkedIn WhereHows• Microsoft Cognitive Services

Agenda

Page 4: 10 Big Data Technologies you Didn't Know About

Two Goals…

Page 5: 10 Big Data Technologies you Didn't Know About

Think Beyond Traditional Big Data Stacks

Page 6: 10 Big Data Technologies you Didn't Know About

Learn from Companies Building Big Data Pipelines at Scale

Page 7: 10 Big Data Technologies you Didn't Know About

Big Data pipelines in the enterprise

Page 8: 10 Big Data Technologies you Didn't Know About

Areas of a Big Data Pipeline

Big Data

Pipeline

Data Processing

Stream Data Ingestion

Data transformations

Cognitive Computing

Machine Learning

High Performance Data Access

Page 9: 10 Big Data Technologies you Didn't Know About

Data Processing

Page 10: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 11: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 12: 10 Big Data Technologies you Didn't Know About

Apache Flink

• Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics.

• Apache Flink engine exploits data streaming and in-memory processing and iteration operators to improve performance.

• Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in 2008 by professor Volker Markl  from the Technische Universität Berlin in Germany.

• In German, Flink means agile or swift. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.

Page 13: 10 Big Data Technologies you Didn't Know About

Apache Flink

• Declarativity• Query optimization• Efficient parallel in-

memory and out-of-core algorithms

• Massive scale-out• User Defined

Functions • Complex data types• Schema on read

• Streaming• Iterations• Advanced

Dataflows• General APIs

Draws on concepts fromMPP Database

Technology

Draws on concepts fromHadoop MapReduce

Technology Add

Page 14: 10 Big Data Technologies you Didn't Know About

Apache Flink

Page 15: 10 Big Data Technologies you Didn't Know About

Apache Flink: An Example

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Page 16: 10 Big Data Technologies you Didn't Know About

Stream Data Processing

Page 17: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 18: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 19: 10 Big Data Technologies you Didn't Know About

Apache Samza

• Created by LinkedIn to address extend the capabilities of Apache Kafka• Simple API• Managed state• Fault Tolerant• Durable messaging• Scalable• Extensible• Processor Isolation

Page 20: 10 Big Data Technologies you Didn't Know About

Apache Samza: Overview

• Samza code runs as a Yarn job

• You implement the StreamTask interface, which defines a process() call.

• StreamTask runs inside a task instance, which itself is inside a Yarn container.

Page 21: 10 Big Data Technologies you Didn't Know About

Apache Samza: Operators

• Filter records matching condition

• Map record ⇒ func(record)• Join two/more datasets by

key• Group records with the same

value in field• Aggregate records within the same group• Pipe job 1’s output ⇒ job 2’s input

• MapReduce assumes fixed dataset.Can we adapt this to unbounded streams?

Page 22: 10 Big Data Technologies you Didn't Know About

Apache Samza: Sample Code

Page 23: 10 Big Data Technologies you Didn't Know About

Data Transformation

Page 24: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 25: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 26: 10 Big Data Technologies you Didn't Know About

Google Cloud Data Flow

• Native Google Cloud data processing service

• Simple programming model for batch and streamed data processing tasks

• Provides a data flow managed service to control the execution of data processing jobs

• Data processing jobs can be authored using the Data Flow SDKs (Apache Beam)

Page 27: 10 Big Data Technologies you Didn't Know About

Google Cloud Data Flow : Details

• A pipeline encapsulates an entire series of

computations that accepts some input

data from external sources, transforms

that data produces some output data.

• A PCollection abstracts a data unit in a pipeline

• Sources and Sink abstract read and write operations in a pipeline

• Google Data Flow provides management, monitoring and security capabilities in data pipelines

Page 28: 10 Big Data Technologies you Didn't Know About

Google Cloud Data Flow is Based on Apache Beam

• 1. Portable - You can use the same code with

different runners (abstraction) and backends on

premise, in the cloud, or locally

• 2. Unified - Same unified model for batch and

stream processing

• 3. Advanced features - Event windowing,

triggering, watermarking, lateless, etc.

• 4. Extensible model and SDK - Extensible API;

can define custom sources to read and write in

parallel

Page 29: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 30: 10 Big Data Technologies you Didn't Know About

StreamSets Data Collector

• Data processing platform optimized for data in motion

• Visual data flow authoring model• Open source distribution model• On-premise and cloud distributions• Rich monitoring and management

interfaces

Page 31: 10 Big Data Technologies you Didn't Know About

StreamSets Data Collector: Details

• Data collectors streams and process data in real time using data pipelines

• A pipeline describes a data flow from origin to destination

• A pipeline is composed of origins, destinations and processors

• Extensibility model based on JavaScript and Jython

• The lifecycle of a data collector can be controlled via the administration console

Page 32: 10 Big Data Technologies you Didn't Know About

Machine Learning

Page 33: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 34: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 35: 10 Big Data Technologies you Didn't Know About

TensorFlow

• Second generation Machine Learning system, followed by DistBelief• TensorFlow grew out of a project at Google, called Google Brain, aimed

at applying various kinds of neural network machine learning to products and services across the company.

• An open source software library for numerical computation using data flow graphs

• Used in following projects at Google1. DeepDream2. RankBrain3. Smart ReplyAnd many more..

Page 36: 10 Big Data Technologies you Didn't Know About

TensorFlow: Details

• Data flow graphs describe mathematical computation with a directed graph of nodes & edges

• Nodes in the graph represent mathematical operations.

• Edges represent the multidimensional data arrays (tensors) communicated between them. 

• Edges describe the input/output relationships between nodes.

• The flow of tensors through the graph is where TensorFlow gets its name.

Page 37: 10 Big Data Technologies you Didn't Know About

TensorFlow

• Tensor• Variable• Operation• Session• Placeholder• TensorBoard

Page 38: 10 Big Data Technologies you Didn't Know About

Fast Data Access

Page 39: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 40: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 41: 10 Big Data Technologies you Didn't Know About

Druid

• Druid was started in 2011• ‣ Power interactive data applications• ‣ Multi-tenancy: lots of concurrent users• ‣ Scalability: trillions events/day, sub-second queries• ‣ Real-time analysis• Key Features

• LOW LATENCY INGESTION• FAST AGGREGATIONS• ARBITRARY SLICE-N-DICE CAPABILITIES• HIGHLY AVAILABLE• APPROXIMATE & EXACT CALCULATIONS

Page 42: 10 Big Data Technologies you Didn't Know About

Druid: Details

• Realtime Node• Historical Node• Broker Node• Coordinator Node• Indexing Service

Page 43: 10 Big Data Technologies you Didn't Know About

Druid: Details

• Realtime Node• Historical Node• Broker Node• Coordinator Node• Indexing Service• JSON based query

language

Page 44: 10 Big Data Technologies you Didn't Know About

Low Latency Data Flows

Page 45: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 46: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 47: 10 Big Data Technologies you Didn't Know About

Apache NiFi

• Powerful and reliable system to process and distribute data

• Directed graphs of data routing and transformation

• Web-based User Interface for creating, monitoring, & controlling data flows

• Highly configurable - modify data flow at runtime, dynamically prioritize data

• Data Provenance tracks data through entire system

• Easily extensible through development of custom components

Page 48: 10 Big Data Technologies you Didn't Know About

Apache NiFi: Architecture

Page 49: 10 Big Data Technologies you Didn't Know About

Apache NiFi: Concepts

• FlowFile• Unit of data moving through the system• Content + Attributes (key/value pairs)

• Processor• Performs the work, can access FlowFiles

• Connection• Links between processors• Queues that can be dynamically

prioritized

• Process Group• Set of processors and their connections• Receive data via input ports, send data

via output ports

Page 50: 10 Big Data Technologies you Didn't Know About

Data Discovery

Page 51: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 52: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

WhereHows

Page 53: 10 Big Data Technologies you Didn't Know About

Linkedin WhereHows

• Where is my data? How did it get there?

• Enterprise data catalog

• Metadata search

• Collaboration

• Data lineage analysis

• Connectivity to many data sources and

ETL tools

• Powering Linkedin data discovery layer

Page 54: 10 Big Data Technologies you Didn't Know About

Linkedin WhereHows: Architecture

• Web interface for data discovery

• API enabled

• Backend server that controls the metadata

crawling and integration with other

systems

Page 55: 10 Big Data Technologies you Didn't Know About

Linkedin WhereHows: Data Lineage

• Collects metadata from ETL platforms and

scripts

• Sources include

• Pig

• MapReduce

• Informatica

• Teradata

• Visualizes the lineage information

associated with a data source

Page 56: 10 Big Data Technologies you Didn't Know About

Cognitive Computing

Page 57: 10 Big Data Technologies you Didn't Know About

Technology Stacks You Know

Page 58: 10 Big Data Technologies you Didn't Know About

But You Probably Didn’t Know About….

Page 59: 10 Big Data Technologies you Didn't Know About

Microsoft Cognitive Services

• Based on Project Oxford and Bing

• Offers 22 cognitive computing APIs

• Main categories include:

• Vision

• Speech

• Language

• Knowledge

• Search

• Integrated with Cortana Intelligence Suite

Page 60: 10 Big Data Technologies you Didn't Know About

Microsoft Cognitive Services

Page 61: 10 Big Data Technologies you Didn't Know About

Microsoft Cognitive Services: Developer Experience

• 22 different REST APIs

that abstract cognitive

capabilities

• SDKs for Windows, IOS,

Android and Python

• Open source

Page 62: 10 Big Data Technologies you Didn't Know About

Summary

• The big data ecosystem is constantly evolving• There are a lot of relevant new technologies beyond the traditional Hadoop-

Spark stacks• Big internet companies are leading innovation in the space

Page 63: 10 Big Data Technologies you Didn't Know About

Thankshttp://[email protected]


Recommended