Big data visualization with tableau€¦ · Tableau Desktop: A data visualization tool designed to...

Post on 06-Jun-2020

15 views 0 download

transcript

Big Data Visualization

with Tableau

Avirup Chakraborty(MDS201908)

Debangshu Bhattacharya(MDS201910)

Ipsita Ghosh(MDS201913)

Swaraj Bose(MDS201936)

Sreya K.K.(MDS201804)

What is big data?

Extremely large data sets that may be analyzed

computationally to reveal patterns, trends, and

associations, especially relating to human behaviour

and interactions.

Velocity

Variety

Volume

Veracity

Why data visualization is

important?

communicates relationships of the data with images

allows trends and patterns to be more easily seen

give meaning to complicated datasets so that their

message is clear and concise

outlier detection becomes easier

results from complex algorithms are much easier to

understand in a visual format

summary of data

Challenges in big data visualization

(4 V’s yet again!)

Traditional visualization tools are not capable of handling

large datasets. Eg: MS Excel, Minitab

Providing low latency in visualization

Parallelization is required

Dimensions of the data has to be carefully chosen

Most current visualization tools have low performance

w.r.t scalability, functionality and response time

Steps for big data visualization

Data

acquisition

Parsing

and

filtering

Mining

hidden

patterns

Data

visualization Refinement

a powerful and fast growing data visualization tool

used in the Business Intelligence Industry.

connects easily to nearly any data source.

allows for instantaneous insight on data by

transforming it into interactive visualizations

called dashboards.

What is Tableau?

Why is Tableau helpful?

Handle large volume of data

No scripts or code required, provides user interface

Filter multiple datasets simultaneously

Creates interactive and shareable dashboards depicting trends and variations

Incorporate other programming languages to do complex calculations

And many more….

Trivia

Founded: January 2003, California

Founders: Christian Chabot, Chris Stolte (Stanford

University) , Pat Hanrahan

Headquarters: Seattle, California

Website: https://www.tableau.com/

Built using C++

Latest version: 2020.1

Tableau Desktop:

A data visualization tool designed to create data visualization, report and

dashboard in a fast and intelligent way.

Users can connect to multiple data sources, carry out multi-dimensional

data analysis, create dashboards or report, modify metadata and publish a

complete workbook to Tableau server if needed.

Adapt your content performance for any size and any device (i.e. Desktop,

laptop, tablet or even a smartphone!).

Tableau Desktop

Personal Edition

Professional Edition

Personal Edition Professional Edition

Connects to limited data sources as:

Microsoft Access,

Microsoft Excel,

Microsoft Azure,

Tableau Data Extract,

Text files (CSVs).

Connects to a wider variety of data sources:

Amazon Redshift,

Google Analytics,

Google BigQuery,

Hortonworks Hadoop,

OLAP databases,

Salesforce.

Cannot connect to Tableau Server but allows

users to create package files for Tableau

Reader.

Enables connection to Tableau Server and

creating package files for Tableau Reader.

Costs $999 per user. Costs $1999 per user.

Tableau Server:

Tableau server is essentially an online hosting platform to hold all your tableau workbooks, data sources and more.

It works like any other server, you can store things here and they will safe from fires and pesky hackers.

So, what are the advantages of Tableau Server??

1. Firstly…. COLLABORATION!

Being a Tableau product, Tableau Server lets us to use the

functionality of Tableau, without needing to always be

downloading and opening workbooks.

Users need not to install Tableau Desktop on their machine, and

they can still interact with dashboards shared with them.

3. COMPATIBILITY

Tableau Server supports variety of Android apps, iPhone apps and Web browsers

like Internet Explorer, Mozilla Firefox, Google Chrome and Safari.

2. CLOUD SUPPORT

Tableau server can be deployed on-premises as well as in public clouds like

Azure, AWS, IBM Cloud, Google Cloud Platform etc.

It also enables an administrator to track and manage the content, licenses,

performance, and permissions for data sources with ease.

4. LIMITED ACCESS DESIGN

On Tableau Server, we can set permissions to different bits

of work, to allow us as an organization to determine who

can access and interact with what.

Let us illustrate this using a really simple example >>

Consider this ‘imaginary’ company consisting:

Tony Stark Dr. Bruce Banner

and

Nick Fury

❖ Tony Stark has access on server to upload and edit work in a project containing test documents.

❖ Dr. Banner can interact with only the production quality documents.

❖ And…Nick Fury can access but not edit the final presentation documents.

❖ Of course…Loki cannot even have a look at the documents!

(at least we can hope so)

Tableau Public:

Tableau Public is a FREE tool that anyone can use to connect to

data, create interactive data visualizations and publish them on

the web.

Once these visualizations are in Tableau public one can share to social medias

or even can embed on webpages.

Since everyone has access to published data, user should be careful not to

put the proprietary data on Tableau Public.

Limitations to Tableau Public:

Row limitation:

Limited to 15,000,000 rows of data per workbook.

Limited storage:

Limited to ten gigabytes (10 GB) of storage space for your workbooks.

No workbook privacy:

Tableau Public does not allow to save workbooks locally. One has to save them publicly which means that everyone can see the data since it’s saved on the cloud.

No security:

As visualizations are public so anyone can access the data and make change by downloading the workbook.

Tableau Online:

Tableau Online is a hosted version of Tableau Server. It is the business analytics

platform where people can share dashboards, interact with report and gain insights.

It is hosted in the cloud so that there is no hardware, no set-up time needed.

“Want the sharing and collaboration of

Server, but without having to actually

manage a server? Then you want Tableau

Online. Secure. Scalable. And Look Ma—No

hardware to maintain!”

- https://www.tableau.com/products

Roughly, Tableau Online can also be thought as a private version (and paid, obviously) of

Tableau Public.

Key Features:

Fully hosted in the cloud. Servers are managed by Tableau

Team.

Supports live data connections to Amazon Redshift, Google

BigQuery, as well as to SQL-based sources hosted on cloud

platforms.

Ideal for small number of users who need to be able to

interact with the data and visualizations in a secure way.

Easily accessible from a browser or Tableau Mobile App.

Authenticate users through TableauID (email address and

password). No guest access allowed.

Subscription rate is $500 per user for one year (half the

price of individual Tableau Server Licenses)

Key Features (Contd.):

Tableau Reader

Tableau Reader is a FREE desktop application

Allows interaction with data visualizations,

created with Tableau Desktop.

Users can filter, drill-down and view the details

of the data as long as the author allows.

Tableau Start Page

Data grid - Displays first 1,000 rows of the

data contained in the Tableau data source.

Left pane- Displays the

connected data source

and other details about

your data.

Canvas- Displays

information about

how the data

source is set up and

options for

combining the

data.

Metadata grid- Displays the fields in your data source

as rows.

Tableau Worksheet

The Dashboard Workspace

DEMO

Philosophy of Tableau working with Big Data

• Democratisation of Data: Knowledge workers of all skill levels

should be able to access and analyze data wherever it resides.

• Partnerships within the Big Data Ecosystem

Overview of how Tableau works with big data

Data access and connectivity

To enable analysis of data of any size and format, Tableau supports

broad access to data wherever it lives.

o SQL and NoSQL based connections — Tableau uses SQL to

interface with Hadoop, NoSQL databases and Spark.

o Open Database Connectivity(ODBC) — By using ODBC, one

can access any data source that supports the SQL standard and

implements the ODBC API. For Hadoop, this includes interfaces such

as Hive Query Language (HiveQL), Impala SQL, BigSQL and Spark

SQL.

o Web Data Connector — With the Tableau Web Data Connector

SDK, users can build connections to data that lives outside of the

existing connectors which is any data accessible over HTTP,

including internal web services, JSON data, and REST API.

Fast Interaction with all data at scale

1. Hyper data engine

• Hyper is a high-performance in-memory data engine technology

that helps customers analyze large or complex data sets faster.

• They use dynamic code generation and cutting-edge parallelism

techniques to achieve high query speed.

• Hyper can also augment and accelerate slower data sources by

creating an extract of the data and bringing it in-memory.

Fast Interaction with all data at scale

2. Hybrid data architecture

• Tableau can connect live to data sources or bring data (or a

subset) in-memory.

• Users can go back and forth between these modes to suit their

needs.

• This hybrid approach brings a lot of flexibility and helps in query

optimization.

Fast Interaction with all data at scale

3. VizQL™

• A traditional analysis tool analyzes data in rows and columns,

choose a subset of the data to present, organize that data into a

table, then create a chart from that table.

• VizQL creates a visual representation of the data right away,

giving visual feedback as the user analyzes.

• VizQL provides an intuitive user experience that lets people

answer questions as fast as they can think of them.

• In this cycle of visual analysis, users learn as they go, add more

data if needed, and ultimately get deeper insights.

Tableau and Big data analytics ecosystem

Tableau fits nicely in the big data paradigm because it prioritizes flexibility—the ability to

move data across platforms, adjust infrastructure on demand, take advantage of new data

types, and enable new users and use cases.

Cloud infrastructure

• Organizations are increasingly moving business processes and infrastructure to the cloud.

• Cloud based infrastructure and data services have removed some of the major hurdles

faced with on-premises Hadoop data lakes.

• Cloud-based big data analytics solutions are easier to implement and manage than ever

before.

• Tableau delivers key integrations with cloud-based technologies that organizations already

use, including Amazon Web Services, Google Cloud Platform and Microsoft Azure.

Ingest and prep

• In modern ingest-and-load design patterns, the destination

for raw data of any size or shape is often a data lake.

• Stream data is generated continuously by connected devices

and apps located everywhere, such as social networks, smart

meters, home automation, video games, and IoT sensors.

• Often, this data is collected via pipelines of semi-structured

data.

• While real-time analytics and predictive algorithms can be

applied to streams, we typically see stream data routed and

stored in raw formats using lambda architecture and into a

data lake, such as Hadoop, for analytics usage.

Ingest and prep

• Lambda architecture is a data processing architecture designed to handle

massive quantities of data by taking advantage of both batch and stream

processing methods.

• The design balances latency, throughput, and fault tolerance challenges.

• A variety of options exist today for streaming data including Amazon Kinesis,

Storm, Flume, Kafka, and Informatica Vibe Data Stream.

• Once data has landed in a data lake, it needs to be ingested and prepared

for analysis.

• Tableau has partners like Informatica, Alteryx, Trifacta, and Datameer that

help with this process and work fluidly with Tableau.

Storage

1. Hadoop Data Lake

• Hadoop has been used for data lakes due to its resilience and low cost, scale-out data storage,

parallel processing, and clustered workload management.

• It provides massive storage for any kind of data, massive processing power, and the ability to

handle extreme volumes of concurrent tasks or jobs.

• Tableau provides direct connectivity to all the major Hadoop distributions with Cloudera via

Impala, Hortonworks via Hive, and MapR via Apache Drill.

2. Databases and Data warehouses

• Even companies who adopt other technologies typically retain relational databases as a part of

their data source mixture. Snowflake is one example of a cloud-native SQL-based enterprise

data warehouse with a native Tableau connector.

Storage

3. Cloud

• Object stores, such as Amazon Web Services Simple Storage Service (S3).

• Tableau supports Amazon’s Athena data service to connect to Amazon S3.

4. NoSQL Databases

• NoSQL databases with flexible schemas can also be used as data lakes.

• Tableau has various tools that enable connectivity to NoSQL databases directly.

• Examples of NoSQL databases that are often used with Tableau include, but

are not limited to, MongoDB, Datastax, and MarkLogic.

Processing

• The data science and engineering platform, Databricks,

offers data processing on Spark.

• Spark is a popular engine for both batch-oriented and

interactive, scale-out data processing.

• Through a native connector to Spark, one can visualize the

results of complex machine learning models from

Databricks in Tableau.

Query acceleration

o Faster databases leveraging in-memory and massive parallel

processing (MPP) technology like Exasol and MemSQL

o Hadoop-based stores like Kudu

o Technologies that enable faster queries with preprocessing like

Vertica.

o Query Accelerators

❖SQL-on-Hadoop engines like Apache Impala, Hive LLAP, etc.

❖Online Analytical Processing(OLAP)-on-Hadoop technologies like

AtScale, etc.

Data Catolog

• Enterprise data catalogs essentially serve as a

business glossary of data sources and common data

definitions, allowing users to more easily find the

right data for decision making from governed and

approved data sources.

• Data catalogs exist within visual analytics solutions

and are also available as standalone offerings

designed for seamless integration with Tableau.

Informatica is an example of a data catalog partner

of Tableau.

Major Cloud Provider examples

USE CASE

Which is the largest streaming platform for TV

shows and movies?

QUIZ

NETFLIX!

• Grown to support more than 1/3rd of all internet

traffic

• Need arose to expand capabilities to do this

• Extensive platform built on Tableau and AWS acts as

blueprint for many organisations looking to build

scalable and flexible business intelligence on the

cloud

NETFLIX

Features

Data platform is complex, but elegant

Built on events and operational data fed into Amazon S3

Data sent to appropriate processors(NoSQL, Amazon

Redshift etc) which are then aggregated into Tableau Data

Extracts

Data lake/warehouse strategy allows storage of massive

amounts of data

Provides a high level view of data to analyze and explore

All data connections and extracts end up on Tableau

Server, hosted on EC2.

Benefits of NETFLIX by using Tableau Server

Reuse its data sources and govern them across a wide

range of users. Eg: Dashboards can be developed that

show usage and watch patterns within individual

countries.

Helps country managers easily manage programming for

their audiences.

Dozens of people can view dashboard but only one data

source to feed it.

Permissions can be set so that right people have access to

information which is relevant to them

So what did we learn?

References:

Big Data Analytics for Data Visualization: Review of Techniques - Geetika Chawla,

Savita Bamal, Rekha Khatana

Visualizing Big Data – Ekaterina Olshannikova, Aleksandr Ometov, Yevgeni

Koucheryavy, Thomas Olsson

Big Data and Tableau - Sofia Machairidou

https://www.tableau.com/learn/whitepapers/tableau-big-data-overview

https://www.tableau.com/products

https://www.thedataschool.co.uk/tom-pilgrem/earth-tableau-server/

https://en.wikipedia.org/wiki/Tableau_Software

THANK YOU