+ All Categories
Home > Documents > WORKING WITH PENTAHO DATA INTEGRATION...

WORKING WITH PENTAHO DATA INTEGRATION...

Date post: 20-Mar-2018
Category:
Upload: hoangque
View: 248 times
Download: 10 times
Share this document with a friend
23
White Paper Abstract This white paper explains how Pentaho Data Integration (Kettle) can be configured and used with Greenplum database in three tier architectures. This allows a quick verification and validation of connectivity and interoperability of Pentaho Data Integration with Greenplum. September 2011 WORKING WITH PENTAHO DATA INTEGRATION USING GREENPLUM The interoperability between Pentaho Data Integration (Kettle) and Greenplum Database
Transcript
Page 1: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

White Paper

Abstract

This white paper explains how Pentaho Data Integration

(Kettle) can be configured and used with Greenplum

database in three tier architectures. This allows a quick

verification and validation of connectivity and

interoperability of Pentaho Data Integration with

Greenplum.

September 2011

WORKING WITH PENTAHO DATA

INTEGRATION USING GREENPLUM The interoperability between Pentaho Data Integration

(Kettle) and Greenplum Database

Page 2: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

2 Working with Pentaho Data Integration Using Greenplum

Copyright © 2011 EMC Corporation. All Rights Reserved.

EMC believes the information in this publication is

accurate of its publication date. The information is

subject to change without notice.

The information in this publication is provided “as is”.

EMC Corporation makes no representations or

warranties of any kind with respect to the information in

this publication, and specifically disclaims implied

warranties of merchantability or fitness for a particular

purpose.

Use, copying, and distribution of any EMC software

described in this publication requires an applicable

software license.

For the most up-to-date listing of EMC product names,

see EMC Corporation Trademarks on EMC.com.

VMware is a registered trademark of VMware, Inc. All

other trademarks used herein are the property of their

respective owners.

Part Number h8294

Page 3: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

3 Working with Pentaho Data Integration Using Greenplum

Table of Contents

Executive summary ............................................................................................. 4

Audience ......................................................................................................................... 4

Organization of this paper .................................................................................. 4

Overview of Pentaho Data Integration .............................................................. 5

Overview of Greenplum Database .................................................................... 5

Integration of Pentaho PDI and Greenplum Database .................................... 6

Using JDBC drivers for Greenplum database connections ............................. 7

Installation of new driver ............................................................................................... 7

Configuration .................................................................................................................. 9

Usage ............................................................................................................................. 14

Simple Example Job that writes from and to Greenplum Database .................... 17

Future expansion and interoperability ............................................................ 21

Conclusion ......................................................................................................... 22

References ......................................................................................................... 23

Page 4: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

4 Working with Pentaho Data Integration Using Greenplum

Executive summary

Pentaho Data Integration (PDI), a.k.a. Kettle, is one of the most popular

open source business intelligence data integration products available for

working with analytical databases such as the EMC Greenplum Database.

The EMC Greenplum Database is capable of managing, storing and

analyzing Terabytes to Petabytes of data in data warehouses. Pentaho

Data Integration unifies the ETL, modeling and visualization processes into a

single, integrated environment with the use of Greenplum to drive better

business decisions and speed up Business Intelligence development and

deployment for joint customers. Currently, Pentaho Data Integration is

connected to Greenplum through JDBC (Java Database Connectivity)

drivers. Greenplum Database can be used both on the source and target

sides in the Pentaho ETL transformations.

Audience

This white paper is intended for EMC field facing employees such as sales,

technical consultants, support, as well as customers who will be using

Pentaho Data Integration tool to integrate their ETL work. This is neither an

installation guide nor an introductory material on Pentaho. It documents

the Pentaho connectivity and operation capabilities, and shows the

readers how it can be used in conjunction with Greenplum database to

retrieve, transform and present data to users. Though the reader is not

expected to have any prior Pentaho knowledge, basic understanding of

data integration concepts and ETL tools would help them understand

better.

Organization of this paper

This paper covers the following topics:

Overview of Pentaho Data Integration

Overview of Greenplum database

Using JDBC drivers for Greenplum database connections

Future expansion and interoperability

Page 5: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

5 Working with Pentaho Data Integration Using Greenplum

Overview of Pentaho Data Integration

Pentaho Data Integration (PDI) delivers comprehensive Extraction,

Transformation and Loading (ETL) capabilities using a meta-data driven

approach. It is commonly used in building data warehouses, designing

Business Intelligence applications, migrating data and integrating data

models. It consists of different components:

Spoon – Main GUI, Graphical Jobs/Transformation Designer

Carte – HTTP server for remote execution of Jobs/Transformations

Pan – Command line execution of Transformations

Kitchen – Command line execution of Jobs

Encr – Command line tool for encrypting strings for storage

Enterprise Edition (EE) Data Integration Server – Data Integration Engine,

Security integration with LDAP/Active Directory, Monitor/Scheduler,

Content Management

Pentaho is capable of loading huge data sets into Greenplum Database

taking full advantage of the massively parallel processing environment

provided by the Greenplum product family.

Overview of Greenplum Database

Greenplum Database is designed based on a share-nothing MPP

(Massively Parallel Processing) architecture which facilitates Business

Intelligence and analytical processing built on top of it using commodity

hardware. Data is distributed across multiple segment servers in the

Greenplum Database to achieve no disk-level sharing. The segment

servers are able to process queries in a parallel manner in order to

promote the high degree of parallelism and scalability.

Highlights of the Greenplum Database:

Dynamic Query Prioritization

Page 6: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

6 Working with Pentaho Data Integration Using Greenplum

Provides continuous real-time balancing of the resources

across queries.

Self-Healing Fault Tolerance

Provides intelligent fault detection and fast online

differential recovery.

Polymorphic Data Storage-MultiStorage/SSD Support

Includes tunable compression and support for both row-and

column-oriented storage.

Analytics and Language Support

Supports analytical functions for advanced in-database

analytics.

Health Monitoring and Alerting

Provides integrated email and SNMP notification for

advanced support capabilities.

Integration of Pentaho PDI and Greenplum Database

Page 7: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

7 Working with Pentaho Data Integration Using Greenplum

Using JDBC drivers for Greenplum database

connections

Pentaho Kettle ships with many different JDBC drivers that reside in a single

java archive (.jar) file that are present in the libext/JDBC directory. By

default, Pentaho PDI is shipped with a postgresql jdbc jar file, which is used

to connect to Greenplum.

Java 1.6 is required for the installation.

Fortunately, there is a startup script which adds all these .jars to classpath.

Installation of new driver

To add a new driver, simply drop the .jar file containing the driver into the

libext/JDBC directory. For example,

• For Data Integration Server: <Pentaho_installed_directory>/server/data-

integration-server/tomcat/lib/

• For Data Integration client: <Pentaho_installed_directory>/design-

tools/data-integration/libext/JDBC/

For BI Server: <Pentaho_installed_directory>/server/biserver-ee/tomcat/lib/

• For Enterprise Console: <Pentaho_installed_directory>/server/enterprise-

console/jdbc/

If you installed a new JDBC driver for Greenplum to the BI Server or DI

Server, you have to restart all affected servers to load the newly installed

database driver. In addition, if you want to establish a Greenplum data

source in the Pentaho Enterprise Console, you must install that JDBC driver

in both Enterprise Console and the BI Server to make it effective.

In brief, to update the driver, the user would need to update the jar file in

/data-integration/libext/JDBC/.

Assume that a Greenplum Database GPDB is installed and ready to use,

users can define the Greenplum database connections in the Database

Connection dialog. Users can give a connection name, choose

Greenplum as the Connection Type, choose “Native (JDBC)” as Access,

Page 8: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

8 Working with Pentaho Data Integration Using Greenplum

and give the Host Name, Database Name, Port Number, User Name and

Password in the Setting section.

Special attention may be required to setup the host files and configuration

files in Greenplum database as well as the hosts in which Pentaho is

installed. For instance, in Greenplum database, the user may need to

configure pg_hba.conf with the IP address of the Pentaho host. In addition,

users may need to add the hostnames and the corresponding IP address in

both systems (i.e. Pentaho PDI server as well as the Greenplum Database)

in order to ensure both machines are able to communicate.

Page 9: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

9 Working with Pentaho Data Integration Using Greenplum

Configuration

Detailed steps for JDBC configuration are self explanatory with screenshot

images showing how to choose the existing jdbc driver for Greenplum

Database as your data source or target database in Spoon:

1) After a user open a Job in spoon, on the right side View panel, there is a

folder called “Database Connection” for that corresponding job that

he/she is working on. The user can highlight and right click on

“Database Connections”, then choose either “New Connection” or

“New Connection Wizard”.

Page 10: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

10 Working with Pentaho Data Integration Using Greenplum

2) If the user chooses the “New Connection” option, he/she would need

to input the Greenplum database as follows:

Page 11: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

11 Working with Pentaho Data Integration Using Greenplum

3) If the user chooses the “New Connection Wizard” option, the wizard will

guide the user through the JDBC definition process.

First, select the database name and type:

Page 12: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

12 Working with Pentaho Data Integration Using Greenplum

Second, Set the JDBC settings and click on “Next”:

Page 13: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

13 Working with Pentaho Data Integration Using Greenplum

Third, input the username and password and click “Finish”:

Last, test the database connection, if the input details are correct, there

should be a prompt like this:

Page 14: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

14 Working with Pentaho Data Integration Using Greenplum

Usage

There are a few ways to apply Greenplum database connections, such as:

1) The following diagram shows how to apply the newly defined

Greenplum database connection as the data source:

Page 15: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

15 Working with Pentaho Data Integration Using Greenplum

2) The following diagrams show how to apply the newly defined

Greenplum database connection to the target tables to be loaded:

Example 1: Dimension Lookup/Update:

Page 16: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

16 Working with Pentaho Data Integration Using Greenplum

Example 2: Insert/Update for loading the target table:

Page 17: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

17 Working with Pentaho Data Integration Using Greenplum

Simple Example Job that writes from and to Greenplum Database

Here is a simple new transformation to test out the JDBC connectivity that we defined

before. (Assumption: gpadmin is the user.)

A source table is being created called Category in which contains:

CREATE TABLE category

(

category_id serial NOT NULL,

name varchar(25) NOT NULL,

last_update timestamp without time zone NOT NULL DEFAULT now(),

CONSTRAINT category_pkey PRIMARY KEY (category_id)

) DISTRIBUTED BY (category_id);

ALTER TABLE category OWNER TO gpadmin;

Populating the data to table category by either INSERT or COPY commands to

category table:

insert into category (category_id, name, last_update) values (1,'Action','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (2,'Animation','2006-02-

15 09:46:27');

insert into category (category_id, name, last_update) values (3,'Children','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (4,'Classics','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (5,'Comedy','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (6,'Documentary','2006-

02-15 09:46:27');

insert into category (category_id, name, last_update) values (7,'Drama','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (8,'Family','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (9,'Foreign','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (10,'Games','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (11,'Horror','2006-02-15

09:46:27');

Page 18: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

18 Working with Pentaho Data Integration Using Greenplum

insert into category (category_id, name, last_update) values (12,'Music','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (13,'New','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (14,'Sci-Fi','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (15,'Sports','2006-02-15

09:46:27');

insert into category (category_id, name, last_update) values (16,'Travel','2006-02-15

09:46:27');

Then, a target table is created called:

CREATE TABLE category_demo_target

(

category_id integer,

showname varchar(25),

last_update timestamp without time zone

) DISTRIBUTED BY (category_id);

ALTER TABLE category_demo_target OWNER TO gpadmin;

A similar sample job can be created as following:

Page 19: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

19 Working with Pentaho Data Integration Using Greenplum

Sample Configuration of the source table:

Sample Configuration of the target table:

Page 20: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

20 Working with Pentaho Data Integration Using Greenplum

Click the GREEN arrow on the top left corner to execute the transformation/job.

Now, check the target table category_demo_target to see if data is loaded into this

target Greenplum database table.

User can add different components in this transformation or incorporate into a well

developed job for transforming the data from source to target.

Page 21: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

21 Working with Pentaho Data Integration Using Greenplum

Future expansion and interoperability

Both Greenplum and Pentaho are increasing their capability to adopt the

BIG DATA trends as there is a significant growth in data sizes in the industry.

Therefore, both companies are expanding their interoperability to adopt the

upcoming demands. For example:

One of the latest enhancements that Pentaho did for expanded

support for OLAP includes a native bulk loader integration with EMC

Greenplum to improve the data loading process and overall

performance. Pentaho is offering a native adaptor support for

Greenplum GPLoad capability (bulk loader), which enables joint

customers to leverage data integration capabilities to capture,

transform and quickly load massive amounts of data into Greenplum

Databases, especially in form of data warehouses.

Pentaho has certified support of Pentaho Data Integration (PDI)

working with EMC Greenplum Hadoop and EMC Greenplum Database

and corresponding data warehouse products. Future collaborations will

be established between Pentaho and EMC Greenplum Hadoop HD

solutions. When Pentaho complements the Greenplum distribution of

Hadoop, it provides an end-to-end Data Integration and Business

Intelligence suite that improves data movement into and out of

Hadoop and the cost advantages of commercial open source. This will

benefit the customers by providing more choices with enhanced

performance and better cost-saving options.

The EMC Greenplum Data Integration Accelerator (DIA) will be

integrated with Pentaho PDI (with the fast loading adapter invoking

gpfdist utility) to bring the best of performance of data staging and

data loading. To meet the challenges of fast data loading, the EMC

Data Integration Accelerator (DIA) is purpose-built for batch loading,

and micro-batch loading, and leverages a growing number of data

integration applications such as Pentaho.

Page 22: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

22 Working with Pentaho Data Integration Using Greenplum

Conclusion

In this white paper, the process of how to create and apply JDBC driver to

connect Pentaho Data Integration with Greenplum Database is discussed

using JDBC driver in particular. It only covers the preliminary interoperability

between both Pentaho PDI and Greenplum database for basic data

integration and business intelligence projects.

It also discussed briefly the anticipated interoperability and integrations of

both technologies to accommodate the Big Data Trend in the coming

future, such as, the Greenplum native bulk loader, Pentaho Integration

with Greenplum Hadoop HD solutions and Greenplum DIA integration with

Pentaho BI/DI servers/tools. We will discuss those future expansions in

upcoming white papers.

Page 23: WORKING WITH PENTAHO DATA INTEGRATION …emc2.co/.../h8294-working-pentaho-data-integration-greenplum-wp.pdf · Working with Pentaho Data Integration Using Greenplum 4 Executive summary

23 Working with Pentaho Data Integration Using Greenplum

References

1) Pentaho Kettle Solutions – Building Open Source ETL Solutions with

Pentaho Data Integration (ISBN-10: 0470635177 / ISBN-13: 978-

0470635179)

2) Getting Started with Pentaho Data Integration guide from

www.pentaho.com

3) The PostgreSQL Pagila Schema for using in Greenplum database


Recommended