PDI-Labguide ETL Using Pentaho Data Integration

Post on 26-Oct-2015

100 views 12 download

description

xcxvcbfhygnmhjmhm

transcript

Infosys Technologies Limited

Version No: 3.0 i

Lab Guide

For Pentaho Data Integration 4.0.1

(also known as Kettle)

Table of Contents

Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE ............................................................... 3

Assignment 1: The Kettle Repository ..................................................................................................... 3

Assignment 2: My first Data transfer using Kettle ................................................................................ 6

Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations .... 15

Assignment 4: Creating an ODBC data source ..................................................................................... 26

Assignment 5: Using the ‘Database Lookup’ transformation............................................................ 29

Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE

Learning Objective: To download and install Pentaho Data Integration 4.0.1, and open the PDI

interface.

Step 1: Install Java Runtime Environment (version 1.4 or higher) in your system.

Step 2: Go to http://www.pentaho.com site and download Pentaho Data Integration 4.0.1.

Step 3: Unzip the downloaded PDI zip file. Open the ‘data-integration’ folder, and double click on the

spoon.bat file to open the PDI IDE.

Assignment 1: The Kettle Repository

Learning Objective: To learn the concept of a repository in PDI (Kettle) and learn how to create,

connect or disconnect from a repository.

Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This

workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the

repository, all information about transformations, jobs, schedules, etc. is stored. The repository concept

promotes re-usability, which in turn saves time and effort.

A repository may be created in two ways:

1) Kettle database repository

2) Kettle file repository

When kettle is started, the ‘Repository Connection’ dialog box appears, asking you to select arepository

from the list of existing repositories, or create a new one.

To create a file repository:

Step 1: In ‘Repository Connection’ dialog box click on + [ ] button. The ‘Select the repository type’

dialog box will appear.

Step 2: Select ‘kettle file repository’ and click ok.

Step 3: In ‘File repository settings’ dialog box, click on Browse button, select a folder that shall

exclusively be your file repository space; fill ID and Name and click on ‘OK’ button. Click on the

‘Repository connection’- ‘OK’ button to select the newly-created repository.

You are now ready to create transformations and jobs on this workspace.

To disconnect from the current working repository, go to Tools menu:

Tools -> Repository -> Disconnect repository

…or alternatively, press Ctrl+D.

NOTE: In the course of working with Kettle, if you want to change your repository or create a new one,

then you can do so by first disconnecting from the current working repository. Then, open the

‘Repository Connection’ dialog box from:

Tools -> Repository -> Connect

…or alternatively, press Ctrl+R. The ‘Repository Connection’ dialog box appears.

Assignment 2: My first Data transfer using Kettle

Learning Objective: To create a simple transformation that involves data transfer from a flat file to an

Access database destination.

Step 1: In the Kettle IDE file menu, open File -> New -> Transformation, or alternatively, press Ctrl+N.

Step 2: To save your transformation file with a name of your choice, press Ctrl+S. The ‘Transformation

properties’ dialog box opens up. Give the transformation a name of your choice, and then click on ‘OK’.

Step 3: On the ‘Design’ pane on the left of the IDE, expand the ‘Input’ group. Drag and drop the ‘Text file

input’ on the transformation design surface.

Step 4: Double-click on the ‘Text file input’. The text file input properties dialog box opens up. Click on

‘Browse’ to select the flatfile to be used as an input.

Select the ‘Products.txt’ flat file that will be used as input for the transformation. After clicking on

‘Open’, click on the button ‘Add’ to add the file to the list of selected files.

Step 5: Go to the ‘Content’ tab. Since this is a ‘Comma separated values (CSV)’ flat file, specify the

separator as comma (,).

Step 6: Open the fields tab click on Get fields, enter 0 to see the scan results of flat file and click on close

button.

You can also see the text file contains by click on preview rows button.

Step 7: Once done, click on ‘OK’ to complete the process of defining a flat file input.

Step 8: Expand the ‘Output’ group on the design pane, and drag and drop ‘Access output’ on the

transformation surface.

To determine data flow sequence from one transformation item to another, a ‘Hop’ is used.

To create the hop: a) Click on the Text file input, then press the <SHIFT> key and draw a line to the Access Output.

OR b) Place the mouse pointer on Text file input until the hover menu appears and then drag the

hop Output connector to Access output.

OR c) Place mouse pointer on the Text file input, press the middle button of the mouse then drag

the hop pointer and release on Access Output.

Step 9: Double-click on the ‘Access output’ to open its properties dialog box. Since the access database

does not currently exist, enter the file name along with the full path in ‘The database filename’ field.

Also enter the name of the target table in the ‘Target table’ field. Keep the checkboxes of the ‘Create

database’ and ‘Create table’ options selected, so that the database and the table will be created

respectively if they do not exist already.

After this is done, click on ‘OK’.

Step 10: To run the transformation, click on the green-coloured triangular button.

The ‘Execute a transformation’ dialog box opens up. Click on ‘Launch’ to execute the transformation.

The ‘Execution Results’ pane appears.

In the ‘Step Metrics tab, the column ‘Active’ shows ‘Finished’ if the transformation was executed

successfully.

Open the ‘Northwind’ access database file. You will see that the data has been successfully populated in

the ‘Products’ table.

Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations

Learning Objective: To learn how to use the ‘Calculator’ to calculate a new column using existing

column values, and select specific fields to be populated in the destination using the ‘Select Values’

transformation.

Requirements:

i. The columns from the ‘employee’ excel sheet that are required to be sent to an Excel worksheet

are: EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, HireDate, City, Country,

HomePhone, Extension and ReportsTo.

ii. In the ‘Employee’ table, the ‘Firstname’ and ‘Lastname’ columns should be stored as a single

column in the destination.

Step 1: Create a new transformation called ‘Employee’. Drag and drop ‘Excel input’ on the

transformation surface. Double-click the ‘Excel’ input to open its properties dialog box. Click on

‘Browse’.

Select the excel workbook that contains the source data for the ‘Employee’ table, and click on the ‘Add’

button to add it to the list of selected files.

Step 2: Go to the ‘Sheets’ tab, and click on ‘Get sheetnames’ to get the list of the names of the sheets

that you wish to include in the data flow. A dialog appears, that asks you to select the sheets you want.

Select the sheet named ‘employee’ and click on the ‘>’ button to include it in the list of selected sheets.

Then click on ‘OK’.

Step 3: Next, go to the ‘Fields’ tab and click on ‘Get fields from header row’ button to get a list of the

field names from the first row of the excel sheet ‘employee’.

Click on ‘Preview rows’ and enter the number of rows that you would like to preview (this facility is for

the developer to ensure that the connection will successfully be able to fetch the data from the excel

sheet correctly).

Step 4: Click on ‘OK’ to complete the task of defining a connection to the excel sheet data source.

Step 5: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Add constants’

transformation on the transformation surface. Double-click on it to open its properties dialog box.

Name the new field as ‘space’, specify data-type as ‘String’ and length as 1. The value should be given as

a space.

After this is done, click on ‘OK’. The ‘Add constants’ will now add a new field called ‘space’ in the data

flow.

Step 6: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Calculator’

transformation on the transformation surface. Create a hop from ‘Add constants’ to ‘Calculator’.

Step 7: Double-click on the ‘Calculator’ to open its properties dialog box.

i. Specify the new field name as ‘FullName’.

ii. Select the calculation type as ‘A+B+C’.

iii. Specify ‘Field A’ as ‘FirstName’, ‘Field B’ as ‘space’, ‘Field C’ as ‘LastName’, ‘Value type’ as ‘String’

and ‘Length’ as 70. Click on ‘OK’.

Step 8: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Select values’

transformation on the transformation surface. Create a hop from ‘Calculator’ to ‘Select values’.

[NOTE: The ‘Select values’ transformation is used for the purpose of specifically removing the columns

that are not required further in the data flow. The existing columns that are required may also be re-

named to any other name and cast to another data type, if needed.]

Step 9: Double-click on the ‘Select values’ transformation to open its properties dialog box. Click on the

‘Get fields to select’ button the fetch the fields that are presently in the data flow.

Step 10: Go to the ‘Remove’ tab. This is where the columns that have to be excluded from the data flow

are specified.

Under the ‘Fieldname’ column, click on the drop-down. It will show a list of the available fields in the

data flow. Click on the name of the column you wish to exclude. For example, click on ‘Address’, since it

is not required further in the data flow.

Do the same for all other fields that have to be excluded.

Step 11: Under the ‘Metadata’ tab, click on the ‘Get fields to change’ button. Remove the fields that are

not required in the data flow. Specify the alternative name, data-type, length, precision, etc. for each of

the input fields (if required).

Once done, click on ‘OK’.

Step 12: From the ‘Output’ group in the Design pane of Kettle, drag and drop ‘Excel output’ on the

transformation surface.

Create a hop from ‘Select values’ to ‘Excel output’. Double-click on ‘Excel output’ to open its properties

dialog box.

Click on the ‘Browse’ button.

Step 13: Select the folder where you want to save the excel destination workbook. Specify the name of

the file, and click on ‘Save’.

Step 14: In the ‘Content’ tab, specify the sheet name as ‘Employee’.

Step 15: In the ‘Fields’ tab, click on the ‘Get Fields’ button to fetch the fields that have to be included in

the ‘Employee’ worksheet. Specify ‘#’ as format for integer fields. Once done, click on ‘OK’.

Step 16: Your transformation is now complete and ready to be executed. Run the transformation by

clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.

After execution, the destination Excel sheet looks like this:

Assignment 4: Creating an ODBC data source

Step 1: Click on Start->Control Panel->Administrative Tools->Data Sources (ODBC), then in ODBC Data

Source Administrator dialog box select User DSN tab. Click on ‘Add’.

Step 2: Select ‘Microsoft Access driver (*.mdb, *.accdb) and click on ‘Finish’.

Step 3: Specify data source name, description and then click on ‘Select’ to select the access database to

be used.

Step 4: Select ‘Northwind.accdb’ from its location and click on ‘OK’.

Step 5: Click on ‘OK’ again.

Step 6: Click on ‘OK’ again.

The ODBC data source has now been created.

Assignment 5: Using the ‘Database Lookup’ transformation

Learning Objective: To learn how to lookup values from an referenced table using key-value pairs, and

include the value field(s) into the data flow.

Requirements:

i. The ‘OrderDetails’ sheet from the excel workbook ‘Northwind’ contains product-wise data

about orders. Replace the ‘ProductID’ field by the ‘ProductName’ and populate the data into

the Northwind.accdb Access database, into a table named ‘OrderDetails’.

Step 1: Create a new transformation file, and save it as ‘OrderDetails’.

Step 2: Drag and drop an ‘Excel input’ on the transformation surface. Edit the properties of the Excel

input.

i. Select the data source as ‘Northwind.xls’.

ii. Select the source sheet as ‘orderdetails’.

iii. Click on ‘Get fields from header row’ to fetch the fields for the data flow. Click on ‘OK’, once

done.

Step 3: Drag and drop ‘Database lookup’ on the transformation surface. Create a hop from ‘Excel input’

to the ‘Database lookup’.

Step 4: Double-click on ‘Database lookup’ to open its properties dialog box. For creating a new

connection to the Access database table ‘Products’ that belongs to the ‘Northwind.accdb’ database,

click on ‘New’.

Step 5: Give the connection a name. Select connection type as ‘MS Access’. Specify the name of the

ODBC connection to the Northwind.accdb database. Click on ‘Test’ to test the connection.

If connection is successful, the following message is displayed:

Click on ‘OK’.

Step 6: Click on ‘Browse’ to select the lookup table.

Step 7: Select the ‘Products’ table as the table to be looked up for value fields.

Step 8: To equate the key values between the source table and the lookup table, specify ‘Table field’ as

‘ProductID’, comparator as ‘=’ and ‘Field1’ as ‘ProductID’. Select the ‘Values to return from the lookup

table’ as ‘ProductName’.

Step 9:

i. Drag and drop ‘Select Values’ on the transformation surface. Create a hop from ‘Database

lookup’ to ‘Select values’.

ii. In the ‘Remove’ tab, select the field ‘ProductID’ to be removed.

iii. In the ‘Metadata’ tab, specify the data types of the fields that are included in the data flow.

Step 10: Drag and drop ‘Access output’ on the transformation surface. Create a hop from ‘Select values’

to the ‘Access output’.

i. Specify the database as the existing ‘Northwind.accdb’ database.

ii. Give the table name as ‘OrderDetails’.

iii. Click on ‘OK’.

Step 10: Your transformation is now complete and ready to be executed. Run the transformation by

clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.

After execution, the destination table looks like this:

<EOF>