Informatica Data Lake Management on the AWS Cloud · PDF fileUsing Informatica Data Lake...

Page 1 of 30

Informatica Data Lake Management on the AWS Cloud

Quick Start Reference Deployment

January 2018

Informatica Big Data Team

Vinod Shukla – AWS Quick Start Reference Team

Contents

Overview ................................................................................................................................. 2

Informatica Components ................................................................................................... 3

Costs and Licenses .............................................................................................................. 3

Architecture ............................................................................................................................ 4

Informatica Services on AWS ............................................................................................. 5

Planning the Data Lake Management Deployment ..............................................................8

Deployment Options ..........................................................................................................8

Prerequisites .......................................................................................................................8

Deployment Steps .................................................................................................................. 9

Step 1. Prepare Your AWS Account .................................................................................... 9

Step 2. Upload Your Informatica License .......................................................................... 9

Step 3. Launch the Quick Start ........................................................................................ 10

Step 4. Monitor the Deployment ...................................................................................... 18

Step 5. Download and Install Informatica Developer ..................................................... 21

Manual Cleanup ............................................................................................................... 22

Troubleshooting ................................................................................................................... 22

Using Informatica Data Lake Management on AWS .......................................................... 23

Amazon Web Services – Informatica Data Lake Management on the AWS Cloud January 2018

Page 2 of 30

Transient and Persistent Clusters .................................................................................... 23

Common AWS Architecture Patterns for Informatica Data Lake Management............. 23

Process Flow ..................................................................................................................... 27

Additional Resources ........................................................................................................... 29

GitHub Repository .............................................................................................................. 30

Document Revisions ........................................................................................................... 30

This Quick Start deployment guide was created by Amazon Web Services (AWS) in

partnership with Informatica.

Quick Starts are automated reference deployments that use AWS CloudFormation

templates to deploy key technologies on AWS, following AWS best practices.

Overview

This Quick Start reference deployment guide provides step-by-step instructions for

deploying the Informatica Data Lake Management solution on the AWS Cloud.

A data lake uses a single, Hadoop-based data repository that you create to manage the

supply and demand of data. Informatica’s solution on the AWS Cloud integrates, organizes,

administers, governs, and secures large volumes of both structured and unstructured data.

The solution delivers actionable fit-for-purpose, reliable, and secure information for

business insights.

Consider the following key principles when you implement a data lake:

The data lake must prevent barriers to onboarding data of any type and size from any

source.

Data must be easily refined and immediately provisioned for consumption.

Data must be easy to find, retrieve, and share within the organization.

Data is a corporate accountable asset, managed collaboratively by data governance, data

quality, and data security initiatives.

This Quick Start is for users who want to deploy and develop an Informatica Data Lake

Management solution on the AWS Cloud.

http://aws.amazon.com/quickstart/


Page 3 of 30

Informatica Components

The Data Lake Management solution uses the following Informatica products:

Informatica Big Data Management enables your organization to process large, diverse,

and fast changing datasets so you can get insights into your data. Use Big Data

Management to perform big data integration and transformation without writing or

maintaining Apache Hadoop code. Collect diverse data faster, build business logic in a

visual environment, and eliminate hand-coding to get insights on your data.

Informatica Enterprise Data Catalog brings together all data assets in an enterprise

and presents a comprehensive view of the data assets and data asset relationships.

Enterprise Data Catalog captures the technical, business, and operational metadata for a

large number of data assets that you use to determine the effectiveness of enterprise data.

From across the enterprise, Enterprise Data Catalog gathers information related to

metadata, including column data statistics, data domains, data object relationships, and

data lineage information. A comprehensive view of enterprise metadata can help you make

critical decisions on data integration, data quality, and data governance in the enterprise.

The Developer tool includes the native and Hadoop run-time environments for optimal processing. In the native environment, the Data Integration Service processes the data. In the Hadoop environment, the Data Integration Service pushes the processing to nodes in a Hadoop cluster.

Costs and Licenses

You are responsible for the cost of the AWS services used while running this Quick Start

reference deployment. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters

that you can customize. Some of these settings, such as instance type, will affect the cost of

deployment. See the pricing pages for each AWS service you will be using for cost estimates.

This Quick Start requires a license to deploy the Informatica Data Lake Management

solution, as described in the Prerequisites section. To sign up for a demo license, contact

Informatica.

https://now.informatica.com/contact-us.html#fbid=W99prmuaC63



Page 4 of 30

Architecture Figure 1 shows the typical components of a generic data lake management solution.

Figure 1: Components of a data lake management solution

The solution includes the following core components, beginning with the lower part of the

diagram in Figure 1:

Big Data Infrastructure: From a connectivity perspective (for example, on-premises,

cloud, IOT, unstructured, semi-structured) the solution reliably accommodates an

expanding volume and variety of data types. The solution has the capacity to scale up (when

you increase individual hardware capacity) or scale out (when you increase infrastructure

capacity linearly for parallel processing), and can be deployed directly into your AWS

environment.

Big Data Storage: The solution can store large amounts of a variety of data (structured,

unstructured, semi-structured) at scale with the performance that guarantees timely

delivery of data to business analysts.

Big Data Processing: The solution can process data at any latency, such as real time,

near real time, and batch, using big data processing frameworks such as Apache Spark.

Metadata Intelligence manages all the metadata from a variety of data sources. For

example, a data catalog manages data generated by big data and by traditional sources. To


Page 5 of 30

do this, it collects, indexes, and applies machine learning to metadata. It also provides

metadata services such as semantic search, automated data domain discovery and tagging,

and data intelligence that can guide user behavior.

Big Data Integration, in which a data lake architecture must integrate data from various

disparate data sources, at any latency, with the ability to rapidly develop ELT (extract, load,

and transform) or ETL (extract, transform, and load) data flows.

Big Data Governance and Quality are critical to a data lake, especially when dealing

with a variety of data. The purpose of big data governance is to deliver trusted, timely, and

relevant information to support the business outcome.

Big Data Security is the process of minimizing data risk. Activities include discovering,

identifying, classifying, and protecting sensitive data, as well as analyzing its risk based on

value, location, protection, and proliferation.

Finally, Intelligent Data Applications (Self-Service Data Preparation, Enterprise Data

Catalog, and Data Security Intelligence) provide data analysts, data scientists, data

stewards, and data architects with a collaborative self-service platform for data governance

and security that can discover, catalog and prepare data for big data analytics.

Informatica Services on AWS Deploying this Quick Start with default parameters builds the Informatica Data Lake

environment illustrated in Figure 2 in the AWS Cloud. The Quick Start deployment

automatically creates the following Informatica elements:

Domain

Model Repository Service

Data Integration Service

In addition, the deployment automatically embeds Hadoop clusters in the virtual private

cloud (VPC) for metadata storage and processing.

The deployment then assigns the connection to the Amazon EMR cluster for the Hadoop

Distributed File System (HDFS) and Hive. It also sets up connections to enable scanning of

Amazon Simple Storage Service (Amazon S3) and Amazon Redshift environments as part of

the data lake.


Page 6 of 30

The Informatica domain and repository database are hosted on Amazon Relational

Database Service (Amazon RDS) using Oracle, which handles management tasks such as

backups, patch management, and replication.

To access Informatica Services on the AWS Cloud, you can install the Informatica client to

run Big Data Management on a Microsoft Windows machine. You can then access

Enterprise Data Catalog by using a web browser.

Figure 2 shows the Informatica Data Lake Management solution deployed on AWS.

Figure 2: Informatica Data Lake Management solution deployed on AWS

The Quick Start sets up a highly available architecture that spans two Availability Zones,

and a VPC configured with public and private subnets according to AWS best practices.

Managed network address translation (NAT) gateways are deployed into the public subnets

and configured with an Elastic IP address for outbound internet connectivity.


Page 7 of 30

The Quick Start also installs and configures the following information services during the

one-click deployment:

Informatica domain, which is the fundamental administrative unit of the

Informatica platform. The Informatica platform has a service-oriented architecture that

provides the ability to scale services and share resources across multiple machines.

Model Repository Service, which is a relational database that stores all the metadata

for projects created using Informatica client tools. The model repository also stores run-

time and configuration information for applications that are deployed to a Data

Integration Service.

Data Integration Service, which is a compute component within the Informatica

domain that manages requests to submit big data integration, big data quality, and

profiling jobs to the Hadoop cluster for processing.

Content Management Service, which manages reference data. It provides reference

data information to the Data Integration Service and Informatica Developer.

Analyst Service, which runs the Analyst tool in the Informatica domain. The Analyst

Service manages the connections between the service components and the users who log

in to the Analyst tool. You can perform column and rule profiling, manage scorecards,

and manage bad records and duplicate records in the Analyst tool.

Profiling, which helps you find the content, quality, and structure of data sources of an

application, schema, or enterprise. A profile is a repository object that finds and

analyzes all data irregularities across data sources in the enterprise, and hidden data

problems that put data projects at risk. The profiling results include unique values, null

values, data domains, and data patterns. When you use this Quick Start, you can run

profiling on the Data Integration Service (default) or Hadoop.

Business Glossary, which consists of online glossaries of business terms and policies

that define important concepts within an organization. Data stewards create and publish

terms that include information such as descriptions, relationships to other terms, and

associated categories. Glossaries are stored in a central location for easy lookup by

consumers. Glossary assets include business terms, policies, and categories that contain

information that consumers might search for. A glossary is a high-level container that

stores Glossary assets. A business term defines relevant concepts within the

organization, and a policy defines the business purpose that governs practices related to

the term. Business terms and policies can be associated with categories, which are

descriptive classifications.

Catalog Service, which runs Enterprise Data Catalog and manages connections

between service components and external applications.


Page 8 of 30

An embedded Hadoop cluster that uses Hortonworks, running HDFS, Hbase, Yarn,

and Solr.

Informatica Cluster Service, which runs and manages all Hadoop services, Apache

Ambari server, and Apache Ambari agents on the embedded Hadoop cluster.

Metadata and Catalog, which include the metadata persistence store, search index,

and graph database in an embedded Hadoop cluster. The catalog represents an indexed

inventory of all the data assets in the enterprise that you configure in Enterprise Data

Catalog. Enterprise Data Catalog organizes all the enterprise metadata in the catalog

and enables the users of external applications to discover and understand the data.

The Informatica domain and the Informatica Model Repository databases are configured

on Amazon RDS using Oracle.

Planning the Data Lake Management Deployment

Deployment Options

This Quick Start provides two deployment options:

Deployment of the Data Lake Management solution into a new VPC (end-to-

end deployment). This option builds a new virtual private cloud (VPC) with public

and private subnets, and then deploys the Informatica Data Lake Management solution

into that infrastructure.

Deployment of the Data Lake Management solution into an existing VPC.

This option provisions data lake components into your existing AWS infrastructure.

The Quick Start provides separate templates for these options. It also lets you configure

CIDR blocks, instance types, and data lake settings, as discussed later in this guide.

Prerequisites

Specialized Knowledge

Before you deploy this Quick Start, we recommend that you become familiar with the

following AWS services:

Amazon VPC

Amazon EC2

Amazon EMR

If you are new to AWS, see the Getting Started Resource Center.

https://aws.amazon.com/documentation/vpc/

https://aws.amazon.com/documentation/ec2/

http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-what-is-emr.html

https://aws.amazon.com/getting-started/


Page 9 of 30

Technical Requirements

Before you deploy this Quick Start, verify the following prerequisites:

You have an account with AWS, and you know the account login information.

You have purchased a license for the Informatica Data Lake Management solution . To

sign up for a demo license, please contact Informatica, your sales representative, or the

consulting partner you’re working with.

The license file should have a name like AWSDatalakeLicense.key.

Deployment Steps

Step 1. Prepare Your AWS Account

1. If you don’t already have an AWS account, create one at https://aws.amazon.com by

following the on-screen instructions.

2. Use the region selector in the navigation bar to choose the AWS Region where you want

to deploy the Informatica Data Lake Management solution on AWS.

3. Create a key pair in your preferred region.

When you log in to any Amazon EC2 system or Amazon EMR cluster, you use a

password file for authentication. The file is called a private key file and has a file name

extension of .pem. If you do not have an existing .pem key to use, follow the instructions

in the AWS documentation to create a key pair.

Note Your administrator might ask you to use a particular existing key pair.

When you create a key pair, you save the .pem file to your desktop system.

Simultaneously, AWS saves the key pair to your account. Make a note of the key pair

that you want to use for the Data Lake Management instance, so that you can provide

the key pair name during network configuration.

4. If necessary, request a service limit increase for the Amazon EC2 M3 and M4 instance

types. You might need to do this if you already have an existing deployment that uses

this instance type, and you think you might exceed the default limit with this reference

deployment.

Step 2. Upload Your Informatica License

Upload the license for the Informatica Data Lake Management solution to an S3 bucket,

following the instructions in the Amazon S3 documentation. You will be prompted for the

bucket name during deployment.


https://aws.amazon.com/

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair

https://console.aws.amazon.com/support/home#/case/create?issueType=service-limit-increase&limitType=service-code-

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html

https://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html


Page 10 of 30

To sign up for a demo license, please contact Informatica, your sales representative, or the

consulting partner you’re working with.

Step 3. Launch the Quick Start

Note You are responsible for the cost of the AWS services used while running this

Quick Start reference deployment. There is no additional cost for using this Quick

Start. For full details, see the pricing pages for each AWS service you will be using in

this Quick Start. Prices are subject to change.

1. Choose one of the following options to launch the AWS CloudFormation template into

your AWS account. For help choosing an option, see deployment options earlier in this

guide.

Option 1

Deploy data lake into a

new VPC on AWS

Option 2

Deploy data lake into an

existing VPC on AWS

Important If you’re deploying Informatica Data Lake Management into an

existing VPC, make sure that your VPC has two private and two public subnets in

different Availability Zones for the database instances. These subnets require NAT

gateways or NAT instances in their route tables, to allow the instances to download

packages and software without exposing them to the Internet. You’ll also need the

domain name option configured in the DHCP options as explained in the Amazon

VPC documentation. You’ll be prompted for your VPC settings when you launch the

Quick Start.

Each deployment takes about two hours to complete.

2. Check the region that’s displayed in the upper-right corner of the navigation bar, and

change it if necessary. This is where the network infrastructure for Informatica Data

Lake Management will be built. The template is launched in the US East (Ohio) Region

by default.

3. On the Select Template page, keep the default setting for the template URL, and then

choose Next.

Launch Launch


http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat.html

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat.html

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_DHCP_Options.html

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_DHCP_Options.html

https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=Informatica-data-lake&templateURL=https://s3.amazonaws.com/quickstart-reference/datalake/informatica/latest/templates/informatica-datalake-master.template

https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=Informatica-data-lake&templateURL=https://s3.amazonaws.com/quickstart-reference/datalake/informatica/latest/templates/informatica-datalake.template


Page 11 of 30

4. On the Specify Details page, change the stack name if needed. Review the parameters

for the template. Provide values for the parameters that require input. For all other

parameters, review the default settings and customize them as necessary. When you

finish reviewing and customizing the parameters, choose Next.

In the following tables, parameters are listed by category and described separately for

the two deployment options:

– Parameters for deploying Informatica components into a new VPC

– Parameters for deploying Informatica components into an existing VPC

Note The templates for the two scenarios share most, but not all, of the same

parameters. For example, the template for an existing VPC prompts you for the VPC

and subnet IDs in your existing VPC environment. You can also download the

templates and edit them to create your own parameters based on your specific

deployment scenario.

Option 1: Parameters for deploying into a new VPC

View template

Network Configuration:

Parameter label

(name)

Default Description

Availability Zones

(AvailabilityZones)

Requires input The two Availability Zones that will be used to deploy

Informatica Data Lake Management components. The Quick

Start preserves the logical order you specify.

VPC CIDR

(VPCCIDR)

10.0.0.0/16 The CIDR block for the VPC.

Private Subnet 1 CIDR

(PrivateSubnet1CIDR)

10.0.0.0/19 The CIDR block for the private subnet located in Availability

Zone 1.

Private Subnet 2 CIDR

(PrivateSubnet2CIDR)

10.0.32.0/19 The CIDR block for the private subnet located in Availability

Zone 2.

Public Subnet 1 CIDR

(PublicSubnet1CIDR)

10.0.128.0/20 The CIDR block for the public (DMZ) subnet located in

Availability Zone 1.

Public Subnet 2 CIDR

(PublicSubnet2CIDR)

10.0.144.0/20 The CIDR block for the public (DMZ) subnet located in

Availability Zone 2.

IP Address Range

(RemoteAccessCIDR)

Requires input The CIDR IP range that is permitted to access the Informatica

domain and the Amazon EMR cluster. We recommend that

you use a constrained CIDR range to reduce the potential of

inbound attacks from unknown IP addresses. For example, to

https://s3.amazonaws.com/quickstart-reference/datalake/informatica/latest/templates/informatica-datalake-master.template


Page 12 of 30

Parameter label

(name)

Default Description

specify the range of 10.20.30.40 to 10.20.30.49, enter

10.20.30.40/49.

Amazon EC2 Configuration:

Parameter label

(name)

Default Description

Informatica

Embedded Cluster

Size

(ICSClusterSize)

Small The size of the Informatica embedded cluster. Choose from the

following:

Small: c4.8xlarge, single node

Medium: c4.8xlarge, three nodes

Large: c4.8xlarge, six nodes

Informatica Domain

Instance Type

(InformaticaServer

InstanceType)

c4.4xlarge The EC2 instance type for the instance that hosts the

Informatica domain. The two options are c4.4xlarge and

c4.8xlarge.

Key Pair Name

(KeyPairName)

Requires input A public/private key pair, which allows you to connect securely

to your instance after it launches. When you created an AWS

account, this is the key pair you created in your preferred

region.

Amazon EMR Configuration:

Parameter label

(name)

Default Description

EMR Cluster Name

(EMRClusterName)

Requires input The name of the Amazon EMR cluster where the Data Lake

Management instance will be deployed.

EMR Core Instance

Type

(EMRCoreInstanceType)

m4.xlarge The instance type for Amazon EMR core nodes.

EMR Core Nodes

(EMRCoreNodes)

Requires input The number of core nodes. Enter a value between 1 and 500.

EMR Master Instance

Type

(EMRMasterInstance

Type)

m4.xlarge The instance type for the Amazon EMR master node.

EMR Logs Bucket

Name

(EMRLogBucket)

Requires input The S3 bucket where the Amazon EMR logs will be stored.


Page 13 of 30

Amazon RDS Configuration:

Parameter label

(name)

Default Description

Informatica Database

Username

(DBUser)

awsquickstart The user name for the database instance associated with the

Informatica domain and services (such as Model Repository

Service, Data Integration Service, and Content Management

Service). The user name is an 8-18 character string.


Instance Password

(DBPassword)

Requires input The password for the database instance associated with the

Informatica domain and services. The password is an 8-18

character string.

Amazon Redshift Configuration:

Parameter label

(name)

Default Description

Redshift Cluster Type

(RedshiftClusterType)

single-node The type of cluster. You can specify single-node or multi-node.

If you specify multi-node, use the Redshift Number of

Nodes parameter to specify how many nodes you would like

to provision in your cluster.

Redshift Database

Name

(RedshiftDatabaseName)

dev The name of the first database to create when the cluster is

created.

Redshift Database

Port

(RedshiftDatabasePort)

5439 The port number on which the cluster accepts incoming

connections.

Redshift Number of

Nodes

(RedshiftNumberOf

Nodes)

1 The number of compute nodes in the cluster. For multi-node

clusters, this parameter must be greater than 1.

Redshift Node Type

(RedshiftNodeType)

ds2.xlarge The compute, memory, storage, and I/O capacity of the

cluster's nodes. For node size specifications, see the Amazon

Redshift documentation.

Redshift Username

(RedshiftUsername)

defaultuser The user name that is associated with the master user account

for the cluster that is being created.

Redshift Password

(RedshiftPassword)

Requires input The password that is associated with the master user account

for the cluster that is being created. The password must be an

8-64 character string that consists of at least one uppercase

letter, one lowercase letter, and one number.


Page 14 of 30

Informatica Enterprise Catalog and BDM Configuration:

Parameter label

(name)

Default Description

Informatica

Administrator

Username

(InformaticaAdminUser)

Requires input The administrator user name for accessing Big Data

Management. You can specify any string. Make a note of the

user name and password, and use it later to log in to the

Administrator tool to configure the Informatica domain.

Informatica

Administrator

Password

(InformaticaAdmin

Password)

Requires input The administrator password for accessing Big Data




License Key Location

(InformaticaKeyS3Bucket)

Requires input The name of the S3 bucket in your account that contains the

Informatica license key.

License Key Name

(InformaticaKeyName)

Requires input The Informatica license key name; for example,

INFALicense_10_2.key.

Note: The key file must be in the top level of the S3 bucket

and not in a subfolder.

Import Sample

Content

(ImportSampleData)

No Select Yes to import sample catalog data. You can use the

sample data to get started with the product.

AWS Quick Start Configuration:

Informatica recommends that you do not change the default values for the

parameters in this category.

Parameter label

(name)

Default Description

Quick Start S3 Bucket

Name

(QSS3BucketName)

quickstart-

reference

The S3 bucket name for the Quick Start assets. This bucket

name can include numbers, lowercase letters, uppercase

letters, and hyphens (-), but should not start or end with a

hyphen. You can specify your own bucket if you copy all of the

assets and submodules into it, if you want to customize the

templates and override the Quick Start behavior for your

specific implementation.

Quick Start S3 Key

Prefix

(QSS3KeyPrefix)

informatica/

datalake/latest/

The S3 key name prefix for your copy of the Quick Start assets.

This prefix can include numbers, lowercase letters, uppercase

letters, hyphens (-), and forward slashes (/). This parameter

enables you to customize or extend the Quick Start for your


https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html


Page 15 of 30

Option 2: Parameters for deploying into an existing VPC

View template

Network Configuration:

Parameter label

(name)

Default Description

VPC

(VPCID)

Requires input The ID of your existing VPC where you want to deploy the

Informatica Data Lake Management solution (for example,

vpc-0343606e).

The VPC must meet the following requirements:

It must be set up with public access through the

internet via an attached internet gateway.

The DNS Resolution property of the VPC must be set

to Yes.

The Edit DNS Hostnames property of the VPC must

be set to Yes.

Informatica Domain

Subnet

(InformaticaServer

SubnetID)

Requires input A publicly accessible subnet ID where the Informatica

domain will reside. Select one of the available subnets

listed.


Subnets

(DBSubnetIDs)

Requires input The IDs of two private subnets in the selected VPC.

Note: These subnets must be in different Availability Zones

in the selected VPC.

IP Address Range

(IPAddressRange)

Requires input The CIDR IP range that is permitted to access the

Informatica domain and the Informatica embedded cluster.

We recommend that you use a constrained CIDR range to

reduce the potential of inbound attacks from unknown IP

addresses. For example, to specify the range of 10.20.30.40

to 10.20.30.49, enter 10.20.30.40/49.

Amazon EC2 Configuration:

Parameter label

(name)

Default Description

Key Pair Name

(KeyPairName)

Requires input A public/private key pair, which allows you to connect securely

to your instance after it launches. When you created an AWS

account, this is the key pair you created in your preferred

region.

Informatica Domain

Instance Type

(InformaticaServer

InstanceType)

c4.4xlarge The EC2 instance type for the instance that hosts the

Informatica domain. The two options are c4.4xlarge and

c4.8xlarge.

Informatica

Embedded Cluster

Small The size of the Informatica embedded cluster. Choose from the

following:

https://s3.amazonaws.com/quickstart-reference/datalake/informatica/latest/templates/informatica-datalake.template

https://s3.amazonaws.com/quickstart-reference/datalake/informatica/latest/templates/informatica-datalake.template


Page 16 of 30

Parameter label

(name)

Default Description

Size

(ICSClusterSize)

Small: c4.8xlarge, single node

Medium: c4.8xlarge, three nodes

Large: c4.8xlarge, six nodes

Amazon EMR Configuration:

Parameter label

(name)

Default Description

EMR Master Instance

Type

(EMRMasterInstance

Type)

m4.xlarge The instance type for the Amazon EMR master node.

EMR Core Instance

Type

(EMRCoreInstanceType)

m4.xlarge The instance type for Amazon EMR core nodes.

EMR Cluster Name

(EMRClusterName)

Requires input The name of the Amazon EMR cluster where the Data Lake

Management instance will be deployed.

EMR Core Nodes

(EMRCoreNodes)

Requires input The number of core nodes. Enter a value between 1 and 500.

EMR Logs Bucket

Name

(EMRLogBucket)

Requires input The S3 bucket where the Amazon EMR logs will be stored.

Amazon RDS Configuration:

Parameter label

(name)

Default Description


Username

(DBUser)

awsquickstart The user name for the database instance associated with the

Informatica domain and services (such as Model Repository

Service, Data Integration Service, and Content Management

Service). The user name is an 8-18 character string.


Instance Password

(DBPassword)

Requires input The password for the database instance associated with the

Informatica domain and services. The password is an 8-18

character string.

Amazon Redshift Configuration:

Parameter label

(name)

Default Description

Redshift Database

Name

(RedshiftDatabaseName)

dev The name of the first database to create when the cluster is

created.


Page 17 of 30

Parameter label

(name)

Default Description

Redshift Cluster Type

(RedshiftClusterType)

single-node The type of cluster. You can specify single-node or multi-node.

If you specify multi-node, use the Redshift Number of

Nodes parameter to specify how many nodes you would like

to provision in your cluster.

Redshift Number of

Nodes

(RedshiftNumberOf

Nodes)

1 The number of compute nodes in the cluster. For multi-node

clusters, this parameter must be greater than 1.

Redshift Node Type

(RedshiftNodeType)

ds2.xlarge The compute, memory, storage, and I/O capacity of the

cluster's nodes. For node size specifications, see the Amazon

Redshift documentation.

Redshift Username

(RedshiftUsername)

defaultuser The user name that is associated with the master user account

for the cluster that is being created.

Redshift Password

(RedshiftPassword)

Requires input The password that is associated with the master user account

for the cluster that is being created. The password must be an

8-64 character string that consists of at least one uppercase

letter, one lowercase letter, and one number.

Redshift Database

Port

(RedshiftDatabasePort)

5439 The port number on which the cluster accepts incoming

connections.

Informatica Enterprise Catalog and BDM Configuration:

Parameter label

(name)

Default Description

Informatica

Administrator

Username

(InformaticaAdminUser

name)

Requires input The administrator user name for accessing Big Data




Informatica

Administrator

Password

(InformaticaAdmin

Password)

Requires input The administrator password for accessing Big Data




License Key Location

(InformaticaKeyS3Bucket)

Requires input The name of the S3 bucket in your account that contains the

Informatica license key.

License Key Name

(InformaticaKeyName)

Requires input The Informatica license key name; for example,

INFALicense_10_2.key.

Note: The key file must be in the top level of the S3 bucket

and not in a subfolder.


Page 18 of 30

Parameter label

(name)

Default Description

Import Sample

Content

(ImportSampleData)

No Select Yes to import sample catalog data. You can use the

sample data to get started with the product.

AWS Quick Start Configuration:

Informatica recommends that you do not change the default values for the

parameters in this category.

Parameter label

(name)

Default Description

Quick Start S3 Bucket

Name

(QSS3BucketName)

quickstart-

reference

The S3 bucket name for the Quick Start assets. This bucket

name can include numbers, lowercase letters, uppercase

letters, and hyphens (-), but should not start or end with a

hyphen. You can specify your own bucket if you copy all of the

assets and submodules into it, if you want to customize the

templates and override the Quick Start behavior for your


Quick Start S3 Key

Prefix

(QSS3KeyPrefix)

informatica/

datalake/latest/

The S3 key name prefix for your copy of the Quick Start assets.

This prefix can include numbers, lowercase letters, uppercase

letters, hyphens (-), and forward slashes (/). This parameter

enables you to customize or extend the Quick Start for your


When you finish reviewing and customizing the parameters, choose Next.

5. On the Options page, you can specify tags (key-value pairs) for resources in your stack

and set advanced options. When you’re done, choose Next.

6. On the Review page, review and confirm the template settings. Under Capabilities,

select the check box to acknowledge that the template will create IAM resources.

7. Choose Create to deploy the stack.

Step 4. Monitor the Deployment

During deployment, you can monitor the creation of the cluster instance and the

Informatica domain, and get more information about system resources.

1. Choose the stack that you are creating, and then choose the Events tab to monitor the

creation of the stack.

Figure 3 shows part of the Events tab.

https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-resource-tags.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-add-tags.html


Page 19 of 30

Figure 3: Monitoring the deployment in the Events tab

When stack creation is complete, the Status field shows CREATE_COMPLETE, and

the Outputs tab displays a list of stacks that have been created, as shown in

Figure 4.

Figure 4: Stack creation complete

2. Choose the Resources tab.

This tab displays information about the stack and the Data Lake instance. You can select

the linked physical ID properties of individual resources to get more information about

them, as shown in Figure 5.


Page 20 of 30

Figure 5: Resources tab

3. Choose the Outputs tab.

When the Informatica domain setup is complete, the Outputs tab displays the

following information:

Key Description

RedShiftIamRole Amazon Resource Name (ARN) for the Amazon RedShift

IAM role

EICCatalogURL URL for the Informatica EIC user console

InstanceID Informatica domain host name

InformaticaAdminConsoleURL URL for the Informatica administrator console

EtcHostFileEntry Etc host file entry to be added to the /etc/hosts file

to enable access to the domain, using the host name of

the Adminstrative Server

EICAdminURL URL for the EIC Administrator

EMRResourceManagerURL URL for the Amazon EMR Resource Manager

RedShiftClusterEndpoint Amazon Redshift cluster endpoint

CloudFormationLogs Location of the AWS CloudFormation installation log

S3DatalakeBucketName Name of the S3 bucket used for the data lake

InstanceSetupLogs Location of the setup log for the Informatica domain EC2

instance

InformaticaHadoopInstallLogs Location of the master node Hadoop installation log

InformaticaDomainDatabaseEndPoint Informatica domain database endpoint

InformaticaAdminConsoleServerLogs Location of the Informatica domain installation log

InformaticaHadoopClusterURL URL to the IHS Hadoop gateway node


Page 21 of 30

Key Description

InformaticaBDMDeveloperClient Location where you can download the Informatica

Developer tool (see step 5)

Note If the Outputs tab is not populated with this information, wait for domain

setup to be complete.

4. Use the links in the Outputs tab to access Informatica management tools. For example:

Use To

InformaticaAdminConsoleURL Open the Instance Administration screen. You can

use this screen to manage Informatica services and

resources. You can also get additional information about

the instance, such as the public DNS and public IP

address.

EICAdminURL Administer the Enterprise Data Catalog environment.

EICCatalogURL Access Enterprise Data Catalog. See the Informatica

Enterprise Data Catalog User Guide for information

about logging in to Enterprise Data Catalog.

Step 5. Download and Install Informatica Developer Informatica Developer (the Developer tool) is an application that you use to design and

implement data integration, data quality, data profiling, data services, and big data

solutions. You can use the Developer tool to import metadata, create connections, and

create data objects. You can also use the Developer tool to create and run profiles,

mappings, and workflows.

1. Log in to the AWS CloudFormation console at

https://console.aws.amazon.com/cloudformation/.

1. Choose the Outputs tab.

2. Right-click the value of the InformaticaBDMDeveloperClient key to download the

Developer tool client installer.

3. Uncompress and launch the installer to install the Developer tool on a local drive.

https://kb.informatica.com/proddocs/Product%20Documentation/5/IN_1011_EnterpriseInformationCatalogUserGuide_en.pdf

https://kb.informatica.com/proddocs/Product%20Documentation/5/IN_1011_EnterpriseInformationCatalogUserGuide_en.pdf

https://console.aws.amazon.com/cloudformation/


Page 22 of 30

Manual Cleanup

If you deploy the Quick Start for a new VPC, Amazon EMR creates security groups that are

not deleted when you delete the Amazon EMR cluster. To clean up after deployment, follow

these steps:

1. Delete the Amazon EMR cluster.

2. Delete the Amazon EMR-managed security groups (ElasticMapReduce-master,

ElasticMapReduce-slave) by deleting the circularly dependent rules followed by the

security groups themselves.

3. Delete the AWS CloudFormation stack.

Troubleshooting

Q. I encountered a CREATE_FAILED error when I launched the Quick Start.

A. If you encounter this error in the AWS CloudFormation console, we recommend that you

relaunch the template with Rollback on failure set to No. (This setting is under

Advanced in the AWS CloudFormation console, Options page.) With this setting, the

stack’s state will be retained and the instance will be left running, so you can troubleshoot

the issue. (You'll want to look at the log files in %ProgramFiles%\Amazon\EC2ConfigService

and C:\cfn\log.)

Important When you set Rollback on failure to No, you’ll continue to

incur AWS charges for this stack. Please make sure to delete the stack when

you’ve finished troubleshooting.

For additional information, see Troubleshooting AWS CloudFormation on the AWS

website.

Q. I encountered an error while installing Informatica domain and services.

A. We recommend that you view the /installation.log log file to get more information

about the errors you encountered.

Q. I encountered a size limitation error when I deployed the AWS Cloudformation

templates.

A. We recommend that you launch the Quick Start templates from the location we’ve

provided or from another S3 bucket. If you deploy the templates from a local copy on your

computer or from a non-S3 location, you might encounter template size limitations when

you create the stack. For more information about AWS CloudFormation limits, see the AWS

documentation.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html


Page 23 of 30

Using Informatica Data Lake Management on AWS After you deploy this Quick Start, you can use any of the patterns described in this section

to use the Informatica Data Lake Management solution on AWS.

Transient and Persistent Clusters Amazon EMR provides two methods to configure a cluster: transient and persistent.

Transient clusters are shut down when the jobs are complete. For example, if a batch-

processing job pulls web logs from Amazon S3 and processes the data once a day, it is more

cost-effective to use transient clusters to process web log data and shut down the nodes

when the processing is complete. Persistent clusters continue to run after data processing is

complete. The Informatica Data Lake Management solution supports both cluster types.

For more information, see the Amazon EMR best practices whitepaper.

This Quick Start sets up a persistent EMR cluster with a configurable number of core nodes,

as defined by the EMRCoreNodes parameter.

Common AWS Architecture Patterns for Informatica Data Lake Management Informatica Data Lake Management supports the following patterns that leverage AWS for

big data processing.

Pattern 1: Using Amazon S3

In this first pattern, data is loaded to Amazon S3 using Informatica. For data processing,

the Informatica Big Data Management mapping logic pulls data from Amazon S3 and sends

it for processing to Amazon EMR.

Amazon EMR does not copy the data to the local disk or HDFS. Instead, the mappings open

multithreaded HTTP connections to Amazon S3, pull data to the Amazon EMR cluster, and

process data in streams, as illustrated in Figure 6.

https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf


Page 24 of 30

Figure 6: Pattern 1 using Amazon S3

Pattern 2: Using HDFS and Amazon S3 as Backup Storage

In this pattern, Informatica writes data directly to HDFS and leverages the Amazon EMR

task nodes to process the data and periodically copy data to Amazon S3 as the backup

storage, as illustrated in Figure 7.

The advantage of this pattern is the ability to process data without copying it to Amazon

EMR. Although copying to Amazon EMR may improve performance, the disadvantage is

durability. Because Amazon EMR uses ephemeral disk to store data, data could be lost if the

EC2 instance for Amazon EMR fails. HDFS replicates data within the Amazon EMR cluster

and can usually recover from node failures. However, data loss could still occur if the

number of lost nodes is greater than your replication factor. Informatica recommends that

you back up HDFS data to Amazon S3 periodically.


Page 25 of 30

Figure 7: Pattern 2 using HDFS and Amazon S3 as backup

Pattern 3: Using Amazon Kinesis and Kinesis Firehose for Real-Time and

Streaming Analytics

In the third pattern, unbounded events streams that are continuously generated from

devices, IoT applications, and cloud applications are ingested in real time, using

Informatica Edge Data Streaming, into Amazon Kinesis. Using Informatica Big Data

Streaming, which leverages the existing Informatica platform, streaming pipelines can be

built using pre-built transformations, connectors, and parsers. These elements are

optimized to execute on an Amazon EMR cluster in streaming mode using Spark

Streaming. They support the consumption of data records from an Amazon Kinesis stream

and act as a producer for writing data to a defined Amazon Kinesis Firehose delivery

stream. Data can be persisted to Amazon S3, Amazon Redshift, and Amazon Elasticsearch

Service (Amazon ES) and delivered in JSON and binary payloads.

For more information about deploying Informatica Big Data Streaming on AWS, please

contact Informatica or your implementation partner.

Figure 8 shows the Informatica Big Data Streaming architecture.



Page 26 of 30

Figure 8: Pattern 3 using the Informatica Big Data Streaming architecture

Pattern 4: Using AWS for Self-Service Data Discovery and Preparation

In the last pattern, Informatica Enterprise Data Lake provides data analysts with a

collaborative, self-service, big data discovery and preparation solution. Analysts can rapidly

discover and turn raw data into insights, with quality and governance powered by data

intelligence deployed on AWS.

When deployed on AWS, Informatica Enterprise Data Lake leverages the existing

Informatica platform, which allows analysts to discover, search, and explore data assets for

analysis using an AI-driven data catalog. The Data Lake Management solution makes

recommendations based on the behavior and shared knowledge of the data assets used for

analysis. Once analysts find the relevant data, they can blend, transform, cleanse, and

enrich data by using a Microsoft Excel-like data preparation interface, at scale on an

Amazon EMR cluster.

Data is prepared, published, and made available for consumption in the data lake. An

analyst can assess the prepared data using ad-hoc queries to generate charts, tables, and

other visual formats. IT can operationalize the data preparation steps that will execute the

ad-hoc work done by analysts into Informatica big data mappings, which will run in batch

on an Amazon EMR cluster.


Page 27 of 30

You can deploy Informatica Enterprise Data Lake on the same AWS infrastructure that

supports Informatica Big Data Management and Informatica Enterprise Data Catalog.

Figure 9 shows the data flows for Informatica Enterprise Data Lake.

Figure 9: Data flows used in pattern 4

Process Flow

Figure 10 shows the process flow for using the Informatica Data Lake Management solution

on AWS. It illustrates the data flow process using the Informatica Data Lake Management

solution and Amazon EMR, Amazon S3, and Amazon Redshift.


Page 28 of 30

Figure 10: Informatica Data Lake Management Solution process flow using Amazon EMR

The numbers in Figure 10 refer to the following steps:

Step 1: Collect and move data from on-premises systems into Amazon S3 storage. Consider

offloading infrequently used data, and batch-load raw data to a defined landing zone in

Amazon S3.

Step 2: Collect cloud application and streaming data generated by machines and sensors in

Amazon S3 storage instead of staging it in a temporary file system or a data warehouse.

Step 3: Discover and profile data stored in Amazon S3, using Amazon EMR as the

processing infrastructure. Profile data to better understand its structure and context. Parse

raw data, either in multi-structured or unstructured formats, to extract features and

entities, and cleanse data with data quality tasks. To prepare data for analysis, you can

execute prebuilt transformations and data quality rules natively in EMR to prepare data for

analysis.

Step 4: Match duplicate data within and across big data sources and link them to create a

single view.

Step 5: Perform data masking to protect confidential data such as credit card information,

social security numbers, names, addresses, and phone numbers from unintended exposure

to reduce the risk of data breaches. Data masking helps IT organizations manage the access

to their most sensitive data, providing enterprise-wide scalability, robustness, and

connectivity to a vast array of databases.


Page 29 of 30

Step 6: Data analysts and data scientists can prepare and collaborate on data for analytics

by incorporating semantic search, data discovery, and intuitive data preparation tools for

interactive analysis with trusted, secure, and governed data assets.

Step 7: After cleansing and transforming data on Amazon EMR, move high-value curated

data back to Amazon S3 or to Amazon Redshift. From there, users can directly access data

with BI reports and applications.

Additional Resources

AWS services

AWS CloudFormation http://aws.amazon.com/documentation/cloudformation/

Amazon EBS http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html

Amazon EC2 http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/

Amazon EMR

https://aws.amazon.com/documentation/emr/

Amazon Redshift

https://aws.amazon.com/documentation/redshift/

Amazon S3

https://aws.amazon.com/documentation/s3/

Amazon VPC http://aws.amazon.com/documentation/vpc/

Informatica

Informatica Network: a source for product documentation, Knowledge Base articles, and other information https://network.informatica.com

Quick Start reference deployments

AWS Quick Start home page https://aws.amazon.com/quickstart/

http://aws.amazon.com/documentation/cloudformation/

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html

http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/

https://aws.amazon.com/documentation/emr/

https://aws.amazon.com/documentation/redshift/

https://aws.amazon.com/documentation/s3/

http://aws.amazon.com/documentation/vpc/

https://network.informatica.com/

https://aws.amazon.com/quickstart/


Page 30 of 30

GitHub Repository You can visit our GitHub repository to download the templates and scripts for this Quick

Start, to post your comments, and to share your customizations with others.

Document Revisions

Date Change In sections

January 2018 Initial publication —

© 2018, Amazon Web Services, Inc. or its affiliates, and Informatica LLC. All rights

reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings

and practices as of the date of issue of this document, which are subject to change without notice. Customers

are responsible for making their own independent assessment of the information in this document and any

use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether

express or implied. This document does not create any warranties, representations, contractual

commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities

and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of,

nor does it modify, any agreement between AWS and its customers.

The software included with this paper is licensed under the Apache License, Version 2.0 (the "License"). You

may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/apache2.0/ or in the "license" file accompanying this file. This code is distributed on

an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

https://github.com/aws-quickstart/quickstart-datalake-informatica

http://aws.amazon.com/apache2.0/

Date post:	13-Mar-2018
Category:	Documents
Upload:	lambao
View:	224 times
Download:	7 times

Informatica Data Lake Management on the AWS Cloud · PDF fileUsing Informatica Data Lake...

Documents