+ All Categories
Home > Documents > IN_950_SystemPerformanceGuidelines_en.pdf

IN_950_SystemPerformanceGuidelines_en.pdf

Date post: 03-Apr-2018
Category:
Upload: jacobineiro
View: 220 times
Download: 0 times
Share this document with a friend

of 34

Transcript
  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    1/34

    Informatica Data Explorer and Informatica Data Quality (Version 9.5.0)

    System Performance Guidelines

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    2/34

    Informatica Data Explorer and Informatica Data Quality System Performance Guidelines

    Version 9.5.0June 2012

    Copyright (c) 2010-2012 Informatica. All rights reserved.

    This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use anddisclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form,by any means (electronic, photocopying, recording or otherwise) w ithout prior consent of Informatica Corporation. This Software may be protected by U.S. and/or internationalPatents and other Patents Pending.

    Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided inDFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013 (1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.

    The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us inwriting.

    Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica OnDemand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging and InformaticaMaster Data Management are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other companyand product names may be trade names or trademarks of their respective owners.

    Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rightsreserved. Copyright Sun Microsystems. All rights reserved. Copyright RSA Security Inc. All Rights Reserved. Copyright Ordinal Technology Corp. All rightsreserved.Copyright Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright MetaIntegration Technology, Inc. All rights reserved. Copyright Intalio. All rights reserved. Copyright Oracle. All rights reserved. Copyright Adobe Systems Incorporated. Allrights reserved. Copyright DataArt, Inc. All rights reserved. Copyright ComponentSource. All rights reserved. Copyright Microsoft Corporation. All rights reserved.Copyright Rogue Wave Software, Inc. All rights reserved. Copyright Teradata Corporation. All rights reserved. Copyright Yahoo! Inc. All rights reserved. Copyright Glyph & Cog, LLC. All rights reserved. Copyright Thinkmap, Inc. All rights reserved. Copyright Clearpace Software Limited. All rights reserved. Copyright InformationBuilders, Inc. All rights reserved. Copyright OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved. Copyright Cleo Communications, Inc. All rightsreserved. Copyright International Organization for Standardization 1986. All rights reserved. Copyright ej-technologies GmbH. All rights reserved. Copyright JaspersoftCorporation. All rights reserved. Copyright is International Business Machines Corporation. All rights reserved. Copyright yWorks GmbH. All rights reserved. CopyrightLucent Technologies 1997. All rights reserved. Copyright (c) 1986 by University of Toronto. All rights reserved. Copyright 1998-2003 Daniel Veillard. All rights reserved.Copyright 2001-2004 Unicode, Inc. Copyright 1994-1999 IBM Corp. A ll rights reserved. Copyright MicroQuill Software Publishing, Inc. All rights reserved. Copyright PassMark Software Pty Ltd. All rights reserved.

    This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and other software which is licensed under the Apache License,Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing,software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See theLicense for the specific language governing permissions and limitations under the License.

    This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software copyright1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any kind, either express or implied, including but notlimited to the implied warranties of merchantability and fitness for a particular purpose.

    The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine,and Vanderbilt University, Copyright ( ) 1993-2006, all rights reserved.

    This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and redistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

    This product includes Curl software which is Copyright 1996-2007, Daniel Stenberg, . All Rights Reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or withoutfee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

    The product includes software copyright 2001-2005 ( ) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms availableat http://www.dom4j.org/ license.html.

    The product includes software copyright 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http://dojotoolkit.org/license.

    This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

    This product includes software copyright 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at http://www.gnu.org/software/ kawa/Software-License.html.

    This product includes OSSP UUID software which is Copyright 2002 Ralf S. Engelschall, Copyright 2002 The OSSP Project Copyright 2002 Cable & WirelessDeutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

    This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are subjectto terms available at http://www.boost.org/LICENSE_1_0.txt.

    This product includes software copyright 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at http://

    www.pcre.org/license.txt.This product includes software copyright 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http:// www.eclipse.org/org/documents/epl-v10.php.

    This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://www.stlport.org/doc/ license.html, http://www.asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://httpunit.sourceforge.net/doc/license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/license.html, http://www.libssh2.org,http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html; http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt. http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231; http://developer.apple.com/library/mac/#samplecode/HelpHook/Listings/HelpHook_java.html; http://www.jcraft.com/jsch/LICENSE.txt; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html,http://www.tcl.tk/software/tcltk/license.html, http://www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    3/34

    www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; and http://srp.stanford.edu/license.txt; and http://www.schneier.com/blowfish.html; http://www.jmock.org/license.html; http://xsom.java.net/.

    This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and DistributionLicense (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License

    Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php) the MIT License (http://www.opensource.org/licenses/mit-license.php) and the Artistic License (http://www.opensource.org/licenses/artistic-license-1.0).

    This product includes software copyright 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this softwareare subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab. For further information please visit http://www.extreme.indiana.edu/.

    This Software is protected by U.S. Patent Numbers 5,794,246; 6,014,670; 6,016,501; 6,029,178; 6,032,158; 6,035,307; 6,044,374; 6,092,086; 6,208,990; 6,339,775;

    6,640,226; 6,789,096; 6,820,077; 6,823,373; 6,850,947; 6,895,471; 7,117,215; 7,162,643; 7,243,110; 7,254,590; 7,281,001; 7,421,458; 7,496,588; 7,523,121; 7,584,422;7,676,516; 7,720,842; 7,721,270; and 7,774,791, international Patents and other Patents Pending.

    DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the impliedwarranties of noninfringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. Theinformation provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation issubject to change at any time without notice.

    NOTICES

    This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress SoftwareCorporation ("DataDirect") which are subject to the following terms and conditions:

    1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PUR POSE AND NON-INFRINGEMENT.

    2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OFTHE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACHOF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

    Part Number: IN-DPS-95000-0001

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    4/34

    Table of Contents

    Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i i iInformatica R esources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Informatica C ustomer Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Informati ca Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Informati ca Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Informati ca How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Informati ca Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    Informati ca Multimedia Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    Informati ca Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    Chapter 1: Pro filing Service Module Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . 1Profiling Service M odule Performance Tuning Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Profiling Servi ce Module Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Data Integration S ervice Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Profiling Servi ce Module Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Column Profiling R esource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Nonrelati onal and Relational Data Source Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Hardware Co nsiderations for Column Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    Flat File and Mainframe Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Relational Da tabase Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Key and Functiona l Dependency Resource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Hardware Considerations for Key and Functional Dependency Discovery. . . . . . . . . . . . . . . . . . . 8

    Foreign Key a nd Overlap Discovery Resource Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Hardware Considerations for Foreign Key and Overlap Discovery. . . . . . . . . . . . . . . . . . . . . . . . 8

    Profile Wareh ouse Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    Profile W arehouse Guidelines for Column Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    Profile W arehouse Guidelines for Key and Funct ional Dependency Discovery. . . . . . . . . . . . . . . 10

    Profile W arehouse Guidelines for Foreign Key and Overlap Discovery. . . . . . . . . . . . . . . . . . . . 11

    Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Chapter 2: Sys tem Performance Guidelines for Data Quality Mapping Operations. . . 13Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    Basic On-Disk Inst allation Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    General Runtime Memory Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Address Valid ation Reference Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Mapping Mem ory and Disk Size Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Standard Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Referenc e Data-Based Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Table of Contents i

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    5/34

    Dynamic Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Data Quality Table Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Reference Database Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Exception Management Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Appendix A: Disk and Memory Guidelines for United States Customers. . . . . . . . . . . . 20User 1: Matching Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    User 2: Address Validation Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    User 3: Standardization Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    User 4: Association Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Addit ional Memory and Disk Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Appendix B: Address Validation Reference Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Address Val idation Reference Data with On-Disk Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Appendix C: Identity Population Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Identity Population Data with On-Disk Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    ii Table of Contents

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    6/34

    PrefaceThe System Performance Guide is written for system administrators and others who must plan the installation of Informatica Data Explorer and Informatica Data Quality.

    This guide assumes that you understand operating system, database, profiling, and data quality concepts.

    Informatica Resources

    Informatica Customer Portal As an Informatica customer, you can access the Informat ica Customer Portal site athttp://mysupport.informatica.com. The site contains product information, user group information, newsletters,access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library,the Informatica Knowledge Base, the Informatica Multimedia Knowledge Base, Informatica ProductDocumentation, and access to the Informatica user community.

    Informatica DocumentationThe Informatica Documentation team takes every effort to create accurate, usable documentation. If you havequestions, comments, or ideas about this documentation, contact the Informatica Documentation team throughemail at [email protected]. We will use your feedback to improve our documentation. Let usknow if we can contact you regarding your comments.

    The Documentation team updates documentation as needed. To get the latest documentation for your product,navigate to Product Documentation from http://mysupport.informatica.com.

    Informatica Web SiteYou can access the Informatica corporate web site at http://www.informatica.com. The site contains informationabout Informatica, its background, upcoming events, and sales offices. You will also find product and partner information. The services area of the site includes important information about technical support, training andeducation, and implementation services.

    Informatica How-To Library As an Informatica customer, you can access the Informat ica How-To Library at http:/ /mysupport .informati ca.com.The How-To Library is a collection of resources to help you learn more about Informatica products and features. Itincludes articles and interactive demonstrations that provide solutions to common problems, compare features andbehaviors, and guide you through performing specific real-world tasks.

    iii

    http://mysupport.informatica.com/http://www.informatica.com/http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/
  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    7/34

    Informatica Knowledge Base As an Informatica customer, you can access the Informat ica Knowledge Base at http: //mysupport .informatica.com.Use the Knowledge Base to search for documented solutions to known technical issues about Informaticaproducts. You can also find answers to frequently asked questions, technical white papers, and technical tips. If you have questions, comments, or ideas about the Knowledge Base, contact the Informatica Knowledge Base

    team through email at KB_Feedback @informatica.com.

    Informatica Multimedia Knowledge Base As an Informatica customer, you can access the Informat ica M ultimedia Knowledge Base athttp://mysupport.informatica.com. The Multimedia Knowledge Base is a collection of instructional multimedia filesthat help you learn about common concepts and guide you through performing specific tasks. If you havequestions, comments, or ideas about the Multimedia Knowledge Base, contact the Informatica Knowledge Baseteam through email at KB_Feedback @informatica.com.

    Informatica Global Customer Support

    You can contact a Customer Support Center by telephone or through the Online Support. Online Support requiresa user name and password. You can request a user name and password at http://mysupport.informatica.com.

    Use the following telephone numbers to contact Informatica Global Customer Support:

    North America / South America Europe / Middle East / Africa Asia / Australia

    Toll Free

    Brazil: 0800 891 0202

    Mexico: 001 888 209 8853

    North America: +1 877 463 2435

    Toll Free

    France: 0805 804632

    Germany: 0800 5891281

    Italy: 800 915 985

    Netherlands: 0800 2300001

    Portugal: 800 208 360

    Spain: 900 813 166Switzerland: 0800 463 200

    United Kingdom: 0800 023 4632

    Standard Rate

    Belgium: +31 30 6022 797

    France: +33 1 4138 9226

    Germany: +49 1805 702 702

    Netherlands: +31 306 022 797

    United Kingdom: +44 1628 511445

    Toll Free

    Austra lia: 1 800 151 830

    New Zealand: 09 9 128 901

    Standard Rate

    India: +91 80 4112 5738

    iv Preface

    http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/
  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    8/34

    C H A P T E R 1

    Profiling Service ModulePerformance Tuning

    This chapter includes the following topics:

    Profiling Service Module Performance Tuning Overview, 1

    Profiling Service Module Overview, 2

    Data Integration Service Resources, 2

    Profiling Service Module Resources, 2

    Column Profiling Resource Guidelines, 3

    Key and Functional Dependency Resource Guidelines, 8

    Foreign Key and Overlap Discovery Resource Guidelines, 8

    Profile Warehouse Guidelines, 9

    Summary, 12

    Profiling Serv ice Module Performance Tuning OverviewThe performance tuning guidelines help you to determine the system resources for the Profiling Service Module.

    The following cat egories describe the system resourc e recommendations:

    Resource gui delines for the Profiling Service Mod ule and the Data Integration Service, including memory, diskspace, and CPU usage.

    Resource guidelines for column profiling, key and functional dependency discovery, and foreign key andoverlap discovery based on data source types and hardware capacity.

    Resource guidelines for profile warehouse.

    The Profiling Service Module is a component of the Data Integration Service. The performance guidelines in thisdocument are for the Profiling Service Module and are independent of the resource requirements for other modules within the Data Integration Service.

    1

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    9/34

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    10/34

    CPU

    The Profiling Service Module uses less than 1 CPU. Consider the following CPU requirements for differentprofile types:

    The CPU requirements for column profiles depend on the data source type. Relational systems requireless than one CPU for each Data Transformation Manager thread. Flat files use approximately 2.3 CPUs

    for each Data Transformation Manager thread. Key and functional dependency discovery require one CPU for each Data Transformation Manager thread.

    Join, foreign key, and overlap discovery require two CPUs for each Data Transformation Manager thread.

    When you calculate the number of CPUs required for Data Transformation Manager operations, round thetotal number up to the nearest integer. Disk space is a one-time cost when the Data Integration Service isinstalled. CPU overhead is minimal when the Data Integration Service is not running jobs.

    Memory

    No additional memory is required beyond the minimum needed to run the mapping.

    Disk

    No disk space is required.

    Operating System

    Use a 64-bit operating system, if possible, as a 64-bit system can handle memory sizes greater than 4 GB. A32-bit system works if the profiling parameter fits within the memory limitations of the system.

    Column Profiling Resource GuidelinesThe system resource guidelines for column profiling depend on the type of data source and hardware capacity.

    Nonrelational and Relational Data Source ProfilingThe types of data source on which you run a column profile influence the requirements of the Profiling ServiceModule machine. In case of flat file profiling, you can allocate hardware resources to the machine that runs theProfiling Service Module. For relational databases, you can allocate hardware resources to the database hostmachine.

    Most enterprises have a mixture of nonrelational and relational database data sources. In such a case, you candivide the resources between the machines. The relational database for the profile warehouse may run on aseparate machine. This machine stores the profiling results and may store cached data. The resources that theprofile warehouse needs depend on the quantity of data that it may need to store.

    When you run a column profile in a relational database, the Profiling Service Module allocates one mapping for each column. Each mapping divides the work between the relational database and Data Transformation Manager as follows:

    The relational database performs value frequency computation.

    The Profiling Service Module performs profile analysis, including pattern analysis. This task is less resource-intensive compared to the value frequency computation.

    When you run a profile on the columns in a flat file, the Profiling Service Module allocates one mapping to fivecolumns by default. You can change this setting to run multiple mappings of five columns concurrently.

    Column Profiling Resource Guidelines 3

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    11/34

    Nonrelational Data Source ProfilingThe profiling methodology for nonrelational data sources includes the assumption that the data source cannot runany logic. This assumption applies to flat files, VSAM and other mainframe databases, and SAP and other nonrelational databases.

    The resources required for nonrelational data sources depend on the data source:

    Flat Files

    The Profiling Service Module generates mappings that perform all the profiling logic for flat files. Theprocessing happens on the Data Integration Service using its resources.

    Mainframe Databases

    The profiling methodology for nonrelational data sources includes mainframe databases because of thefinancial cost of providing resources for mainframe machines. You can transfer the processing to the DataTransformation Manager to avoid this cost outlay. However, the one-time movement of data from themainframe to the Data Transformation Manager may incur greater processing overhead than repeated readoperations on the mainframe data.

    Some mainframe databases, such as IMS and VSAM, work like flat files with no ability to transfer the logic tothe databases. Other mainframe databases, such as DB2, are expensive to access because of the way themainframes are administered. Because of both these reasons, the Profiling Service Module treats mainframesources as flat files. The Profiling Service Module performs all the processing tasks on the Data IntegrationService to minimize administrative costs.

    Other Nonrelational Databases

    Databases, such as SAP, are included in the profiling methodology for nonrelational data sources because of the proprietary way that they access data. The Profiling Service Module cannot push the profiling logic tothese databases.

    Relational Database ProfilingRelational databases can run few subtasks of the profiling task on the database machine.

    Transferring a profiling subtask to the database machine delivers higher performance than retrieving the data andrunning a profile on it in the Profiling Service Module machine. The Profiling Service Module transfers as muchprocessing as possible to the database machine.

    Hardware Considerations for Column ProfilingThe factors that affect profile performance include the speed of the central processing unit, memory size, diskspace, and the speed of the disk and network.

    Consider the following hardware considerations for column profiling:

    Central Processing Unit (CPU)

    The Profiling Service Module takes advantage of the multithreaded environment of the Data IntegrationService. Therefore, the CPU speed is less important than the number of cores in the CPU. To calculate thenumber of cycles that the Profiling Service Module uses each second, add the clock speeds of the cores.

    Memory

    Profiling operations run faster with more memory. In flat file profiling, the Profiling Service Module usesmemory to sort the value frequency data and buffer data. The Profiling Service Module performs multiple readoperations to the same part of a file by reading from the memory buffer, not from the flat file on disk. This alsoapplies to rule profiling for flat file and relational sources.

    4 Chapter 1: Profiling Service Module Performance Tuning

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    12/34

    Disk

    The Profiling Service Module uses disk space for temporary storage when there is not enough memory tostore the intermediate profile results. The Profiling Service Module uses multiple temporary directories in asingle profiling job so that the storage and input/output operations can be spread among multiple disks inparallel.

    Profiling speed increases if the temporary directories are located on separate physical disks. Disktechnologies, such as rotational speed and on-disk buffering also benefit profile run.

    Input/Output

    The input/output speeds for memory, disk, and network affect the Profiling Service Module performance.Higher speeds allow the Profiling Service Module to quickly access large amounts of data. Network isimportant for relational databases that are not located along with the Profiling Service Module and flat fileslocated on a network attached storage device.

    Flat File and Mainframe ConsiderationsWhen you run a profile job on a flat file, the Profiling Service Module divides the job into a number of mappingsthat infer the metadata for the columns and virtual columns. Each mapping can run serially, or two or moremappings can run in parallel.

    In addition, a second type of mapping may be generated to cache the source data. This mapping always runs inparallel with the column profiling mappings because it takes longer than a column profile mapping.

    In case of a mainframe data source, the Profiling Service Module groups as many columns as possible into asingle mapping to minimize the number of table scans on the data source. Mainframe data sources require moredisk space than flat files to store the temporary computations.

    Column Profile Mapping RequirementsThe default column profile mapping for a flat file runs a profile on five columns at a time. For mainframe datasources, the column profile mapping runs a column profile on 50 columns at a time. The CPU, memory, disk, andoperating system requirements are based on five columns.

    CPU

    Column profiling consumes approximately 2.3 CPUs for each mapping. When you calculate the number of CPUs you need, round up the total to the nearest integer.

    Memory for Mappings

    The Profiling Service Module uses two methods for profile mappings. First, it applies a method that requiresapproximately 2 MB of memory for each column. If the first method does not work, it uses the second methodof sorting columns with a buffer of 64 MB.

    The minimum resource required is 10 MB, representing 2 MB 5 columns. The maximum resource required is72 MB, representing a 64 MB buffer for one high-cardinality column and 8 MB for the remaining four low-cardinality columns.

    Memory for the Buffer CacheThe Profiling Service Module caches the flat file data as it reads the data. Profiling speed increases if theProfiling Service Module can cache all the file data.

    The exception to using cache memory is when two or more mappings read a file concurrently. In this case,add 100 MB of memory. This enables the mappings to share the read operations and increase performance.

    Column Profiling Resource Guidelines 5

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    13/34

    Disk

    A profile mapping may need disk space to perform profi ling computations. The foll owing formula calculatesthe disk space for a single mapping:

    2 number of columns per mapping maximum number of rows ((2 bytes per character maximumstring size in characters) + frequency bytes)

    where

    2 = two passes. Some analyses need two passes.

    Number of columns for each mapping = 5 (default)

    Maximum number of rows = the maximum number of rows in any flat file

    2 bytes per character = the typical number of bytes for a single Unicode character

    Maximum string size in characters = the maximum number of characters in any column in any flat file, or 255, whichever is less

    Frequency bytes = 4 bytes to store the frequency calculation during the analysis

    Perform the above calculation and allocate the disk space to one or more physical disks. Use one disk for each mapping, and use a maximum of four disks.

    Operating System

    Use a 64-bit operating system to accommodate memory sizes greater than 4 GB. A 32-bit system works if theprofiling parameter fits within the memory limitations of the system.

    Note: These guidelines covers the optimal flat file profiling case, which uses five columns for each mapping. Insome cases, the Profiling Service Module must run the profile for more than five columns in one mapping, for example, when running a profile on mainframe data where the financial cost of accessing the data can be high.

    Profile Cache Mapping Requirements A prof ile cache mapping caches data to the profile warehouse and has different r esource requ irements than acolumn profile mapping.

    The CPU, memory, and disk space requirements for a profile cache mapping are as follows:

    CPU

    The cache mapping requires approximately 1.5 CPUs.

    Memory

    The cache mapping requires no additional memory beyond the Data Transformation Manager thread memory.

    Disk

    The cache mapping requires no disk space.

    Aggregate Profile Mapping GuidelinesTo compute the total resources required by profiling, add the profile mapping requirements to the cache mapping

    requirements.

    Use the following formula to determine the total profiling resources:

    (number of concurrent profile mappings resources per mapping) + cache mapping resources

    6 Chapter 1: Profiling Service Module Performance Tuning

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    14/34

    Relational Database ConsiderationsThe Profiling Service Module transfers as much processing as it can to the database machine where the relationaldatabase resides. The division of work between the Profiling Service Module and database can be challengingwhen you estimate resources for each machine.

    To understand the resource requirements, you must first distinguish rule profiling from column profiling.

    Rule and Column ProfilingDepending on the rule logic, rules can be pushed down to the database or handled internally by the ProfilingService Module.

    If a rule is pushed down to the database, it is treated like a column during the profile run.

    Rules that run inside the Profiling Service Module are treated like columns in a flat file profile run. The rules aregrouped into mappings of five output columns at a time before running the profile. The flat file calculations apply inthis case.

    BandwidthThe network between the relational database and the Profiling Service Module must be able to handle the datatransfers.

    For large databases, the bandwidth required can be considerable.

    Relational Database Mapping ResourcesThe following resource considerations are based on a single mapping that pushes the profiling logic down to therelational database for each column.

    CPU

    Depending on the relational database, at least one CPU processes each query. If the relational databaseprovides a mechanism to increase this, such as the parallel hint in Oracle, the number of CPUs utilized

    increases accordingly.

    Memory

    The relational database requires memory in the form of a buffer cache. The greater the buffer cache, thefaster the relational database runs the query. Use at least 512 MB of buffer cache.

    Disk

    Relational systems use temporary table space. The formula for the maximum amount of temporary tablespace required is:

    2 maximum number of rows (maximum column size + frequency bytes)

    where

    2 = two passes (some analyses need two passes).

    Maximum number of rows = the maximum number of rows in any table.

    Maximum column size = the number of bytes in any column in a table that is not one of the very large datatypes, for example CLOB, that you cannot run a profile on. The column size must take into account thecharacter encoding, such as Unicode or ASCII.

    Frequency bytes = 4 or 8 bytes to store the frequency during the analysis. This is the default size that thedatabase uses for COUNT(*).

    Column Profiling Resource Guidelines 7

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    15/34

    In many situations, less disk space is needed. Perform the disk computation and allocate the temporary tablespace to one or more physical disks. Use one disk for each mapping, and use a maximum of four disks.

    Operating System

    Use a 64-bit operating system to accommodate memory sizes greater than 4 GB. A 32-bit system works if theprofiling parameter fits within the memory limitations of the system.

    Key and Functional Dependency Resource GuidelinesThe Profiling Service Module processes a data source sample to infer the keys and functional dependencies. Thebandwidth requirement for flat files and relational databases is less because the data size is usually small.

    Hardware Considerations for Key and Functional DependencyDiscovery

    Both key and functional dependency discovery algorithms are CPU- and temporary disk-intensive.The algorithmsuse memory to cache between the intermediate results and temporary disk.

    The factors that affect profile performance include CPU, memory, disk size and disk speed:

    Central Processing Unit (CPU)

    Uses one CPU for each mapping.

    Memory

    Requires 256 MB of memory in addition to the mapping memory.

    Disk Size

    Caches intermediate profile results to the disk and the required amount of disk space depends on thecomplexity of data and the number of columns. A minimum of 128 GB disk space is recommended.

    Disk Speed

    The input/output speeds, for both memory and disk, affect the Profiling Service Module performance. Higher speeds allow the Profiling Service Module to quickly access large amounts of data.

    Foreign Key and Overlap Discovery ResourceGuidelines

    The hardware resource guidelines depend on whether column signatures are available before the profile run.

    Hardware Considerations for Foreign Key and Overlap DiscoveryThe Profiling Service Module uses a couple of strategies to infer foreign keys or overlap columns. If columnsignatures are available before the profile run, the Profiling Service Module does not require additional resources.If the signatures need to be computed, the profile needs additional resources.

    The hardware resource requirements are as follows:

    8 Chapter 1: Profiling Service Module Performance Tuning

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    16/34

    Central Processing Unit (CPU)

    Requires two CPUs for each mapping.

    Memory

    Requires 64 MB of additional memory for internal caches If no column profile is run. Requires no additionalmemory if column profile is run.

    Disk

    Does not require temporary disk space.

    Profile Warehouse GuidelinesThe profile warehouse stores profiling results. More than one Profiling Service Module may point to the sameprofile warehouse. The main resource for the profile warehouse is disk space. The disk size calculations dependon the expected storage sizes of integers. Some databases, such as Oracle, use a compressed number formatand they require less disk size.

    Profile Warehouse Guidelines for Column ProfilingColumn profiling stores three types of results in the profile warehouse: statistical and bookkeeping data, valuefrequencies, and staged data.

    Statistical and Bookkeeping Data GuidelinesEach column contains a set of statistics, such as the minimum and maximum values. It also contains a set of tables that store bookkeeping data, such as profile ID. These take up very little space and you can exclude themfrom disk space calculations.

    Consider the disk requirement to be effectively zero.

    Value Frequency Calculation GuidelinesValue frequencies are a key element in profile results. They list the unique values in a column along with a countof the occurrences of each value.

    Low-cardinality columns have very few values, but large-cardinality columns can have millions of values. TheProfiling Service Module limits the number of unique values it identifies to 16,000 by default. You can change thisvalue.

    Use this formula to calculate disk size requirements:

    Number of columns number of values (average value size + 64)

    where

    Number of columns = the sum of columns and virtual columns in the profile run.

    Number of values = the number of unique values. If you do not use the default of 16,000, use the averagenumber of values in each column.

    Average value s ize i ncludes Uni code encoding of characters .

    64 bytes for each value = 8 bytes for the frequency and 56 bytes for the key.

    Profile Warehouse Guidelines 9

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    17/34

    Cached Data GuidelinesCached data is also known as staged data. It is a copy of the source data that is used for drilldown operations.Depending on the data source, this can use a very large amount of disk space.

    Use the following formula to calculate the disk size requirements for cached data:

    number of rows number of columns (average value size + 24)Note: 24 is the cache key size.

    Sum the results of this calculation for all cached tables.

    For example, an 80 column table that contains 100 million rows with an equal mixture of high- and low-cardinalitycolumns may require the following disk space:

    Value frequency data 50 MB

    Cached data 327,826 MB

    Total 327,876 MB

    Source data is staged when caching is selected. If you do not cache data for drilldown, the disk space required issignificantly reduced. All profiles store the value frequencies.

    Other Resource NeedsThe profile warehouse has the following memory and CPU requirements:

    Memory

    The queries run by the Profiling Service Module do not use significant amounts of memory. Use themanufacturer's recommendations based on the table sizes.

    CPU

    The recommended options are:

    1 CPU for each concurrent profile job. This applies to each relational database or flat file profile job, not toeach profile mapping.

    2 CPUs for each concurrent profile job if the data is cached.

    Profile Warehouse Guidelines for Key and Functional DependencyDiscovery

    The disk space for key and functional dependency discovery depends on the number of inferred keys, functionaldependencies, and their dependency violations. These items take up large space in the profile warehouse only if you set a large number for key and functional dependency discovery.

    You can use the following formulas to compute the disk space. If the confidence parameter is set to 100%, theprofile warehouse does not store violating rows and its computation may be omitted.

    Keys

    Number of Keys Average Number of Key Columns 32 + Number of Keys ( 32 + (2 Bytes perCharacter Average Column Size ) Average Number of Key Columns Average Number of Violating Rows

    Where

    Number of Keys is the number of inferred keys.

    10 Chapter 1: Profiling Service Module Performance Tuning

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    18/34

    Average Number of Key Columns is the av erage number of col umns in the key.

    32 is the number of bytes used to store one column in the key.

    Average Col umn Size i s the average number of characters in the columns. Assumes numbers and dateshave been converted to string.

    2 Bytes per Character is the typical number of bytes used for a single Unicode character. Average Number of Violating Rows i s the average number o f rows that v iolate the key.

    Functional Dependency

    Number of FDs (Average Number of LHS Columns + 1) 32 + Number of FDs (32 + (2 Bytes perCharacter Average Column Size ) (Average Number of LHS Columns ) Average Number of ViolatingRows

    Where

    Number of FDs is the number of inferred functional dependencies.

    Average Number of LHS Columns is the average number of columns in the determi nant of the functionaldependency. One is added for the dependent column.

    32 is the number of bytes used to store one column in the functional dependency.

    Average Col umn Size i s the average number of characters in the columns. Assumes numbers and dateshave been converted to string.

    2 Bytes per Character is the typical number of bytes used for a single Unicode character.

    Average Number of Violating Rows i s the average number o f rows that v iolate the functi onal dependency.

    Profile Warehouse Guidelines for Foreign Key and Overlap DiscoveryThe disk space for foreign key and overlap discovery is dependent on the number inferred foreign keys andoverlapping column pairs. These items take up large space in the profile warehouse only if you set a large number for foreign key and overlap discovery.

    You can use the following formulas for computing the disk space. The Profiling Service Module computes columnsignatures once for foreign key and overlap discovery.

    Signatures

    Number of Columns in Schema * 3600

    Where

    Number of Columns in Schema is the total number of columns in the profile model. After the ProfilingService Module generates the column signature for a profile task, subsequent profile tasks reuse thesignature.

    3600 is the amount of space required to store the signatures for one column.

    Foreign Keys

    Number of Foreign Keys * 2 * (Average Number Of Key Columns) * 32 + Number Of Foreign Keys *( 32 + (2 Bytes per Character * Average Column Size ) * Average Number Of Key Columns * Average Number of Violating Rows

    Where

    Number of Foreign Keys is the number of inferred foreign keys.

    Average Number Of Key Columns is the average number of columns in the pr imary or foreign key .

    2 is the multiplier to get the total number of columns for the foreign key.

    Profile Warehouse Guidelines 11

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    19/34

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    20/34

    C H A P T E R 2

    System Performance Guidelines for Data Quality Mapping Operations

    This chapter includes the following topics:

    Overview, 13

    Basic On-Disk Installation Size, 14

    General Runtime Memory Size, 14

    Mapping Memory and Disk Size Guidelines, 15

    Data Quality Table Guidelines, 17

    OverviewThis chapter provides system resource recommendations for Informatica Data Quality 9.0.1 mapping operations.

    The chapter addresses three areas:

    The size of installed elements on the file system

    The runtime footprint of general services for all users

    The memory and disk overhead when a use r runs mappings

    The effect of ma pping execution on disk and memory usage is the most critical of these factors. It is also the mostdifficult to estimate. When you determine your resource needs, consider the number of concurrent mappingssubmitted to the server, the types of transformation used in each mapping, and the size of the source data sets.

    13

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    21/34

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    22/34

    Mapping Memory and Disk Size GuidelinesFrom the point of view of resource usage and performance, data quality transformations can be divided into thefollowing categories:

    Standard transformations

    Reference data-based transformations

    Dynamic transformations

    The standard components do not incur additional costs in memory or disk usage beyond the standard runningsize. Reference data-based transformations can hold reference table lookup structures in memory. Dynamictransformations can use third-party engines, sort space, or b-tree storage.

    The dynamic transformations use of memory and disk can vary considerably, depending on the data that theyprocess.

    Standard TransformationsThe standard transformations are Comparison, Decision, and Merge.

    The memory or disk usage of these transformations does not vary with the size of the data processed. Thesecomponents process data rows in small batches and send them to the next component in the mappingimmediately.

    Reference Data-Based TransformationsThe following transformations use Informatica reference table data:

    Case Converter

    Labeller

    Parser

    Standardizer

    These transformations process data immediately, but they have initialization costs that increase memory useaccording to their configuration. While the reference table data is managed in the database, at runtime it is held inmemory for performance reasons. To optimize data throughput, this in-memory storage is designed for speedrather than space efficiency. Each transformation has its own copy of the in-memory reference data.

    Multiply the number of bytes in each column of the reference table by the number of lines in the reference table.Then multiply the total by 1.3 to estimate the in-memory footprint.

    For example, consider a reference table with the following dimensions:

    Rows 10,000

    Columns 6

    Average byte count 25

    The product of 10,000 6 25 1.3 equals approximately 2 MB of runtime memory usage. This runtime memorycost applies to the lifetime of the mapping. All in-memory reference tables are freed when the mapping is finished.

    Mapping Memory and Disk Size Guidelines 15

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    23/34

    Dynamic TransformationsThe following transformations have dynamic memory and disk usage requirements. These components store largenumbers of rows internally for block processing and have memory and disk requirements that increase with thevolume of input rows and the number of columns per row.

    Address Validator TransformationThe following factors affect performance for this transformation. Refer also to the section on General RuntimeMemory Size, as the Address Validator transformation affects all users when an address validation mapping runs.

    Preload and Cache Settings

    If the memory is available, fully preload all the address reference data files you need. If you use fullpreloading, set the CacheSize value to NONE. Otherwise, accept the default CacheSize value of LARGE.

    Concurrent Address Validator Mappings in PowerCenter

    Data Quality 9.0.1 creates a single process for all Address Validator mappings it runs, while PowerCenter creates a process for each mapping. Address validation processes do not share memory in PowerCenter. If you specify full preloading in PowerCenter, each address validation mapping needs approximately 6 GB of memory at runtime. If this is an issue, use partial preloading for address reference data in PowerCenter.

    Memory Usage

    Use the amount of data that is used by the Pre-Load setting. The AddressDoctor engine only consumes therequired amount of memory.

    Maximum AddressObject Setting

    The required AddressObject setting depends on the number of concurrent address validation mappings thatrun. This is a consideration in Data Quality 9.0.1, as all mappings are run in a single process, and it can be aconsideration in PowerCenter if there is more than one Address Validator transformation in a mapping.

    Set the AddressObject count to the maximum number of Address Validator transformations that can runconcurrently in a process. If you run two mappings concurrently in Data Quality 9.0.1, with two AddressValidator transformations in each mapping, set the AddressObject count to 4. In the same situation inPowerCenter, set the AddressObject count to 2.

    The address validation mapping may fail if the required AddressObjects are not available.

    Maximum ThreadCount Setting

    The optimal setting for Maximum ThreadCount is the number of processors or cores available for the addressvalidation process. If Maximum ThreadCount is set lower, the mappings will run more slowly.

    Association TransformationThis transformation makes extensive use of B-tree file based storage.

    Each column that the transformation reads has its own B-tree, and a general B-tree is used to store all the inputdata rows. The Informatica B-tree is space-efficient but not compressed.

    Use the following formulas to determine the needs of this transformation:

    Association Transformation Column Size

    total volume of data for each column 20 bytes for each input row

    On-disk Runtime Cost of the General Storage Cache:

    size of input data set 10 bytes for each row

    16 Chapter 2: System Performance Guidelines for Data Quality Mapping Operations

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    24/34

    Maximum Internal Memory Map for Association IDs and Data Rows

    number of rows 20 bytes

    Consolidation and Key Generator Transformations

    These transformations use standard Informatica sort transformations. By default they are configured to give thetransformation as much memory as possible without affecting system performance.

    You can set a limit on the amount of main memory the sort transformation uses to sort data. This increases on-disk temporary memory use, as the sort transformation must store all data rows.

    Match TransformationThe Match transformation can make use of two types of B-tree.

    When you configure the transformation with pass through ports and for identity matching, both types of B-tree areused. Assume that B-tree storage will not significantly exceed the space used by the data, if the data residesoutside the B-tree on the file system.

    Data Quality Table GuidelinesInformatica Data Quality uses the following types of proprietary database table:

    Reference data tables

    Exception management tables

    Reference Database TablesData quality uses reference tables to enable operations such as standardization, labeling, and parsing. Each

    reference data set is carried in a table and has a size in the database equivalent to its on disk size.Use the following formulas to calculate reference data table size:

    Assumption: Columns Have the Same Average Data Size

    number of data rows number of columns number of characters per column

    Assumption: Columns have Different Average Data Sizes

    number of data rows (characters in column 1 + characters in column 2 + characters in column n)

    Exception Management TablesYou can examine database tables for bad-quality or duplicate records in the Analyst tool. The table must containcolumns that are recognized by the Analyst tool.

    Some columns contain data that the Analyst tool can use as a filter. Other columns contain values that indicate therecords you want to write back to the source database. The columns must be in place before you import thedatabase table to the staging database.

    Some column names are case-sensitive, and you must enter them in uppercase. The names of columns that holdsource data are not case-sensitive.

    You must also verify that your source data does not contain columns with reserved names.

    Data Quality Table Guidelines 17

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    25/34

    Bad Record TableThe Bad Record table contains all potential bad-quality records that a mapping writes to the exception channel.Each row contains the original data size and a set of control columns.

    The control column structures are as follows:

    ID (numeric) RECORD_STATUS (varchar, 20)

    UPDATED_STATUS (varchar, 20)

    ROW_IDENTIFIER (numeric)

    COL* (user-defined text)

    Issue TableThis table allows multiple issues to be created for a single column. The table contains a row of data for each issuein a column cell.

    For the following row in a Bad Record table, the Issue table contains eight rows:

    Table 1. Bad Record Table, Row 1

    Column 1 Column 2 Column 3

    Two issues Four issues One issue

    The Issue table format is as follows:

    ID (numeric)

    ISSUE* (user-defined text)

    Duplicate Record TableThe Duplicate Record table contains all potential duplicate records that a mapping writes to the exception channel.Each row contains the original data size and a set of control columns.

    The control columns are as follows:

    ID (numeric)

    CLUSTER_ID (decimal 20,2)

    MATCH_SCORE (decimal 20,2)

    IS_MASTER (char 1)

    UPDATED_STATUS (varchar, 20)

    Audit Table An Audit table i s an Informatica system tab le that is assoc iated with a Bad Record tabl e. It contai ns a summarytable and a detail table.

    When you edit the Bad Record table, the Audit table records the change with a line of detail for each row-levelchange and an issue line for each column change. For example, if you make three changes to a record in the BadRecord table, the Audit table updates the detail table with three rows of data that contain the before and after states of the data.

    18 Chapter 2: System Performance Guidelines for Data Quality Mapping Operations

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    26/34

    The summary table and detail table are defined as varchars. The size of each detail row depends on the contentsof the old and new column values and the column name. The size of each summary row depends on the user comments and user name size.

    Data Quality Table Guidelines 19

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    27/34

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    28/34

    The following table lists the reference data file sizes:

    United States Batch / Interactive 533 MB

    United States GeoCoding 422 MB

    United States FastCompletion 380 MB

    Total disk usage added by this mapping = 0

    Memory usage = 533 MB + 422 MB + 380 MB = 1.3 GB

    User 3: Standardization MappingThis user runs a standardization mapping on a data source that contains 10 million rows.

    The mapping has minimal transformations and loads ten reference tables.The memory considerations are as follows:

    Each reference table has 10,000 rows with five columns and 25 bytes average data per column

    Total disk usage added by this mapping = 0

    Memory usage per reference table = 10,000 5 25 1.3

    Total memory usage = 16 MB

    User 4: Association MappingThis user runs a single-source mapping on a data source that contains 10 million rows and uses an Associationtransformation that reads eight groups.

    This mapping has no matching transformations. It sources data directly from a table. Each Association key columnhas a 10 byte key, and there are 10 additional columns of row data, each 50 bytes wide.

    The memory considerations are as follows:

    Each key column B-tree requires 300 MB of memory: 10M (10 + 20)

    General storage requirement = 10M * ((8 10) + (10 50)) = 5.8 GB

    Total disk usage added by this mapping = 300 MB 8 columns + 5.8 GB = 8.2 GB

    Total memory usage = 10M 20 = 200 MB

    Additional Memory and Disk Usage All users run their mappings concurrently. This has the fol lowing impact:

    Disk usage = 165 MB + 550 MB + 8,200 MB = 8.7 GB

    Memory usage = 10 MB + 1,300 MB + 16 MB + 200 MB = 1.5 GB

    User 3: Standardization Mapping 21

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    29/34

    AP P E N D I X B

    Address Validation Reference DataThe address validation reference data that you install can take up sizable disk space. This appendix lists thelargest address validation reference data files and their sizes.

    Address Validation Reference Data with On-Disk SizeThe following table lists the largest address validation reference data files.

    Country Validation Type Size

    United States Batch/Interactive 533 MB

    United Kingdom FastCompletion 501 MB

    United States GeoCoding 422 MB

    United States FastCompletion 380 MB

    United Kingdom Batch/Interactive 306 MB

    France FastCompletion 210 MB

    France Batch/Interactive 153 MB

    Argent ina FastCompleti on 120 MB

    Brazil FastCompletion 104 MB

    Germany FastCompletion 102 MB

    Germany Batch/Interactive 99 MB

    United Kingdom Supplementary 94.5 MB

    Italy FastCompletion 92.9 MB

    Argent ina Batch/ Intera ctive 90 MB

    Canada FastCompletion 83.1 MB

    22

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    30/34

    Country Validation Type Size

    India FastCompletion 83.1 MB

    India Batch/Interactive 80 MB

    Germany GeoCoding 73.5 MB

    Brazil Batch/Interactive 73.3 MB

    Italy Batch/Interactive 66 MB

    Canada Batch/Interactive 61.8 MB

    United Kingdom GeoCoding 51.8 MB

    Sweden FastCompletion 49 MB

    Mexico FastCompletion 48.5 MB

    Austra lia FastCompleti on 44.6 MB

    Russian Federation FastCompletion 44.3 MB

    Mexico Batch/Interactive 42.8 MB

    Austra lia Batch/ Intera ctive 40.9 MB

    Russian Federation Batch/Interactive 40.5 MB

    France GoeCoding 39.7 MB

    Portugal FastCompletion 38.8 MB

    Italy GoeCoding 36.6 MB

    Netherlands FastCompletion 35.5 MB

    Canada GeoCoding 32.7 MB

    China FastCompletion 28.4 MB

    Netherlands Batch/Interactive 27.8 MB

    Sweden Batch/Interactive 27.4 MB

    Spain GeoCoding 25.6 MB

    Austra lia GeoCoding 25.4 MB

    Spain FastCompletion 23.7 MB

    Chile FastCompletion 23.4 MB

    Netherlands GeoCoding 22.7 MB

    Add ress Vali dat ion Refere nce Dat a w ith On-Disk S ize 23

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    31/34

    Country Validation Type Size

    Portugal Batch/Interactive 22.5 MB

    China Batch/Interactive 21.4 MB

    Finland GeoCoding 18.8 MB

    Switzerland FastCompletion 18.2 MB

    Sweden GeoCoding 17.8 MB

    Chile Batch/Interactive 16.8 MB

    Belgium FastCompletion 16.1 MB

    Spain Batch/Interactive 15.4 MB

    24 Appendix B: Address Validation Reference Data

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    32/34

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    33/34

    File Name Size

    IM_cyrillic.zip 3 MB

    IM_arabic_r.zip 3 MB

    IM_singapore.zip 2 MB

    IM_india.zip 2 MB

    IM_chinese_t.zip 2 MB

    IM_aml.zip 2 MB

    IM_greek_l.zip 2 MB

    IM_switzerland.zip 2 MB

    IM_france.zip 2 MB

    IM_philippines.zip 2 MB

    IM_luxembourg.zip 2 MB

    IM_belgium.zip 2 MB

    IM_germany.zip 2 MB

    IM_brasil.zip 2 MB

    IM_portugal.zip 2 MB

    IM_korean_r.zip 2 MB

    IM_italy.zip 1 MB

    IM_turkey.zip 1 MB

    IM_hk_r.zip 1 MB

    IM_sweden.zip 1 MB

    IM_czech.zip 1 MB

    IM_netherlands.zip 1 MB

    IM_taiwan_r.zip 1 MB

    IM_denmark.zip 1 MB

    IM_slovakia.zip 1 MB

    IM_malaysia.zip 1 MB

    IM_thai_r.zip 1 MB

    26 Appendix C: Identity Population Data

  • 7/28/2019 IN_950_SystemPerformanceGuidelines_en.pdf

    34/34

    File Name Size

    IM_spain.zip 1 MB

    IM_chinese_r.zip 1 MB

    IM_colombia.zip 1 MB

    IM_argentina.zip 1 MB

    IM_indo_chin_r.zip 1 MB

    IM_chile.zip 1 MB

    IM_peru.zip 1 MB

    IM_vietnam_r.zip 1 MB

    IM_puerto_rico.zip 1 MB

    IM_mexico.zip 1 MB

    IM_thai.zip 1 MB

    IM_finland.zip 1 MB

    IM_norway.zip 1 MB

    IM_poland.zip 1 MB

    IM_greek.zip 1 MB

    IM_hungary.zip 1 MB

    IM_estonia.zip 1 MB

    IM_korean.zip 1 MB

    IM_ofac.zip 1 MB

    IM_hebrew.zip 1 MB

    IM_chinese_i.zip 1 MB

    IM_arabic.zip 1 MB