Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
1
A Comparative Study between ETL and E-LT approaches for loading data into a Data Warehouse
Vikas Ranjan
CSCI 693
2
Agenda
Introduction Extract, Transform and Load (ETL) Approach Strengths/ Weaknesses of ETL Extract, Load and Transform (E-LT) Approach Strengths/ Weaknesses of E-LT Experiments Results Conclusion/ Future Work References Q & A
3
Introduction All Business Intelligence applications are data-centric.
Large volumes of data is stored and processed along with its history in the Data Warehouse.
The Data is extracted from various heterogeneous source systems and transformed as per the business requirements.
Data is used for analytical purposes- future forecasting, profitability analysis, trend analysis etc. to drive the business.
4
Extract, Transform and Load (ETL)
Traditional approach of loading data into data warehouses. Data is first pulled/ pushed from various heterogeneous sources
like ERP, CRM, RDBMS, Flat Files. Business rules are applied on the data in the Staging Area. Transformed data is loaded into target database. Often designed backward, thus only the relevant data is fetched.
5
Strengths/ Weaknesses of ETL
Strengths Can perform complex operations in single data flow diagrams. Mostly designed backward, thus only relevant data is loaded. Used for building real-time data warehouses. Robust tools are available like Informatica, Ab Initio, Data Stage .
Weaknesses Data transformation step of ELT is performed by ETL engine,
therefore increasing processing time. Data is moved over the network twice. Since it is developed backward, more effort for future redesign.
6
Extract, Load and Transform (E-LT)
Newer Approach of loading into target data warehouse. The Data is extracted from various sources same way it is done via
ETL. This extracted data is loaded directly into target data warehouse. The transformations and complex business rules are applied by
native SQL drivers. The processing is done by database engine rather than ETL engine.
7
Strengths/ Weaknesses of E-LT
Strengths Since all the data is available, the future changes can be easily
incorporated. Once the data is loaded on the target platform, all transformations are
placed on the RDBMS engine. This reduces network congestion. Provides optimal performance as no extra hardware needed.Weaknesses E-LT suffers from a limited availability of tools like Informatica
Pushdown, Data Integrator etc. Does not work well for complex business cases. E-LT cannot be used to design near real-time enterprise data
warehouse.
8
Tools and Data Set
Software / Hardware
OS Platform: Sun M9000 Server (www.sun.com). Relational Database Management System: Teradata V2R6.2.
1 (www.teradata.com). ETL/E-LT Tool: Informatica PC 8.6.1 Hot Fix 9 Advanced E
dition (www.informatica.com).
Data Set All the data used is experiments is obtained from the test data
base a telecom company. All the experiments were conducted using Informatica as ETL
and E-LT tool.
9
Experiment 1
ETL vs. E-LT (Full Pushdown): Informatica has recently introduced both ETL and E-LT capabilities in its Power Center tool
Job Name: Tax _Write_Off Job Same Job was developed using ETL approach as well as E-LT
approach. Informatica pushed down all the code processing to the RDBMS
engine.
10
Experiment1 contd.
ETL Job: Informatica Server handled all the code processing and generated its own internal SQL
11
Experiment 1 contd.
E-LT Job ( Full Pushdown): In the full pushdown, Informatica Server pushed all the code to RDBMS engine and Informatica Server worked purely as an E-LT tool
12
Experiment 1 Results
Approximately Five Times more performance gain using E-LT Full Pushdown
Approach Data Read (rows)
Data Load (rows)
Runtime
(second)
Throughput
(rows/sec)
Memory CPU
ETL 32044 32044 110 291 DB/ETL Server
DB/ETL Server
E-LT Full 32044 32044 6 5340 DB Memory
DB CPU
13
Experiment 2
ETL vs. E-LT (Target Pushdown):
Job Name: Tax_Extract_ETL_tst Job Same job was developed using ETL approach as well as E-LT approach with Target Pushdown only The Pushdown to database engine happened at Target database only
14
Experiment 2 contd.
ETL Job: Informatica Server handled all the code processing and generated its own internal SQL
15
Experiment 2 contd.
E-LT Job (Target Pushdown): Informatica Server pushed code processing on target database to RDBMS engine
16
Experiment 2 Results
No Performance Differences since the Pushdown was on Target Database only
Approach Data Read (rows)
Data Load (rows)
Runtime
(second)
Throughput
(rows/sec)
Memory CPU
ETL 154271 462813 1879 246 ETL/Partial DB
ETL/Partial DB
E-LT Target 154271 462813 1887 245.26 DB/E-LT Server
DB/E-LT Server
17
Experiment 3
ETL vs. E-LT (Source Pushdown):
Job Name: Tax_Extract_ETL Job Same job was developed using ETL approach as well as E-LT approach with Source Pushdown only The Pushdown to database engine happened at Source database only
18
Experiment 3 contd.
ETL Job: Informatica Server handled all the code processing and generated its own internal SQL
19
Experiment 3 contd.
E-LT Job (Source Pushdown): Informatica Server pushed down code processing on source side to RDBMS engine
20
Experiments 3 Results No Performance Differences as Pushdown was on Source Database only Source Pushdown E-LT does not work with Teradata Sequence Generator
Approach Data Read (rows)
Data Load (rows)
Runtime
(second)
Throughput
(rows/sec)
Memory CPU
ETL 154271 154271 948 162 ETL/Partial DB ETL/Partial DB
E-LT Source
154271 154271 947 162 DB/E-LT Server DB/E-LT Server
21
Results Significant performance gains were obtained using full pushdown E-
LT over ETL approach.
No performance gains using both source and target pushdown of E-LT approach and ETL approach as both used ETL Server and Database Engine resources( memory and CPU).
Code changes were needed to redesign existing ETL jobs to use E-LT power of database engine.
E-LT does not work in building Real Time Data warehouses
22
Conclusion and Future Work ETL works well for very complex transformations and
active data warehouses
E-LT works well for small and medium-sized data marts and when the source and target are on the same database platform only
Future: Building data warehouse solutions using hybrid approach (combination of ETL and E-LT processes)
.
23
References
.
[1] A. Simitsis, P. Vassiliadis, T. Sellis “Optimizing ETL Processes in DataWarehouses,” in Proc. 21st International Conference on Data Engineering, 2005, (ICDE 2005), pp. 564-575. [2] G.X. Zhou, Q.S. Xie, Y. Hu, “E-LT Integration to Heterogeneous Data Information for SMEs Networking based on E-HUB,” in Proc. Fourth International Conference on Natural Computation, 2008, IEEE, pp. 212-216.[3] I. William, S. Derek, and N. Genia, DW 2.0: The Architecture for the Next Generation of Data Warehousing. Burlington, MA: Morgan Kaufman, 2008, pp. 215-229.[4] R. J. Davenport, September 2007. [Online] ETL vs. ELT: A Subjective View. InSource IT Consulting Ltd., U.K. Available at: http://www.insource.co.uk/pdf/ETL_ELT.pdf[5] L. Troy, C. Pydimukkala, How to Use PowerCenter with Teradata to Load and Unload Data, Informatica Corporation [Online], Available at: www.myinformatica.com
24
Questions ???