Date post: | 10-Jun-2015 |
Category: |
Technology |
Upload: | skaluska |
View: | 6,755 times |
Download: | 2 times |
1
Data Ingestion, Extraction, and Preparation for Hadoop
Sanjay Kaluskar, Sr. Architect, Informatica
David Teniente, Data Architect, Rackspace
2
Safe Harbor Statement• The information being provided today is for informational purposes only. The
development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future.
• Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development.
• All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made.
• Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department.
3
Sales & Marketing Data mart
Customer ServicePortal
The Hadoop Data Processing PipelineInformatica PowerCenter + PowerExchange
3. Transform & Cleanse Data on Hadoop
1. Ingest Data into Hadoop
2. Parse & Prepare Data on Hadoop
4. Extract Data from Hadoop
1H / 2012
Product & Service Offerings
Customer Profile
Social MediaAccount Transactions
Customer Service Logs & Surveys
Marketing Campaigns
PowerCenter + PowerExchange
Available Today
4
Options
Ingest/Extract Data
Parse & Prepare Data
Transform & Cleanse Data
Structured (e.g. OLTP, OLAP)
Informatica PowerCenter + PowerExchange, Sqoop
N/A Hive, PIG, MR, Future: Informatica Roadmap
Unstructured, semi-structured (e.g. web logs, JSON)
Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka
Informatica HParser, PIG/Hive UDFs, MR
Hive, PIG, MR, Future: Informatica Roadmap
5
Unleash the Power of HadoopWith High Performance Universal Data Access
WebSphere MQJMSMSMQSAP NetWeaver XI
JD Edwards Lotus NotesOracle E-BusinessPeopleSoft
OracleDB2 UDBDB2/400SQL ServerSybase
ADABASDatacomDB2IDMSIMS
Word, ExcelPDFStarOfficeWordPerfectEmail (POP, IMPA)HTTP
InformixTeradataNetezzaODBCJDBC
VSAMC-ISAMBinary Flat FilesTape Formats…
Web ServicesTIBCO webMethods
SAP NetWeaver SAP NetWeaver BI SASSiebel
Messaging, and Web Services
Relational and Flat Files
Mainframe and Midrange
Unstructured Data and Files Flat files
ASCII reportsHTMLRPGANSILDAP
EDI–X12EDI-FactRosettaNet HL7HIPAA
ebXMLHL7 v3.0ACORD (AL3, XML)
XMLLegalXMLIFXcXML
ASTFIXCargo IMPMVR
Salesforce CRMForce.comRightNowNetSuite
ADP HewittSAP By DesignOracle OnDemand
Packaged Applications
Industry Standards
XML Standards
SaaS/BPO
Social Media
FacebookTwitter
LinkedInEMC/GreenplumVertica
AsterData
MPP Appliances
6
Ingest Data
HDFS
HIVE
Batch
Real-time
CDC
Web server
ERP, CRM
Databases,Data Warehouse
Message Queues,
Email, Social Media
Mainframe
PowerExchange PowerCenter
Access Data Pre-Process Ingest Data
e.g. Filter, Join, Cleanse
Reuse PowerCenter mappings
7
Extract Data
Batch
Web server
ERP, CRM
Databases,Data Warehouse
Mainframe
PowerExchange
Deliver Data
HDFS
Extract Data
PowerCenter
Post-Process
e.g. Transform to target schema
Reuse PowerCenter mappings
8
2. Create Hadoop Connection
3. Configure Workflow
4. Create & Load Into Hive Table
1. Create Ingest or Extract Mapping
9
Sales & Marketing Data mart
Customer ServicePortal
The Hadoop Data Processing PipelineInformatica HParser
3. Transform & Cleanse Data on Hadoop
1. Ingest Data into Hadoop
2. Parse & Prepare Data on Hadoop
4. Extract Data from Hadoop
1H / 2012
Product & Service Offerings
Customer Profile
Social MediaAccount Transactions
Customer Service Logs & Surveys
Marketing Campaigns
HParser
Available Today
10
Options
Ingest/Extract Data
Parse & Prepare Data
Transform & Cleanse Data
Structured (e.g. OLTP, OLAP)
Informatica PowerCenter + PowerExchange, Sqoop
N/A Hive, PIG, MR, Future: Informatica Roadmap
Unstructured, semi-structured (e.g. web logs, JSON)
Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka
Informatica HParser, PIG/Hive UDFs, MR
Hive, PIG, MR, Future: Informatica Roadmap
11
Informatica HParserProductivity: Data Transformation Studio
12
SWIFT MTSWIFT MXNACHAFIXTelekursFpMLBAI – V2.0\LockboxCREST DEXIFXTWISTUNIFI (ISO 20022)SEPAFIXMLMISMO
B2B Standards
UN\EDIFACTEDI-X12EDI ARREDI UCS+WINSEDI VICSRosettaNetOAGI
Financial
Healthcare
HL7HL7 V3HIPAANCPDPCDISC
Insurance
DTCC-NSCCACORD-AL3ACORD XML
IATA-PADISPLMXMLNEIM
Other
Easy example based visual enhancements and edits
Definition is done using Business (industry) terminology and definitions
Enhanced Validations
Out of the box transformations for all messages in all versions
Updates and new versions delivered from Informatica
Informatica HParserProductivity: Data Transformation Studio
13
Informatica HParserHow does it work?
Hadoop cluster
HDFS
SS
Svc Repository
SS
hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt
1. Develop an HParser transformation2. Deploy the transformation3. Run HParser on Hadoop to produce
tabular data4. Analyze the data with HIVE / PIG /
MapReduce / Other
14
Sales & Marketing Data mart
Customer ServicePortal
The Hadoop Data Processing PipelineInformatica Roadmap
3. Transform & Cleanse Data on Hadoop
1. Ingest Data into Hadoop
2. Parse & Prepare Data on Hadoop
4. Extract Data from Hadoop
1H / 2012
Product & Service Offerings
Customer Profile
Social MediaAccount Transactions
Customer Service Logs & Surveys
Marketing Campaigns
Available Today
15
Options
Ingest/Extract Data
Parse & Prepare Data
Transform & Cleanse Data
Structured (e.g. OLTP, OLAP)
Informatica PowerCenter + PowerExchange, Sqoop
N/A Hive, PIG, MR, Future: Informatica Roadmap
Unstructured, semi-structured (e.g. web logs, JSON)
Informatica PowerCenter + PowerExchange, copy files, Flume, Scribe, Kafka
Informatica HParser, PIG/Hive UDFs, MR
Hive, PIG, MR, Future: Informatica Roadmap
16
Informatica Hadoop Roadmap – 1H 2012
• Process data on Hadoop• IDE, administration, monitoring, workflow• Data processing flow designed through IDE: Source/Target,
Filter, Join, Lookup, etc.• Execution on Hadoop cluster (pushdown via Hive)
• Flexibility to plug-in custom code• Hive and PIG UDFs• MR scripts
• Productivity with optimal performance• Exploit Hive performance characteristics• Optimize end-to-end data flow for performance
17
Mapping for Hive execution
17
INSERT INTO STG0SELECT * FROM StockAnalysis0;
INSERT INTO STG1SELECT * FROM StockAnalysis1;
INSERT INTO STG2SELECT * FROM StockAnalysis2;
Source
Pre-view generated Hive code
Validate & configure for Hive translation
Logical representation of processing steps
18
Takeaways
• Universal connectivity• Completeness and enrichment of raw data for holistic analysis• Prevent Hadoop from becoming another silo accessible to a few
experts
• Maximum productivity• Collaborative development environment
• Right level of abstraction for data processing logic• Re-use of algorithms and data flow logic
• Meta-data driven processing• Document data lineage for auditing and impact analysis• Deploy on any platform for optimal performance and utilization
19
Customer Sentiment - Reaching beyond NPS (Net Promoter Score) and surveys
Gaining insight in to our customer’s sentiment will improve Rackspace’s ability to provide Fanatical Support™
Objectives:• What are “they” saying• Gauge the level of sentiment• Fanatical Support™ for the win
• Increase NPS• Increase MRR• Decrease Churn• Provide the right products• Keep our promises
20
Customer Sentiment Use CasesPulling it all together
Case 1Match social media posts with Customer. Determine
a probable match.
Case 2Determine the sentiment of a
post, searching key words and scoring
the post.Case 3
Determine correlations between posts, ticket volume and NPS leading to negative
or positive sentiments.
Case 4Determine correlations in
sentiments with products/configurations which lead to negative or
positive sentiments.Case 5The ability to trend all
inputs over time…
21
Rackspace Fanatical Support™Big Data Environment
21
BI Stack
BI Analytics
Search, Analytics, Algorithmic
Greenplum DB
Hadoop HDFS
Message bus / port listening
Data Sources(DBs, Flat files, Data
Streams)
OracleMySqlMS SQLPostgresDB2
ExcelCSVFlat FileXML
EDIBinarySys LogsMessagingAPIs
Indirect Analytics over Hadoop
Direct Analytics over Hadoop
22
Twitter Feed for RackspaceUsing Informatica
Input Data Output Data
23