Post on 12-Apr-2017
transcript
Put your data to work with Big Data services from AWS and Informatica
Data is growing
of new data will be created every second for every human being on the planet by 2020
http://www.whizpr.be/upload/medialab/21/company/Media_Presentation_2012_DigiUniverseFINAL1.pdf
1.7MB
compound annual growth rate of 58% surpassing $1 billion by 2020 forecasted for the Hadoop market
http://www.ap-institute.com/big-data-articles/big-data-what-is-hadoop-%E2%80%93-an-explanation-for-absolutely-anyone.aspx
http://www.marketanalysis.com/?p=279
58%of all data is ever analyzed and used at the moment
http://www.technologyreview.com/news/514346/the-data-made-me-do-it/
0.5%<
Big Data is for everyoneThe market for Big Data technologies is growing more than six times faster than the information technology market as a whole….
…and those companies who use their data well win.
Why AWS for Big Data?
Immediately Available
Broad and Deep Capabilities
Trusted and Secure
Scalable
Collect, Store, Analyze, and VisualizeIt’s easy to get data to AWS, store it securely, and analyze it with the engine of your choice, without any long-term commitment or vendor lock-in
CollectImport/Export
Snowball
Direct Connect
VM Import/Export
StoreAmazon S3
EMR
Amazon Glacier
Amazon Redshift
DynamoDB
AnalyzeAmazon Kinesis
Lambda
EMR
EC2
Aurora
AWS provides the most complete platform for Big DataWhat can you do with Big Data on AWS?
Big Data Repositories Clickstream Analysis ETL Offload
Machine Learning Online Ad Serving BI Applications
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools, Hadoop, Machine Learning, Streaming
Analysis in-line with process flows
Pay as you go, grow as you need
Managed availability & DR
Enterprise Big Data SaaS
The cloud can be made more secure than on-premises
High speed redundant direct connect lines
Load billions of rows in minutes
All data in private VPC
All data encrypted with private on-premises hardware keys
Encryption of data, transport, backups, partial spills
Audit of all SQL actions
Audit of all configuration changes
Data warehouses can support real-time dataBig data does not mean batch
Can be streamed in
Can be processed in near real time
Can be used to respond quickly to requests
You can mix and match on-premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Hybrid Cloud Data Management with AWS and InformaticaPresented by Andrew McIntyre
Agenda The IT Landscape and how it is changing IT challenges with Hybrid Cloud Architecture Customer success story with SendGrid How Informatica can help customer migrate to Hybrid Architecture Why choose Informatica?
IT Landscape is Changing…
Why Enterprises are Adopting Cloud Architecture
Business agility requires IT agility
Cloud economics pay off in a big way
Focus on core competencies & unique value
Hybrid Cloud is Common Approach
ERP & On-Premises AppsTraditional Relational Databases
Traditional Data Warehouse
Amazon Redshift
+
Cloud:
On-Premises:
Defining Hybrid Cloud Data Management
Integrate, Cleanse, Govern, Master, Secure^Integrating data from:
On-premises databases, data warehouses, apps with SaaS applications
With Public cloud: AWS
Data Management Challenges in Hybrid Cloud Architecture
Connectivity
Many Data Systems: Cloud & On-Prem Reuse work across systems Secure connection
Data Visibility
Complex data flows-less comprehension Quality, Governance, security, regulation,
audits, mastering
Scalability Support large data volume Match infinite capacity in cloud platform
Operational Control
Monitor & Manage data in production Ensure operational success Monitor end to end business process
Informatica + AWS Use Cases
Lift and Shift: Moving on-premises databases, systems and/or DW to AWS-based workloads
Hybrid App Integration: Integrate on-premises and cloud apps with Informatica Cloud. Also known as iPaaS (integration Platform-as-a-Service)
Hybrid Data Warehousing: Load multiple data sources from cloud and/or on premise to AWS using Informatica Cloud
+
Lift and Shift your Workloads
Cloud
On premise
Use Case Summary:Moving on-premises databases, systems and/or data warehouse to AWS-based workloads
Amazon Redshift
On-premises Data Warehouse
Other Databases Your Data Integration Platform
Firewall
Amazon RDS
Amazon Aurora
Hybrid App Integration
Use Case Summary:Integrate on-premises and cloud apps with Informatica Cloud. Also known as iPaaS (integration Platform-as-a-Service)
Cloud
On premise
Data Warehous
e on-premises Apps
Firewall
Amazon RDS Amazon Redshift
Your Data Integration Platform
on-premises Data
Warehouse
Other Databases
Hybrid Data WarehousingUse Case Summary: Load multiple data sources from cloud and/or on premise to AWS using Informatica Cloud
On-premisesData Warehouse
Your Data Integration Platform
ERP, on-premises Apps
Traditional Relational Databases
Social Media
Logs IoT
Analytics Tools
Cloud
On premise
Firewall
Amazon RDS Amazon Redshift
Informatica Cloud for Amazon Web Services
Amazon DynamoDB
Amazon EMR
Amazon S3
Amazon Redshift
Amazon Aurora
Amazon RDS
Informatica Cloud provides native connectivity to Amazon Web Services for scalable, high-performance integration with any cloud and on-premises data source.
Informatica Cloud and Amazon Redshift
Seamless integration with any data system on cloud and on-prem
Native, high performance data integration and synchronization
The only solution to provide “Upsert” functionality
Step by step integration wizards for non-technical users
Advanced point and click integration workflows for technical users
Hybrid Data Warehousing
An Informatica Case Study
SendGrid: Company Background
Founded in 2009, after graduating from the TechStars program, SendGrid developed an industry-disrupting, cloud-based email service to solve the challenges of reliably delivering emails on behalf of growing companies. Like many great solutions, SendGrid was born from the frustration of three engineers whose application emails didn’t get delivered, so they built an app for email deliverability. Today, SendGrid’s reliable email platform delivers each month over 25 billion transactional and marketing emails on behalf of many of your favorite brands, including Uber, Airbnb, Spotify, Foursquare and NextDoor.
Business and Technical Requirements
Emphasis for the architecture was speed over accuracy, sustainability and growth.
As a result, the architecture was already hitting the limitations of its design.
Architecture IssuesPrior to my joining the company, SendGrid had already committed to using MySQL for a new data warehouse build. The SendGrid Data Warehouse architecture that was underway did not follow a formal data warehousing methodology. It was built specifically to support the BI tool and it’s features and limitations. This resulted in an architecture that does not follow many of the industry standard Data Warehousing best practices.
Business and Technical Requirements
Our small team is responsible for the strategic direction, design, delivery and availability of business data for corporate-wide utilization in measuring performance, business outcomes and decision making capabilities. Data and analytics need to be provided in various ways and formats through effective and efficient delivery methods.
To accomplish this, the team was tasked with building a new data warehouse. We planned to start on our main data source, which houses our email event and customer information.
Business Needs for Data & Analytics
Director, Enterprise Data Operations
Data Warehouse Architect / ETL Developer
BI Developer
Business Systems Analyst
Meet the Team:
Technical RequirementsEvaluate the overall data warehouse architecture and suggest required changes and improvements to: Database technology, design and work products ETL tool, design and work products
Data Warehouse Assessment Needed
Database Technology: Nimble Cost effective Meets storage and capacity needs Allows the team to be self-sufficient without reliance on
other team’s skill-sets ETL Tool: Mature ETL tool to leverage for data warehousing
Technical Requirements
Data Warehouse AssessmentDatabase Technology Options
Data Warehouse AssessmentThe Findings: Overview
Confirmed assumption that utilizing MySQL was not sustainable as a database technology
Switch to a technology that better aligns with a data warehouse infrastructure: Amazon Redshift selected
Mature ETL tool is needed for data warehousing while providing a user-friendly tool for business communities
Informatica selected to load data into Amazon Redshift from multiple data sources, cloud, and on-prem while supporting citizen integrators.
Data Warehouse and Analytics Conceptual Architecture
Marketo
SalesForce
Zuora
Mail db
Raw Data
Acquisition Layer
Core LayerMapping Schemas
Data Sources
Data Mining,Benchmark
Data
Enterprise Data Warehouse (Amazon Redshift)
Time
RevenueCustomer
SalesForce
Product Volume Usage
Product Usage
Segment*
Jira
Hadoop Cluster
Analytics Tools
Reporting/Analytics,
Dashboards,Export Data
Publishing Layer
Clean Data/Metadata Dimensional Data
Zendesk
Test and Learn
Campaigns
ETL
Informatica
Clo
udon
-pre
mis
es
ETL
Informatica
ETL
Informatica
Or
Project Outlook
The project is still in the early stages of the data warehouse build.
The project is in the early stages of the data warehouse build. We have set up our Amazon Redshift instance for the data warehouse and have started sourcing data from six sources, a mix of both cloud and on-premises.
We are actively using Informatica data integration portfolio in a hybrid architecture to support ETL integration.
By the end of 2016, we will have enough data from multiple data sources in the Amazon Redshift data warehouse and our BI tool, allowing us to roll-out self service analytics with a foundational view of customer, product, revenue, and email volume and usage data.
We are confident that with this approach we have set ourselves up for success in a nimble, scalable, cost effective manner to rapidly enable business driven insights for SendGrid!
How Informatica Can Help
Connectivity
High-performance out-of-the-box native connectors to any data system
Abstraction layer enables reuse Secure
Data Visibility
Metadata-driven visual design: visibility into data flows cross cloud and on-prem
Metadata: the foundation of quality, governance, security, mastering
Scalability Inherently designed for performance at scale iPaaS offers infinite integration capacity and
bursting
Operational Control
Single point of control for production data across cloud and on-prem
Admin can monitor production data flows and flag issues early
Informatica Addresses Data Management Challenges
Hundreds of Connectors For Every Type of Data Source Sales & Service Big Data
Human Resources
Web Protocols & API ERP & Financials
B2B
Marketing
Social
IT & Admin Analytics
Informatica’s 3 Key Differentiators
The project is still in the early stages of the data warehouse build.
Unlock your data
1 2
Scale with Performance
UI maximizes productivity for developers & citizen integrators
Visual data mappingOut of box templates & wizardsEasy to use & highly reusable
3
Hundreds of out-of-box connectors for cloud and on-prem data
sources
Optimized to process the largest data volumes
Pushdown Optimization Automated
CONNECT DEVELOP DEPLOY
Informatica Product Portfolio for Hybrid Cloud Management
The project is still in the early stages of the data warehouse build.
Cloud Test Data
Management
Cloud Application Integration
Cloud Data Integration
Data as a
Service
Cloud Customer 360
Amazon Redshift Upsert – Manual Coding Method1. Extract the data from source2. Put into flat files and compress3. Transfer Compressed Files To
S34. Wait for S3 Consistency5. Create Staging Table in
Redshift6. Copy Data From S3 Into
Staging Table7. Inner Join With Target Table To
Delete Rows To Be Updated8. Insert Updated Rows From
Staging Table9. Delete Staging Table10. Delete Files From S3
Or, Do It In 3 Simple Steps…
Amazon Redshift Upsert – Informatica Cloud Method
1. Choose Upsert Operation
2. Map Your Fields
3. Run Or Schedule!
Informatica Cloud Amazon Redshift Architecture
Informatica Cloud Secure Agent
Metadata Mappings
Build mapping and execute job
1
1Retrieve Account Data2
23 Put Account Data into Flat File(s)
4 Transfer compressed Flat File(s) to S35 Initiate copy from S3
6 Load data into Amazon Redshift
6
3
54
Firewall
iPaaS customers
4,500OEMs with over 1,000 customers
70+Transactions per month 130% growth yoY
300BIntegration jobs / processes per day
1M<
Next Steps…
Additional Resources
Getting Started – Amazon Web Services
www.informatica.com/products/cloud-integration/connectivity/amazon-connectors.html
4 hour Trial of Specific Use Cases
60 Day Trial of All Functionality
www.informatica.com/products/cloud-integration/connectivity/amazon-connectors/amazon-test-drive.html
Informatica.com Amazon Marketplace