@nmotgi
Nitin Motgi
Accelera/ng Hadoop Projects with Cask Data Applica/on Pla;orm
PROPRIETARY & CONFIDENTIAL2
• Introduction to data applications
• Challenges with building operational data applications on Hadoop
• Goals and Motivation for CDAP
• Introduction to CDAP and Architecture Overview
• Use-‐cases
• Building Blocks
• Datasets • Programs • Application and Application Template
Agenda
PROPRIETARY & CONFIDENTIAL3
Applications that use data insights to enhance the customers/user experience, achieve a business objective or improve a business process.
What are Data Applications?
PROPRIETARY & CONFIDENTIAL4
• 360-‐Degree Customer View
• Recommendation Engine
• Predictive Modeling
• Fraud Analysis
• Network Threat Detection
• Telemetry Analysis
• Time Series Analysis
• Data Processing -‐ ETL
• And many more
Examples
Challenges
Technology Explosion
Core HadoopHDFS, MR
2006
HbaseZooKeeper
Core Hadoop
2008
HivePig
MahoutHbase
ZooKeeperCore Hadoop
2009
SqoopWhirrAvroHivePig
MahoutHbase
ZookeeperCore Hadoop
2010
FlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHbase
ZookeeperCore Hadoop
2011
SparkImpala
SolrKafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHbase
ZookeeperCore Hadoop
2012
SentryTez
ParquetYARNSparkYARNImpala
SolrKafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHbase
ZookeeperCore Hadoop
Knox
Present
APPLICATION
COMPLEXITY
MANY DOMAINS TO
BRIDGE
LOTS OF
BOILERPLATEINCONSISTENT
APIS
NO
REUSABILITY LACK OF DEVELOPER
PRODUCTIVITY
Challenges
Application Complexity
Mo/va/on
Motivation• Simple yet powerful platform for developers to build applications on Hadoop
• Expose capabilities rather than features
•Make Hadoop accessible to developers with no Hadoop knowledge
Goals• Unified platform for building solutions on Hadoop
• Simpler application development lifecycle
• Reusable Data and Processing Patterns with Abstractions
• Framework level correctness and consistency
Introduc/on toCask Data Applica/on Pla;orm
An open source, integrated, distributed and extensible platform for building data applications on Hadoop.
Cask Data Application Platform
Provides
Supports developers, operations, and organizations through the entire enterprise data application lifecycle.
CASK DATA APP PLATFORM
Data Lifecycle
Ingest
Explore
Transform
Serve
Application Lifecycle
Develop
Test
Deploy
Scale
EnterpriseLifecycle
Secure
Manage
Monitor
Operate
Supports
16
ServeTransformExploreIngest
Unification
ACID
Dataset
Streams
Realtime - Tigon
JDBC
Query
RPC
SparkMR Dataset
Dataset
MR
Spark
Ad-hocquery
Dataset API, SPI & Management Services
Application Structure
17
Deployment Architecture
• Services• Master• Router • Auth Server
CDAP Server• Highly Available (HA)• Installed on edge node(s)• Supports Kerberos - Impersonation & Permitter Security• Manager system services in YARN
CDAP Server
System Services (Twill Containers)• Transactions (Tephra)• Metrics Aggregation• Log Aggregation• Dataset Services• Metadata Management Service• Explore Service• Stream Management Service & more
PROPRIETARY & CONFIDENTIAL18
• Reliable and scalable real-‐time business critical analytics
• Closed Loop Recommendation and Analytics
• Data Ingestion As A Service
• Extendable and Reusable use-‐case blueprints
• ETL Automation -‐ Real-‐time and Batch
• Data As A Service
• Reduce development and operational complexity of Hadoop
Typical Use-cases
Building Blocks
Building Blocks
Dataset Program
Encapsulated data access paBerns and data model in a reusable, domain-‐specific API
Standardized containers for processing paradigms
ProgramaUc abstracUon for composing mulUple Datasets and Programs that integrates ingesUon, exploraUon, transformaUon and serving
Application
Dataset ProgramProgramDataset
Dataset
PROPRIETARY & CONFIDENTIAL22
RDBMS Hadoop Dataset
Raw Storage Interfaces, Data Modeling, Data Layout,
Op/miza/ons and SchemaRaw Storage
Raw Distributed Storage, Model, Layout, Op/miza/ons and
op/onal Schema
• OpUmizaUon are pushed closer to storage
• ApplicaUons use SQL to access data (store or retrieve)
• Modeling, layout and opUmizaUons are embedded within applicaUons
• Hard to scale -‐ lack of reusability
• Access through domain specific APIs with opUonal SQL Interface
• OpUmizaUons embedded within datasets
• Simpler ApplicaUons!
Dataset Motivation
PROPRIETARY & CONFIDENTIAL23
• Encapsulate a data access paBern and data model in a reusable, domain-‐specific API • Establishes best prac/ces in schema definiUon • Abstract away underlying storage pla\orm • Reusable as data storage templates • Easy sharing of stored data:
• Between applicaUons • Batch and real-‐Ume processing
• Integrated tes/ng • Extensible to create your own soluUons • Transparent Integra/on with
• Hive metastore • MR Input/Output Formats • Spark RDDs
Building Blocks - Dataset
PROPRIETARY & CONFIDENTIAL24
• System Dataset Types • Secondary Indexes
• Example use case: Entity storage - store customer records indexed by location • Object Mapping
• Example use case: Entity storage - easily store User instances for user profiles • Timeseries Data
• Example use case: any data organized around a time dimension • Data Cube
• Example use case: Retail product sales reports, web analytics • ParUUoned Fileset
• Example use case: Time partitioned processing of feeds • Custom Dataset Types
• Build your own!
Dataset - Types
PROPRIETARY & CONFIDENTIAL25
Dataset - Example• A Java Library • Table Dataset • First Name, Last Name and Link to Picture in a Table
• Fileset Dataset • Pictures in a Fileset
• Instance of Dataset as • HBase Table and • HDFS Directory
• Access using SQL (HIVE) • Tigon, MR & Spark can access
public class ContactsDataset extends AbstractDataset {
private ObjectMappedTable<Contact> contacts;
private FileSet pictures;
public ContactsDataset(DatasetSpecification spec, @EmbeddedDataset("contacts") ObjectMappedTable<Contact> contacts, @EmbeddedDataset("pictures") FileSet pictures) { super(spec.getName(), contacts, pictures); this.contacts = contacts; this.pictures = pictures; }
public void addContact(String nick, Contact contact) { contacts.write(nick, contact); }
public Contact getContact(String nick) { return contacts.read(nick); } // continued...
PROPRIETARY & CONFIDENTIAL26
Dataset - Composite
Embedded Datasets
PROPRIETARY & CONFIDENTIAL27
public class ContactsDataset extends AbstractDataset {
// ...continued
public void addPhoto(String nick, byte[] photoBytes) throws IOException { Contact contact = getContact(nick); if (contact.getPicturePath() != null) { // delete picture path }
String picturePath = "pic." + nick; Location location = pictures.getLocation(picturePath); try { ByteStreams.copy(new ByteArrayInputStream(photoBytes), location.getOutputStream()); contact.setPicturePath(picturePath); contacts.write(nick, contact); } catch (IOException e) { LOG.error("Got exception: ", e); // delete path throw e; } } }
Dataset - Transactional Update
PROPRIETARY & CONFIDENTIAL28
public class ContactsDataset extends AbstractDataset implements RecordScannable<StructuredRecord> {
//..
@Override public Type getRecordType() { return StructuredRecord.class; }
@Override public List<Split> getSplits() { return contacts.getSplits(); }
@Override public RecordScanner<StructuredRecord> createSplitRecordScanner(Split split) { return contacts.createSplitRecordScanner(split); } }
Dataset - Explorable
PROPRIETARY & CONFIDENTIAL29
Dataset Example - Usagepublic class Contacts extends AbstractApplication {
@Override public void configure() { try { setName("Contacts"); setDescription("An application to manage contacts and their pictures");
createDataset("contacts", ContactsDataset.class);
// Define programs, other datasets...
} catch (UnsupportedTypeException e) { // cannot happen with Contact } } }
Programs
PROPRIETARY & CONFIDENTIAL31
• Standardized containers for processing paradigms • Establishes unified way of extracUng logs & metrics • Compose complex applicaUons -‐ real-‐/me or batch • Seamless Integra/on with Datasets -‐ simple or composite. • Provides conceptual integrity across different processing paradigms
• Integrated end-‐to-‐end tes/ng • Extensible to add new processing paradigms. • Leverage common services to ease
• version management • deployment • management
Building Blocks - Programs
Applica/on
PROPRIETARY & CONFIDENTIAL33
ProgramaUc abstracUon for composing a use
case by combining Datasets and Programs to
perform ingesUon, transformaUon and serving.
Building Blocks - Applicationpublic class PurchaseApp extends AbstractApplication {
public static final String APP_NAME = "PurchaseHistory";
@Override public void configure() { setName(APP_NAME); setDescription("Purchase history application."); addStream(new Stream("purchaseStream")); createDataset("frequentCustomers", KeyValueTable.class); createDataset("userProfiles", KeyValueTable.class); addFlow(new PurchaseFlow()); addWorkflow(new PurchaseHistoryWorkflow()); addService(new PurchaseHistoryService()); addService(UserProfileServiceHandler.SERVICE_NAME, new UserProfileServiceHandler()); addService(new CatalogLookupService()); try { createDataset("history", PurchaseHistoryStore.class, PurchaseHistoryStore.properties()); ObjectStores.createObjectStore(getConfigurer(), "purchases", Purchase.class); } catch (UnsupportedTypeException e) { // This exception is thrown by ObjectStore if its parameter type cannot be // (de)serialized (for example, if it is an interface and not a class, then there is // no auto-‐magic way deserialize an object.) In this case that will not happen // because PurchaseHistory and Purchase are actual classes. throw new RuntimeException(e); } } }
PROPRIETARY & CONFIDENTIAL34
• Is a use-‐case Blueprint • Composed using one or more Programs and Datasets
• Supports real-‐/me or batch or combina/on • Highly reusable through configuraUon & extensible through plugins
• Is an applicaUon that is reusable through configuraUon and extensible through plugins.
• Plugins extend the ApplicaUon Template by implemenUng an interface expected by the template.
• Support with an end to end tes/ng framework
Building Blocks - Application Template
Application Template
Pluggable Interface
Adapter1
Plugin Config1
Config2
Config3 Adapter2
Plugin
Adapter3
Plugin
Want to Learn More?
Open-source (Apache License v2)
Website: http://cdap.io
Mailing List: [email protected] [email protected]
IRC: #cdap on freenode.net
QUESTIONS?