Date post: | 14-Apr-2017 |
Category: |
Software |
Upload: | todd-fritz |
View: | 677 times |
Download: | 6 times |
@wipBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
AJUG, Jan 17, 2017Todd Fritz
Cox Automotive, Inc.
2
This presentation is a draft of what will be presented, next month, at DevNexus.
“Sneak peek”
Reactions and questions may influence evolution of content
Disclaimer
3
4
[email protected]@coxautoinc.com
www.linkedin.com/in/tfritz
http://www.slideshare.net/ToddFritz
https://github.com/todd-fritz
5License: CC BY-SA 3.0
• Senior Solutions Architect @ Cox Automotive, Inc.• Strategic Data Services• The opinions contained herein may not represent my employer, but I
believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc• DevOps mentality• Exposed to many environments, technologies, people• Life-long learner and always curious• Novice bass player• Scuba diver
About Me
6
DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015
Great Wide Open - Atlanta (April 3, 2014)http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud
AJUG (April 15, 2014)http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-groupVideo - https://vimeo.com/94556976
Previous Presentations
7
• Forward• The Briefest History• Background: Reactive Systems, Patterns, Implementations• The Enterprise• Fast Data • The Data Lake for Analytics, App Dev• Presentation Improvements Planned for DevNexus• Questions• Resources
Agenda
8
Forward“Our greatest glory is not in never failing,
but in rising every time we fall.”-Confucius
9
• Why is Reactive Important?• Reactive Systems and Programming != Reactive Management• Reactive underpins every use case, every business capability, every product
feature• Tendency for companies to survey market and select products to match
perceived business need• Importance of vision, governance, tenancy, entitlements, security• Both Process and Technology
• Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages• Real time data, for application components, middleware, data processing,
analytics
• A journey of innovation and successive refinement
Onward
10
The Briefest HistoryHistory is the sum total of things that could have been
avoided. - Konrad Adenauer
11
• “Reactive” is not new
• Underlying principles go back almost 50 years, to the days of punch cards
• Erlang• Built to scale, handle extremely high volume• Extensive use in Telco for decades; many billions of messages• Actors
• Bedrock is messaging• We’ve been using this technique for decades, via many technologies• Improvements over time around component isolation, decoupling to benefit scalability and
concurrency
The Briefest History
12
Reactive Systems, Patterns, Implementations“People who think they know everything
are a great annoyance to those of us who do.”- Isaac Asimov
13
• The Reactive Manifesto • http://www.reactivemanifesto.org
• Many organizations independently building software to new, and similar, patterns• Increasing pressures to simplify, scale, innovate and improve customer experience• Increasing proliferation and interoperability of system environments and connected
devices• More data with contextual use cases
• Yesterday’s architectures just don’t cut it• Need more flexible, resilient, robust systems• Solution through evolution
• Spawn of Actors (Akka, Erlang)• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns
Reactive Systems
14
“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”*
“A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps)
Components may qualify as reactive, but when combined, does not guarantee a Reactive System
What is Reactive? A Set of Design Principles
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
15
1. Reactive Systems (architecture, design)2. Reactive Programming (declarative event-based)3. Functional Reactive Programming (FRP)
NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-22-essence-of-frp.html
Reactive Begets*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
16
• Reactive Programming != Functional Reactive Programming• Subset of Asynch Programming• New information drives logic flow vs. control flow driven by thread-of-execution
• Avoids resource contention (Amdahl’s Law) that impedes scalability.• Decompose into multiple steps that are asynch and nonblocking
• Combine into a composed workflow• Reactive Systems very rarely block• Reactive API libraries are either declarative (functional composition,
combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc)
• Reactive programming is event-driven• Reactive systems are message driven• Wait? What? More about this distinction in a few slides.
Reactive Programming*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
17
• Reactive Programming is related to Dataflow Programming• Both emphasize flow of data vs. flow of control
• Examples• Futures / Promises• (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured
pipelines connecting sources and destinations.• Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value
change can trigger dependent functions to produce new values (state)• Technologies that do this, include• Akka Streams• RxJava• Vert.X
• Reactive Streams Specification• The standard for interoperability amongst Reactive Programming libraries on the JVM• “…an initiative to provide a standard for asynchronous stream processing with non-
blocking back pressure.”
About Dataflow Programming*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
18
• Notable benefits• Increased (efficient) utilization of compute resources (incl. multi-core)• Increased performance via serialization reduction• Amdahl’s Law• Neil Günter’s Universal Scalability Law. To quantify the effects of
contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system.
• Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components.
• Great for creating components that are composed to workflows, that are back-pressured, scalable, high-performance
Why Reactive Programming*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
19
• Reactive Programming - event-driven• Computation via dataflow chains• Events are not directed; “addressable event sources”• Events are mere facts that can be observed• Emitted by changes in state (state machine)• Listeners attach to even sources, which in turn, react to them• Emitted locally
• Reactive Systems - message-driven• Basis of communication across network/components; prefer asynch• Sender/Receiver are decoupled• Focus on resilience and elasticity via communication/communication inherent to distributed systems• Long-lived, addressable components• Waits for messages to be sent, then reacts to them• Messages are ”directed”• A message has a clear destination; “addressable recipient”
Event-Driven vs. Message-Driven*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
20
• Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages
• Examples• AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)
• Pros• Abstraction and simplicity
• Cons• Lose some control• Messaging forces developers to deal with complex realities of distributed programming• Failure detection• Message delivery contracts (dupes, retry, ordering)• Consistency guarantees• Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC,
etc)
Event-Driven & Message-Driven*
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
21 Reactive Systems: Characteristics*
Responsive• Low latency• Consistent &
predictable• Foundation of
Usability• Essential for Utility• Expose problems
quickly• Happier Customers
Resilient• Responsive through failure• Resilient (H/A) through
replication• Failure isolated to
Component (bulkheads)• Delegate recovery to
external Component (supervisor hierarchies)
• Client does not handle failures
Elastic• Responsive as
workload varies• Devoid of bottlenecks
or hot spots• Perf metrics drive
predictive / reactive autonomic scaling
• Favor commodity infrastructure
Message Driven• Asynchronous
• Loose Coupling• Isolation• Location Transparency
• Non Blocking• Recipients can passivate
• Message Passing• Flow Control• Exception Management• Elasticity• Back Pressure
* Source: The Reactive Manifesto
22 Reactive Systems: PatternsArchitecture Pattern:Single Component • Component does one thing, fully and well• Single Responsibility Principle• Max cohension, min coupling
Architecture Pattern:Let-it-Crash• Prefer a full component restart to complex internal
error handling• Design for failure• Leads to more reliable components• Avoids hard to find/fix errors• Failure is unavoidable
Implementation PatternsCircuit Breaker• Protect services by breaking connections
during failures• From EE• Protects clients from timeouts• Allows time for service to recover
Source: https://www.lightbend.com/blog/architect-reactive-design-patterns
Implementation PatternsSaga• Divide long-lived, distributed transactions into
quick local ones with compensating actions for recovery.
• Compensating txns run during saga rollback• Concurrent sagas can see intermediate state• Sags need to be persistent to recover from
hardware failures. Save points.
23
• Still need to use the brain and learn• Not going to be able to build reactive systems with just
paper certifications• No substitute for experience!
• The “new” is based on evolution (old techniques and patterns)
• Paradigms and patterns not isolated to components or technology
Sound Complicated?
Source: The Reactive Manifesto
24
The Enterprise“Even if you are on the right track,
you’ll get run over if you just sit there.”- Will Rogers
25
• We all work for 1..n companies that have a size, complexity and age• Young companies (e.g. start-ups)• Less legacy overhead• More able to adopt newer technology; attempting to innovate, disrupt, find a niche• Less able to leverage expensive, enterprise-class solutions; value through IP or a
unique product• More likely to build cloud native (various reasons)• Staff typically has larger sphere of influence• Sometimes filled with Unicorns
• Mid-size companies• May have legacy overhead• Complexity if growth through acquisition vs. growth through product evolution• Strategic use of of Enterprise products • May be cloud native, or hybrid• Increased division of labor, defined roles, paying customers to keep happy
The Enterprise
26
• Generalizing large companies• Probably where the terms “technical debt” and “culture debt” came from• Certainly has legacy overhead, very complex environments• Fear of change, less room to fail – backed by valid business reasons• Likely has complexity due to both acquisition and product evolution• Purchase a company, adding a ”different” tech stack• Money to fully absorb? Risk? Or purchase to evolve existing lines of business?
• Use of of Enterprise products, likely prefers “supported” flavors of open source• A blend of on-premises, hybrid, cloud• More politics and cats to herd• Slower to see value from new technology• Matrixed division of labor, defined roles, paying customers to keep happy• Transformative change is more difficult; planning, budgeting, process, management
structure
The Enterprise
27
• An Enterprise has a variety of software applications• Customer facing, revenue generating• “Supporting” applications for operation, development, maintenance activities• Datamarts used for analytics• Data typically moved from system of record, into one or more centralized data
“hubs”• OLAP / Data Warehouses• Data Lake, common use of Hadoop
• Older application architectures focused on the application, not on enterprise interoperability• Drop in ETL, ESB, SOA, iPaaS to do so ($$$)• Perhaps refactor service layers into more modern middleware
• Enterprises becoming ”real time”• The old, batch-oriented, application silos are too complex and just can’t do this well
Breaking Down the Enterprise
28
• (Not a complete list)• Products (and supporting software) operates in Real Time• Opens door to do things to increase customer satisfaction/retention
• Save money, become more agile (speed to market) – productivity enhancer• Less reliant on batch or pull (streaming batch to interop legacy)• Data flows to system that need it, where it is acted on in real time• More powerful data processing• Easier to manage; reduced TCO• Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service)• Leverage cloud solutions, auto scaling, resiliency, high availability• Analysts get to use latest tools, oftentimes in the cloud• Enables autonomic automation
How the Enterprise Benefits from Reactive
29
• Friends don’t let their friend-analysts do map-reduce• Analysts are not highly technical (some do code in R)• Larger companies may have communities of SAS users• They just want to connect a friendly BI tool (e.g. Tableau) to run their
SQL against data.• Can’t expect people to switch careers or skill sets to accommodate
new technology.
And Again, those Analysts
30
31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
32
Fast Data“If you are in a spaceship that is traveling at the speed of
light, and you turn on the headlights,
does anything happen?”- Steven Wright
33
• Big Data Fast Data• Constant stream of inbound data, at incredible rates• The old way is to store it, then analyze
• Say hello to Map Reduce!• Hadoop did reinvent how to process petabytes of data on commodity hardware
• Why not analyze and act on data as it is received?• Using (proven) technology built to scale to massive volume?• Akka, Kafka, Spark, Flink• Use case reminder: millions of concurrent actors handling billions of
messages, in real-time• Fast Data means acting on data in real-time, and sending it to destinations
• Data Lake• Other systems
Fast Data
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
34 Time is Gold
* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
35
• Fast Data is the obvious future• Many of us have been using these techs for years
• The easy part• Build new systems using Reactive patterns, architectures and technology• You start ups have it easy…• The ability of a company to adopt disruptive architectures and patterns is inversely related to its
size• In a real world where SDLC costs money
• How to reconcile with legacy architectures and implementations?• How to interoperate with disparate systems with varying capabilities?• Does it make sense to refactor “what works”?• Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter
of perspective, when viewed in the rear-facing mirror of innovation?• What is a cost efficient approach to adopt and adapt to Reactive?
Reality Check
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
36
• Enough background, time to talk tech• For this discussion, the reference platform is Lightbend’s Fast
Data platform• (A few deviations mentioned)• Core techs will be• Kafka - open source or Confluent• Spark - open source or Databricks (a nice managed service in AWS)• Akka (and Akka-HTTP) (open source or Lightbend stack)• Alpakka
Now, the Fun Stuff
* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
37Source: Lightbend, Inc.
38
• JMS -> AMQP –> Kafka• Streaming platform• Process messages when provided• Fault-tolerant storage• Pub/Sub capability
• Common use cases• Data pipelines (messaging) between a source and destination• Real-time action / transformation• Metrics, weblogs, stream processing, event sourcing
• A commit log• Spread across a cluster• Record streams stored in topics• Each record has a key, value, timestamp• Each topic has offsets and a retention policy
Kafka 101
39 Kafka Performance
Source: “Introduction to Kafka”, Ducas Francis
30k/s 1.8M/min 108M/hr 2.7B/day
40
• Producer API• To publish a record to Kafka
• Consumer API• To subscribe to a topic(s) • Consumer groups• Handle records pushed to topics
• Streams API• Stream processing• Consume from Stream An, do processing,
publish to Stream Bn
• E.g. aggregation, joining• Connector API• To build custom Producers/Consumers• Purpose built integration components
Kafka APIs
41
• > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex• Need to use other abstractions for batching• Lacks replay ability (reset offset, etc)
• > Log forwarders, e.g. Scribe or Flume• Push architecture• High performance • Scales well• Sensitive to business logic in endpoints, because needs to push data fast• Assumes data is pushed to large sink (e.g. Hadoop)• Oh, and then queries later. So much for real-time.
• Supports Polyglot• Python, Go, .NET, Node.js, C/C++, etc
• Robust ecosystem• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Why Use Kafka?
42
• “…a fast and general engine for large-scale data processing.”• Hadoop / YARN• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource• Great for parallel file processing of large files• Synchronization barrier during persistence
• Spark• In-memory data processing• Interactive/iterative data query• Better supports more complex, interactive (real-time) apps
• 100x faster than Hadoop MR (in memory), 10x faster on disk• Microbatching
Spark: What is it?
Source: spark.apache.org
43
• Combine SQL, streaming, complex analytics• SQL• Dataframes• MLlib• GraphX• Spark Streaming
Spark
Source: spark.apache.org
44
• Run it • Standalone• Hadoop• Mesos• Cloud
• Access data• HDFS• Cassandra• Hbase• S3• Hive• Tachyon, and more
• Write code• Scala• Java• Python, Clojure, R
• Interactive query shell (notebooks)
Spark Execution Modes
Source: spark.apache.org
45
• Slow due to replication, serialization, filesystem IO• Inefficient use cases:• Iterative algorithms (ML, Graphs, Network analysis)• Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries)
Spark: Hadoop MR
Source: “Spark Overview”, Lisa Hua
46 Spark: Hadoop & Spark (in Hadoop)
Source: “Spark Overview”, Lisa Hua
• Spark has additional features, such as interop with S3 storage
Spark: A Clustered Application
47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark: Execution Terminology
48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Job – a set of tasks to be executed as a result of an action• Stage – a set of tasks in a job that can be run in parallel• Task – a individual unit of work sent to a single executor
Spark: SQL
49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Spark SQL is a module for structured data querying• Supports basic SQL and HiveQL• Can act as distributed query engine via JDBC/ODBC, or CLI
Spark: Dataframe, Datasets, RDDs
50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Dataframe is a distributed assembly of data into named columns• Analogous to a relational table, or data frame in R/Python (with richer
optimizations)• Dataset was added in 1.6 to provide benefit of RDDs and Spark
SQL’s execution engine• Build datasets from JVM objects and then manipulate with functional
transformations
• Scalable, high-throughput, fault tolerant processing of real time streams and use cases.
• Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event.
Spark Streaming
51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Can merge inbound data with historical data• Write code in Scala, Python, Java, etc• Lower level access via DStream, obtained via StreamingContext.• Create RDD’s from the DStream• Two primary metrics to monitor and tune:• Processing time (per batch)• Scheduling delay (processed upon arrival?)
• Use Kyro for serialization
Spark Streaming
52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark Streaming - comparison with other techs
53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Building a Spark Application
54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Scala or Java, compiled to JAR (in turn uploaded to worker nodes)• Running a spark app is as easy as pie (submit options can expand the
experience):
$ spark-submit myAwesomePythonScript.py theFileURL
$ spark-submit –class SkyNetInScala skynet2017.jar theFileURL
• Spark uses log4j (beware)• YARN can aggregate worker logs
55 Akka, who uses it?
Source: “Akka Actor Introduction”, Gene
56 Akka
Source: “Introducing Akka”, Jonas Bonér
• Vision• Simpler: Concurrency, Scalability, Fault-Tolerance• With a single unified• Programming Model• Managed Runtime• Open Source Distribution
• Manage System Overload (backpressure)• Scale up & Scale out• Program to a higher level• No more shared state, state visibility, threads, locks, concurrent collections, thread
notifications• Low level concurrency built into the plumbing; it becomes simple workflow just
deal with messages• Increases CPU utilization, lowers latency, high throughput, scalable!• Superior, Proven model to detect and recover from errors
57 Akka: Perfect for the Cloud
Source: “Introducing Akka”, Jonas Bonér
• Elastic and dynamic• Fault tolerant & self healing (autonomic)• Adaptive load-balancing, cluster rebalancing & Actor migration• Build loosely coupled systems that can dynamically adapt at runtime
58 Akka 101
Source: “Introducing Akka”, Jonas Bonér
• Akka’s unit of code is called an Actor• The vehicle to create concurrent, scalable, fault-tolerant apps atop the
fabric• Encapsulates code like servlets or session beans; policy decisions
separated from biz logic• Actors have been around since 1973, and if you’ve ever used a
phone, that software helped make it work. 9 nines of uptime!• Think of an actor as a VM in the cloud (it isn’t, but)• Encapsulated, decoupled• Managing own memory and behavior• Communicates asynchronously with non-blocking messages• Elastic – grow/shrink on demand• Hot deploy, change runtime behavior
59 Akka: How to use Actors?
Source: “Introducing Akka”, Jonas Bonér
• Alternative to:• A thread• An object instance or component• Callback or listener• Singleton or service• Router, load-balancer or pool• Session bean or MDB• Out of process service• A Finite State Machine (FSM)
60 Akka: What is it?
Source: http://bit.ly/hewitt-on-actors
• Carl Hewitt’s definition• Fundamental unit of computation that embodies:• Processing• Storage• Communication
• 3 Axioms – When an actor receives a message it can:• Create new Actors• Send messages to Actors it knows• Designate how it should handle the next message it receives
61 Akka: Core Actor Operations
Source: “Introducing Akka”, Jonas Bonér
0. Define1. Create2. Send3. Become4. Supervise
62 Akka: Define Operation
Source: “Introducing Akka”, Jonas Bonér
0. Define
• Define the message (class) the actor should respond to, and Actor class
63 Akka: Create Operation
Source: “Introducing Akka”, Jonas Bonér
1. Create
• Yes, creates new actor. From ActorSystem, then ActorRef.• Lightweight, 2.6M per Gb RAM• Strong encapsulation of: state/behavior (indistinguishable), message queue
64 Akka: Send Operation
Source: “Introducing Akka”, Jonas Bonér
2. Send
• Sends a message to an Actor• Asynch and on-blocking (fire & forget)• Everything is Reactive• Actor is passivated until receiving a message, which triggers it to awaken• Messages are energy
• Everything is asynch and lockless
65 Akka: Performance
Source: “Introducing Akka”, Jonas Bonér
+50 million messages per second !!!
66 Akka: Remote Deployment
Source: “Introducing Akka”, Jonas Bonér
67 Akka: Become Operation
Source: “Introducing Akka”, Jonas Bonér
3. Become
• Dynamically redefine Actor’s behavior• Reactively triggered by receipt of a message• Will not react differently to messages it receives• Behaviors are stacked – can by pushed and popped…
(Think in terms of the object changed it’s type – interface, protocol, implementation)
68 Akka: Become Operation – Why?
Source: “Introducing Akka”, Jonas Bonér
Why do this?• A busy actor can become an Actor Pool or Router!• Implement FSM (Finite State Machine)• Implement graceful degradation• Spawn empty workers that can ”Become” whatever the Master
desires• Very useful. Limited only by your imagination.
69 Akka: Failure Management
Source: “Introducing Akka”, Jonas Bonér
In Java/C/C+ etc.• Single thread of control• If that thread blows up you are screwed• Only option to do explicit error handling within your single thread• Errors isolated within thread; other threads have no clue• Results in tons of defensive code, scattered throughout the codebase
and entangled in your business logic
70 Akka: Supervise Operation
Source: “Introducing Akka”, Jonas Bonér
4. Supervise
• Manage another Actor’s failures (or the person sitting next to you)• Let actors monitor (supervise) each other for failure, then respond• Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling
71 Akka Streaming
Source: Akka Documentation
• Akka Streams API is completely decoupled from Reactive Streams interfaces• Implementation details to pass stream data between processing stages• Akka Streams is interoperable with any conformant Reactive Streams i
• Principles• All features explicit in the API• Compositionality; combined pieces retain the function of each part• Model of domain of distributed bounded stream processing
• Reactive Streams -> JDK9 Flow APIs
72 Akka Streaming
Source: Akka Documentation
• Immutable building blocks / blueprints enabled for libraries, include• Source - something with exactly one output stream• Sink - something with exactly one input stream• Flow - something with exactly one input and one output stream• BidiFlow - something with exactly two input streams and two output streams that conceptually
behave like two Flows of opposite direction• Graph - a packaged stream processing topology that exposes a certain set of input and output ports,
characterized by an object of type Shape.• Built in backpressure capability• No stage can push downstream unless it received a pull beforehand
• Difference between error and failure• Error is accessible within the stream as a data element (signaled via onNext)• Failure means the stream itself has collapsed (signaled via onError).• Want failure to propagate faster than data (essential, to deal with backpressure)• Data elements emitted before a failure can still be lost of the onError overtakes them• Recovery element acts as bulkhead to confine a stream collapse to a given region of stream
topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
73 Akka HTTP
Source: Akka Documentation
• Built atop Akka Streams• Can expose an incoming connection in form of a Source instance• To start listening on network with Akka HTTP, create a Route and bind it to a port
(similar syntax to Spray).• Backpressure on source? Akka HTTP stops consuming data from network; eventually
leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor)• Rules• Libraries shall provide their users with reusable pieces, i.e. expose factories that return
graphs, allowing full compositionality• Avoid destruction of compositionality.• Express functionality of a library such that materialization can be done by user
outside of library’s control.• Libraries may optionally and additionally provide facilities that consume and
materialize graphs• Allows a library to provide convenience “sugar” for use cases
74
• Akka Streams Integration• https://github.com/akka/alpakka• http://developer.lightbend.com/docs/alpakka/current/
• Adds interesting capabilities to Akka Streams• Modern alternative to Apache Camel (EIP
implementation)• Camel en ze Akka
• Community driven, focused on connectors to external libraries, integration patterns and conversions.• ”A call to arms”• https://github.com/akka/alpakka/releases/tag/v0.1
Alpakka 101
75
The Data Lake for Analytics, App Dev“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”- Garrison Keillor
76 The Data Lake
• Note: This section will be expanded for DevNexus.
77 The Data Lake
• The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc.
• Easily accessible, massive repository of data built on commodity hardware (or Cloud).
• Data is not stored in a way that is optimized for data analysis (S3)• Data Lake retains all attributes• Beware the Data Lake fallacy: http://
www.gartner.com/newsroom/id/2809117• Let’s combine all this data to drive increased information sharing, usage,
while reducing cost through consolidation / tech simplication.• Does it really work?• Has the ideal of Enterprise-wide data management been realized?• Deriving value from data still in hands of business end user (enter: Fast Data Platforms)
78 The Dark Side of the Data Lake
Source: “Gartner Says Beware of the Data Lake Fallacy”
• Many companies tend to vacuum data into a Hadoop for later use• Many companies use overlapping tools within the same ecosystem, that do not
interoperate• Data lakes ignore how/why data is used, governed, defined and secured.• Does this sound like a good solution?• Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary?
• Inability to quantitatively measure data quality• Accepts any data without governance or oversight• Accepts any data without metadata (description)• Inability to share lineage of findings by other analysis to share found value• Security, access control (and tracking of both)• Data Ownership, Entitlements?• Tenancy?• Regard for regulatory controls, compliance issues?• What to do?
79 More to Come…
Source: “Gartner Says Beware of the Data Lake Fallacy”
• At DevNexus 2017• Thurs, Feb 23 @ 2:30pm• http://devnexus.com/s/devnexus2017/presentations/17212
80
Presentation Improvements for DevNexus“Build something 100 people love,
not something 1 million people kind of like.”- Brian Chesky
• More diagrams, fewer words• Refine, refine, refine• Mix in coding examples• Improve contrast between older architectures and reactive, for the
Enterprise• Content that contrasts different streaming options (Akka, Spark,
Kafka)• Add specific performance details• Incorporate additional, interesting content (incl. Data Lake related)
Planned Improvements
81
82
Questions?“I'm sorry, if you were right, I'd agree with you.”
- Robin Williams
83
• The Reactive Manifesto (http://www.reactivemanifesto.org/)• Chaos Monkey? Use Linear Fault Driven Testing instead.• https://www.lightbend.com/blog/architect-reactive-design-patterns• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html• https://
www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka• https://kafka.apache.org• http://www.slideshare.net/ducasf/introduction-to-kafka• http://www.slideshare.net/SparkSummit/grace-huang-49762421• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms• https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/• https://github.com/akka/alpakka/releases/tag/v0.1• http://www.slideshare.net/LisaHua/spark-overview-37479609• http://spark.apache.org/• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/• http://www.slideshare.net/gene7299/akka-actor-presentation• http://www.slideshare.net/jboner/introducing-akka• http://bit.ly/hewitt-on-actors• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/
Resources
Ideas
84