research
Third Quarter 2019 — Algorithmia Research
ML INFRASTRUCTURE PART 3
Connectivity
2 A Third Quarter 2019 — Algorithmia Research
Since 2014, Algorithmia has accelerated the deployment and adoption of
machine learning (ML) for many of the world’s largest enterprises. In addition
to our enterprise product, algorithmia.com currently serves more than 8,000
different models to more than 90,000 developers and processes millions of
requests every day.
As we’ve scaled through the years and serviced requests from our customers,
we’ve learned a lot about best practices and scaling machine learning
infrastructure. We’re passing that knowledge to ML developers to help them
along the path to maturity and empower data science teams to achieve more.
About This Series
Ml Infrastructure Part 3 — Connectivity A 3
Preparing data for ML pipelines is challenging when end-to-end data and analytic architectures are not refined to interoperate with underlying analytic platforms. New architectural patterns can help, but data engineers must understand end-to-end ML workflows to properly apply them.
Gartner, Preparing and Architecting for Machine Learning: 2018 Update, Carlton Sapp, 14 September 2018
“
4 A Third Quarter 2019 — Algorithmia Research
ML Infrastructure
Machine learning (ML) toolsets, languages, and processes are evolving quickly.
The infrastructure to support ML must be able to adapt as data scientists
experiment with new and better solutions. At the same time, organizations must
be able to connect a variety of systems into a platform that delivers consistent
results.
ML architecture can be broken into four distinct functional groups: Data and
Data Management Systems, Training Platforms and Frameworks, Serving and
Life Cycle Management, and the External Systems with which they all interact.
This paper examines the infrastructure needed to connect those groups into a
system capable of productionizing ML at scale.
Data and Data Management Systems
The data used in ML projects for training and scoring and their related systems.
In almost all cases this infrastructure is in place before the buildout of the
remainder of an organization’s ML architecture.
Ml Infrastructure Part 3 — Connectivity A 5
Most data management systems include built-in authentication, role access
controls, and data views. In more advanced cases, the organization will have
a data-as-a-service engine that allows for querying data through a unified
interface. Even in the simplest cases, ML projects likely rely on a variety of
data formats—different types of data stores from many different vendors. For
example, one model might train on images from a cloud-based Amazon S3
bucket, while another pulls rows from on-premises PostgreSQL and SQL Server
databases, while a third interprets streaming transactional data from a Kafka
pipeline.
Training Platforms and Frameworks
Training platforms and frameworks are the wide variety of tools used for model
building and training, each of which should ultimately generate model files and
dependencies that the serving infrastructure can run and manage.
EXTERNALSYSTEMS
Data Management System
Training Platforms/Frameworks
Cloudera Workbench
TensorFlowScikit-learn
PyTorchKeras
SageMakerDataiku
H20.aiAzure ML
TRAINING SYSTEM A
TRAINING SYSTEM B
TRAINING SYSTEM C
Serving
Model PortfolioOrchestrationCLIDependency Mgt.LanguagesVersionsPipeliningComputeHardwareModel EvaluationGovernanceMonitoring
ETL Pipelines Feature Store
] Figure 3.1: Source: Algorithmia, 2019
6 A Third Quarter 2019 — Algorithmia Research
Within training platforms, tooling options are nearly limitless. Dataiku, Amazon
SageMaker, Azure ML Studio, Cloudera Data Science Workbench, and dozens of
other commercial training platforms compete with home-grown solutions, any
of which might be the right solution for a given team and job. Given the highly
specialized nature of training tools, freedom of choice is paramount.
Serving and Life Cycle Management
The services that allow data scientists to deliver trained models into production
and maintain them include everything needed to:
● INGEST and containerize models and dependencies;
● CATALOG models to make them discoverable;
● SERVE models in a scalable environment;
● INTEGRATE into DevOps alerting, logging, and system health monitoring;
● MANAGE and govern the entire life cycle in compliance with
performance and regulatory / governance specifications.
External Systems
Machine learning does not exist in isolation. A wide variety of applications
external to the ML process need to consume model output, log and audit model
behavior, or otherwise monitor or integrate with data, training, and production
systems.
Ml Infrastructure Part 3 — Connectivity A 7
Connectivity ApproachesAs discussed in The Roadmap to Machine Learning Maturity whitepaper,
ML-focused projects generate value only after they connect these distinct
functional areas into a workflow. Data is useful only after models interpret it,
and model inference generates value when external apps consume it. The path
toward integration generally falls into one of two categories:
Approach 1: Horizontally Integrated
● PRO: Fastest path to automating existing processes.
● CON: Fragile, ongoing software development and maintenance, vendor lock-in.
The quickest way to develop an ML platform is by supporting only a subset of
solutions from each of the functional groups. By limiting the available options,
it is faster and easier to integrate each component into a horizontal platform,
often hardcoding the handoff from one component to the next.
This is where many companies building DIY systems begin—automating existing
processes and tightly integrating current tools. For these organizations, horizontal
integration offers the fastest path to in-house production. It requires no additional
TrainingStorage
scikit
Serving
] Figure 3.2: Horizontally integrated systems build hardcoded connectors between specific components.
8 A Third Quarter 2019 — Algorithmia Research
workforce training and simply adds speed to workflows already in place.
Unfortunately, this commits an organization to full-time software development.
Rather than training models and adding business value, organizations spend
resources building and maintaining brittle integrations, creating new projects
for any new tools and services, and ultimately, attempting to compete with
commercial platforms with far larger budgets.
Many commercial platforms also pursue a horizontally integrated
strategy, largely to block competitors. By mandating a training platform,
storage solution, or deployment infrastructure, commercial vendors can
increase the value of their customer relationships and reduce churn.
Compared to DIY solutions, these vendors generally offer more points of
integration—but only with non-competitive components, such as
frameworks or languages. These solutions bet that what they provide will be
good enough, and that the ease of an all-in-one solution will be worth vendor
lock-in and lack of choice as the customer matures.
In theory, horizontally integrated systems can be quite efficient and easy to
use, since they follow a simple path. In practice, resource constraints (for DIY
systems) or lack of competition (for commercial systems) often leads to sub-par
experiences throughout the system.
Regardless of whether the platform was purchased off the shelf or built in-
house, organizations will ultimately encounter complications in their upgrade
Ml Infrastructure Part 3 — Connectivity A 9
path as they attempt to stay on
top in a fast-moving environment.
Commercial cloud platform customers
are beholden to a massive company’s
rollout schedule, while DIY shops are
forced to wait for compatibility among
multiple vendor roadmaps.
Approach 2: Loosely Coupled, Tightly Integrated
● PRO: Flexibility to select the best tools for any job,
which helps, future-proofing ML investment.
● CON: More up-front development time or software licensing costs.
In ML, agility is essential because infrastructure that works today is guaranteed
to be outdated in six months.
Fortunately, each component of the ML system is fairly self-contained, and the
interactions of those components are always fairly consistent:
● Data informs all systems through queries.
● Training systems export model files and dependencies.
● Serving and life cycle management systems return inferences to
applications and model pipelines, and export logs to systems of record.
● External systems call models, trigger events, and capture and modify data.
The infrastructure and workflows within each component are quite complex but
Without the ability or necessity to do better, many systems simply don’t, providing poor documentation, sub-par UX, or lackluster performance.
10 A Third Quarter 2019 — Algorithmia Research
the connective tissue binding them together does not need to be.
An architecture that allows each system to evolve independently can help
organizations choose the right components for today without sacrificing the
flexibility to rethink those choices tomorrow. To enable this loosely coupled,
best-of-breed approach, a deployment platform must support three kinds of
connectivity: Publish/Subscribe, Data Connectors, and RESTful APIs.
Publish/Subscribe
Publish/subscribe (pub/sub) is an asynchronous, message-oriented notification
pattern. In a pub/sub model, one system acts as a publisher, sending events to
a message broker. Through the message broker, subscriber systems explicitly
enroll in a channel, and the hub forwards and verifies delivery of publisher
notifications, which can then be used by subscribers as event triggers.
The pub/sub pattern is highly scalable, but in the context of ML, its most
important feature is flexibility. By abstracting communications between
publishers and subscribers, each side operates independently. This reduces
the overhead of integrating any number of systems and allows publishers and
subscribers to be swapped at any time, with no impact on performance. Since
ML infrastructure is diverse, evolves quickly, and has a wide variety of demand
cycles, the flexibility of pub/sub’s loose coupling provides an excellent fit for
most high-level communications tasks.
Ml Infrastructure Part 3 — Connectivity A 11
There are myriad technologies and services available to manage pub/sub
systems—Amazon SNS, Azure Service Bus, Google Pub/Sub, Kafka, and many
more. A deployment and management system must be designed to interact with
these systems. Algorithmia’s AI Layer provides configurable event listeners
that allow users to trigger actions based on input from pub/sub systems. In the
diagram above, a change in source data triggers a run of a specific model in
response, while also potentially triggering actions in other subscribed systems.
Data Connectors
While the model is the engine of any machine learning system, data is both the
fuel and the driver. Data feeds the model during training, influences the model in
production, then retrains the model in response to drift. As data changes, so does its
interaction with the model, and to support that iterative process, an ML deployment
and management system must integrate with every relevant data store.
Docs
Publisher SubscribersMessage Broker
SQS 1
SQS 2, 3...
SNS
AI LAYER
EventListener
^ Figure 3.3: A change in source data triggers a run of a specific model in response, while also potentially triggering actions in other subscribed systems.
12 A Third Quarter 2019 — Algorithmia Research
Connecting to Cloud Data
From consumer apps to the enterprise, the cloud is the new default for data
storage, and a common location for training data. Any model deployment and
management platform should include connectors to the most popular cloud-
based data storage services. The AI Layer, for example, includes support for
multiple blob storage services, such as Amazon, Microsoft, Google, and others.
Connecting to Databases
An enormous amount of enterprise data is stored in databases, most of which
remain on-premises. A model deployment platform should be extensible to
allow developers to connect to a wide variety of databases.
^ Figure 3.4: The AI Layer includes pre-built integrations to popular cloud-based data sources.
algorithmia.com
Ml Infrastructure Part 3 — Connectivity A 13
Connecting to Other Sources
A massive amount of training data is stored in filesystems as images, delimited
data files, or other file formats. While a deployment platform could mandate
that data scientists upload these files to an approved cloud storage bucket then
connect directly (as do horizontally integrated platforms from cloud providers),
this introduces an unnecessary step and ties users to storage solutions that can
increase costs and might become deprecated in the future.
To offset these risks, deployment platforms should offer Web- and API-
based tools to import and host data from filesystems. This hosted data store
should support grouping of assets (similar to folders in a filesystem) and full
permissioning at both the asset and group levels. Models and algorithms should
be able to call all hosted data via a URI.
RESTful APIs
ML model output is consumed in many different ways. Applications written in
a variety of languages call models directly. Other models, written in yet another
set of languages, ingest output as part of multi-model pipelines.
Because of the variety of requesting platforms, and the unpredictability of those
requests, a loose coupling is, again, the most elegant answer, and RESTful APIs
are the most elegant implementation, due to the five required REST constraints:
1. Uniform Interface
● DEFINITION: All requests adhere to a common format.
● BENEFIT: Requests from different systems are formatted identically,
14 A Third Quarter 2019 — Algorithmia Research
dramatically reducing the burden of supporting disparate platforms.
2. Client–Server
● DEFINITION: The server only interacts with the client through requests.
● BENEFIT: The client and the server are decoupled black boxes, allowing
clients to change and servers to evolve without breaking the relationship.
3. Stateless
● DEFINITION: All necessary information must be included within a request
rather than relying on information from previous requests or other data.
● BENEFIT: Statelessness greatly expands flexibility and horizontal
scalability, while also enabling comprehensive auditing.
4. Layered system
● DEFINITION: The REST client is agnostic to any layers between itself and
the server.
● BENEFIT: This allows for performance- or security-related
infrastructure to sit between the client and server, as needed.
5. Cacheable
● DEFINITION: Developers can declare certain responses to be cacheable.
● BENEFIT: Cacheability reduces latency and increases scalability.
Management APIs
A deployment and management system should not expose just its models via
API, but also its management functions. Every point of entry and exit to the
system, as well as all management commands, should be available via API. This
Ml Infrastructure Part 3 — Connectivity A 15
will enable a variety of integrations with external systems and ensure that no
application is truly incompatible.
The most requested integration for any deployment platform is likely Jupyter
Notebooks—the standard interface for data science across a number of
frameworks. Jupyter notebooks are used to document and visualize work and
administer many functions of data science workbenches. Extending a notebook
to include deployment is a natural closing of the loop that allows data scientists
to remain in their preferred environment while seeing a project through to the
finish.
Manage
Deploy
Connect
algorithmia.com
] Figure 3.5: Sample deployment to the AI Layer from a Jupyter Notebook.
16 A Third Quarter 2019 — Algorithmia Research
What’s Next?
In the previous chapters in this series, we identified seven challenges of the
ML life cycle and discussed what it takes to deploy ML models. In subsequent
chapters, we’ll continue to examine what you need to maintain the ML life cycle,
whether you build the architecture yourself, use a third-party solution, or work
with a service provider. Topics will include:
● Serving & Scaling
● Management & Governance
Ml Infrastructure Part 3 — Connectivity A 17
About Algorithmia
Algorithmia helps organizations extend human potential with the AI Layer, a
machine learning operating system.
The AI Layer empowers organizations to:
● Deploy models from a variety of frameworks, languages, and platforms.
● Connect popular data sources, orchestration engines, and step functions.
● Scale model inference on multiple infrastructure providers.
● Manage the ML life cycle with tools to iterate, audit, secure, and govern.
To learn more about how Algorithmia can help your company accelerate its ML
journey, visit our website at algorithmia.com.
Copyright © 2019 Algorithmia, Inc. All Rights Reserved. WP7-190710-v.1.3