Date post: | 13-Apr-2017 |
Category: |
Software |
Upload: | jesus-rodriguez |
View: | 526 times |
Download: | 1 times |
A Practical Guide to Enterprise Machine Learning Platforms By Tellago Research
Contents Overview ....................................................................................................................................................... 3
Key Characteristics of Enterprise Machine Learning Solutions .................................................................... 3
Cloud vs. On-Premise Machine Learning Platforms...................................................................................... 5
Enterprise Cloud Machine Learning Platforms ............................................................................................. 5
Azure Machine Learning .......................................................................................................................... 6
AWS Machine Learning ............................................................................................................................ 7
IBM Watson Developer Cloud ................................................................................................................. 8
Databricks ................................................................................................................................................. 9
On-Premise Enterprise Machine Learning Platforms ................................................................................... 9
Revolution Analytics .............................................................................................................................. 10
Dato ........................................................................................................................................................ 11
Spark MLib and Spark R ......................................................................................................................... 12
PredictionIO ............................................................................................................................................ 13
Scikit-Learn ............................................................................................................................................. 13
Summary ..................................................................................................................................................... 14
Overview
Machine learning is becoming one of the most important aspects of modern
enterprise applications. Recent years has seen an explosion in the innovation of
machine learning platforms taking it from a domain constrained to a few data
scientists to a mainstream developer audience. As a result, companies are now in a
position to build really comprehensive machine learning applications that were
completely impossible just 2-3 years ago.
The explosion in machine learning technologies doesn’t come without a price for
enterprises. As any other rapidly emerging technology trends, machine learning has
experienced a rapid growth in the number of new platforms and startups that
provide relevant machine learning capabilities for enterprises. As a result, many
enterprises struggle navigating the new ecosystem of machine learning
technologies and platforms.
This paper provides an analysis of some of the most relevant technologies in the
machine learning space along with experiences that Tellago’s data science practice
team has implementing machine learning solutions in the real world. The analysis
illustrated in this paper is solely based on practical experiences and not theoretical
exercises.
Key Characteristics of Enterprise Machine Learning Solutions
Integration with Mainstream Data Stores
The integration with diverse data stores is a key element for the mainstream
adoption of machine learning platforms. Databases, SaaS platforms, ERPs, CRMs
are just some of the data sources that can be relevant in machine learning
scenarios. The ability to seamlessly integrate with different line of business systems
drastically simplifies the adoption of machine learning platforms in enterprise
environments.
Integration with R and Python
R and Python have been the main platforms used in machine learning and data
science applications. Consequently, there are many widely adopted machine
learning frameworks implemented in R and Python. The interoperability with R and
Python libraries allows machine learning platforms to take advantage of well-
established data science practices and techniques implemented in those
frameworks. In that sense, enterprises can benefit from machine learning platforms
that can natively leverage R and Python libraries.
Simple Infrastructure
Scaling machine learning infrastructures can be a complex endeavor. Even worse,
the complexities around the configuration of machine learning infrastructures
sometimes become a friction point for the early adoption of machine learning
platforms. To avoid those challenges, enterprises should look for machine learning
platforms that can be relatively simple to setup and don’t require massive
investments in infrastructure. This will allow organizations to focus on the
evaluation of core machine learning capabilities instead of the infrastructure behind
them.
Programmatic Interfaces
Executing and evaluating machine learning models is often seen as an activity
exclusively performed by humans. However, incorporating machine learning models
into business applications is incredibly relevant in the enterprise. To achieve that,
machine learning platforms should support the programmatic execution of models
via APIs or mainstream enterprise programming platforms such as .NET or Java.
Monitoring and Management Tools
Monitoring and managing the execution of machine learning models is an essential
element to guarantee the adoption of these type of platforms in enterprise
environments. From the monitoring perspective, machine learning platforms should
enable both analytics about the results of executed models as well as operational
metrics related to the execution of those models. Additionally, organizations should
favor machine learning platforms that provide a simple but robust management
experience.
Extensibility
Until a few years ago, machine learning platforms were notoriously closed systems.
That factor really limited the mainstream adoption of these platforms in enterprise
environments, as many machine learning solutions require complex levels of
customizations that require extending the core platform. In that sense,
organizations should carefully evaluate the extensibility models of machine learning
platforms and analyze how those can help to optimize the platform for their specific
scenarios.
Cloud vs. On-Premise Machine Learning Platforms
A simple way to simplify the really crowded machine learning platform market is to
make a distinction between cloud and on-premise platforms. For many
organizations, the nature of the underlying infrastructure (cloud vs. on-premise) is
a determining factor in terms of which machine learning platforms to evaluate.
Deciding between and on-premise and cloud platform is always an interesting
dilemma for most organizations but its even more relevant when it comes to data-
centric platforms. While cloud machine learning platforms abstract the complexity
of the underlying machine learning infrastructure and are rapidly driving innovation
in the space, they lack the levels of control and extensity that you can achieve with
on-premise machine learning stacks.
The next section in this document provides an analysis of some of the most
relevant cloud and on-premise platforms in the machine learning space.
Enterprise Cloud Machine Learning Platforms Machine learning platforms are rapidly emerging as one of the most important
components of platform as a service (PaaS) technologies. While the first iteration of
cloud big data technologies focused on providing a seamless experience for hosting
and provisioning a Hadoop based infrastructure, the lead platforms in the space are
rapidly adding value data intelligence capabilities including machine learning. This
movement has been led by platforms like Microsoft, Amazon or IBM, which have
added sophisticated machine learning capabilities to their existing PaaS offerings.
Additionally, there is a large number of startups trying to provide specialized
machine learning cloud services that simplify the experience for organizations trying
to apply machine learning models to specific business scenarios. When analyzing
the cloud machine learning platform space, organizations should consider Azure,
AWS, IBM and Databricks as some of the leader in the space.
Azure Machine Learning Overview: Azure machine learning is a fully managed service included in the
Azure platform that allows the implementation of predictive analytics
solutions using machine learning. The service provides interfaces for building,
deploying and managing machine learning models and its tightly integrated
with other Azure services. Currently, Azure machine learning is included as
part of the Cortana Analytics suite.
Key Capabilities: Azure machine learning includes some of the following
capabilities
o Machine Learning Studio: Microsoft Azure Machine Learning Studio
is a collaborative, drag-and-drop tool you can use to build, test, and
deploy predictive analytics solutions on your data. Machine Learning
Studio publishes models as web services that can easily be consumed
by custom apps or BI tools such as Excel.
o API Generation: Azure machine learning provides the infrastructure
to expose machine learning models as APIs that can be
programmatically accessed by client applications. These APIs can also
be integrated with the Azure API Gateway to enable more
sophisticated management and monitoring features.
o R and Python Extensibility: Azure machine learning allows
developers to incorporate custom R and Python scripts into models.
This extensibility mechanism allows developers to implement machine
learning applications that combine the capabilities of Azure with many
of the popular R and Python machine learning frameworks in the
market.
Challenges: Azure machine learning is still relatively limited in terms of the
integration with on-premise data stores, which are predominant in the
enterprise. Additionally, we feel Azure machine learning can benefit for more
complete extensibility mechanisms beyond the ones provided by R and
Python scripts.
AWS Machine Learning Overview: Amazon Machine Learning is a native AWS service that makes it
easy for developers of all skill levels to use machine learning technologies.
Amazon Machine Learning provides visualization tools and wizards that guide
developers through the process of creating machine learning (ML) models
without having to learn complex ML algorithms and technology. Amazon
Machine Learning makes it easy to obtain predictions for your application
using simple APIs, without having to implement custom prediction generation
code, or manage any infrastructure
Key Capabilities: Azure Machine Learning enables some of the following key
capabilities:
o Model Creation: AWS APIs and wizards make it easy for any
developer to create and fine-tune ML models from data stored
in different data stores and query these models for predictions. The
service’s built-in data processors, scalable ML algorithms, interactive
data and model visualization tools, and quality alerts help you build
and refine your models quickly.
o Prediction Services: AWS machine learning provides the
mechanisms for quickly and reliably generate predictions for your
applications based on previously created machine learning models. The
prediction services can be elastically scaled using AWS infrastructure.
o Data Transformation DSL: AWS machine learning includes a domain
specific language (DSL) that allows developers to model
transformations on the data processed by machine learning models.
Data transformation implemented using these DSLs can be published
as “recipes” and reused across other transformation processes.
Challenges: The experience of getting started with AWS machine learning is
relatively complex compared to its competitors in the space. We believe the
AWS machine learning service can benefit from incorporating more visual
tools that facilitate the authoring of machine learning models. Another
challenging factor in AWS machine learning applications remains the
communication with on-premise data stores.
IBM Watson Developer Cloud Overview: IBM Watson developer cloud is a series of cognitive data services
included as part of the IBM Bluemix platform. The Watson developer cloud
includes services such as vision analysis, text analytics, text-to-speech
transformation, concept expansion, among a dozen of other that enable
developers to incorporate deep learning and cognitive data capabilities within
their applications.
Key Capabilities: The Watson developer cloud includes some of the
following capabilities.
o Text Analytics: Watson developer cloud provides a large number of
text analytics related services including relationship extraction,
concept insights, sentiment analysis etc. These services can be easily
integrated with other machine learning or business applications.
o Vision Analytics: Watson developer cloud provides a group of
innovative services that abstract key image analysis capabilities such
as face recognition, object detection, image link extraction etc. These
services can complement image libraries required in line of business
applications and solutions.
o Integration with Bluemix Services: Watson developer cloud is
included as part of IBM Bluemix and, consequently, is tightly
integrated with other Bluemix platform services. As a result,
developers can implement really robust applications that leverage
cognitive data services
Challenges: Watson developer cloud is a collection of APIs that enable
cognitive data capabilities. As a result, Watson developer cloud is typically
used as a complement to machine learning applications and can’t be
considered a complete machine learning solution.
Databricks Overview: Databricks is a cloud integrated platform that enables the
implementation and operation of Apache Spark applications. As part of the
current capabilities, Databricks provides strong support for Spark MLib and
Spark R.
Key Capabilities:
o Model Performance: Databricks provides a highly scalable
architecture that powers the performance of Spark MLib models. This
capability allows developers to focus on writing Spark MLib solutions
without worrying about the underlying infrastructure.
o Support for R: In addition to Spark MLib, Databricks provides support
for Spark R. This capability allows developers to write very
sophisticated applications that combine traditional machine learning
and R models to achieve optimal results.
o On-premise Support: One of the biggest advantages of Databricks is
that is completely based on Apache Spark. That model allows
developers to write machine learning applications that can seamlessly
work in both on-premise and cloud topologies.
Challenges: The current feature set of Spark MLib and Spark R is relatively
limited compared to some of its cloud competitors. Additionally, Databricks is
a Spark-exclusive cloud, which means that it doesn’t include complementary
platform services comparable the ones provided by PaaS solutions like Azure,
AWS or Bluemix.
On-Premise Enterprise Machine Learning Platforms Similar to the cloud space, the on-premise machine learning space is experiencing
an explosion in the number of technologies and platforms that enable the
implementation of enterprise-ready machine learning solutions. Differently from the
cloud space, new on-premise machine learning technologies seem to be actively
built on popular open source data science frameworks such as R and Python instead
of building proprietary stacks. As a result, many of the lead machine learning
platforms are also delivered as open source distributions. The following sections in
this paper evaluates some of the key on-premise machine learning stacks such as
Revolution Analytics, Data, Spark, PredicitonIO and Scikit-learn.
Revolution Analytics Overview: Revolution R Enterprise provides the infrastructure for
implementing enterprise-ready analytics applications based on R. Supporting
a variety of big data statistics, predictive modeling and machine learning
capabilities, Revolution R Enterprise is also 100% R. Revolution R Enterprise
supports a variety of analytical capabilities including exploratory data
analysis, model building and model deployment.
Key Capabilities: Revolution R provides some of the following key
capabilities:
o Scalable R: Revolution R Enterprise scales and accelerates R, running
R scripts in a high-performance, parallel architecture that supports
systems from workstations to clusters and grids including Hadoop and
enterprise data warehouses.
o Enterprise-Ready R Capabilities: Revolution R expands R with
enterprise-ready capabilities such as logging, instrumentation,
security, monitoring among other features that are essential to
operationalize R solutions in the enterprise.
o Integration with Mainstream Analytic Tools: Revolution R
provides integration with many of the most popular analytics tools in
the enterprise such as Tableau, Excel or Qlikview. Additionally,
Revolution R also integrates with traditional reporting platforms such
as Cognos, Business Objects etc.
Challenges: Revolution R is optimized for authoring applications in the R
language. Sometimes, this model results limited for the implementation of
complete enterprise applications. Additionally, the applications implemented
with Revolution R can be complex to integrate into other enterprise solutions.
Dato Overview: Dato enables the rapid development, simple deployment, and
robust management of real-time services and applications that use machine
learning. Dato leverages the advancements in Python machine learning
libraries to enable the implementation of highly sophisticated, enterprise-
ready machine learning solutions. The Dato platform includes three key
products: Graphlb Create, Dato Distributed and Dato Predictive Services.
Key Capabilities:
o Model Creation: Dato’s Graphlab Create is an extensible machine
learning framework that enables developers and data scientists to
easily build and deploy intelligent applications and services at scale. It
includes distributed data structures and rich libraries for data
transformation and manipulation as well as scalable task-oriented
machine learning toolkits for creating, evaluating, and improving
machine learning models.
o Scalable Execution: The Dato platform includes Data Distributed
which is a server product that allows distributed execution of machine
learning jobs on a cluster of machines. Jobs can include distributed
training of machine learning models, parallel model scoring &
predictions, distributed hyperparameter tuning, model ensembling,
and evaluation tasks. This capability abstracts the complexities of
scaling machine learning models in enterprise environments.
o API Access: Dato Predictive Services enables the execution of Dato
machine learning models as high performance APIs. This capability
allows developers to easily incorporate machine learning models into
new applications without having to use any proprietary libraries.
Challenges: As any new product, enterprises adopting Dato faced the
challenge of embracing a product without a large community of developers
and system implementers. However, the communities around Dato are
rapidly growing. Additionally, Dato is completely Python-centric which makes
it challenging to adopt by organizations without that in-house expertise.
Spark MLib and Spark R Overview: Apache Spark includes two main libraries for machine learning
applications: Spark MLib and Spark R. MLlib is Spark’s scalable machine
learning library consisting of common learning algorithms and utilities,
including classification, regression, clustering, collaborative filtering,
dimensionality reduction, as well as underlying optimization primitives. Spark
R is an R package that provides a light-weight frontend to use Apache Spark
from R. Spark R provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation etc. (similar to R
data frames, dplyr) but on large datasets.
Key Capabilities: Spark provides the following key capabilities for machine
learning applications:
o Scalability: Because Spark MLib and Spark R are built on the Spark
platform; they enjoy the scalability and performance benefits of the
Spark architecture. In that sense, Spark machine learning models can
run across large topologies with hundreds of nodes and recover from
unexpected errors.
o Support for R: The addition of Spark R offers developers a very
unique option of combining R and machine learning models as part of
the same applications. More importantly, both Spark R and Spark MLib
are provisioned, scaled and managed using the same underlying
infrastructure.
o Developer and System Integrator Community: Apache Spark is
enjoying a rapidly growing community of developers and system
integrators. As a result, organizations can enjoy a strong support for
machine learning applications built on Apache Spark and Apache R.
Challenges: The infrastructure required to run Spark Mlib and Spark R
applications at an enterprise scale can result is a very complex endeavor.
Additionally, the tools to fully operationalize Spark Mlib and Spark R
applications are still limited compared to other platforms in the space.
PredictionIO Overview: PredictionIO is an open-source Machine Learning server for
developers and data scientists to build and deploy predictive applications in a
fraction of the time. PredictionIO template gallery offers a wide range of
predictive engine templates for download where developers can customize
them easily. PredictionIO is built on top of Apache Spark and it expands it
with enterprise-ready capabilities such as event-based activations, API
generation or monitoring tools.
Key Capabilities:
o Template Based Authoring: PredictionIO provides a model for
authoring simple machine learning applications based on templates.
These templates abstract some of the underlying complexity of a
machine learning model and can be extended and customized for
specific scenarios.
o Event Based Activation: PredictionIO includes an event server
component that enables the asynchronous activation of machine
learning engines. This architecture provides a scalable model to
execute machine learning applications across diverse topologies.
o Monitoring and Management Tools: PredictionIO extends Apache
Aprk with sophisticated management and monitoring tools that
facilitate the operational readiness of machine learning applications.
Challenges: Although incredibly easy to use for simple machine learning
scenarios, PredictionIO can result limited in the implementation of more
complex models. Additionally, PredictionIO still hasn’t been able to build
large developer and system integrator communities and streamline its
implementation in enterprise environments.
Scikit-learn Overview: Scikit-learn is framework provides a range of supervised and
unsupervised learning algorithms via a consistent interface in Python. It is
licensed under a permissive simplified BSD license and is distributed under
many Linux distributions, encouraging academic and commercial use.
Key Capabilities:
o Rich Machine Learning Algorithm Library: Scikit-learn provides
what can be considered the richest collection of machine learning
algorithms of any framework in the space. The framework also
combines features from popular frameworks like Numpy, Scipy or
Sympy to provide sophisticated capabilities in areas such as symbolic
mathematics or scientific computing.
o Simple Programming Model: Despite its large feature set, Scikit-
learn provides a very simple programming model that allow developers
without strong expertise in machine learning to implement highly
sophisticated data science applications.
o Rich Data Visualizations: Scikit-learn provides a strong set of data
visualization capabilities that can be combines with the machine
learning model to rapidly evaluate the effectiveness of the models.
Challenges: Scikit-learn is a programming framework and not a machine
learning platform. In that sense, Scikit-learn does not provide the scalability
models or the monitoring and management tools typically included in
machine learning platforms. As a result, enterprises should look to leverage
the rich capabilities of Scikit-learn in conjunction with other machine learning
platforms to implement enterprise-ready data science solutions.
Summary Machine learning is becoming one of the most relevant aspects of data intelligence
solutions in the enterprise. Enterprises evaluating machine learning platforms
should consider both cloud and on-premise options. Cloud enterprise machine
learning platforms excel on abstracting the underlying infrastructure needed to run
and scale machine learning models. On-premise enterprise machine learning
platforms offer rich extensibility models and typically rely on open source
distribution channels.
Platforms like Azure, AWS and IBM are leading the charge in the cloud enterprise
machine space. Vendors like DAtabricks are also bringing a lot of innovation to the
space. In the on-premise arena, companies like Data and PredictionIO as well as
popular open source frameworks like Apache Spark or Scikit-learn are some of the
robust options for enterprises building data science solutions. This paper included
an analysis of some of the key machine learning platforms including their strengths
and weaknesses based on our experience in real world implementations.