White Paper
Creating an Agile Closed Loop Data Science Workflow with Datameer and IBM Cloud Private for Data
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 2
Introduction
There’s a lot of hype and hoopla around AI and machine learning today. Whether you’re looking at technology trade publications and Web sites, analyst reports or even the mainstream press, AI is talked about everywhere. And its potential is indeed impressive.
There are some significant factors mitigating the success of AI in the enterprise though,
because it can be hard to integrate into operational processes. Part of the reason is that
AI tools are, to varying degrees, primitive and segregated from mainstream analytics
tools. In reality, AI should be connected to all the other work you do with data; it can’t be
in an isolated zone unto itself. Data Science is not an island.
If you’re going to do machine learning work right, you’re going to need well-honed data
sets on which to build your models. It is not just about cleaning the data. It is about
finding more data to increase accuracy, and discover data that may be more relevant to
the problem at hand.
Getting there isn’t just about “preparing” the data, either. It’s about exploring it and
understanding it – achieving intimacy with the data, grasping its content and meaning in
a fundamental way. It takes way more than simple preparation to do that. It takes deep
analysis and exploration. And that requires advanced analytics and machine learning
work to be done in tandem and in harmony.
The industry and ecosystem have largely segregated these two subdisciplines. Given
that the specializations in each are legitimately different, that is understandable. But –
understandable or not – it’s not something we should settle for. Machine learning without
analytics is machine learning that is less efficient, more expensive and less enlightened.
And given the way society is starting to delegate certain business-critical, and even
socially impactful, decisions to machine learning models, we have a deep obligation to
make those models as accurate as possible. Pairing strong analytics with ML isn’t just
good form; it’s mandatory and essential, from a societal point of view.
The good news is that this integration of analytics and machine learning can be done
relatively easily and effectively, and that can be shown in a practical, matter-of-fact
fashion. In this report, we’ll illustrate how to create a more efficient and effective ML
workflow, carried out using a rich data preparation and exploration platform – in this case
Datameer, combined with a machine learning platform and programming environment –
IBM Cloud Private for Data, utilizing Apache Spark and Jupyter notebooks.
IBM Cloud Private for Data (ICP4D) is an open, cloud-native environment for AI. With
this integrated, fully governed team platform, you can keep your data secure at its
source and add preferred data and analytics microservices flexibly. Datameer – which
is available from ICP4D as an add-on – is an advanced data preparation and exploration
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 3
platform that works especially well with big data. We’ll highlight the specific features of
the Datameer platform that complement the machine learning process especially well
and we’ll see how a strong symbiosis between Datameer and ICP4D can be fashioned to
create a faster workflow that produces better ML models and results.
We’ll also show how the cooperative relationship can work both ways – with data being
sent from Datameer to an ICP4D to build and test models, and the results of that testing,
with predictive output, fed back into Datameer for further analysis. We’ll show how this
loop can repeat iteratively – setting up a virtuous analytics-ML workflow cycle.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 4
Our ExampleBefore we cover the rigors of the workflow for ML, it’s probably a good idea to explain
the high-level premise of machine learning in the context of a tabular data set. It’s
actually pretty simple: out of all the columns in the data set, there’s going to be one
whose value you’d eventually like to predict, and there are going to be several others
whose values become germane inputs and influencers into the what the predicted values
will be. The column whose value you will predict is often called the label; the columns
relevant to the predicted values are often called features.
To flesh this out with an example, a data set of customer information, including buying
and demographic information, might become the basis for a model. One column that
indicated a customer’s spending level, using some discrete set of categories, might be
the label. Other columns, with demographic information, including age, gender, marital
status, income, and whether the customer rents or owns her home, might be determined
to be features. Later, when demographic data arrives for new or potential customers, you
can use the model to predict the spending level of each one. This can be of immense
help in running efficient marketing campaigns.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 5
Today’s ML WorkflowIn this section we’ll explain the typical machine learning workflow and how data scientists
tend to execute it with their customary AI platform tools. From there we’ll explain how this
workflow can be carried out more effectively using an AI platform like ICP4D in tandem
with Datameer. This approach will be no less convenient than working just with the AI
tools, and yet will deliver far more in the way of insightful data exploration, precision data
engineering and straightforward data preparation.
Understanding this structure and this little bit of vocabulary can actually get you a long
way toward understanding how machine learning works1. But understanding that much
of the workflow in machine learning involves building a data set with data that’s clean,
accurate, and focused to include just the feature columns (as well as the label column,
of course) gives you the essential insight you need to understand how to do the best ML
work possible.
What should now be clear with respect to the workflow is that it involves equal parts data
exploration (profiling, querying and visualizing the data), preparation (cleansing, de-
duplicating, and properly summarizing or aggregating the data), and feature engineering
(determining which columns seem to be impactful to the value in the label column).
And at a higher level, this work will need to be carried out repeatedly and iteratively,
potentially across numerous data sets, with data that may overlap, but which are
nonetheless distinct.
Tooling
In the world of data science today, much of the work tends to happen inside a
programming and data manipulation environment called a notebook. Notebooks are
Web browser-based documents that contain a combination of text (in markdown format,
specifically), code, and output generated by the code. One way to think of notebooks is
as a combination of a Wiki and a code editor. The code can be executed in place and the
output – quite often the result set of a query – can be expressed in text, tabular format.
There are a few different notebook standards out there but Jupyter notebooks are the
most prevalent, and ICP4D hosts them natively. Notebooks allow the output of various
commands to be viewed inline, within the notebook.
1 There’s lots more, of course, including the numerous machine learning frameworks out there, the algorithms available in each, the groupings those algorithms fall into, the various parameters each framework’s algorithm supports, and the range of values those parameters can be set to.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 6
Figure 1: A notebook with DataFrame schema and table list output
Given what we have discussed so far, it should be clear that notebooks in ICP4D can be
used to explore, query, profile and transform data. But it’s also clear that in order to do
this, data scientists must write the code, and understand the intricacies of connecting to
data.
The latter point has two important consequences:
1. Only data scientists or data engineers, with the right coding skills, understanding of
file and database connection information, and having sufficient file and database
permissions can perform these tasks
2. Even those professionals with sufficient skills and access must manually write code
to perform these tasks, making the work error prone and, in the best case, far less
efficient than it could be.
With this in mind, a better, more efficient workflow would use a platform specifically
designed to accelerate the difficult preparation and exploration tasks – finding the right
data, exploring it to understand its relevance to the problem, shaping it in the right way,
and engineering the features needed to make the ML model hum.
The data scientist can then use the notebook in ICP4D for creating the ML model, and
the ML platform for training and testing it. This would minimize the amount of coding
required. And, as we’ve discussed, once done with the ML modeling, one could also use
the data exploration capabilities of the preparation platform to validate the model across
any number of dimensions to ensure its accuracy.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 7
Our ArchitectureWe can easily design an architecture that conforms to the more efficient workflow, by
combining Datameer and IBM Cloud Private for Data, with Apache Spark and Jupyter
notebooks. This accommodates data science without being rocket science; it’s pretty
straightforward. We’ll use:
• Datameer as our platform for data exploration, transformation and preparation
• ICP4D and Apache Spark as our ML platform, and notebooks as the coding
environment
• Datameer as our tool for post-training model performance evaluation, by analyzing
test output from the model built on Spark in ICP4D
Figure 2: An integrated machine learning architecture
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 8
A More Ideal ML WorkflowOur architecture enables a far more efficient – and repeatable – machine learning
workflow. Data starts in Datameer, where it is explored and shaped, allowing for upfront
code-less preparation. Datameer simplifies exploration of the data for relevancy, and
makes for easier feature engineering, to create the most appropriately shaped data set,
the first time around.
Datameer’s output is exported to ICP4D where Spark is used to build, train and test an
ML model. ML model test results are round-tripped back into the exploration platform for
detailed model validation. And if the model’s accuracy is not satisfactory, the data can be
further refined and the entire process repeated, creating a virtuous cycle.
Our workflow, then, has three phases: The preparatory workflow; model design, training and
testing; and model performance validation. The steps involved in each phase are as follows:
Preparatory Workflow
• Profiling
• Exploration
• Enrichment
• Preparation
• Coarse analytics
Model Design, Training and Testing
• Creating a model
• Training and testing it
• Model Performance Validation
• Round-tripping the results from the model testing in the ML environment back to
Datameer
• Performing data exploration to explore all aspects and dimensions of the model to
ensure accuracy on any front
Datameer’s capabilities are designed to help you understand your data better, shape it
in the right manner, and to explore it on a self-service basis. Many of these capabilities
make Datameer a great platform for the machine learning preparatory workflow. We
review some of those features here. In the next section we’ll apply them to a particular
data set that we’ll then use to build a machine learning model.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 9
Data Profiling
While the Python and R programming languages (and various machine learning
frameworks) provide functions to help data scientists profile their data, Datameer
provides that functionality in a visual format, requiring almost no effort at all.
Specifically, by using Datameer’s Inspector, while examining workbook data in
spreadsheet view, users can see vital profiling statistics (like number of rows, number of
distinct values, and the maximum, minimum and mean) on any column, just by selecting
it. Here’s a closeup view of the inspector in data profiling mode:
Figure 3: Profiling a single column in the Datameer Inspector
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 10
Users can also view data profiling information for the entire data set by going into Flip
Sheet view.
Figure 4: Profiling an entire data set in Datameer‘s Flip Sheet view
And while it’s nice to have a histogram per column laid out in a single view as it is above,
any one of those histograms can be zoomed to full-screen mode, simply by clicking on it:
Figure 5: Flip Sheet view histogram in full screen
Data profiling, especially using histograms that visualize the distribution of values in a
column, is important functionality in the data science preparatory workflow. Knowing
that a given column has a small number of values may provide a clue that its value can
be predicted by a machine learning model. If the column has a large number of distinct
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 11
values, or even a small number if they’re evenly distributed, then it may be important
to predicting the value of another column. Datameer provides this data distribution
information by default. No special effort is required.
Visual Exploration
Visual Explorer (VE) is Datameer’s patent-pending technology for performing visual
analysis on large volumes of data, with amazingly fast interactive performance. This
allows customers to see trends and patterns in the data and do so in an iterative fashion.
Figure 6: Visual Explorer showing the impact of values in one column on the value of another
While the above visualization is easily accessible and helpful, VE is about much more
than quick and simple column charts. Users can easily create more sophisticated
visualizations, get started very quickly and configure things iteratively.
For example, users need only select the columns they’re interested in and VE will
automatically suggest particular visualizations, allowing the user to select one from a
gallery-style user interface:
Figure 7: Visual Explorer‘s suggested visualizations based on the selection of two columns from a data set.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 12
Once a user selects a suggested visualization, it is displayed in large format:
Figure 8: A Visual Explorer visualization in large-format view
From here, the user can easily tune and refine the visualization, changing the particular
columns selected, and even particular column values included. This can continue until
the visualization is precisely configured according to the user’s needs. For example, the
above visualization could be changed to display a multi-line chart, showing one plotted
line per each of five selected values from the legend column:
Figure 9: Adding a legend/color column, as shown at upper-right, and filtering for specific values, at far left
The ability to do this on large volumes of data is very helpful in the creation of accurate
data models for which mere sampling of data must be avoided. The ability to do this
quickly encourages users to investigate their data tenaciously, since they can ask
question upon question with little if any waiting time.
VE enables users to achieve true, deep understanding of their data. And once satisfied
with the visualization, users can generate a worksheet version of the same data, just by
clicking “Create Summary Sheet”:
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 13
Figure 10: A summary sheet, created from the modified Visual Explorer output
The above data could then be further enriched with other columns, and the resulting data
set used to train a machine learning model. Alternately, users can click “Create Details
Sheet” to generate a sheet with all columns, rather than just those in the visualization.
Smart Analytics
Based on certain of the same algorithms used in machine learning work, Smart Analytics
is a Datameer feature that can help users understand the relationship between columns
and/or the impact one column has on others. Users can easily build decision tree,
column dependency or clustering models on data to see how data columns relate and
the most important values that lead to specific outcomes, items that are extremely
important in the feature engineering process.
Figure 11: A Smart Analytics decision tree diagram
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 14
Easy to Follow Workflow
Beyond these exploration features, working with data in a spreadsheet and formula
environment is highly congruent with the machine learning preparatory workflow. In
Datameer, data is loaded into one sheet then gradually transformed and summarized in
successive sheets in the workbook.
This leaves the lineage of the data fully visible and discoverable. In effect, each tab of the
workbook is a chapter in a story of the data’s evolution, essentially a presentation of the
progressions in the analyst’s thought process in working with the data.
In some ways, the cells of a Datameer workbook are like the cells in a Data Scientist’s
notebook. Each one allows a set of transformations on, and/or analyses of, the data. The
difference is that notebooks have code, while a Datameer workbooks has data rows and
columns with easily readable spreadsheet formulas as well as simple filters, sorts, joins
and unions.
Rich Suite of Functions
Another major part of feature engineering is creating new columns from the source
data to feed the right columns and values to the AI/ML model. Doing so requires a
comprehensive and large suite of easy to apply functions to that transform and aggregate
data as well as calculating new values based on statistical or calculation functions.
Datameer offers over 270 powerful, yet simple to use functions to transform and
massage the data into the right shape. These elements provide much of the same
expressiveness and power of code, but on declarative functions instead of imperative
stepwise lines of code. They’re accessible to more authors and more readable for a
broader set of consumers.
These powerful spreadsheet-style formulas will come in handy when performing feature
engineering. ML models typically require numerical, statistical and encoded values. We
can apply transformation, calculation and statistical functions to the core data to produce
new columns that provide highly tuned values to the ML model in the right format.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 15
Doing AI in Datameer and SparkNotebooks are a great place to build and test models. Focusing the Notebooks on this
task greatly reduces coding efforts. You’ll get the fastest workflow with the greatest
reusability by combining the preparation & exploration platform with the machine learning
platform. In this section we’d like to show you how that can be done. We’ll go through a
real-world example with a well-known public data set, with anonymized U.S. census data
on personal income. We’ll describe how we might examine the data in Datameer, build a
machine learning model on Apache Spark, with Python code in a notebook.
We’ll also run some data through the model to test it, and we’ll bring a data set back into
Datameer that includes that data, the model’s predictions, and the actual values for the
column it tried to predict. Let’s start on the Datameer side.
Preparatory Workflow in Datameer
Earlier we mentioned that performing preparatory work in notebooks (or other code-
based environments) can be less efficient than doing so in tools and platforms that are
more geared to self-service users and more based on direct work with the data rather
than on code meant to manipulate the data.
In order to demonstrate the validity of that claim, let’s examine how many of these
preparatory tasks could be carried out in Datameer. As we go through this explanation,
you’ll see that the Datameer solution is more integrated, code-free and superior in
visuals. You’ll see also that data transformation operations are at once richer and yet
more declarative in nature. That may sound a bit abstract now, but the example will
make it clearer.
First let’s start with mere data acquisition or ingest. Unlike the notebook case, where
explicit code must be written to connect to a database or file, Datameer provides a
file browser metaphor, similar in concept to Windows’ Files Explorer or Mac’s Finder.
Data stewards, database administrators or personnel in a central analytics group can
configure the population of the folder structure in the browser.
This means access to data in Datameer can be handled with a sophisticated division
of labor. A specialist can define the connection to a database or file, and less technical
business users can consume the connections on a self-service basis. That means
the self-service users don’t need to know about server and database names, storage
volumes, folder names or login credentials. On the other hand, self-service users who
are in mastery of such information can build the connections themselves, in addition to
consuming connections built by others.
Connections can be passive links to remote databases, import jobs that bring data
into the Datameer workbook, or file uploads that transfer data to Datameer from files
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 16
contributed by users from their own storage media. Data can live on the Datameer server,
in the database or in cloud storage.
A specialist can set up a file upload job from a URL point to the file:
Figure 12: Creating a URL-based file upload job pointing to our CSV file
And a business user can then import data the file at that URL, simply by selecting the file
upload, and now the data comes right in:
Figure 13: Viewing the census income data in a Datameer workbook
Note the currently selected column (workclass in the above case) is profiled in the
Datameer Inspector, on the right-hand side of the screen.
Note that unlike what has to happen in a data science notebook, not a single line of code
has been written here. Data profiling is carried out by Datameer implicitly, and is available
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 17
as a dedicated view, or contextually while viewing the data. This kind of profiling is critical
to good data science work. It has to be done for virtually every experiment. Having it
a click away makes far more sense than needing a data scientist to write, execute and
debug code to do it each and every time it’s needed.
I look at the column workclass and notice there are some values of “?”. I also notice
those same rows contain a value of “?” in the occupation column as well. Based on a
hunch that the occupation column will be impactful on the prediction of income, we will
remove those rows from the data set:
Figure 14: Filtering out rows with “?” in the workclass and occupation columns
Let’s also delete the native-country column from the new sheet, since we don’t want to
use it as a criterion in predicting income:
Figure 15: Deleting the native-country column from the data set
Next, I am going to use one of Datameer’s built in algorithmic enrichment functions to
explore relationships in the data. This can help with the feature engineering process. I’ll use
the column dependencies algorithm to examine what fields might be indicators of income.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 18
Figure 16: Exploring Column relationships using algorithms in Datameer
These tell me that education, relationship and marital status seem to be indicative of income.
Now I am going to perform some “one hot encoding” in which I take some categorical
variables and convert them into a form that could be provided to our ML model to do
a better job in prediction. I use a Datameer formula to convert those columns with the
categorical variables into individual columns each representing one of the values, which
can make this feature engineering part of the preparation faster and easier.
Figure 17: One Hot Encoding in Datameer
Finally, let’s download the data from the shaped_data sheet. We’ll download the filtered
data from the new sheet in the workbook as a CSV file to our local machine.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 19
Figure 18: Downloading data from the filtered sheet in the workbook
Next, we can upload this file to our IBM Cloud Private for Data project:
Figure 19: Creating a new ICP4D data set from the downloaded worksheet data
Then, from a new Jupyter notebook in ICP4D, we can browse to our data set and
generate the code necessary to open it.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 20
Figure 20: Browsing to our data set and generating a Spark DataFrame from it
In subsequent code in the same notebook, we can use the data from the CSV file for
training, test and validation data.
Model Design, Training and Testing in an ML platform
Now that our shaped data has been brought into the ICP4D environment, we can start
building a model on it. To show you how easy this is, we’ve written additional PySpark
code (Python code, running on Apache Spark) in the notebook that does just this.
We’ll now show you highlights of what we did there, including defining, training
and testing the model, then pushing the “scored” data (the test data along with the
predictions generated from it) back into the workbook as a new sheet.
Our first task is to read the data from the export job CSV file into a Spark DataFrame,
called income:
## READ IN CENSUS INCOME DATA FROM CSV
Income = spark.read.csv(path=census_data_file_loc, header=True, inferSchema=True)
Note the variable census_data_file_loc, which would have been initialized previously,
points to the exported CSV file. Note also that we set header=True in order that Spark
recognize the same column names we set up back in the Datameer workbook; we also
then set inferSchema to True. Between these two settings, we avoid having to specify
anything about our columns’ names and data types.
Once the data is loaded, we can use Spark’s randomSplit function to divide the income
DataFrame into two new DataFrames, trainPartition and testPartition, for training and
test data, respectively. We’ll set it up so that 75% of the data is used for training, with the
remaining 25% of the data set aside for testing the model once it’s been trained.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 21
trainingFraction = 0.75; testingFraction = (1-trainingFraction)
seed = 1234
# SPLIT SAMPLED DATAFRAME INTO TRAIN/TEST
trainPartition, testPartition = income.randomSplit ([trainingFraction,
testingFraction], seed=seed);
We’re getting closer, but Spark’s LogisticRegression machine learning algorithm (which
we’ll use to create the model) requires all our feature and label data to be numeric. That
poses a challenge since much of our data, for columns like Occupation and Marital_
Status, is text-based. Spark’s machine learning APIs help us out here, by giving us an
object called a StringIndexer, which can scan all the string data in a column, assign a
numeric ID to each unique value, then add a column to the DataFrame containing the
corresponding ID for each string value in the original column.
We’ll need to create StringIndexer objects for the columns Workclass, Marital_Status,
Occupation, Relationship, Race, Sex and Income_Bracket. We’ll then use a Spark
Pipeline object to apply each of them and the partitioned DataFrames from the previous
snippet to generate two new DataFrames, (finalTrain and finalTest, for the training and
test data respectively).
The new DataFrames include all the indexed columns, which we can then use as features
in our ML model. We’ll name these indexed feature columns after their string-based
source columns, with “_ind” appended a s a suffix. We’ll name the indexed version of the
Income_Bracket column to “label” so that Spark’s LogisticRegression algorithm knows
to treat it as our model’s label.
# Create the StringIndexer objects
sI0 = StringIndexer (inputCol=”Workclass”, outputCol=”Workclass_ind”);
sI1 = StringIndexer (inputCol=”Marital_Status”, outputCol=”Marital_Status_ind”);
sI2 = StringIndexer (inputCol=”Occupation”, outputCol=”Occupation_ind”);
sI3 = StringIndexer (inputCol=”Relationship”, outputCol=”Relationship_ind”);
sI4 = StringIndexer (inputCol=”Race”, outputCol=”Race_ind”);
sI5 = StringIndexer (inputCol=”Sex”, outputCol=”Sex_ind”);
sI6 = StringIndexer (inputCol=”Income_Bracket”, outputCol=”Income_Bracket_ind”);
# construct pipeline based on StringIndexer objects
transformPipeline = Pipeline(stages[sI0, sI1, sI2, sI3, sI4, sI5, sI6]);
# Create final test and training Data Frames and temp tables
finalTrain = transformPipeline.fit(trainPartition).transform(trainPartition)
finalTest = transformPipeline.fit(testPartition).transform(testPartition)
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 22
With that work complete, it’s almost time to build our model. One more preparatory step that
remains is to use a Spark VectorAssembler object to add a new column (called “features”) to
the DataFrame. This new column collects the values for all the feature columns into a list of
sorts, called a Vector. This step is necessary as the Spark LogisticRegression object, which
we’ll use to build our model, requires features to be vectorized.
Our features consist of six of the seven indexed columns we created with the
StringIndexer objects, as well as the Age, Fnlwgt, Education_Num, Capital_Gain ,
Capital_Loss and Hours_Per_Week columns. We’ll list those using the VectorAssembler’s
setInputCols method and we’ll send the value “features” to setOutputCol so this new
vectorized column has that as its name.
# vectorize all numeric feature columns into a new column call “features”
featuresAssembler = (VectorAssembler()
.setInputCols([‘Age’, ‘Workclass_ind’, ‘Fnlwgt’, ‘Education_Num’, ‘Marital_Status_
ind, ‘Occupation_ind’, \
‘Relationship_ind’, ‘Race_ind’, ‘Sex_ind’, ‘Capital_Gain’, ‘Capital_
Loss’, ‘Hours_Per_week’])
.setOutputCol(“features”))
With our vectorized feature values in a column called “features” and our label in a column
called “label”, we are ready to define and train the model! We’ll create a Pipeline, object
passing it the VectorAssembler object and a LogisticRegression object. Then we’ll
generate a model by calling the Pipeline’s fit method, passing it the finalTrain DataFrame
we created previously.
# Initialize the LogisticRegression object
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Create pipeline based on VectorAssembler and LogisticRegression objects
Pipeline = Pipeline(stages=[featuresAssember, lr])
# Create the model, with finalTrain Dataframe
Model – pipeline.fit(finalTrain)
Note the settings for maxIter and regParam are somewhat arbitrary and beyond the
scope of this paper.
With the model built, we can now run our test data through it. This will generate a new
DataFrame that includes columns called “prediction” and “probability”. We can save that
data to a Spark SQL temp table, then select a subset of its columns and overwrite the
DataFrame with the result:
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 23
# Evaluate the model with finalTest Dataframe
Predictions = model.transform(finalTest)
Predictions.createOrReplaceTempView(“tmp_predictions”)
# Remove indexed columns from predictions Dataframe and add “correct” Boolean
column
sqlStatement = “SELECT Age, Workclass, Fnlwgt, Education, Marital_Status,
Occupation, Relationship, Race, Sex, \
Capital_Gain, Capital_Loss, Hours_Per_Week, Income_Bracket, label,
prediction, \
label = prediction AS correct, probability \
FROM tmp_predictions”
predictions = spark.sql(sqlStatement)
Finally, we can write out the contents of the temp table to a CSV file, suitable for pushing
back into our Datameer workbook. The file will be saved in a folder called PredictionData,
which is itself located in the folder pointed specified by the dataDir variable, which was
initialized previously. This is shown in the code below:
Predictionsfilename = dataDir + “PredictionData”;
Predictions.write.mode(“overwrite”).option(“header”, “true”).
csv(predictionsfilename)
Round-Tripping: Bringing Test Results Back into Datameer for Evaluation of Accuracy and Analysis
With our model now trained and tested, we can examine the prediction data back in
Datameer. To get the data back in, we download the generated CSV file and the create a File
Upload job in Datameer to push it back into the workbook. The end-result is shown below:
Figure 21: Viewing the prediction data back in the workbook
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 24
Note the presence of “label,” the indexed version of the Income_Bracket column, as well
as the presence of the “prediction” column, both at the far right. If you look carefully,
you’ll see that the prediction value is often, but not always, the same as the label. The
percentage of the time that the label and prediction value are the same will give us some
insight into the accuracy of the model we’ve built.
Determining that accuracy is easy enough. To start with, we could duplicate the sheet
and add a column to the new sheet, called “correct,” that returns the value 1 when the
label and prediction are the same and zero when they’re not:
Figure 22: Defining the “correct” column
We could then create a new sheet, grouped on the boolean value True (so that the sheet
contains a single group that includes all rows), then use the GROUPSUM function on the
Correct column to get the number of correct predictions, the GROUPCOUNT function to
get the total number of rows, and add a calculated column to determine the percentage
of correct predictions.
Figure 23: Determining ML model accuracy
As we can see from our analysis of the prediction data back in Datameer, the model we
built in Spark appears to be about 82% accurate. Not bad!
Returning to the Visual Explorer allows us, for example, to determine the accuracy of the
model across genders. One thing we are able to find out is that our model seems to be
more accurate for female subjects than male subjects:
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 25
Figure 24: Breakdown of correct and incorrect predictions by gender
Specifically, what we find is that while 91% of predictions for female subjects are correct,
only 78% of predictions for male subjects are correct. This insight proves that using an
analytics tool like Datameer in tandem with a machine learning platform is very effective:
after training and testing the model on the Spark platform in Python, we’re able to send
the scored data to Datameer and analyze patterns in the model’s accuracy.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 26
Summing upThere’s more we could do here, of course, but the above validation example, as well as
the entire workflow described in this report, demonstrate the complementary nature of
Datameer, an AI platform like IBM Cloud Private for Data, with Apache Spark and Jupyter
notebooks. We’ve shown the full round-trip flow, and examined how each platform can
be used for its strong suits, avoiding inefficiencies on either side.
Data analysis and ML experimentation are overlapping yet synergistic disciplines. While
data science tools can be used for both, using analytics tools in the phases preceding
and following model design/training/testing is more appropriate and will lead to better
results. Tools focused on prep and analysis are key assets in successful ML work, in
support of building ML models, and in evaluating them.
Data work exists along a phased continuum. Use the right tool for the right job,
leveraging in each phase the components that excel at the tasks involved. Doing so
will benefit the outcome of that phase and the process overall. The efficient workflow
we’ve described in this report will assure the best quality machine learning work and will
enhance the success of that work’s outcomes.
CREATING AN AGILE CLOSED LOOP DATA SCIENCE WORKFLOW WITH DATAMEER AND
IBM CLOUD PRIVATE FOR DATA
Datameer WHITE PAPER
PAGE 27
About DatameerDatameer is an analytics lifecycle platform that helps enterprises unlock all their raw data.
The cloud-native platform was built for the complexity of large enterprises—yet it’s so
easy to use that everyone from business analysts to data scientists to data architects can
collaborate on a centralized view of all their data. Without any code, teams can rapidly
integrate, transform, discover, and operationalize datasets to their projects. Datameer
breaks down data siloes, gets companies ahead of their data demands, and empowers
everyone to discover insights. Datameer works with customers from every industry
including Dell, Vodaphone, Citibank, UPS, and more. Learn more at www.datameer.com.