+ All Categories
Home > Documents > SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to...

SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to...

Date post: 18-Aug-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
25
Show Notes: http://www.superdatascience.com/491 1 SDS PODCAST EPISODE 491: R IN PRODUCTION
Transcript
Page 1: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 1

SDS PODCAST

EPISODE 491:

R IN PRODUCTION

Page 2: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 2

Jon: 00:00 This is episode number 491 with Veerle van Leemput,

Managing Director and Head of Data Science at Analytic

Health.

Jon: 00:12 Welcome to the SuperSuperDataScience podcast. My

name is Jon Krohn, a chief data scientist and bestselling

author on deep learning. Each week we bring you

inspiring people and ideas to help you build a successful

career in data science. Thanks for being here today and

now let's make the complex simple.

Jon: 00:42 Welcome back to the SuperDataScience podcast. I am

absolutely delighted to have Veerle as my guest on the

program today. Hailing from the Netherlands, Veerle has

held a number of data science leadership roles at Dutch

companies. She now serves as Managing Director and

Head of Data Science at Analytic Health, a London-based

firm that builds data-centric software for the healthcare

industry. On the side, Veerle is an impressive podium

level weightlifter on the Dutch national stage.

Jon: 01:13 Beyond bonding over powerlifting, in today's episode

Veerle details for me how R is not only an option for

production software, but may in fact be the best

production option for you if data or data models are

central to your application. Specifically, Veerle runs down

for us her favorite R tools for data gathering, model

development and deployment into production systems.

Today's episode will primarily be of interest to technical

professionals like data scientists and software developers,

but we did our best to break down the technical concepts.

And we do have a lot of laughs in the episode, which

could make it appealing to anyone who enjoys a good

giggle. All right, you're ready for another awesome

episode? Let's go.

Jon: 02:00 Veerle, welcome to the program, I'm so excited to have

you on. Where in the world are you calling in from?

Page 3: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 3

Veerle: 02:07 I'm calling in from Leiden in the Netherlands, it's just

below Amsterdam.

Jon: 02:12 Yeah, we need to know. Everyone wants to know how is

that in relation to Amsterdam?

Veerle: 02:18 Amsterdam is the only place in the Netherlands

apparently.

Jon: 02:23 Is there a really good football team in Leiden? I think

there is. If I heard better in that context.

Veerle: 02:28 I wouldn't know, they most of the time [crosstalk

00:02:30].

Jon: 02:32 Oh, no kidding. Did you ever play hockey?

Veerle: 02:34 Yeah. It is styling at least. No, I don't. I don't play hockey.

Jon: 02:37 Ah, you don't. I grew up in Canada, so I must play

hockey. It's a part of growing up there.

Veerle: 02:43 Oh, really?

Jon: 02:46 Yeah. So, well, I guess we won't have ice hockey to talk

about, but we do have powerlifting. So that's how you

originally came to my attention. So because we're both

data scientists, we're in each other's LinkedIn network,

however that happened. And then I think you commented

on a post. So we're recording at the beginning of July and

about a month ago... Actually, I think it was exactly six

weeks ago because I have six week weightlifting cycles.

And yesterday I again had a [inaudible 00:03:16] for my

deadlift.

Jon: 03:18 So I think six weeks ago, I posted my all-time deadlift PR,

which was 405 pounds. And probably after some initial

confusion about kilos and pounds, you commented on

the LinkedIn poss. I can't remember what you said, but

Page 4: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 4

then it... Oh yeah, you said something about, "Have you

ever thought about competing?" And I said, "Well, I've

done an Olympic weightlifting competition. I've never

done powerlifting and I'm definitely interested." Then I

was like, "Well, why would somebody ask this? Veerle, do

you do powerlifting?" And you said...

Veerle: 03:52 Of course. Yes, I am a powerlifter. And in fact, a couple of

weeks ago I participated in the Dutch Nationals

Powerlifting.

Jon: 04:00 No way.

Veerle: 04:01 Yes. And I came in second. So I'm vice champion

[inaudible 00:04:05]. Yeah.

Jon: 04:07 Wow. That's incredible. This is really exciting. I didn't

know that you were that into it. So, okay. So we should

let the audience know exactly what powerlifting is. So I

think there's always three movements in a traditional

powerlifting competition, right?

Veerle: 04:21 Yeah. There's three movements, there's the squats,

there's the bench press and there's the deadlift. And

those three you need to get the highest weight, and

combined it's your total and the total determines your

ranking basically.

Jon: 04:33 And so you just add up across the three, back squat so

you've got a barbell on your back, you squat to below

parallel and then back up-

Veerle: 04:42 Below parallel. Yeah. Yes.

Jon: 04:46 Exactly. Deadlift is the [crosstalk 00:04:47] one of those.

Oh yeah, bench press. So is that the order? Is always in

the same order in competition? You always do...

Veerle: 04:52 Yes.

Page 5: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 5

Jon: 04:52 ... back squat, then bench press. Bench press, I think a

lot of people know that one you're laying on a bench

horizontally.

Veerle: 05:00 And you press the weight. Yeah.

Jon: 05:01 Yeah. Absolutely.

Veerle: 05:02 The important thing though, in powerlighting you need to

pause at the chest. So it's not like touch and go which

you see normally in the gym [crosstalk 00:05:08] you

have to wait for the judges to say, "Okay, press."

Jon: 05:13 Oh, really?

Veerle: 05:13 And then you can go.

Jon: 05:16 Oh, geez. That makes it a lot tougher.

Veerle: 05:18 Definitely.

Jon: 05:20 I've got a really bouncy rib cage. So that's my key to

bench press success.

Veerle: 05:23 Oh, really?

Jon: 05:25 Really bounce it off of there [inaudible 00:05:27].

Veerle: 05:28 [inaudible 00:05:28] really pausing.

Jon: 05:31 Just drop it, and catch it on the way back up. And then

the third movement is the deadlift, so that's the video that

I posted six weeks ago at time of recording. And so that

is... It's kind of the simplest idea. You've got a barbell on

the ground and you need to lift it up. You need to stand

up straight shoulders back. And of the three movements

that's the one that typically people can lift the most of.

Veerle: 05:58 Same for me.

Page 6: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 6

Jon: 05:59 Yeah. It would be surprising if it was otherwise, if you

benched more than you...

Veerle: 06:02 It would be epic. If I benched more than my deadlift I

would be very, very good.

Jon: 06:11 And that's how you become the second most powerful

powerlifter in the Netherlands. So would you mind telling

us what was your combined score? What was your

combined weight across the three?

Veerle: 06:22 My combined score... Geez then I'd have to do the math

really because it's 115 kilos squats, then we had 67.5

bench and 152.5 deadlifts. And combined I think that's

324?

Jon: 06:42 I don't know, I don't have a calculator out. But...

Veerle: 06:47 A lot.

Jon: 06:47 It is a lot. And then so for people that want to do it in

pounds, you need to multiply it by 2.2 and that'll give you

the weight in pounds. And I think we can conclude Veerle

can lift a lot of weight. And so this is really cool. I didn't

know that you were so actively into it. And so what are

you doing now? So you had the big national competition

two weeks ago. Are you back in a training cycle now

training for something else?

Veerle: 07:10 Well, actually I started to focus a bit more on Olympic

weightlifting now because I just [crosstalk 00:07:15] that

is. So, yeah, I'm now into snatches and clean and jerk.

But that's just the... Yeah, I don't know, I really liked it.

So I don't know if it's temporary yet, but I'm still a

powerlifter but now doing a sidetrack into weightlifting.

And I'm the proud owner of a fully equipped gym at home

as well, both powerlifting [inaudible 00:07:37]

Page 7: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 7

weightlifting up. It's like a giant hobby. It's like getting out

of control a bit.

Jon: 07:43 I understand. I'm super fortunate to have a very well-

equipped gym across the street. It's basically unheard of,

I'd have to be absurdly wealthy to have a fully equipped

gym in my apartment in New York. That would be

incredible. Maybe that's something to aspire to. [crosstalk

00:07:58].

Veerle: 07:58 That is expensive hobby to have your own gym. Yeah.

[crosstalk 00:08:04].

Jon: 08:06 And so, yeah, so Olympic weightlifting, that's the one I'm

much more familiar with. And I've only done it once. I

competed once, and I was okay for my... I've lift a fair bit

if you don't consider weight or gender.

Veerle: 08:23 Okay. That's important [crosstalk 00:08:28].

Jon: 08:30 I know. So once you put me in my weight class, it's not so

impressive, but for the audience, there's only two

movements in Olympic weightlifting. There's Veerle

already mentioned them, the clean and the jerk and the

snatch. But they are more technical, I hope you don't

mind me saying [crosstalk 00:08:48].

Veerle: 08:49 Yes. It's very true. It's much more difficult really. Because

powerlifting is more brute force and weightlifting is like...

If you're one inch off, then you miss your attempt. It's

very different road. But that makes it kind of a challenge.

Jon: 09:05 It is. It's nice. It's like there's so much more on

positioning and timing, accuracy. I enjoy it a lot.

Veerle: 09:13 Yeah, me too.

Jon: 09:16 And yes, people can probably look up videos to get a

sense of how a clean and jerk and a snatch works, but it's

Page 8: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 8

the same. So it's the same barbell that you use for all of

the powerlifting movements. Although, I guess technically

speaking you could have it different. You might even

given that you've such a well-equipped gym, you probably

have two different barbells.

Veerle: 09:34 Yes. I have two, women's weightlifting bar and a

powerlifting bar. Because the weight differs between bars

[inaudible 00:09:40].

Jon: 09:41 Wait, what?

Veerle: 09:43 Women do have a lighter bar than men. So the standard

bar is 20 kilos, but for women it's 15. So I have two bars.

Yeah.

Jon: 09:52 But you also have two 15 kilo bars. You have one for

powerlifting and one for Olympic.

Veerle: 09:57 No powerlifting is 20. It's always 20.

Jon: 10:00 Oh, it's always 20. Oh, interesting. I didn't know that.

Wow. Okay. Oh, yeah, so in terms of the idea though the

barbell it looks... It's a barbell, at a distance it would look

the same whether it's powerlifting or Olympic

weightlifting. And so with a snatch, you have to in a

single movement get the bar overhead.

Jon: 10:25 So it's on the ground like with a deadlift, but then you in

a single movement, boom, it's over your head and you

have to stand up with it over your head and show control.

And the clean and jerk, you get to do it in two, so up to

your chest and then overhead. And so you can do more

weight that way. Anyway, so do you have a specific date

in mind for that, for your first [crosstalk 00:10:46]?

Veerle: 10:46 The competition? Well, perhaps in October, but yeah, it

depends. I really want to get a good snatch and then I

Page 9: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 9

might be ready to get to the platform, but I'm not there

yet. I can tell you, my snatch is like basically a very

segmented lift at this point in time. It's not smooth at all,

but yeah, we learn every day. Right? You can only get

better.

Jon: 11:14 Well, very exciting. I look forward to watching the journey.

I'm going to say that again, because I just hit my mic with

my hand. I don't know if that worked. Well, very exciting

I'm looking forward to seeing how this journey unfolds for

you. And I do hope that you stay in touch, not just about

data science, but about this as well. It's really cool. And I

hope... Do you ever post about your weightlifting stuff on

LinkedIn?

Veerle: 11:37 Not on LinkedIn. No, I keep that separate. But I do have a

very dedicated Instagram account to all my lifting

[crosstalk 00:11:44].

Jon: 11:45 Oh, okay. Well maybe you can share that at the end as

well in addition to your LinkedIn details. All right. But

let's get away from Instagram style chat to LinkedIn style

chats. So yeah, so we know each other from LinkedIn.

And so yeah, you came to my attention because of this

powerlifting posts that I made. And then shortly

thereafter, I noticed that on June 22nd, you gave a talk

on the R-Ladies of Amsterdam and the talk was on R in

production.

Jon: 12:21 And my mind was blown. I'm constantly on this show and

in life in general saying I like R and I genuinely do. I was

a R user for years before I started using Python. And so a

lot of my statistical programming knowledge came from

using R, and I really like it. But in the last five, six years,

I don't use it very much because I started working at a

startup where we're putting models into production. And

I've always had this idea in my head, and I don't think I'm

the only one who goes around spreading this propaganda.

Page 10: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 10

But there's this idea in the data science community that

especially if you're looking at putting models into

production, you need to be using Python. And well, today

you're going to tell us why I'm wrong.

Veerle: 13:18 Yes, definitely. Because you said that you worked at a

startup and needed to put models into production and

that's why you didn't use R. But let me tell you, I'm

working in a startup as well, and we have models in

production but we are using R. So it definitely is possible.

We're running a whole business on R basically. So the

stigma around, R is only for academics, and it's only for

statistical programming and it's only good enough for

quick prototyping it's not true. And that's what I told

them.

Jon: 13:52 So let's dig into that in a second, but first give us a little

bit of context about your startup. So it's called Analytic

Health. It looks to me like it's headquartered in London.

Veerle: 14:02 Yes. Correct.

Jon: 14:04 But yeah. So tell us a bit about what the company does,

what you do there.

Veerle: 14:09 Yeah. So I'm Managing Director and Head of Data Science

at Analytic Health and together with my business partner,

Greg Mills, and our Head of Operations Jana, we develop

web applications for the healthcare sector. And what we

do is we gather and we analyze healthcare data in order

to retrieve value from the data. Because we believe that

we can accelerate innovation in healthcare by getting

insights from this data. And we gather data from the

United Kingdom, health care data on a daily basis. And

we also use internal sales data from pharmaceutical

companies. And we build web apps around it, and those

web apps are made in R, as well as the data gathering

process and the modeling process.

Page 11: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 11

Jon: 14:58 Wow. This episode is brought to you by

SuperDataScience. Yes, our online membership platform

for transitioning into data science and the namesake of

the podcast itself. In the SuperDataScience platform, we

recently launched our new 99 day data scientist study

plan, a cheat sheet with a week-by-week instructions to

get you started as a data scientist in as few as 15 weeks.

Each week, you complete tasks in four categories. The

first is, SuperDataScience courses to become familiar

with the technical foundations of data science. The

second is, hands-on projects to fill up your portfolio and

showcase your knowledge in your job applications. The

third is, a career toolkit with actions to help you stand

out in your job hunting. And the fourth is, additional

curated resources, such as articles, books, and podcasts

to expand your learning and stay up to date.

Jon: 15:54 To devise this curriculum we sat down with some of the

best data scientists as well as many of our most

successful students and came up with the ideal 99 day

data scientist study plan to teach you everything you

need to succeed. So you can skip the planning and simply

focus on learning. We believe the program can be

completed in 99 days, and we challenge you to do it. Are

you ready? Go to superdatascience.com/challenge,

download the 99 day study plan and use it with your

SuperDataScience subscription to get started as a data

scientist in under 100 days. And now let's get back to this

amazing episode.

Jon: 16:32 Okay. So it sounds like a really cool company and it

sounds like you have an amazing job there.

Veerle: 16:36 Yes.

Jon: 16:38 So what kinds of products do you have? I like this idea of

retrieving value from healthcare data, and it's cool that

you have these different kinds of sources. So I guess if it's

Page 12: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 12

public data coming from the UK, is it from the National

Health Service from the NHS?

Veerle: 16:54 Yes. It's from the NHS, mainly. Yeah. Also from other

sources, but mainly from the NHS.

Jon: 17:01 And then also-

Veerle: 17:02 But imagine you have like 10 different data sources from

the NHS, which is great, but you need to access them all

separately. And that's the issue, getting that data from all

these different data sources and maintaining that is a

full-time job. And what we do is we basically do that. We

gather it, we combine it, we clean it, we validate it, bring

it all together and then put it into a web application to

easily access and to analyze. So we're basically doing all

this pre-work so that other people don't have to, and that

they can focus on what really matters, namely doing

enough of these things with those data instead of

gathering it.

Jon: 17:44 Very cool. So this reminds me a little bit in a recent

episode, in episode 485 with Doug Eisenstein, Doug was

on the show talking about engineering data pipelines for

the financial sector. So there he was talking about... In

that case, it was people, investment managers working at

big financial companies in order to be able to make good

investment decisions, they need to have many different

data sources. It could be dozens of data sources that need

to be integrated together so that you have this one big

perspective of the situation. In that case, like the

economic situation so that you can make the right trading

decisions. So it sounds like what Analytic Health provides

is an analogous kind of system in healthcare, where you

have many data sources together. You engineer systems

so that those data sources become integrated and then

you create, I guess a user interface or an API, so that

Page 13: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 13

users... Who's a typical end user of this product can

make better decisions?

Veerle: 18:51 Typical end users work in pharmaceutical companies and

are sales managers or brand managers who are trying to

optimize their supply chain processes, for example.

Jon: 19:02 Nice. That is super cool. So yeah. So tell us about how R

can do all of these aspects for us? So I guess the first

piece is going to be data gathering. So ETL processes, for

example, extraction, transformation, loading of data. How

can we do that at a production type scale in R?

Veerle: 19:29 Yeah, so imagine that we have all these data sources and

they are coming from all kinds of different things, are not

excel files that you're getting everywhere. So some data

come from PDF documents, other data comes from

emails, other data comes from API endpoints. And all

these data comes available at different times. For

example, some data is released on a monthly basis, and

other data is released on a weekly basis, or biweekly or

daily. So you need an automated process to basically

check all those data sources. And whenever there's new

data, the process needs to kick off and then start

gathering it, cleaning it, merging it and sending it back to

the database, our data warehouse where we then can well

look into the data warehouse to gather data. And a key

point here is that you need to have these processes

scheduled.

Veerle: 20:23 So you need a way where you can schedule all your ETL

versus to kick off at appropriate times. And we do that

with an R extension which is called CronR. And it's

basically a very simple R native tool that allows you to

schedule scripts or in our case whole projects to kick off

automatically. And so that it can start the data gathering

process when it's time to do that. And this CronR tool is

very simple to use because it even has a RStudio

Page 14: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 14

interface. So is basically your click and play solution. And

for the people who know Linux, it's obviously working on

Linux because it makes use of the prompt, that

functionality there. So that's a great tool where you can

actually automatically, especially if you have a server

which is always on where you can ultimately schedule all

your R jobs and also manage them.

Jon: 21:20 Veerle, the tool that you've mentioned this sounds really

interesting. I haven't heard about it before. So it sounds...

Yeah, you mentioned how it builds upon crontab, which

is a familiar tool for a lot of people who are scheduling

processes on Linux systems. But this tool is called

Chrome. Is that right? Like a Google it's the same as like

Google Chrome, like the kind of medal?

Veerle: 21:40 No. Cron.

Jon: 21:42 Oh, it is Cron.

Veerle: 21:43 C-R-O-N. Yeah.

Jon: 21:44 So, its CronR? Oh, I see just like crontab? Okay. Okay.

Okay.

Veerle: 21:48 Yes.

Jon: 21:48 Very good.

Veerle: 21:49 Just to give you an idea about how many processes we're

talking, we have 32 ETL processes running daily. So how

are you going then to manage all these processes?

Because obviously we can't look at R all day making sure

that everything kicked off because life happens, hours

prevail, stuff goes wrong. So one thing that we also do to

manage the process is to monitor it. And we have a

beautiful data pipeline report coming into our mailbox

every day, telling us exactly what processes did kick off,

Page 15: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 15

at what time, if there were errors, if there are other

noteworthy messages also dealt with R. So what we use

for this is blastula, which is a package which can help

you send emails. But you can schedule those. And we use

RStudio Connect here to deploy it on. So what we do here

is we scheduled this data pipeline report.

Jon: 22:54 Can I interrupt you for one second again, just to get the

name of the [inaudible 00:22:57]?

Veerle: 22:57 Blastula. So it's B-L-A-S-T-U-L-A.

Jon: 23:08 Blastula? Okay. I gotcha. And so that's... Yeah, it's kind

of this idea of maybe like an email blast. I don't know.

Whatever, we don't need you to figure out-

Veerle: 23:16 Yes.

Jon: 23:16 ... where this name came from.

Veerle: 23:17 It is. It is. And it makes beautiful emails. So these are not

the emails that you would expect from R. Like these techy

emails with only texts which look awful. No, these are

beautiful HTML emails with beautiful tables and graphs.

So this is coming into our mailbox every day at eight

o'clock, which we discuss then as our daily standup to

see, okay, all of these processes did they run accordingly.

And I think that is key when you are managing in

whatever language, even things in production you need to

monitor it. Because you can set up all these processes, do

it on a production scale, have many processes running on

servers, 1, 2, 3, 4, 5 servers. But you need a way to

actually make sure that everything runs accordingly. So

monitoring is key here. And I think what a lot of people

don't realize is that that stuff is also something that you

can do with R. for example, with these email reports and

with organized projects.

Page 16: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 16

Jon: 24:15 Yeah. I didn't know that. Okay. So before I interrupted

you, after you finished talking about blastula, you were

then talking about another tool that also sounded really

cool for the same kind of data quality process checking

kind of thing.

Veerle: 24:30 Yeah. So the question is how are you going to schedule

these email reports? And there we already touched a bit

on how you're going to deploy these things. And what we

use for deployment is RStudio Connect which is basically

an enterprise level tool, which you can purchase from

RStudio, which helps you to deploy your Shiny apps, your

email reports, markdown reports and APIs even. And

that's what we use for deploying everything that we have.

Jon: 25:02 Nice. So let's talk a bit more about that. So I guess I'll...

Maybe model deployment is the last step. Maybe before

we get to model deployment, we need to be talking about

model development in general. I guess you use RStudio as

your main tool?

Veerle: 25:21 Yes. We use RStudio server actually it's a Pro version

which we have running on a server as well. And where we

have multiple accounts on so that we can work together

on the same projects. And yeah, we develop there, so we

develop our apps there, we develop our APIs there and we

develop our markdown reports there. And one important

thing that might be noteworthy to mention is that, in

production it's very important that you separate your

development processes from your production processes,

right? You don't want to do development work where your

production is and the other way round. So how we solve

that at a relatively small company is just to set up two

servers, on one server we have RStudio server running on

the other one we have RStudio Connect and those are

basically our development and production servers. So

whenever we develop something on our development

server with RStudio server, we then push it to RStudio

Page 17: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 17

Connect and that's our live environment. And on RStudio

Connect is also the place where our customers go to, to

access our web applications.

Jon: 26:36 Nice. So I guess RStudio Connect might also make it easy

then it sounds like if RStudio Connect allows you to

deploy a Shiny user interfaces. So maybe I could do it a

little bit, but you can do it better than me. Tell us a bit

about R Shiny.

Veerle: 26:55 So R Shiny is the tool to make web applications of your R

code. And R Shiny is basically very simple, you create a

user interface and you create a server part. And with a

user interface you obviously make the front-end of your

application. And with the server part, you will do the

back-end loading. And it's very easy to spin up

applications, but I also would like to correct something or

talk in favor of Shiny. Because Shiny's often said, "Okay,

Shiny is like this dashboarding tool. You can make great

dashboards with it." Yeah, sure. You can make

dashboards, but you can make fully professional

applications as well. And I think that if you're saying that

Shiny is only for dashboards, then you didn't understand

it correctly. Because you can do so much more with Shiny

than just the dashboard tool.

Veerle: 27:50 You can really make applications that you can distinguish

from few and note applications, for example. So I develop

in other languages as well. We also have few applications

which has a JavaScript framework. It's not different. So I

think it's definitely, well, I can safely say that you can

have an app in production that is running purely on R.

Because your clients won't notice, in fact, your end users

don't care which tech stack something was developed as

long as they [crosstalk 00:28:23].

Jon: 28:23 No they really don't. And yeah, I'm probably the same

kind of person who goes around saying not only that we

Page 18: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 18

should only be using Python for production, but also that

if we're going to use R Shiny is for dashboards. That's

definitely something that [inaudible 00:28:42].

Veerle: 28:41 Yes.

Jon: 28:43 So [crosstalk 00:28:45].

Veerle: 28:44 You should never do that again.

Jon: 28:46 No I won't. That's why I wanted to have you on as a guest.

And yeah, you're changing my life here. Now I can go

back to R which is all I wanted to be doing all along. And I

think a really cool idea here, based on my experience I've

built R Shiny apps, most for dashboards. But I've done

that and I don't have expertise in HTML, CSS, JavaScript

really like I'm quite bad at those things. And I can make a

fully functional application in R Shiny. So I think that

there's probably a really cool opportunity here for a lot of

listeners to now with the conversation that we've had

today. If they have some experience with R or even if they

don't. So if they are doing their data science with another

tool like Python but they want to be making apps, now it

seems like the most obvious thing to be doing is learning

R and using R Shiny to develop and deploy those apps.

Because the way that you are thinking about your code is

going to be a lot more similar to how you do it in Python

than relative to say learning JavaScript or HTML and

CSS. So that's really cool. There's a huge opportunity

there for...

Veerle: 30:09 And the beauty of... Because a web application is exactly

those three things, HTML, CSS, JavaScript. The beauty

about Shiny is it has all these amazing wrappers around

JavaScript libraries, which make all the cool JavaScript

stuff easily available for you as an R user. And that's the

thing that I love about Shiny, the development is so

amazing there. So every day new stuff comes out, which

Page 19: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 19

allow you to make these awesome applications. And in

order to use the fundamentals of Shiny, you don't need to

know JavaScript yourself because you have these nice

wrappers around it. And obviously if you want to do more

customization later on, it would be handy if you know

JavaScript. Because that you can do even more amazing

stuff, but then the basis you don't need it to get started. I

started somewhere as well, like just building a Shiny app

while I had a bit of R experience. And yeah, you learn

along the way, but that's with everything with every tool

that you choose.

Jon: 31:14 Nice. Super cool, Veerle. This is really exciting, I think

that, yeah, not only for me, but for a lot of listeners, I

think this is been a podcast that has potentially changed

their life. Not only will people be doing more powerlifting

and more Olympic lifting, but they'll also be using R in

production. So [crosstalk 00:31:35] are there any other

particular tools in your ecosystem that you recommend

we check out either for model development or

deployment? I think we've talked a little bit more about

ETL already, but for development we've focused mostly on

RStudio server. Maybe there are particular packages that

you use a lot that you highly recommend in R?

Veerle: 31:55 Yeah. What I would recommend for Shiny apps is

checking out the Golem package. The Golem package is

basically a nice [inaudible 00:32:03] framework even, or

way to organize your project in order to make it a

production great Shiny application. So it provides you

basically with the infrastructure you need to set up a

professional application, which is also very scalable. So I

would definitely recommend to check out the Golem

package here.

Jon: 32:24 How do you spell that? That's like the Lord of the rings

Golum?

Page 20: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 20

Veerle: 32:28 Yes.

Jon: 32:29 G-0-L-U-M

Veerle: 32:32 Yeah, only E-M at the end, but yeah.

Jon: 32:37 Oh, yeah, yeah.

Veerle: 32:41 Yeah, but that's a great tool. And yeah, as I said, check

out the email reports bastula so that you can keep

yourself updated about what's happening in your

processes. And one other tip I can give you when you are

setting up R in production is to make sure that when you

set up a new project, that you choose a structure and

make it the same across all projects. You can imagine

with us having 32 processes running it would be a pain if

we go from one project to the other but if the project looks

totally different every time. So make sure that you have a

base structure, and you can even make a package out of

it on your own that can easily create a structure for you,

but make sure that you do it standardized.

Jon: 33:26 Cool. How about the Tidyverse? Are you a Tidyverse fan or

an [crosstalk 00:33:32].

Veerle: 33:32 I'm a Tidyverse lover. I'm a Tidyverse fan, yeah. But the

thing is my business partner with whom I work very

closely obviously is not.

Jon: 33:41 No.

Veerle: 33:41 He's a [inaudible 00:33:42]. Oh, it's awful. It's a data.table

lover basically. So we have this clash of data.table and...

Jon: 33:52 And [inaudible 00:33:53]?

Veerle: 33:53 Dplyr. Yeah. [crosstalk 00:33:56]. But did you know there

is a great package as well, which is dtplyr, which basically

combines dplyr and data.table together?

Page 21: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 21

Jon: 34:04 I did not know that.

Veerle: 34:07 So there you can make use of the advantages of the

speeds of the data.table. Because that's why I have to be

[inaudible 00:34:12]. In most of R Shiny apps, we use

data.table because of its speed and we want our apps to

be as fast as possible. So we prefer data.table over dplyr.

But the dplyr syntax is just beautiful. So what you can do

now with this package dtplyr, you can actually use the

dplyr syntax, but under the hood it uses data.table. And

you have a bit of overhead there, so it's not as fast as

using pure data.table, but it's considerably faster than

dplyr [inaudible 00:34:42].

Jon: 34:44 Yeah. And so I guess we should back up a little bit and let

audience listeners know a little bit more about the

Tidyverse. So the Tidyverse, I think it was created by

Hadley Wickham. He's definitely the biggest figure in the

space. So Hadley Wickham was actually on the

SuperDataScience show last year, so episode 337. And

Hadley Wickham works at RStudio which is the biggest

player in commercial development of R software. But they

open source tons and tons and tons of things. And so

we've been talking about RStudio server and RStudio

Connect, which are tools that you can purchase from

RStudio. But there's a free IDE which I've used for over a

decade that is I think, the leading IDE. So for developing

within an R environment. Same thing R Shiny, which

we've talked about a lot is free. it's completely open

source and dplyr... So dplyr is a part of this suite of R

packages in the Tidyverse. And the reason why it's called

the Tidyverse is because all of these packages are based

on the idea of having tidy data. Which is a particular way

of structuring your data. And then it allows you to pipe

functions. So you can take a given data frame, what's

called a table in the Tidyverse, right?

Veerle: 36:29 Yeah. Yeah.

Page 22: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 22

Jon: 36:31 And then you see you take like a noun, you take an object

and then you can pass it through a series of functions, a

series of verbs. And so you can form a series of a pipeline

basically, of operations and it is such an intuitive and

easy way to write code. I absolutely love it. And I must

admit Veerle, that it is something that I miss about that.

Veerle: 36:56 Yeah, I can imagine. It's awesome.

Jon: 37:01 Okay, Veerle, this is all super cool if we are building a

user interface, something that somebody can click and

point around with. But what if our production application

needs to be an API that people can program against and

make calls against? Is there a tool that we can use in that

case?

Veerle: 37:20 Yes, definitely. You have the plumber packaging R and

the plumber package in R allows you to turn your R code

into an API end point. So your modals can go directly in

there. And nice thing about that is, is that it's stacking

like three extra lines of code to turn your script into an

API, and then you can deploy it basically anywhere, you

can dockerize it and put it somewhere into AWS or Azure.

Or you can deploy to RStudio Connect. And that's a great

way to actually productionize your R code.

Jon: 37:55 Cool. Well, we've learned a lot. You've completely changed

my perspective on R as a tool we could be using in

production. I can't wait to check out blastula, CronR,

dtplyr. And I'm sure there's lots of audience members out

there who can't wait to get started as well. So do you

happen to have a book recommendation that might be

related to R? It might be unrelated to R, it doesn't matter

either way, but...

Veerle: 38:27 I can give a recommendation for R and not for R. But for

R I would definitely recommend JavaScript for R by John

Coene, that is. It's a great book on how you can use

Page 23: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 23

JavaScript and you can access it online, so it's pretty

cool. And a none related R book I would say The Art of

Thinking Clearly by Rolf Dobelli. It has these fairly short

chapters which you can read in like two minutes and it'll

give you fresh insights on life in general, which is also

very useful to use in a business, for example.

Jon: 39:05 That sounds cool. I want to check that out. That sounds

like... So something with such short chapters could be

ideal for a commute or just waiting in line for something

briefly. It sounds like the perfect way to get some extra

philosophy in life.

Veerle: 39:25 Yeah. At our company we have a weekly meeting where

we discuss each of us one chapter of this book and talk

about it then what it can mean for business. Just to do

something else than programming and getting involved in

day-to-day business. We step outside once per week and

talk about these general things.

Jon: 39:43 Beautiful. All right. I'm looking forward to checking that

out. And then so I've learned so much from you in this

episode. I'm sure lots of people have, and I'm sure lots of

listeners are wondering how they can keep up with your

latest thoughts perhaps on the art of thinking clearly, but

maybe also on R. So how should people follow you?

Veerle: 40:03 They can follow me on LinkedIn. I share regularly tips

about R and also things that we're doing in the business.

So give me a follow if you want to know more.

Jon: 40:14 Nice. All right. So we will definitely include that in the

show notes, making it easy for people to follow you.

Veerle, this has been such a fun episode. I've really loved

having you on, and hopefully we can have you on again

sometime soon.

Veerle: 40:26 Thank you, Jon.

Page 24: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 24

Jon: 40:34 That was a lot of fun and wow, did Veerle ever blow my

narrow Python centric mind. In today's episode, Veerle

filled us in on lots of specific tools for using R as your

production software language. Specifically, she mentioned

the R Tidyverse as a tidy way to manage your data and

data operations, particularly dplyr for piping operations

into each other's sequentially and intuitively. She told us

about dtplyr for obtaining dplyr style piping with

computational efficiency that is near R's data.tables. She

talked about CronR for scheduling processes to run

automatically. Blastula for beautiful automated emails.

RStudio server for model development in the cloud. R

Shiny for designing user interfaces of any ELT. RStudio

Connect for deploying R Shiny apps and golem for

professional grade scalability of those Shiny apps. And

finally, she told us about R Plumber for creating API end

points.

Jon: 41:38 As always, you can get all the show notes including the

transcript for this episode, the video recording any

materials mentioned on the show, such as the list of R

packages that I just rifled through and the URL for

Veerle's LinkedIn profile, as well as my own social media

profiles at superdatascience.com/491. That's

superdatascience.com/491. If you enjoyed this episode I'd

of course greatly appreciate it if you left a review on your

favorite podcasting app or on the SuperDataScience

YouTube channel where we have a video version of this

episode. To let me know your thoughts on the episode,

please do feel welcome to add me on LinkedIn or on

Twitter, and then tag me in a post to let me know your

thoughts on this episode. Your feedback is invaluable for

figuring out what topics we should cover next.

Jon: 42:24 Since this is a free podcast if you're looking for a free way

to help me out, I'd be very grateful. If you left a rating of

my book, Deep Learning Illustrated on Amazon or

Goodreads. Give some videos on my YouTube channel a

Page 25: SDS PODCAST EPISODE 491: R IN PRODUCTION...Today's episode will primarily be of interest to technical professionals like data scientists and software developers, but we did our best

Show Notes: http://www.superdatascience.com/491 25

thumbs up or subscribe to my free content, rich

newsletter on jonkrohn.com. To support the

SuperDataScience company that kindly funds the

management, editing and production of this podcast

without any annoying third-party ads, you could create a

free login to their learning platform at

superdatascience.com. You can check out the 99 days to

your first data science job challenge at

superdatascience.com/challenge, or you could consider

buying a usually pretty damn cheap Udemy course

published by Ligency an affiliate of SuperDataScience,

such as my Mathematical Foundations of Machine

Learning Course.

Jon: 43:11 All right, thanks to Ivana, Jaime, Mario and JP on the

SuperDataScience team for managing and producing

another amazing episode today. Keep on rocking it out

there, folks, and I'm looking forward to enjoying another

round of the SupeDataScience podcast with you very

soon.


Recommended