Show Notes: http://www.superdatascience.com/491 1
SDS PODCAST
EPISODE 491:
R IN PRODUCTION
Show Notes: http://www.superdatascience.com/491 2
Jon: 00:00 This is episode number 491 with Veerle van Leemput,
Managing Director and Head of Data Science at Analytic
Health.
Jon: 00:12 Welcome to the SuperSuperDataScience podcast. My
name is Jon Krohn, a chief data scientist and bestselling
author on deep learning. Each week we bring you
inspiring people and ideas to help you build a successful
career in data science. Thanks for being here today and
now let's make the complex simple.
Jon: 00:42 Welcome back to the SuperDataScience podcast. I am
absolutely delighted to have Veerle as my guest on the
program today. Hailing from the Netherlands, Veerle has
held a number of data science leadership roles at Dutch
companies. She now serves as Managing Director and
Head of Data Science at Analytic Health, a London-based
firm that builds data-centric software for the healthcare
industry. On the side, Veerle is an impressive podium
level weightlifter on the Dutch national stage.
Jon: 01:13 Beyond bonding over powerlifting, in today's episode
Veerle details for me how R is not only an option for
production software, but may in fact be the best
production option for you if data or data models are
central to your application. Specifically, Veerle runs down
for us her favorite R tools for data gathering, model
development and deployment into production systems.
Today's episode will primarily be of interest to technical
professionals like data scientists and software developers,
but we did our best to break down the technical concepts.
And we do have a lot of laughs in the episode, which
could make it appealing to anyone who enjoys a good
giggle. All right, you're ready for another awesome
episode? Let's go.
Jon: 02:00 Veerle, welcome to the program, I'm so excited to have
you on. Where in the world are you calling in from?
Show Notes: http://www.superdatascience.com/491 3
Veerle: 02:07 I'm calling in from Leiden in the Netherlands, it's just
below Amsterdam.
Jon: 02:12 Yeah, we need to know. Everyone wants to know how is
that in relation to Amsterdam?
Veerle: 02:18 Amsterdam is the only place in the Netherlands
apparently.
Jon: 02:23 Is there a really good football team in Leiden? I think
there is. If I heard better in that context.
Veerle: 02:28 I wouldn't know, they most of the time [crosstalk
00:02:30].
Jon: 02:32 Oh, no kidding. Did you ever play hockey?
Veerle: 02:34 Yeah. It is styling at least. No, I don't. I don't play hockey.
Jon: 02:37 Ah, you don't. I grew up in Canada, so I must play
hockey. It's a part of growing up there.
Veerle: 02:43 Oh, really?
Jon: 02:46 Yeah. So, well, I guess we won't have ice hockey to talk
about, but we do have powerlifting. So that's how you
originally came to my attention. So because we're both
data scientists, we're in each other's LinkedIn network,
however that happened. And then I think you commented
on a post. So we're recording at the beginning of July and
about a month ago... Actually, I think it was exactly six
weeks ago because I have six week weightlifting cycles.
And yesterday I again had a [inaudible 00:03:16] for my
deadlift.
Jon: 03:18 So I think six weeks ago, I posted my all-time deadlift PR,
which was 405 pounds. And probably after some initial
confusion about kilos and pounds, you commented on
the LinkedIn poss. I can't remember what you said, but
Show Notes: http://www.superdatascience.com/491 4
then it... Oh yeah, you said something about, "Have you
ever thought about competing?" And I said, "Well, I've
done an Olympic weightlifting competition. I've never
done powerlifting and I'm definitely interested." Then I
was like, "Well, why would somebody ask this? Veerle, do
you do powerlifting?" And you said...
Veerle: 03:52 Of course. Yes, I am a powerlifter. And in fact, a couple of
weeks ago I participated in the Dutch Nationals
Powerlifting.
Jon: 04:00 No way.
Veerle: 04:01 Yes. And I came in second. So I'm vice champion
[inaudible 00:04:05]. Yeah.
Jon: 04:07 Wow. That's incredible. This is really exciting. I didn't
know that you were that into it. So, okay. So we should
let the audience know exactly what powerlifting is. So I
think there's always three movements in a traditional
powerlifting competition, right?
Veerle: 04:21 Yeah. There's three movements, there's the squats,
there's the bench press and there's the deadlift. And
those three you need to get the highest weight, and
combined it's your total and the total determines your
ranking basically.
Jon: 04:33 And so you just add up across the three, back squat so
you've got a barbell on your back, you squat to below
parallel and then back up-
Veerle: 04:42 Below parallel. Yeah. Yes.
Jon: 04:46 Exactly. Deadlift is the [crosstalk 00:04:47] one of those.
Oh yeah, bench press. So is that the order? Is always in
the same order in competition? You always do...
Veerle: 04:52 Yes.
Show Notes: http://www.superdatascience.com/491 5
Jon: 04:52 ... back squat, then bench press. Bench press, I think a
lot of people know that one you're laying on a bench
horizontally.
Veerle: 05:00 And you press the weight. Yeah.
Jon: 05:01 Yeah. Absolutely.
Veerle: 05:02 The important thing though, in powerlighting you need to
pause at the chest. So it's not like touch and go which
you see normally in the gym [crosstalk 00:05:08] you
have to wait for the judges to say, "Okay, press."
Jon: 05:13 Oh, really?
Veerle: 05:13 And then you can go.
Jon: 05:16 Oh, geez. That makes it a lot tougher.
Veerle: 05:18 Definitely.
Jon: 05:20 I've got a really bouncy rib cage. So that's my key to
bench press success.
Veerle: 05:23 Oh, really?
Jon: 05:25 Really bounce it off of there [inaudible 00:05:27].
Veerle: 05:28 [inaudible 00:05:28] really pausing.
Jon: 05:31 Just drop it, and catch it on the way back up. And then
the third movement is the deadlift, so that's the video that
I posted six weeks ago at time of recording. And so that
is... It's kind of the simplest idea. You've got a barbell on
the ground and you need to lift it up. You need to stand
up straight shoulders back. And of the three movements
that's the one that typically people can lift the most of.
Veerle: 05:58 Same for me.
Show Notes: http://www.superdatascience.com/491 6
Jon: 05:59 Yeah. It would be surprising if it was otherwise, if you
benched more than you...
Veerle: 06:02 It would be epic. If I benched more than my deadlift I
would be very, very good.
Jon: 06:11 And that's how you become the second most powerful
powerlifter in the Netherlands. So would you mind telling
us what was your combined score? What was your
combined weight across the three?
Veerle: 06:22 My combined score... Geez then I'd have to do the math
really because it's 115 kilos squats, then we had 67.5
bench and 152.5 deadlifts. And combined I think that's
324?
Jon: 06:42 I don't know, I don't have a calculator out. But...
Veerle: 06:47 A lot.
Jon: 06:47 It is a lot. And then so for people that want to do it in
pounds, you need to multiply it by 2.2 and that'll give you
the weight in pounds. And I think we can conclude Veerle
can lift a lot of weight. And so this is really cool. I didn't
know that you were so actively into it. And so what are
you doing now? So you had the big national competition
two weeks ago. Are you back in a training cycle now
training for something else?
Veerle: 07:10 Well, actually I started to focus a bit more on Olympic
weightlifting now because I just [crosstalk 00:07:15] that
is. So, yeah, I'm now into snatches and clean and jerk.
But that's just the... Yeah, I don't know, I really liked it.
So I don't know if it's temporary yet, but I'm still a
powerlifter but now doing a sidetrack into weightlifting.
And I'm the proud owner of a fully equipped gym at home
as well, both powerlifting [inaudible 00:07:37]
Show Notes: http://www.superdatascience.com/491 7
weightlifting up. It's like a giant hobby. It's like getting out
of control a bit.
Jon: 07:43 I understand. I'm super fortunate to have a very well-
equipped gym across the street. It's basically unheard of,
I'd have to be absurdly wealthy to have a fully equipped
gym in my apartment in New York. That would be
incredible. Maybe that's something to aspire to. [crosstalk
00:07:58].
Veerle: 07:58 That is expensive hobby to have your own gym. Yeah.
[crosstalk 00:08:04].
Jon: 08:06 And so, yeah, so Olympic weightlifting, that's the one I'm
much more familiar with. And I've only done it once. I
competed once, and I was okay for my... I've lift a fair bit
if you don't consider weight or gender.
Veerle: 08:23 Okay. That's important [crosstalk 00:08:28].
Jon: 08:30 I know. So once you put me in my weight class, it's not so
impressive, but for the audience, there's only two
movements in Olympic weightlifting. There's Veerle
already mentioned them, the clean and the jerk and the
snatch. But they are more technical, I hope you don't
mind me saying [crosstalk 00:08:48].
Veerle: 08:49 Yes. It's very true. It's much more difficult really. Because
powerlifting is more brute force and weightlifting is like...
If you're one inch off, then you miss your attempt. It's
very different road. But that makes it kind of a challenge.
Jon: 09:05 It is. It's nice. It's like there's so much more on
positioning and timing, accuracy. I enjoy it a lot.
Veerle: 09:13 Yeah, me too.
Jon: 09:16 And yes, people can probably look up videos to get a
sense of how a clean and jerk and a snatch works, but it's
Show Notes: http://www.superdatascience.com/491 8
the same. So it's the same barbell that you use for all of
the powerlifting movements. Although, I guess technically
speaking you could have it different. You might even
given that you've such a well-equipped gym, you probably
have two different barbells.
Veerle: 09:34 Yes. I have two, women's weightlifting bar and a
powerlifting bar. Because the weight differs between bars
[inaudible 00:09:40].
Jon: 09:41 Wait, what?
Veerle: 09:43 Women do have a lighter bar than men. So the standard
bar is 20 kilos, but for women it's 15. So I have two bars.
Yeah.
Jon: 09:52 But you also have two 15 kilo bars. You have one for
powerlifting and one for Olympic.
Veerle: 09:57 No powerlifting is 20. It's always 20.
Jon: 10:00 Oh, it's always 20. Oh, interesting. I didn't know that.
Wow. Okay. Oh, yeah, so in terms of the idea though the
barbell it looks... It's a barbell, at a distance it would look
the same whether it's powerlifting or Olympic
weightlifting. And so with a snatch, you have to in a
single movement get the bar overhead.
Jon: 10:25 So it's on the ground like with a deadlift, but then you in
a single movement, boom, it's over your head and you
have to stand up with it over your head and show control.
And the clean and jerk, you get to do it in two, so up to
your chest and then overhead. And so you can do more
weight that way. Anyway, so do you have a specific date
in mind for that, for your first [crosstalk 00:10:46]?
Veerle: 10:46 The competition? Well, perhaps in October, but yeah, it
depends. I really want to get a good snatch and then I
Show Notes: http://www.superdatascience.com/491 9
might be ready to get to the platform, but I'm not there
yet. I can tell you, my snatch is like basically a very
segmented lift at this point in time. It's not smooth at all,
but yeah, we learn every day. Right? You can only get
better.
Jon: 11:14 Well, very exciting. I look forward to watching the journey.
I'm going to say that again, because I just hit my mic with
my hand. I don't know if that worked. Well, very exciting
I'm looking forward to seeing how this journey unfolds for
you. And I do hope that you stay in touch, not just about
data science, but about this as well. It's really cool. And I
hope... Do you ever post about your weightlifting stuff on
LinkedIn?
Veerle: 11:37 Not on LinkedIn. No, I keep that separate. But I do have a
very dedicated Instagram account to all my lifting
[crosstalk 00:11:44].
Jon: 11:45 Oh, okay. Well maybe you can share that at the end as
well in addition to your LinkedIn details. All right. But
let's get away from Instagram style chat to LinkedIn style
chats. So yeah, so we know each other from LinkedIn.
And so yeah, you came to my attention because of this
powerlifting posts that I made. And then shortly
thereafter, I noticed that on June 22nd, you gave a talk
on the R-Ladies of Amsterdam and the talk was on R in
production.
Jon: 12:21 And my mind was blown. I'm constantly on this show and
in life in general saying I like R and I genuinely do. I was
a R user for years before I started using Python. And so a
lot of my statistical programming knowledge came from
using R, and I really like it. But in the last five, six years,
I don't use it very much because I started working at a
startup where we're putting models into production. And
I've always had this idea in my head, and I don't think I'm
the only one who goes around spreading this propaganda.
Show Notes: http://www.superdatascience.com/491 10
But there's this idea in the data science community that
especially if you're looking at putting models into
production, you need to be using Python. And well, today
you're going to tell us why I'm wrong.
Veerle: 13:18 Yes, definitely. Because you said that you worked at a
startup and needed to put models into production and
that's why you didn't use R. But let me tell you, I'm
working in a startup as well, and we have models in
production but we are using R. So it definitely is possible.
We're running a whole business on R basically. So the
stigma around, R is only for academics, and it's only for
statistical programming and it's only good enough for
quick prototyping it's not true. And that's what I told
them.
Jon: 13:52 So let's dig into that in a second, but first give us a little
bit of context about your startup. So it's called Analytic
Health. It looks to me like it's headquartered in London.
Veerle: 14:02 Yes. Correct.
Jon: 14:04 But yeah. So tell us a bit about what the company does,
what you do there.
Veerle: 14:09 Yeah. So I'm Managing Director and Head of Data Science
at Analytic Health and together with my business partner,
Greg Mills, and our Head of Operations Jana, we develop
web applications for the healthcare sector. And what we
do is we gather and we analyze healthcare data in order
to retrieve value from the data. Because we believe that
we can accelerate innovation in healthcare by getting
insights from this data. And we gather data from the
United Kingdom, health care data on a daily basis. And
we also use internal sales data from pharmaceutical
companies. And we build web apps around it, and those
web apps are made in R, as well as the data gathering
process and the modeling process.
Show Notes: http://www.superdatascience.com/491 11
Jon: 14:58 Wow. This episode is brought to you by
SuperDataScience. Yes, our online membership platform
for transitioning into data science and the namesake of
the podcast itself. In the SuperDataScience platform, we
recently launched our new 99 day data scientist study
plan, a cheat sheet with a week-by-week instructions to
get you started as a data scientist in as few as 15 weeks.
Each week, you complete tasks in four categories. The
first is, SuperDataScience courses to become familiar
with the technical foundations of data science. The
second is, hands-on projects to fill up your portfolio and
showcase your knowledge in your job applications. The
third is, a career toolkit with actions to help you stand
out in your job hunting. And the fourth is, additional
curated resources, such as articles, books, and podcasts
to expand your learning and stay up to date.
Jon: 15:54 To devise this curriculum we sat down with some of the
best data scientists as well as many of our most
successful students and came up with the ideal 99 day
data scientist study plan to teach you everything you
need to succeed. So you can skip the planning and simply
focus on learning. We believe the program can be
completed in 99 days, and we challenge you to do it. Are
you ready? Go to superdatascience.com/challenge,
download the 99 day study plan and use it with your
SuperDataScience subscription to get started as a data
scientist in under 100 days. And now let's get back to this
amazing episode.
Jon: 16:32 Okay. So it sounds like a really cool company and it
sounds like you have an amazing job there.
Veerle: 16:36 Yes.
Jon: 16:38 So what kinds of products do you have? I like this idea of
retrieving value from healthcare data, and it's cool that
you have these different kinds of sources. So I guess if it's
Show Notes: http://www.superdatascience.com/491 12
public data coming from the UK, is it from the National
Health Service from the NHS?
Veerle: 16:54 Yes. It's from the NHS, mainly. Yeah. Also from other
sources, but mainly from the NHS.
Jon: 17:01 And then also-
Veerle: 17:02 But imagine you have like 10 different data sources from
the NHS, which is great, but you need to access them all
separately. And that's the issue, getting that data from all
these different data sources and maintaining that is a
full-time job. And what we do is we basically do that. We
gather it, we combine it, we clean it, we validate it, bring
it all together and then put it into a web application to
easily access and to analyze. So we're basically doing all
this pre-work so that other people don't have to, and that
they can focus on what really matters, namely doing
enough of these things with those data instead of
gathering it.
Jon: 17:44 Very cool. So this reminds me a little bit in a recent
episode, in episode 485 with Doug Eisenstein, Doug was
on the show talking about engineering data pipelines for
the financial sector. So there he was talking about... In
that case, it was people, investment managers working at
big financial companies in order to be able to make good
investment decisions, they need to have many different
data sources. It could be dozens of data sources that need
to be integrated together so that you have this one big
perspective of the situation. In that case, like the
economic situation so that you can make the right trading
decisions. So it sounds like what Analytic Health provides
is an analogous kind of system in healthcare, where you
have many data sources together. You engineer systems
so that those data sources become integrated and then
you create, I guess a user interface or an API, so that
Show Notes: http://www.superdatascience.com/491 13
users... Who's a typical end user of this product can
make better decisions?
Veerle: 18:51 Typical end users work in pharmaceutical companies and
are sales managers or brand managers who are trying to
optimize their supply chain processes, for example.
Jon: 19:02 Nice. That is super cool. So yeah. So tell us about how R
can do all of these aspects for us? So I guess the first
piece is going to be data gathering. So ETL processes, for
example, extraction, transformation, loading of data. How
can we do that at a production type scale in R?
Veerle: 19:29 Yeah, so imagine that we have all these data sources and
they are coming from all kinds of different things, are not
excel files that you're getting everywhere. So some data
come from PDF documents, other data comes from
emails, other data comes from API endpoints. And all
these data comes available at different times. For
example, some data is released on a monthly basis, and
other data is released on a weekly basis, or biweekly or
daily. So you need an automated process to basically
check all those data sources. And whenever there's new
data, the process needs to kick off and then start
gathering it, cleaning it, merging it and sending it back to
the database, our data warehouse where we then can well
look into the data warehouse to gather data. And a key
point here is that you need to have these processes
scheduled.
Veerle: 20:23 So you need a way where you can schedule all your ETL
versus to kick off at appropriate times. And we do that
with an R extension which is called CronR. And it's
basically a very simple R native tool that allows you to
schedule scripts or in our case whole projects to kick off
automatically. And so that it can start the data gathering
process when it's time to do that. And this CronR tool is
very simple to use because it even has a RStudio
Show Notes: http://www.superdatascience.com/491 14
interface. So is basically your click and play solution. And
for the people who know Linux, it's obviously working on
Linux because it makes use of the prompt, that
functionality there. So that's a great tool where you can
actually automatically, especially if you have a server
which is always on where you can ultimately schedule all
your R jobs and also manage them.
Jon: 21:20 Veerle, the tool that you've mentioned this sounds really
interesting. I haven't heard about it before. So it sounds...
Yeah, you mentioned how it builds upon crontab, which
is a familiar tool for a lot of people who are scheduling
processes on Linux systems. But this tool is called
Chrome. Is that right? Like a Google it's the same as like
Google Chrome, like the kind of medal?
Veerle: 21:40 No. Cron.
Jon: 21:42 Oh, it is Cron.
Veerle: 21:43 C-R-O-N. Yeah.
Jon: 21:44 So, its CronR? Oh, I see just like crontab? Okay. Okay.
Okay.
Veerle: 21:48 Yes.
Jon: 21:48 Very good.
Veerle: 21:49 Just to give you an idea about how many processes we're
talking, we have 32 ETL processes running daily. So how
are you going then to manage all these processes?
Because obviously we can't look at R all day making sure
that everything kicked off because life happens, hours
prevail, stuff goes wrong. So one thing that we also do to
manage the process is to monitor it. And we have a
beautiful data pipeline report coming into our mailbox
every day, telling us exactly what processes did kick off,
Show Notes: http://www.superdatascience.com/491 15
at what time, if there were errors, if there are other
noteworthy messages also dealt with R. So what we use
for this is blastula, which is a package which can help
you send emails. But you can schedule those. And we use
RStudio Connect here to deploy it on. So what we do here
is we scheduled this data pipeline report.
Jon: 22:54 Can I interrupt you for one second again, just to get the
name of the [inaudible 00:22:57]?
Veerle: 22:57 Blastula. So it's B-L-A-S-T-U-L-A.
Jon: 23:08 Blastula? Okay. I gotcha. And so that's... Yeah, it's kind
of this idea of maybe like an email blast. I don't know.
Whatever, we don't need you to figure out-
Veerle: 23:16 Yes.
Jon: 23:16 ... where this name came from.
Veerle: 23:17 It is. It is. And it makes beautiful emails. So these are not
the emails that you would expect from R. Like these techy
emails with only texts which look awful. No, these are
beautiful HTML emails with beautiful tables and graphs.
So this is coming into our mailbox every day at eight
o'clock, which we discuss then as our daily standup to
see, okay, all of these processes did they run accordingly.
And I think that is key when you are managing in
whatever language, even things in production you need to
monitor it. Because you can set up all these processes, do
it on a production scale, have many processes running on
servers, 1, 2, 3, 4, 5 servers. But you need a way to
actually make sure that everything runs accordingly. So
monitoring is key here. And I think what a lot of people
don't realize is that that stuff is also something that you
can do with R. for example, with these email reports and
with organized projects.
Show Notes: http://www.superdatascience.com/491 16
Jon: 24:15 Yeah. I didn't know that. Okay. So before I interrupted
you, after you finished talking about blastula, you were
then talking about another tool that also sounded really
cool for the same kind of data quality process checking
kind of thing.
Veerle: 24:30 Yeah. So the question is how are you going to schedule
these email reports? And there we already touched a bit
on how you're going to deploy these things. And what we
use for deployment is RStudio Connect which is basically
an enterprise level tool, which you can purchase from
RStudio, which helps you to deploy your Shiny apps, your
email reports, markdown reports and APIs even. And
that's what we use for deploying everything that we have.
Jon: 25:02 Nice. So let's talk a bit more about that. So I guess I'll...
Maybe model deployment is the last step. Maybe before
we get to model deployment, we need to be talking about
model development in general. I guess you use RStudio as
your main tool?
Veerle: 25:21 Yes. We use RStudio server actually it's a Pro version
which we have running on a server as well. And where we
have multiple accounts on so that we can work together
on the same projects. And yeah, we develop there, so we
develop our apps there, we develop our APIs there and we
develop our markdown reports there. And one important
thing that might be noteworthy to mention is that, in
production it's very important that you separate your
development processes from your production processes,
right? You don't want to do development work where your
production is and the other way round. So how we solve
that at a relatively small company is just to set up two
servers, on one server we have RStudio server running on
the other one we have RStudio Connect and those are
basically our development and production servers. So
whenever we develop something on our development
server with RStudio server, we then push it to RStudio
Show Notes: http://www.superdatascience.com/491 17
Connect and that's our live environment. And on RStudio
Connect is also the place where our customers go to, to
access our web applications.
Jon: 26:36 Nice. So I guess RStudio Connect might also make it easy
then it sounds like if RStudio Connect allows you to
deploy a Shiny user interfaces. So maybe I could do it a
little bit, but you can do it better than me. Tell us a bit
about R Shiny.
Veerle: 26:55 So R Shiny is the tool to make web applications of your R
code. And R Shiny is basically very simple, you create a
user interface and you create a server part. And with a
user interface you obviously make the front-end of your
application. And with the server part, you will do the
back-end loading. And it's very easy to spin up
applications, but I also would like to correct something or
talk in favor of Shiny. Because Shiny's often said, "Okay,
Shiny is like this dashboarding tool. You can make great
dashboards with it." Yeah, sure. You can make
dashboards, but you can make fully professional
applications as well. And I think that if you're saying that
Shiny is only for dashboards, then you didn't understand
it correctly. Because you can do so much more with Shiny
than just the dashboard tool.
Veerle: 27:50 You can really make applications that you can distinguish
from few and note applications, for example. So I develop
in other languages as well. We also have few applications
which has a JavaScript framework. It's not different. So I
think it's definitely, well, I can safely say that you can
have an app in production that is running purely on R.
Because your clients won't notice, in fact, your end users
don't care which tech stack something was developed as
long as they [crosstalk 00:28:23].
Jon: 28:23 No they really don't. And yeah, I'm probably the same
kind of person who goes around saying not only that we
Show Notes: http://www.superdatascience.com/491 18
should only be using Python for production, but also that
if we're going to use R Shiny is for dashboards. That's
definitely something that [inaudible 00:28:42].
Veerle: 28:41 Yes.
Jon: 28:43 So [crosstalk 00:28:45].
Veerle: 28:44 You should never do that again.
Jon: 28:46 No I won't. That's why I wanted to have you on as a guest.
And yeah, you're changing my life here. Now I can go
back to R which is all I wanted to be doing all along. And I
think a really cool idea here, based on my experience I've
built R Shiny apps, most for dashboards. But I've done
that and I don't have expertise in HTML, CSS, JavaScript
really like I'm quite bad at those things. And I can make a
fully functional application in R Shiny. So I think that
there's probably a really cool opportunity here for a lot of
listeners to now with the conversation that we've had
today. If they have some experience with R or even if they
don't. So if they are doing their data science with another
tool like Python but they want to be making apps, now it
seems like the most obvious thing to be doing is learning
R and using R Shiny to develop and deploy those apps.
Because the way that you are thinking about your code is
going to be a lot more similar to how you do it in Python
than relative to say learning JavaScript or HTML and
CSS. So that's really cool. There's a huge opportunity
there for...
Veerle: 30:09 And the beauty of... Because a web application is exactly
those three things, HTML, CSS, JavaScript. The beauty
about Shiny is it has all these amazing wrappers around
JavaScript libraries, which make all the cool JavaScript
stuff easily available for you as an R user. And that's the
thing that I love about Shiny, the development is so
amazing there. So every day new stuff comes out, which
Show Notes: http://www.superdatascience.com/491 19
allow you to make these awesome applications. And in
order to use the fundamentals of Shiny, you don't need to
know JavaScript yourself because you have these nice
wrappers around it. And obviously if you want to do more
customization later on, it would be handy if you know
JavaScript. Because that you can do even more amazing
stuff, but then the basis you don't need it to get started. I
started somewhere as well, like just building a Shiny app
while I had a bit of R experience. And yeah, you learn
along the way, but that's with everything with every tool
that you choose.
Jon: 31:14 Nice. Super cool, Veerle. This is really exciting, I think
that, yeah, not only for me, but for a lot of listeners, I
think this is been a podcast that has potentially changed
their life. Not only will people be doing more powerlifting
and more Olympic lifting, but they'll also be using R in
production. So [crosstalk 00:31:35] are there any other
particular tools in your ecosystem that you recommend
we check out either for model development or
deployment? I think we've talked a little bit more about
ETL already, but for development we've focused mostly on
RStudio server. Maybe there are particular packages that
you use a lot that you highly recommend in R?
Veerle: 31:55 Yeah. What I would recommend for Shiny apps is
checking out the Golem package. The Golem package is
basically a nice [inaudible 00:32:03] framework even, or
way to organize your project in order to make it a
production great Shiny application. So it provides you
basically with the infrastructure you need to set up a
professional application, which is also very scalable. So I
would definitely recommend to check out the Golem
package here.
Jon: 32:24 How do you spell that? That's like the Lord of the rings
Golum?
Show Notes: http://www.superdatascience.com/491 20
Veerle: 32:28 Yes.
Jon: 32:29 G-0-L-U-M
Veerle: 32:32 Yeah, only E-M at the end, but yeah.
Jon: 32:37 Oh, yeah, yeah.
Veerle: 32:41 Yeah, but that's a great tool. And yeah, as I said, check
out the email reports bastula so that you can keep
yourself updated about what's happening in your
processes. And one other tip I can give you when you are
setting up R in production is to make sure that when you
set up a new project, that you choose a structure and
make it the same across all projects. You can imagine
with us having 32 processes running it would be a pain if
we go from one project to the other but if the project looks
totally different every time. So make sure that you have a
base structure, and you can even make a package out of
it on your own that can easily create a structure for you,
but make sure that you do it standardized.
Jon: 33:26 Cool. How about the Tidyverse? Are you a Tidyverse fan or
an [crosstalk 00:33:32].
Veerle: 33:32 I'm a Tidyverse lover. I'm a Tidyverse fan, yeah. But the
thing is my business partner with whom I work very
closely obviously is not.
Jon: 33:41 No.
Veerle: 33:41 He's a [inaudible 00:33:42]. Oh, it's awful. It's a data.table
lover basically. So we have this clash of data.table and...
Jon: 33:52 And [inaudible 00:33:53]?
Veerle: 33:53 Dplyr. Yeah. [crosstalk 00:33:56]. But did you know there
is a great package as well, which is dtplyr, which basically
combines dplyr and data.table together?
Show Notes: http://www.superdatascience.com/491 21
Jon: 34:04 I did not know that.
Veerle: 34:07 So there you can make use of the advantages of the
speeds of the data.table. Because that's why I have to be
[inaudible 00:34:12]. In most of R Shiny apps, we use
data.table because of its speed and we want our apps to
be as fast as possible. So we prefer data.table over dplyr.
But the dplyr syntax is just beautiful. So what you can do
now with this package dtplyr, you can actually use the
dplyr syntax, but under the hood it uses data.table. And
you have a bit of overhead there, so it's not as fast as
using pure data.table, but it's considerably faster than
dplyr [inaudible 00:34:42].
Jon: 34:44 Yeah. And so I guess we should back up a little bit and let
audience listeners know a little bit more about the
Tidyverse. So the Tidyverse, I think it was created by
Hadley Wickham. He's definitely the biggest figure in the
space. So Hadley Wickham was actually on the
SuperDataScience show last year, so episode 337. And
Hadley Wickham works at RStudio which is the biggest
player in commercial development of R software. But they
open source tons and tons and tons of things. And so
we've been talking about RStudio server and RStudio
Connect, which are tools that you can purchase from
RStudio. But there's a free IDE which I've used for over a
decade that is I think, the leading IDE. So for developing
within an R environment. Same thing R Shiny, which
we've talked about a lot is free. it's completely open
source and dplyr... So dplyr is a part of this suite of R
packages in the Tidyverse. And the reason why it's called
the Tidyverse is because all of these packages are based
on the idea of having tidy data. Which is a particular way
of structuring your data. And then it allows you to pipe
functions. So you can take a given data frame, what's
called a table in the Tidyverse, right?
Veerle: 36:29 Yeah. Yeah.
Show Notes: http://www.superdatascience.com/491 22
Jon: 36:31 And then you see you take like a noun, you take an object
and then you can pass it through a series of functions, a
series of verbs. And so you can form a series of a pipeline
basically, of operations and it is such an intuitive and
easy way to write code. I absolutely love it. And I must
admit Veerle, that it is something that I miss about that.
Veerle: 36:56 Yeah, I can imagine. It's awesome.
Jon: 37:01 Okay, Veerle, this is all super cool if we are building a
user interface, something that somebody can click and
point around with. But what if our production application
needs to be an API that people can program against and
make calls against? Is there a tool that we can use in that
case?
Veerle: 37:20 Yes, definitely. You have the plumber packaging R and
the plumber package in R allows you to turn your R code
into an API end point. So your modals can go directly in
there. And nice thing about that is, is that it's stacking
like three extra lines of code to turn your script into an
API, and then you can deploy it basically anywhere, you
can dockerize it and put it somewhere into AWS or Azure.
Or you can deploy to RStudio Connect. And that's a great
way to actually productionize your R code.
Jon: 37:55 Cool. Well, we've learned a lot. You've completely changed
my perspective on R as a tool we could be using in
production. I can't wait to check out blastula, CronR,
dtplyr. And I'm sure there's lots of audience members out
there who can't wait to get started as well. So do you
happen to have a book recommendation that might be
related to R? It might be unrelated to R, it doesn't matter
either way, but...
Veerle: 38:27 I can give a recommendation for R and not for R. But for
R I would definitely recommend JavaScript for R by John
Coene, that is. It's a great book on how you can use
Show Notes: http://www.superdatascience.com/491 23
JavaScript and you can access it online, so it's pretty
cool. And a none related R book I would say The Art of
Thinking Clearly by Rolf Dobelli. It has these fairly short
chapters which you can read in like two minutes and it'll
give you fresh insights on life in general, which is also
very useful to use in a business, for example.
Jon: 39:05 That sounds cool. I want to check that out. That sounds
like... So something with such short chapters could be
ideal for a commute or just waiting in line for something
briefly. It sounds like the perfect way to get some extra
philosophy in life.
Veerle: 39:25 Yeah. At our company we have a weekly meeting where
we discuss each of us one chapter of this book and talk
about it then what it can mean for business. Just to do
something else than programming and getting involved in
day-to-day business. We step outside once per week and
talk about these general things.
Jon: 39:43 Beautiful. All right. I'm looking forward to checking that
out. And then so I've learned so much from you in this
episode. I'm sure lots of people have, and I'm sure lots of
listeners are wondering how they can keep up with your
latest thoughts perhaps on the art of thinking clearly, but
maybe also on R. So how should people follow you?
Veerle: 40:03 They can follow me on LinkedIn. I share regularly tips
about R and also things that we're doing in the business.
So give me a follow if you want to know more.
Jon: 40:14 Nice. All right. So we will definitely include that in the
show notes, making it easy for people to follow you.
Veerle, this has been such a fun episode. I've really loved
having you on, and hopefully we can have you on again
sometime soon.
Veerle: 40:26 Thank you, Jon.
Show Notes: http://www.superdatascience.com/491 24
Jon: 40:34 That was a lot of fun and wow, did Veerle ever blow my
narrow Python centric mind. In today's episode, Veerle
filled us in on lots of specific tools for using R as your
production software language. Specifically, she mentioned
the R Tidyverse as a tidy way to manage your data and
data operations, particularly dplyr for piping operations
into each other's sequentially and intuitively. She told us
about dtplyr for obtaining dplyr style piping with
computational efficiency that is near R's data.tables. She
talked about CronR for scheduling processes to run
automatically. Blastula for beautiful automated emails.
RStudio server for model development in the cloud. R
Shiny for designing user interfaces of any ELT. RStudio
Connect for deploying R Shiny apps and golem for
professional grade scalability of those Shiny apps. And
finally, she told us about R Plumber for creating API end
points.
Jon: 41:38 As always, you can get all the show notes including the
transcript for this episode, the video recording any
materials mentioned on the show, such as the list of R
packages that I just rifled through and the URL for
Veerle's LinkedIn profile, as well as my own social media
profiles at superdatascience.com/491. That's
superdatascience.com/491. If you enjoyed this episode I'd
of course greatly appreciate it if you left a review on your
favorite podcasting app or on the SuperDataScience
YouTube channel where we have a video version of this
episode. To let me know your thoughts on the episode,
please do feel welcome to add me on LinkedIn or on
Twitter, and then tag me in a post to let me know your
thoughts on this episode. Your feedback is invaluable for
figuring out what topics we should cover next.
Jon: 42:24 Since this is a free podcast if you're looking for a free way
to help me out, I'd be very grateful. If you left a rating of
my book, Deep Learning Illustrated on Amazon or
Goodreads. Give some videos on my YouTube channel a
Show Notes: http://www.superdatascience.com/491 25
thumbs up or subscribe to my free content, rich
newsletter on jonkrohn.com. To support the
SuperDataScience company that kindly funds the
management, editing and production of this podcast
without any annoying third-party ads, you could create a
free login to their learning platform at
superdatascience.com. You can check out the 99 days to
your first data science job challenge at
superdatascience.com/challenge, or you could consider
buying a usually pretty damn cheap Udemy course
published by Ligency an affiliate of SuperDataScience,
such as my Mathematical Foundations of Machine
Learning Course.
Jon: 43:11 All right, thanks to Ivana, Jaime, Mario and JP on the
SuperDataScience team for managing and producing
another amazing episode today. Keep on rocking it out
there, folks, and I'm looking forward to enjoying another
round of the SupeDataScience podcast with you very
soon.