+ All Categories
Home > Documents > BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants...

BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants...

Date post: 03-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
28
February 4, 2020 Prentice Women’s Hospital 250 East Superior Street Chicago, IL 60611 BDSD BIOMEDICAL DATA SCIENCE DAY DATA SCIENCE FOR EVERYBODY
Transcript
Page 1: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

1

February 4, 2020

Prentice Women’s Hospital250 East Superior Street

Chicago, IL 60611

BDSDBIOMEDICAL DATA SCIENCE DAY

DATA SCIENCE FOR EVERYBODY

Page 2: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

2

“”

The Center for Data Science and Informatics presents Biomedical Data Science Day 2020, a one-day event of immersive, interactive workshops and talks for anyone interested in data science.

From novices to experts, attendees will dig deep into open source data sets, enhance their statistical knowledge, and learn new tools to navigate the computational biomedical sciences. Sessions will be led by experienced faculty, staff, students, and trainees from the Northwestern community and beyond.

About Biomedical Data Science Day

The day is really about building awareness and excitement about what is possible in biomedical informatics and data science. We encourage attendees to connect with each other to learn about new approaches to take with their research.

#BDSD2020

-Dr.Justin Starren

DATA SCIENCE FOR EVERYBODY

Page 3: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

3

8:00 - 9:00 am

9:00 - 9:15 am

9:15 - 10:15 am

10:15 - 10:30 am

10:30 - 11:20 am

11:30 am - 12:20 pm

12:30 - 1:30 pm

1:30 - 2:20 pm

2:30 - 3:20 pm

3:30 - 3:40 pm

3:40 - 4:40 pm

4:40 - 4:45 pm

4:45 - 6:00 pm

Breakfast, Tech Support

Ask Me Anything with Charlton McIlwain

Welcome

Plenary 1 — Nigam Shah, MBBS, PhD

Break

Session 1

Session 2

Lunch

Ask Me Anything with Nigam Shah

Getting Started with SAS Viya

Reproducible Research Interest Group Discussion

Getting Started with AWS

Session 3

Session 4

Break

Plenary 2 — Charlton McIlwain, PhD

Closing

Reception

Please note that some sessions are interactive; these are indicated in the session descriptions. To participate in the interactive components of the session, please bring your laptop. Before and during BDSD, visit the “Interactive Sessions: Technical Details” link at bit.ly/BDSD2020 for software requirements and installation instructions. At BDSD, visit the Tech Support Desk near the registration table for answers to software installation questions.

Tech Requirements and Support

Schedule

Page 4: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

4

Schedule At-A-Glance

Room L South Room L North Room MBreakfast / Q&A

8:00 - 9:00Plenary 1

9:00 - 10:15Break

10:15 - 10:30

Building CARDIAC (Cardiovascular Digital AI Core): A Service Line

Datamart for Quality, Reporting, & Research

Faraz Ahmad

Enabling Predictive Analytics in the Enterprise: Lessons Learned from

Implementing Epic's Cognitive Computing Platform

Anthony Wong

Lunch12:30 - 1:30

Ask Me AnythingNigam Shah

Getting Started with SAS ViyaJacqueline Johnson

Break3:20 - 3:40

Plenary 23:40 - 4:45Reception4:45 - 6pm

Shaping the Future of AI in HealthcareNigam Shah

Modern Problems: Race & Our Data Science Past, Present & FutureCharlton McIlwain

Session 3 1:30 - 2:20

When AI Meets Healthcare Big DataYuan Luo

Taking Care of Vizness: Graphing and Reporting with R

Shannon Haymond

Break

Break

Session 110:30 - 11:20

Evaluating AI/NLP in Healthcare: Reference Standards and Outcome

MeasuresBrett South

A Simple Introduction to PythonAndre Archer

Introduction to AI in Medical ImagingTodd Parrish & Aggelos

Katsaggelos

First Steps with RChristina Maimone

Session 4 2:30 - 3:20

The Good, the Bad and the Ugly: A Primer on the Wild West of Electronic

Health Record DataTheresa Walunas & Abel Kho

Analyzing Human Behavior and Lifestyle Using Wearable Sensors

Rawan Alharbi

From MIMIC-III to Model: Building a Workflow to Model Raw Retrospective

Clinical DataGarett Eickelberg

Session 211:30 - 12:20

Data Exploration, Predictive Modeling, and Machine Learning with SAS Viya for

LearnersJacqueline Johnson

Using R to Process and Analyze Accelerometer Data

David Aaby

Page 5: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

5

Room N Room P Room QAsk Me Anything

Charlton McIlwain

A Pythonic Journey from Scientific Data Visualization to Broadly Data Science

Wenjun Kou

Machine Learning in Bioinformatics: Introduction and Applications

Ramana Davuluri

InvenioRDM: A Next-Generation Repository for Research Data

ManagementMatt Carson

Breaking Data Silos: the Gen3 Platform for Creating Data Commons

Chris Meyer

White Hat P-Hacking: Bayesian Parables to Understand the

Reproducibility CrisisOmkar Venkatesh

Introduction to NLP Services in AWSRandy Ridgley

Open DiscussionReproducible Research Interest

Group

Getting Started with AWSRandy Ridgley

Pragmatic Reproducible Research for Data Scientists

Luke Rasmussen

Overview of Causal Inference Concepts for Data Scientists

Lucia Petito

Predicting Severe Sepsis in a Children’s Hospital

L. Nelson Sanchez-Pinto et al

Building a Community for Development of Open Source Genomics Platform

Michael Bouzinier

Application of Topic Modeling and Information Theory for Clustering

Single Cell RNA-Seq DatasetsZiyou Ren

Heart Rate Variability Dysfunction is Associated with Outcomes in Critically

Ill ChildrenColleen Badke

Break

Break

Making Sense of Big Data: Estimating the Intrinsic Dimensionality of Multi-

Electrode Neural RecordingsEge Altan

Probabilistic Programming for Bayesian Inference Using Rstan

Kyle Honegger

Supervised Machine Learning - Tuning Support Vector Machines

Asma Mustafa

Combining Data: We build Our Own Functional Enrichment Tool

Thomas Stoeger

Biostat Basics: Some Practical Things to Know

Nina Srdanovic

Easy Data Visualization Using ggplot2 in R

Andrew Skol

Demonstrating Ceto, a Modular Suite of Pipelines for Next Generation Sequence

AnalysisElizabeth Bartom

Key

AMAs + Discussions

Plenary

Interactive Coding

Talk

Code Demonstration

Page 6: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

6

Dr. Nigam Shah is associate professor of Medicine (Biomedical Informatics) at Stanford University, Assistant Director of the Center for Biomedical Informatics Research, and a core member of the Biomedical Informatics Graduate Program. Dr. Shah’s research focuses on combining machine learning and prior knowledge in medical ontologies to enable use cases of the learning health system. Dr. Shah received the AMIA New Investigator Award for 2013 and the Stanford Biosciences Faculty Teaching Award for outstanding teaching in his graduate class on data-driven medicine. Dr. Shah was elected into the American College of Medical Informatics (ACMI) in 2015 and is inducted into the American Society for Clinical Investigation (ASCI) in 2016. He holds an MBBS from Baroda Medical College, India, a PhD from Penn State University and completed postdoctoral training at Stanford University.

Bio

Shaping the Future of AI in Healthcare

We will review the issues involved in bringing Artificial Intelligence (AI) technologies to the clinic, safely and ethically. We will begin with an overview of the U.S. healthcare system and how that affects the data strategy for powering a machine learning (ML) health system. We will discuss a framework for analyzing the utility of ML models in healthcare and discuss the implicit assumptions in aligning incentives for AI guided healthcare actions. We will conclude with an outline of ethical considerations for incorporating AI in healthcare as well as the impact on the doctor-patient relationship.

Nigam Shah, PhD

Plenary 1: 9:15 - 10:15 amLocation: Room L

Stanford UniversityAssociate Professor of Medicine (Biomedical Informatics) and of Biomedical Data Science

Page 7: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

7

Since the advent of modern data science in the early 1960s, interest in producing, collecting, analyzing and utilizing data has run far beyond our interest in producing and distributing knowledge. Discussions about what counts as data, and how data should be used, have been driven not just by our need to understand ourselves and the world we create and live in, but by a desire to influence human behavior, and direct social and political outcomes towards predetermined ends. In short, I argue that problems, rather than questions, drive modern data science. Drawing on the history of data science’s formative years, and its collisions with the civil rights movement and the revolutions in computing development in the 1960s, I first describe why we need data science narratives, models and tools that are aware and critical of how we designate, define, frame and operationalize data science problems. I conclude by posing the question: What would it take to develop a data science that is motivated by equity and justice?

Modern Problems: Race & Our Data Science Past, Present & Future

Author of the new book “Black Software: The Internet & Racial Justice, From the Afronet to Black Lives Matter,” Charlton McIlwain is Vice Provost for Faculty Engagement & Development at New York University and Professor of Media, Culture, and Communication. His work focuses on the intersections of computing technology, race, inequality, #Ferguson, #BlackLivesMatter, and the Online Struggle for Offline Justice. He recently testified before the U.S. House Committee on Financial Services about the impacts of automation and artificial intelligence on the financial services sector.

Bio

Charlton McIlwain, PhD

Plenary 2: 3:40 - 4:40 pmLocation: Room L

NYU SteinhardtVice Provost for Faculty Engagement and DevelopmentProfessor of Media, Culture, and CommunicationAuthor of “Black Software”

Page 8: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

8

Prentice Women’s Hospital Third Floor Conference Center Map

18-1844B/0818/PDF© 2018 Northwestern Medicine. All rights reserved.

Stairs

E Elevator

Escalator

Women’s restroom

Men’s restroom

Canning

L North

L South

N

M

Q P

T S

Harris FamilyAtrium

E

E

Ns

Prentice Women’s Hospital Third Floor Conference Center Map

18-1844B/0818/PDF© 2018 Northwestern Medicine. All rights reserved.

Stairs

E Elevator

Escalator

Women’s restroom

Men’s restroom

Canning

L North

L South

N

M

Q P

T S

Harris FamilyAtrium

E

E

Ns

Page 9: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

9

Sessions

Introduction to AI in Medical ImagingTodd Parrish and Aggelos KatsaggelosRadiology, BME and ECE, Northwestern UniversityRoom L South

Biomedical Data Science has the potential to significantly alter the way healthcare is practiced in the future. There are many components to a patient’s health record that could be used. In this session, we will focus on medical imaging data and how it can be used to diagnosis, predict progression, or determine optimal treatment pathways. The first half of the session will provide a background on clinical imaging data. Each imaging modality exists because of unique information that it provides. A goal is to understand the complementary information and how to exploit it to provide optimal precision medicine. The second half will develop the basic concepts of machine learning and then build these into more advanced deep learning models. The attendee should have a better understanding of the types of imaging utilized. First Steps with RChristina MaimoneNorthwestern IT Research Computing ServicesRoom L NorthInteractive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser.

R is a great tool for conducting reproducible statistical analysis, data manipulation, and data visualization. If you’ve never used R before, this session is for you. You will read in a dataset, make your first plot in R, and run a few simple analyses. Building CARDIAC (Cardiovascular Digital AI Core): A Service Line Datamart for Quality, Reporting, and ResearchFaraz AhmadFeinberg School of Medicine, Northwestern UniversityRoom M: Talk 1

In this session, we will describe a collaboration between the Bluhm Cardiovascular Institute, Feinberg School of Medicine faculty, and the Northwestern Medicine Electronic Data Warehouse (NMEDW) team to develop a service-line specific datamart for quality, reporting, and research. We will discuss our ongoing efforts to better organize cardiovascular data and how to work with the NMEDW Research Analytics Team, add new data sources to the NMEDW, and implement natural language processing pipelines. Discussants will include two Feinberg School of Medicine faculty, Dr. Faraz Ahmad and Dr. Yuan Luo, and two members of the NMEDW Research Analytics team, Dan Schneider and Martin Borsje.

Session 1 10:30 - 11:20 am

Page 10: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

10

Sessions

Enabling Predictive Analytics in the Enterprise – Lessons Learned from Implementing Epic’s Cognitive Computing Platform in a Pediatric HospitalAnthony WongData Analytics and Reporting, Lurie Children’s HospitalRoom M: Talk 2

Prediction of clinical outcomes has historically been carried out discretely by individual departments without further integration with the workflow. The need for prediction started with traditional analytics when clinicians started to realize the value of data for decision-making. However, this practice often fails to deliver the operational needs and agility required to effectively manage the program. The increase in adoption of electronic health records (EHR) has spurred the development of new data science tools with tighter integration in clinical practice. Advances in machine learning have enabled these systems to process large amounts of data and provide decision support. There is also a consensus in clinical informatics that an integrated EHR system can significantly improve the quality of patient care, reduce risk of errors, and enhance communication between providers. Our organization recognizes these benefits and took steps to advance the use of machine learning techniques in the clinical decision-support process. In this presentation, we will discuss our experience in developing a pediatric patient readmission risk model and the lessons we learned from integrating the predictive model directly into the EHR user workflows. The Cognitive Computing Platform, developed by Epic Systems, is the core implementation of predictive analytics in our EHR system. This multi-stage project engages various stakeholders from different departments in the workflow design, clinical case management, and full system integration for the cloud services. We will also discuss the basic building blocks of Cognitive Computing, including Predictive Model Markup Language, Docker containers and Python development in the cloud. This discussion also extends to the role of various components in the EHR such as the reporting system and databases that made these tasks possible. Finally, we will conclude with a discussion on the strengths and weaknesses, and the impact on our workflows. A Pythonic Journey from Scientific Data Visualization to Broadly Data ScienceWenjun KouDepartment of Medicine, Gastroenterology and Hepatology, Northwestern UniversityRoom N: Talk 1Interactive Session: This session assumes familiarity with Python. Bring a laptop with access to a recent version of the Anaconda distribution of Python to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

Based on personal journey of using python from biophysical simulation and modeling, to clinical data science, I will discuss several case studies, in the hope to illustrate some features of python-based applications. For

Session 1 10:30 - 11:20 am

Page 11: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

11

Sessions

scientific simulation and modeling, I will start with the open-source library: Visit, for big-data visualization that supports client-server mode, GUI or batch run. Then I will comment the pros and cons of python in physical modeling and application, in comparison with C++ and matlab. Moving to clinical data visualization and data analytics, I will discuss a GUI-based design pattern using python, contrasted with the popular client-server-based design pattern that uses JavaScript. The pythonic pattern helps us to develop user-friendly analytical tools that are easily distributed/maintained in the lab at Northwestern Gastroenterology. Finally, I will briefly talk about recent experience of deep learning built on top of libraries like TensorFlow, Keras, and one application that uses microcomputer (raspberry pi) for lab tests. In the end, a concluding perspective on using Python will be given.

InvenioRDM: A Next-Generation Repository for Research Data ManagementMatt Carson, Guillaume Viger, and Sara GonzalesGalter Health Sciences Library, Northwestern UniversityRoom N: Talk 2

Proper collection, indexing, and preservation is vital to the discovery and dissemination of research output in scientific research. However, many research communities continue to battle the problem of “silos” at the institutional level that hinder discovery of research output. As part of a multi-organization collaboration in partnership with CERN, we are building a digital repository that can be easily deployed and managed, either locally or on a cloud-based platform, to collect, record, preserve, and disseminate a wide range of digital works across the translational community. In turn, this enhances their visibility, promotes people and their expertise, supports attribution of their work, aids discovery and accessibility by the international scientific community, and supports Findable, Accessible, Interoperable, & Reusable (FAIR) science. At the same time, we use this tool to promote good data practice workflows, incorporate standards and persistent identifiers, and account for privacy requirements in translational research. This session will introduce the InvenioRDM platform, run through example use cases relevant to the data science community, and conclude with a demo. Machine Learning in Bioinformatics: Introduction and ApplicationsRamana DavuluriFeinberg School of Medicine, Northwestern UniversityRoom P: Talk 1

With each successive discovery in genetics the dynamic complexity of the gene structure and gene regulation have become increasingly apparent. It’s now understood that the majo rity of human genes produce multiple functional products, or isoforms, primarily through alternative transcription and splicing. Different isoforms within the same gene have been shown to participate in different functional pathways, and the altered expression of

Session 1 10:30 - 11:20 am

Page 12: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

12

Sessions

specific isoforms have been associated with numerous cancers. Consequently, transcriptome analyses based on gene-centric informatics methods; (a) may result in wasted resources in following up “leads” that cannot be replicated because they are false, (b) may result in missing important findings that should have been discovered, and most importantly (c) misinterpretation of the underlying biology. In this talk, I will describe transcriptome analysis pipeline and recent machine learning methodologies, some of which are based on “old ideas,” to account for the underlying splice- and transcript-variants. These informatics methods are illustrated using our recent published studies on platform-independent Informatics pipeline for molecular sub-typing of glioblastoma and ovarian cancers. Breaking Data Silos: the Gen3 Platform for Creating Data CommonsChris MeyerCenter for Translational Data Science (Biological Sciences Division), University of ChicagoRoom P: Talk 2

Data generated by non-profit research groups, large governmental organizations, and for-profit companies conducting clinical trials alike are typically for a very specific purpose and are used only to answer an immediate question of interest. In the past, since there was no simple and inexpensive way to share de-identified data with researchers that may benefit from their re-use, the only way for others to find them was via word of mouth. Thus, these data were lost to the broader scientific community, stashed away in “data silos,” a problem that is further complicated by the lack of standardization due to the fact that different groups collect and store data in every format imaginable. Recently, however, the Center for Translational Data Science at the University of Chicago has released “Gen3”, which is a free, open source platform for creating data commons. A data commons is a cloud-based software platform for managing, analyzing, harmonizing, and sharing large datasets. Gen3 aims to empower research groups at any scale, including small groups of researchers with a modest budget and IT resources, to independently create their own data commons in order to accelerate and democratize the process of scientific discovery. The Gen3 platform achieves this by creating an environment for making data FAIR: findable using cross-project queries, accessible through open APIs, interoperable through data harmonization and query/analysis gateways, and reusable through built-in analysis workspaces and apps. Finally, since Gen3 is open source, the research community is now becoming actively involved in its continued development.

The session will provide an introduction to Gen3, including demonstrations of data analysis in the workspace, analysis apps that provide results without access to raw data, and queries that run across multiple, interoperable data commons, i.e., the data ecosystem.

Session 1 10:30 - 11:20 am

Page 13: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

13

Sessions

Biostat Basics: Some Practical Things to KnowNina SrdanovicBiostatistics Collaboration Center, Feinberg School of Medicine, Northwestern UniversityRoom Q

This introductory biostatistics presentation will discuss different types of data and sampling and estimation methods. It will include an overview of elementary types of analyses including t-tests, one-way ANOVA, Chi-square test, linear and logistic regression, and regression vs. classification. The emphasis will be on statistical hypotheses and the interpretation of results.

Session 1 10:30 - 11:20 am

Page 14: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

14

Sessions

Evaluating AI/NLP in Healthcare: Reference Standards and Outcome MeasuresBrett R SouthCenter for AI, Research, and Evaluation (CARE), IBM Watson HealthRoom L South

Artificial Intelligence (AI) has great potential to help make sense of big data, help patients and caregivers navigate care pathways, aid in clinical decision making and revolutionize healthcare in profound ways. The technology driving these innovations is maturing rapidly. With the rapid implementation of AI applications in healthcare, and more specifically the use of Natural Language Processing (NLP) to extract information from unstructured data sources, we must understand the parameters necessary for real-world evaluations of system performance. This session will present the current landscape of AI/NLP applications in healthcare and present practical real-world examples of these technologies from an industry perspective. This session will also address scientific methods around building valid reference standards that can be used for NLP evaluations, address guidelines for human experts used as judges in evaluation studies, and present performance metrics for evaluation of human and system performance for a given clinical use case.

A Simple Introduction to PythonColby Witherup WoodNorthwestern IT Research Computing ServicesRoom L NorthInteractive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser.

Python is a very popular, general-purpose programming language. It is effective for a wide variety of uses, including automating research tasks, web scraping, data visualization, and machine learning. In this hands-on session, we’ll load in a biomedical data set and create a few simple plots. Data Exploration, Predictive Modeling, and Machine Learning with SAS Viya for LearnersJacqueline JohnsonSAS InstituteRoom M

Attend this session to gain a hands-on preview of SAS Viya for Learners: an engaging ecosystem of tools designed for academic classroom use and for users of all skill levels and roles. Gain hands-on experience with several interfaces and tools: from initial visual exploration, to template-driven predictive modeling and machine learning, to traditional SAS programming, to seamless integration with open source languages like R and Python. Viya for Learners is available to educators: https://www.sas.com/en_us/software/viya-for-learners.html.

Session 2 11:30 am - 12:20 pm

Page 15: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

15

Sessions

Easy Data Visualization Using ggplot2 in RAndrew SkolPathology and Laboratory Medicine, Lurie Children’s HospitalRoom NInteractive Session: Familiarity with R is assumed. Bring a laptop with access to R and RStudio to participate. See the Technical Details link at bit.ly/BDSD2020 for required packages.

Data visualization is one of the more challenging and time consuming, yet important, steps in communicating the relationships that exist within complex (and simple) datasets. ggplot2 is a popular package for the R statistical programming language that is easy to learn, extremely flexible, and expandable. I will present, using hands-on examples, how to prepare data for ggplot2, many of the plotting functions provided within the ggplot2 function, and methods for customizing the look and annotation of plots. In addition, I will share a couple of plotting functions that complement that ggplot2 universe, such as heatmaps.

Combining Data: We build Our Own Functional Enrichment ToolThomas StoegerChemical and Biological Engineering, Northwestern UniversityRoom PInteractive Session: This session assumes familiarity with Python. Bring a laptop with access to a recent version of the Anaconda distribution of Python to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

Functional enrichments are a frequently used technique to analyze large omics datasets, where they help to interpret large lists of genes through prior knowledge. We will develop our own stand-alone Python-based functional enrichment tool.

The main goal of the session is to learn combining meso-scale datasets. We will learn how to work with raw annotations, and how assumptions and decisions that go into the building of bioinformatic tools will influence the interpretation of our own experimental data. Since the tool will be programmed by you, you will have full control over those assumptions.

At the end of the tutorial we will have programmed a robust tool that will include commonly used annotation sources (such as Gene Ontology). Additionally, you will be able import custom lists of annotations and correct for non-expressed genes.

Session 2 11:30 am - 12:20 pm

Page 16: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

16

Sessions

White Hat P-Hacking: Bayesian Parables to Understand the Reproducibility CrisisOmkar VenkateshNorthwestern UniversityRoom Q: Talk 1

P-hacking. Salami Slicing. Data Dredging. “Kitchen sink”-ing. The current reproducibility crisis is often associated with fields like psychology, but its reach extends throughout many disciplines, including data science. At the heart of this crisis lies misunderstanding not of the most sophisticated analytical machinery, but rather simple statistical tenets and tools like the humble p-value. Whether intentional or unintentional, “p-hacking” and misuse of the significance filter paradigm is rampant. This session will teach participants how to be even better at it – to be “white hat” p-hackers. Before we can analyze literature more carefully and be scrupulous in designing our own investigations, we must reexamine fundamentals in ways the textbooks traditionally do not.

We will explore questions such as: Why does low power systematically lead to a “magnitude bias” in published values? What really happens if we change the default significance threshold to 0.005 as some propose? Do we need to adjust our alpha values for all the tests we do in our lifetime? Is the p-value really just a poor-man’s Bayes factor? How bad is the reproducibility crisis, and has it been overblown?

This session will introduce tools like p-curves and funnel plots which can reveal evidence of publication bias. It will also examine the strengths and limitations of Bayesian methods in both interpreting published findings and presenting new results. The graphical program JASP will be used to illustrate some of these points, though no prior knowledge is required (and no pun is intended). This talk is heavily indebted to the writings of Andrew Gelman, Eric-Jan Wagenmakers, Daniel Laakens, and others.

Introduction to NLP Services in AWSRandy Ridgley Amazon Web ServicesRoom Q: Talk 2

There’s a proliferation of unstructured data. Companies collect massive amounts of news feed, emails, social media, and other text-based information to get to know their customers better or to comply with regulations. However, most of this data is unused and untouched. Natural language processing (NLP) holds the key to unlocking value within these huge data sets, by turning free text into data that can be analyzed and acted upon. Join this session and get hands on experience in how you can start mining text data effectively and extracting the rich insights it can bring.

Session 2 11:30 am - 12:20 pm

Page 17: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

17

Sessions

When AI Meets Healthcare Big DataYuan Luo Preventive Medicine, Northwestern UniversityRoom L South

This talk will cover the scope of the healthcare big data using Northwestern Memorial Healthcare Chain, and eMERGE network as regional and national case examples. We will then delve into different modalities of the healthcare data (e.g., unstructured clinical notes, structured EHR data, imaging data, genetic data etc.) and show how these data modalities can be individually and/or jointly mined to derive actionable intelligence. We will illustrate with our recent work on developing AI algorithms with applications to clinical narrative text, integrative genomics and clinical numerical time series. The common theme of these studies aims at building clinical models that improve both prediction accuracy and interpretability, by exploring and combining relational information in different data modalities.

These concrete examples include biomedical relation extraction (short text understanding) from clinical notes and computational phenotyping of cancer patients (long text understanding), imputing missing laboratory data and predicting patient mortality risk using numerical clinical time series, integrating deep phenotypic and genetic information to characterize cardiac mechanics in hypertensive patients. In each example, I will show how to automatically build relational information into a graph representation and how to use AI to learn features from graphs. Depending on the degree of structure in the data format, heavier machinery of factorization models becomes necessary to reliably group important features. I will demonstrate that these methods lead to not only improved performance but also better interpretability.

Taking Care of Vizness: Graphing and Reporting with RShannon HaymondPathology, Lurie Children’s HospitalRoom L NorthInteractive Session: Basic familiarity with R is assumed. Bring a laptop with access to R and RStudio to participate.

This session will demonstrate the use of R and RStudio for (1) reproducible data analysis and visualization workflows, (2) producing highly effective, publication-quality graphics, and (3) generating reports that can be widely and easily shared. A beginning-to-end workflow (i.e., from creating the RStudio project to generating a shareable data analysis report) will be shown for an example data set. The flexibility of visualization and report outputs from R makes this session applicable to anyone who wants to enhance their ability to reproducibly create graphs and communicate results of data analyses.

Session 3 1:30 - 2:20 pm

Page 18: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

18

Sessions

Using R to Process and Analyze Accelerometer DataDavid AabyPreventive Medicine, Biostatistics Collaboration Center, Northwestern UniversityRoom MInteractive Session: Familiarity with R is assumed. Bring a laptop with access to R and RStudio to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

Accelerometers are widely used in both epidemiological and clinical research studies for estimating the duration and volume of physical activity (PA) and sedentary behavior (SB), as well as estimating PA at varying intensities. Both uniaxial and triaxial accelerometers have been used to capture movement, typically measured as counts per minute in 60-second epochs (cpm). Processing accelerometer data can be challenging, due to its size (1440 data points per day per participant) and structure. Converting accelerometer data to meaningful physical activity variables typically requires three steps: (1) identifying the period in which the accelerometer was not worn and determine which days have sufficient wear time; (2) calculating physical activity variables of interest for each day of wear; (3) calculate averages across all valid days, and perhaps separately for weekdays/weekends.

This workshop will introduce the power of using R for managing, preparing, analyzing, and visualizing accelerometer data. We will focus on using the accelerometry package in R, which contains the necessary functions for processing accelerometer data. Example-driven instruction, using data from the 2003-2006 National Health and Nutrition Examination Survey (NHANES), will assist attendees in becoming familiar with reading in accelerometry data for multiple subjects, graphically summarizing accelerometer data within subjects, and summarizing accelerometer data in terms of PA and SB using the accelerometry package. Attendees will also learn how to customize their analysis to meet a variety of research questions and interests. The package is compatible with both uniaxial and triaxial minute-to-minute count data. The package was designed for analyzing Acitgraph accelerometer data, but can analyze cpm from other devices as well. Supervised Machine Learning- Tuning Support Vector MachineAsma MustafaPathology and Laboratory Medicine, Lurie Children’s HospitalRoom NInteractive Session: Familiarity with R is assumed. Bring a laptop with access to R and RStudio to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

Support Vector Machine is one of supervised machine learning algorithms that is widely used as a binary classifier. Depending on the nature of features, finding the perfect hyperplane used to separate the data into the assigned labels can be challenging and time consuming. In this tutorial, SVM Tuning

Session 3 1:30 - 2:20 pm

Page 19: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

19

Sessions

parameters including Kernel, cost, gamma will be tested at 10 fold validation. Model performance will be evaluated using Model Summary output, and AUC/ROC curves. Three R packages will be used to perform the SVM classifications and plots: e1071 for the SVM function, ROCR for the ROC curve, kernlab_0.9-29 for kernel selection. The dataset used for this algorithm is publicly available at: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer. Demonstrating Ceto, a Modular Suite of Pipelines for Next Generation Sequence AnalysisElizabeth Thomas BartomDepartment of Biochemistry and Molecular Genetics, Northwestern UniversityRoom PInteractive Session: Bring a laptop to participate. Participants will be provided with temporary access to Quest for the session. Participants will need software to make an SSH connection, such as a Mac Terminal, FastX, or PuTTY. See the Technical Details link at bit.ly/BDSD2020 for more info.

Ceto is a pipeline building system developed here at Northwestern and installed on the Quest High Performance Compute Cluster. Given a set of raw sequence files and parameters such as experiment type and genomic assembly to be used in the analysis, Ceto will generate a set of shell scripts that will check the quality of the sequence, identify sequence contaminants (like mycoplasma RNA in a human cell line experiment), align the data to the genome, and map the read counts to their genomic context. Depending on the needs of the user, additional ChIP-seq and RNA-seq modules can be turned on to build additional shell scripts, to call peaks and identify differentially expressed genes, for example. Ceto will also launch the shell scripts to the scheduler and manage the dependencies between jobs. In addition to demonstrating the core functionality of Ceto, we will highlight some new developments, including sample genotype checks for human RNA samples and a new web-based front end for setting up the experimental design in an RNA analysis. Ceto is available on github at https://github.com/ebartom/NGSbartom and we always welcome new users and developers.

Pragmatic Reproducible Research for Data ScientistsLuke RasmussenPreventive Medicine, Northwestern University Feinberg School of MedicineRoom Q: Talk 1

Conducting reproducible research is both an opportunity and a challenge. Research is more efficient and robust when research teams can easily recreate and reproduce findings using original data. However, adopting reproducible research workflows can be daunting due to technical barriers, a perceived need to switch away from a favorite software, or the impression that reproducible research is an “all-or-nothing” endeavor. In this session we will explore how to approach reproducible research: steps for starting small, expanding capability, and both technical and non-technical strategies to help along the way. Through interactive activities, we will engage with participants to consider how reproducible research practices could apply to their own projects. We

Session 3 1:30 - 2:20 pm

Page 20: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

20

Sessions

will discuss source code control, electronic laboratory notebooks, containers and dynamic documents. Given the introductory nature of the session, a high-level survey of concepts (including available tools and software) will be provided. This session will equip participants to assess and implement next steps for incorporating reproducible research practices in their own projects. Overview of Causal Inference Concepts for Data ScientistsLucia PetitoPreventive Medicine (Biostatistics), Feinberg School of Medicine, Northwestern UniversityRoom Q: Talk 2Interactive Session: Familiarity with R is assumed. Bring a laptop with access to R to participate. See the Technical Details link at bit.ly/BDSD2020 for a list of R packages to install.

Causal inference is the unstated goal of many observational public health and medical studies that utilize big data. Most investigators believe that causation cannot be definitely proven from observational data, so they typically skip right to estimating associations instead of considering which aspects of their study may induce biases and prohibit valid causal inferences. Here, I will give an overview of causal inference from observational data through estimation of the average treatment effect. I will provide a brief overview of the potential outcomes framework, causal assumptions, and directed acyclic graphs; and present a tool to avoid common issues: the target trial framework.

Session 3 1:30 - 2:20 pm

Page 21: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

21

Sessions

The Good, the Bad and the Ugly: A Primer on the Wild West of Electronic Health Record DataTheresa Walunas and Abel KhoCenter for Health Information Partnerships, Northwestern MedicineRoom L South

Electronic health record (EHR) data is both a treasure trove and a quagmire for clinical and translational research: the next new wild west data frontier. This session will introduce electronic health records as a potential data source, outline considerations when working with protected health information, and show examples of integration of health record data with other data sources. During the session we will explore clinical and research applications of medical records and discuss the strengths, pitfalls and cautionary concerns through both didactics and interactive conversation. At the end of this session participants should leave with a better understanding of this unique and important data source, have a few laughs and leave the room ready to head for their own showdown in the OK Corral of medical record data. Analyzing Human Behavior and Lifestyle Using Wearable Sensors Rawan AlharbiComputer Science and Preventive Medicine, Northwestern UniversityRoom L NorthInteractive Session: Bring a laptop to participate. See the Technical Details link at bit.ly/BDSD2020 for software requirements.

Researchers seek to understand human behaviors in their natural setting so they can design interventions that help manage symptoms, prevent illness, and improve health and wellbeing. Lately, researchers are utilizing wearables to understand human behavior (e.g., physical activity, sleep, eating) in their natural setting. Wearables are an excellent tool for longitudinal behavioral studies because they allow for continuous, non-invasive, passive, and personalized data collection in free-living people.

So how do wearables, like the Apple Watch, use embedded sensors and signals extracted from these sensors to detect human behavior? In this talk, we will first go through an overview of the end-to-end process needed for analyzing passive sensing data and inferring human behavior using wearables. We will go through the passive sensing data analytic chain (PASDAC), a tool that enables programmers to clean, curate, segment, classify and evaluate the signals generated from wearable sensors using signal processing and machine learning. We will also discuss the advantages and disadvantages of various approaches along with opportunities for future research and code development. The second part of the talk will be an interactive coding session where attendees will get the chance to process a sample wearable sensor-based signal stream to build and validate an activity detection model using PASDAC.

Session 4 2:30 - 3:20 pm

Page 22: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

22

Sessions

From MIMIC-III to Model: Building a Workflow to Model Raw Retrospective Clinical DataGarrett EickelbergBiomedical Informatics, Northwestern UniversityRoom MInteractive Session: Familiarity with Python is assumed. Bring a laptop with access to a recent version of the Anaconda distribution of Python to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

There is a common sentiment shared by the data science related communities that I have been a part of: the majority of the work revolves around readying the data. Although I find this sentiment often holds true, I believe the work that happens after data acquisition and before modeling is an overlooked part of data science. In this interactive coding session, I will be guiding the audience through the process of cleaning, managing, and ultimately modeling raw data from a source I both enjoy working with and find particularly challenging: electronic health records (EHR). EHRs contain countless time-stamped measurements generated during each patient’s clinical encounter from different data sources. On top of this, most EHRs were designed with clinical care in mind and had very little consideration of retrospective data analysis usability. In this “dirty data” adventure, I will provide a raw synthetic EHR dataset filled with examples of common data cleaning challenges and demonstrate the techniques I have learned to overcome them. By the end of this session, participants will have more experience with and appreciation for the necessary work involved in EHR predictive modeling projects. Predicting Severe Sepsis in a Children’s Hospital: Our Experience Developing an Actionable, Data-Driven Clinical Decision Support SystemL. Nelson Sanchez-PintoDivision of Critical Care, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Lurie Children’s HospitalRoom N: Talk 1

Pediatric severe sepsis and septic shock is associated with mortality rates of 10 to 30%. Furthermore, about a third of children who survive will have a decrease in their functional status, which can result in potentially lifelong adverse consequences. Prompt recognition and treatment remain the mainstay approaches to reduce mortality and morbidity. Most children who develop sepsis in the community will be treated in the emergency department setting, but a fraction of them may be admitted to the hospital with under-recognized sepsis or develop new severe sepsis while in the hospital. Recognizing and treating this children early may help improve their outcomes. Lurie Children’s has joined with other members of the Children’s Hospital Association to participate in the Improving Pediatric Sepsis Outcomes (IPSO) collaborative. At Lurie Children’s Hospital, between 1 and 2% of children in the inpatient wards will meet criteria for severe sepsis at some point during their hospitalization using the sepsis

Session 4 2:30 - 3:20 pm

Page 23: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

23

Sessions

intention-to-treat criteria provided by IPSO. One of the strategies proposed by the collaborative to improve sepsis outcomes is the development of sepsis screening tools and prediction models. Our team –formed by clinicians, data scientists, quality improvement specialists, and clinical informaticians– was tasked with the development of a severe sepsis prediction model for children in the inpatient wards. In this session we will describe our approach to adapt, calibrate, and validate a prediction model for severe sepsis in a population of over 40,000 pediatric inpatient encounters, our rationale to derive clinically actionable thresholds for a prediction model-based clinical decision support system, and the design of the clinical workflows associated with this data-driven system.

Application of Topic Modeling and Information Theory for Clustering Single Cell RNA-Seq DatasetsZiyou RenMedicine/DGP biomedical informatics track, Northwestern UniversityRoom N: Talk 2

Single cell RNA sequencing (scRNA-seq) technologies promise to enable the quantitative study of biological processes at the single cell level [Patel, A. P. et al. 2014; Treutlein, B. et al 2014; Miyamoto, D. T. et al. 2015]. Commercial platforms such as 10x chromium are becoming established in lab practice [Hwang, B. et al., 2018; Dong, M. B. et al., 2019; Xiong, X. et al. 2019]. Despite the prevalence of this approach and of the technological breakthroughs, many challenges remain in developing robust standardized computational frameworks to process and analyze scRNA-seq data. Data sparsity due to low RNA capture rates and uninformative genes are the two major challenges in analyzing scRNA-seq data [Kharchenko P.V. et al 2014]. The current standard tool, Seurat, quantifies the similarity of every pairwise cell expression profile in order to identify “clusters” of cells of the same type [Kiselev, V. Y. et al., 2019]. However, the lack of uncertainty measurement in Seurat clustering algorithms fails to answer the complexity for cells in transitional states and leads to lower accuracy in classification, especially in a system with perturbation like diseases vs non-diseases. Topic modeling as an innovative method in clustering documents and literatures can be applied in single cell by implementing a straightforward analogy: words to genes and documents to cells. In this session, I will present a brief review on recent application of topic modeling in analyzing single cell datasets [e.g. Carmen et al, 2019, Yotsukura, S., 2016]. Then I am going to present my recent work on developing a new framework for cell type classification from single cell RNA-seq data using topic modeling and information theory.

Session 4 2:30 - 3:20 pm

Page 24: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

24

Sessions

Heart Rate Variability Dysfunction is Associated with Outcomes in Critically Ill ChildrenColleen BadkeFellow, Pediatric Critical Care Medicine, Feinberg School of Medicine, Northwestern UniversityRoom N: Talk 3

Objective: Discuss the role of autonomic nervous system (ANS) in critical illness and the use of heart rate variability (HRV)- a surrogate of ANS function- as a prognostic marker in the PICU. Background: PICU patients are at risk of developing new or progressive multiple organ dysfunction syndrome (NPMODS) or dying. The ANS plays an essential role in maintaining homeostasis but may become dysregulated, indicating worsening organ dysfunction. ANS dysfunction (ANSD) can be assessed by measuring HRV. Methods: Retrospective study of patients admitted to a large PICU between 2012-2016 with at least 12 hours of bedside monitor data during the first day of PICU admission. HRV was measured using integer HRV (HRVi), calculated as the standard deviation of heart rate over 5 minutes. A HRV dysfunction (HRVD) score was developed by calculating the inverse of the age-normalized HRVi values multiplied by 10 and truncated at 0, such that higher HRVD scores were indicative of lower HRV and worse ANSD. Logistic regression was used to adjust for severity of illness on admission (using PRISM III score). Results: 5,455 pediatric patients met inclusion criteria, of which 215 (4%) developed NPMODS and 109 (2%) died. HRVD scores ranged from 0 to 13 with a median of 1.0 (IQR 0-4.3). For every 1-point increase in the median HRVD score in the first 24 hours of admission, there was a 14% increase in the adjusted odds of NPMODS and a 23% increase in the adjusted odds of mortality (p<0.001) after adjusting for severity of illness. When combined with PRISM III, the median HRVD had fair discrimination of NPMODS (area under the curve (AUC) = 0.74 (95% CI 0.70-0.77)) and excellent discrimination of mortality (AUC = 0.92 (95% CI 0.89-0.93)). This discrimination was significantly higher than for PRISM III or HRVD alone. Conclusions: The HRVD score, an age-adjusted surrogate of ANS function, can be used to risk-stratify PICU patients.

Building a Community for Development of Open Source Genomics PlatformMichael A BouzinierDivision of Genetics, Brigham and Women’s HospitalRoom P: Talk 1

Whole genome sequencing is rapidly becoming routine in clinical practice and in everyday life. Processing and interpretation of genomic data requires more computational power and storage than any other task which an ordinary person is likely to come across in their life. One could assume, it should lead to a massive software development effort, however, today, most genomics platforms are either proprietary or academically developed and maintained. We have started building a community of clinicians, researchers and professional software developers

Session 4 2:30 - 3:20 pm

Page 25: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

25

Sessions

from different countries and backgrounds to create the first community-built and fully open-source genomic analysis software platform — Forome. Two important goals of Forome Genomics Platform are to support both clinical and research workflows for all flavors of genetic data and to build into the platform integrated development environment for clinical rules, providing the ability to seamlessly transform research workflows into clinical guidelines, thus speeding up the adoption of WGS into clinical practice. Forome includes a Variant Curation Tool Anfisa which is based on three simple ideas: using OLAP for genetic data, using curated decision trees for clinical rules, Crowdsourcing of the most difficult cases.

From a technical point of view, Forome Anfisa is based on realization that genetic analysis belongs to the same class of analytical problems as data warehousing and business intelligence, when relatively static data is processed. This is where OnLine Analytical Processing (OLAP) comes to help. Traditional Database Management Systems (DBMS) try to balance efficient modification of data with fast access to it, while OLAP tools specifically focus on achieving the maximum performance for data querying and information retrieval. OLAP approach is proven with other verticals like financial analysis, sales forecasting etc., but to the best of our knowledge has never been applied to the big data in genetics.

Making Sense of Big Data: Estimating the Intrinsic Dimensionality of Multi-Electrode Neural RecordingsEge AltanBiomedical Engineering, Northwestern UniversityRoom P: Talk 2

A major challenge in the age of big data is to gain fundamental insights from large volumes of complex, high-dimensional data. As an example, neuroscientists can simultaneously record from thousands of neurons, a limit that has been increasing exponentially for over five decades. The barrier to progress is no longer collecting data, but understanding it. Due to correlation between neurons, high-dimensional neural recordings contain redundant information. One approach to combat this redundancy is to estimate the intrinsic dimensionality of these recordings, representing the degrees of freedom required to describe the data without significant information loss. In the context of neural recordings, intrinsic dimensionality also quantifies the complexity of the information conveyed by a set of neurons. There are many methods for estimating intrinsic dimensionality, and developing these methods remains an active area of research. Which technique is appropriate depends on the nature of the data and the questions being posed. I will demonstrate the importance of understanding intrinsic dimensionality, and the challenges in its estimation using Principal Components Analysis (PCA) and its variants. These challenges include determining the right hyperparameters, capturing nonlinear interactions, and noise robustness. Then, I will provide a pipeline for estimating intrinsic dimensionality more accurately using state-of-the-art methods. Finally, I will apply this pipeline to neural data recorded from the primary motor cortex of monkeys as they performed different motor tasks, and describe the insights it provides.

Session 4 2:30 - 3:20 pm

Page 26: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

26

SessionsSession 4 2:30 - 3:20 pm

Probabilistic Programming for Bayesian Inference Using RStanKyle HoneggerData Analytics and Reporting, Lurie Children’sRoom QInteractive Session: Familiarity with R is assumed. Bring a laptop with access to R to participate. See the Technical Details link at bit.ly/BDSD2020 for more info.

From genomics and phylogenetics to pharmacokinetics and epidemiology, Bayesian methods are increasingly being used to perform statistical inference across biomedical disciplines. Until recently, however, their use has remained largely restricted to niche applications, where solutions to specific problems have been developed by expert Bayesian practitioners. Fortunately, recent advances in probabilistic programming have made it possible for everyday researchers to construct and perform Bayesian inference on probabilistic models custom-tailored for their specific problem. This workshop will walk participants through the process of building a custom model using the Stan modeling language for probabilistic programming. We will learn how to code a custom probabilistic model, choose sensible priors for model parameters by examining prior predictive distributions, and perform Bayesian inference using the RStan interface to R. We will evaluate the quality of the posteriors produced by MCMC sampling in RStan using diagnostic and visualization techniques and employ approximate leave-one-out cross-validation to compare models. No prior experience with Stan or probabilistic programming is expected, but participants have some familiarity with the principles of Bayesian reasoning.

Page 27: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

27

Thank You to Our Sponsors

Page 28: BDSD - Northwestern University · Interactive Session: Bring a laptop to participate. Participants will use a cloud computing environment through a web browser. R is a great tool

#BDSD2020DATA SCIENCE FOR EVERYBODY


Recommended