+ All Categories
Home > Documents > Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics...

Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics...

Date post: 17-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
37
Data Analytics And Analysis Support in Research Services Kang Lee and Ben Rogers
Transcript
Page 1: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

Data Analytics And Analysis Support in Research Services

Kang Lee and Ben Rogers

Page 2: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

2

Trends in Research Computing

• Traditional Needs Not Going Away• Large Scale Data Analytics Growing Rapidly• Changing Research Data Sets• Collaborative Data Analytics• Interactive Data & Method Publishing

Page 3: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

3

Traditional Needs Not Going Away

Page 4: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

4

Traditional Needs Not Going Away

0

2000

4000

6000

8000

10000

Q1

2011

Q2

2011

Q3

2011

Q4

2011

Q1

2012

Q2

2012

Q3

2012

Q4

2012

Q1

2013

Q2

2013

Q3

2013

Q4

2013

Q1

2014

Q2

2014

Q3

2014

Q4

2014

Q1

2015

Q2

2015

Q3

2015

Q4

2015

Q1

2016

Q2

2016

Q3

2016

Q4

2016

Q1

2017

Q2

2017

Q3

2017

Q4

2017

Q1

2018

Total Research Storage (TB)

Dedicated

HPC Infrastructure

RDSS

LSS

Page 5: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

5

Large Scale Data Analytics Growing Rapidly

• What do I mean by data analytics?• Applied Statistics• Machine Learning• “Big Data”

• What is “large scale”? - Data Analytics that you can’t efficiently do on a standard desktop system.

Page 6: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

6

How Do We Know It’s Growing?

• National GPU Resources Heavily Oversubscribed• AWS Volta Spot Instances at $200/hour during SC17• 59 People Attended Deep Learning Workshop• MRI Grant with Deep Learning Focus – 35 Faculty Support Letters, $3.3

Million• Support emails every week about R or Python• One of the major areas of interest/discussion during Research Computing

Council roadmap development

Page 7: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

7

Applied Statistical Analysis On Large Data Sets

• Examples• Identifying significant trends in business data• Examining outcomes in epidemiological data

• Common Tools – R, Python

Page 8: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

8

Machine Learning

• Examples• Using Medical Imaging output to diagnose diseases.• Examining effectiveness of captcha in context of modern computer vision algorithms.

• Common Tools – Tensorflow, Caffe, Theano, Torch, scikit-learn, Python

Page 9: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

9

“Big Data”

• Not gaining widespread traction. (Hadoop, Spark, etc)• Campus Hadoop pilot use was ~90% coursework.• Why?

• Most structured research data sets not large enough to require these tools on modern servers.

• Disciplinary tools must support new paradigm of data access and computation.

Page 10: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

10

Changing Research Data Sets

• Multimodal Data• Data integration challenges• Larger data sets

• Passive Data Collection• Less controlled data collection (messier)• More missing data

• Data Reuse• May not have been collected with current purpose in mind

• Streaming Data• Desire for real-time analysis• Larger data sets

Page 11: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

11

Collaborative Data Analytics and Interactive Data & Method Publishing

• Researchers increasingly want to collaborate directly on their data analytics in shared “electronic notebooks”.

• Researchers wish to be able to publish their work to the web with interactive mechanisms so that others can easily explore their results and/or data.

• Platforms - R Studio/Shiny, Jupyter/Jupyter Hub, Custom Web Applications

Page 12: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

12

Challenges To Current Campus Services

• Exploratory Data Analytics requires interactivity• Complex Software Stacks

• Tensorflow• Spark/Hive

• Containers (Good & Bad)• GPU costs• Structured data store support• Lack of some needed cloud integrations• Lack of good service/funding models

Page 13: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

Data Analytics TrainingInteractive Data Analysis EnvironmentsIowa Quantified Pilot ProjectData Analytics Consulting ExamplesSocial Media Analytics

Page 14: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

14

Data Analytics Training

• Data analytics training workshops provided in 2017

• Data Science Institute Spring/Summer (Jan, Jun)

• Introduction to Python Data Analytics (Jun, Aug, Sep, Dec)

• Introduction to Python Data Analytics for the Tippie College of Business (Nov)

• NVIDIA Deep Learning Institute (Jul)

• Web Scraping with Python (Oct)

• XSEDE Big Data Workshop (May, Dec)

Page 15: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

15

Data Analytics Training

0

1

2

3

4

5

6

7

8

9

10

Turnouts by Department

A long tail

Page 16: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

16

Data Analytics Training

• Three dimensions of data analytics training

• Skill – data collection, refinement, exploration, modeling, visualization, publication

• Tool – widely-used open-source data analytics tools such as Python and R

• Level – introductory, intermediate, advanced

• We’re trying to meet the rising demand for a variety of data analytics training

opportunities from a wide range of disciplines on campus

• UI3 provides regular coursework on informatics and data analytics

Page 17: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

17

Data Analytics Training

• Big picture of the direction of data analytics training

General Introduction

Machine Learning

Applied Statistics

Deep Learning

Introductory Level

Intermediate Level

Advanced Level

Domain/Problem-Specific Topics

Page 18: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

18

Interactive Data Analysis Environments

Jupyter Notebook

RStudio Desktop

Vs. Jupyter Hub

RStudio ServerVs.

Running on a desktop Running on a server (or in the cloud)

Web-based

Page 19: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

19

Interactive Data Analysis Environments

Jupyter Notebook

RStudio Desktop

Vs. Jupyter Hub

RStudio ServerVs.

Useful for individual work Useful for teamwork, teaching and publication

Page 20: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

20

Interactive Data Analysis Environments

Funded by NSF (National Science Foundation)

“XSEDE is a single virtual system that scientists can use to interactively share computing resources, data and expertise. People around the world use these resources and services — things like

supercomputers, collections of data and new tools — to improve our planet.”

“Jetstream, led by the Indiana University Pervasive Technology Institute (PTI), adds cloud-based, on-demand computing and data analysis resources to the national cyberinfrastructure.”

Vs.

Vs.

Page 21: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

21

Interactive Data Analysis Environments

Quick demo of Jetstream Portal& Jupyter Hub

Page 22: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

22

Interactive Data Analysis Environments

• What‘s good for instructors? • They can easily create their own training environment, datasets and class materials

and share them with trainees Have all trainees start with the same environment

Minimize the time needed to tackle technical issues from each computer • The interactive environment is particularly useful for designing hands-on practice • They can have easy control over computing resources allocated to trainees

• What’s good for trainees?• They can benefit from the powerful computing resources of servers and won’t have to

care about the computing power of their own computers

Page 23: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

23

Interactive Data Analysis Environments

• Some drawbacks• Instructors are supposed to have minimum knowledge on servers

• How to set up a Linux/Windows server • How to create user accounts • How to install software • How to monitor system resources

• Cloud services could be unavailable when you need them • Due to scheduled maintenance • Due to unexpected hardware failure

Page 24: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

24

Iowa Quantified Pilot Project

• A framing question“What would you do

with 10,000, ~$10 wireless sensors?”

Page 25: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

25

Iowa Quantified Pilot Project

A Cloud-Based Scientific Gateway for Internet of Things Data Analytics

Page 26: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

26

Iowa Quantified Pilot Project

A Cloud-Based Scientific Gateway for Internet of Things Data Analytics

Refers to the network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, actuators, and network connectivity which enables these objects to connect and

exchange data (from https://en.wikipedia.org/wiki/Internet_of_things)

Page 27: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

27

Iowa Quantified Pilot Project

A Cloud-Based Scientific Gatewayfor Internet of Things Data Analytics

Refers to an interface designed specifically to support a particular type of scientific research, with an emphasis on supporting the

entire scientific process from start to finish (from https://kb.iu.edu/d/auwv)

Page 28: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

28

Iowa Quantified Pilot Project

A Cloud-Based Scientific Gateway for Internet of Things Data Analytics

Fully implemented in the AWS (Amazon Web Services) ecosystemusing AWS IoT, Amazon S3, Amazon Elasticsearch Service,

AWS Lambda, etc.

Page 29: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

29

Iowa Quantified Pilot Project

• Sensor deployment

• 20 volt solar panel• IP56 rugged enclosure• 12 volt deep cycle marine battery• 12 volt to 5 volt DC/DC step-down-converter• Raspberry Pi 2 Model B• GSM cellular data modem• Arduino with LoRaWAN module• LoRaWAN data read over serial connection

with python code.• Python handles all gateway tasks and MQTT

communication with

Page 30: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

30

Iowa Quantified Pilot Project

• Architecture Amazon Web Services (AWS) Ecosystem

Message Stream Processing

AWS IoT

RulesEngine

Message Broker

Amazon Elasticsearch

Database

Elasticsearch Index

IoT Things -Farm Telemetry

IoT Things -Wind Telemetry

DeviceGateway

DeviceGateway Data Warehouse

Amazon S3

Bucket

Monitoring Dashboards

Kibana

Amazon Elasticsearch

Analytics

Amazon EC2

Jupyter Notebook

Any IoT Things

MQTT

HTTP

AnyProtocol

Analysis

Backup

Monitoring

In-Depth Analysis

R Studio

AWS Lambda

Page 31: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

31

Iowa Quantified Pilot Project

Quick demo of Kibana Dashboard

Page 32: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

32

Iowa Quantified Pilot Project

• Future directions• Get the software infrastructure organized so that is structured as a service

on campus • Seek projects with funding that can support the infrastructure• Investigate the possibility of an institute that can help faculty develop

their own sensors

Page 33: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

33

Data Analytics Consulting Examples

• Consultations I’ve provided in 2017

Problem Detail

Data collection

Web scraping of news articlesWeb scraping of academic papers

Web scraping of product informationTwitter data collection

Data handling Data extraction from the cloud

Modeling Student survey data classification

Insight development Awarded grant analysis

Data analysis strategiesMonitoring target audience on social media

Text analytics support for UI researchersNetwork/topic analysis

Software

Parallelization in PythonVariable scoping in MATLAB

Running Jupyter Notebook on HPCBuilding a webserver for data visualization

Page 34: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

34

Social Media Analytics

• Social media analysis depends heavily on data collection & management • Web scraping vs. API (Application Programming Interface)

• A dedicated server needed for continuous data collection • The responsibility of the data falls on the user when collected

• Types of analytics • Statistics for understanding numbers • Text analytics for understanding text• Network analytics for understanding user/keyword networks• Geospatial analytics for understanding geographical or spatial characteristics

Page 35: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

35

Social Media Analytics

• Social Media Interest Group • A diverse group with 9 faculty members, 3 staff members, 1 graduate student who are

interested in social media • Gather once a month to share information and look for collaboration opportunities • Anyone interested is welcome to join

• Computational Psychiatry Interest Group • A diverse group led by Prof. Jacob Michaelson at Psychiatry with a number of faculty

members and MDs• Focused on computational aspects of psychiatry • Gather once a month to share information and look for collaboration opportunities

Page 36: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

36

Social Media Analytics

• Twitter analysis • My own tweet stream collection for more than two years• +100K random English tweets (+100GB) per month • Contact me if you need to do quick analysis using Twitter data • We’re looking to provide Twitter data as a service on campus

Page 37: Data Analytics And Analysis Support in Research Services · 2018. 1. 17. · • Data analytics training workshops provided in 2017 • Data Science Institute Spring/Summer (Jan,

37

Social Media Analytics

Quick demo of a Twitter analysisexample on addiction


Recommended