Storage and Analysis of Big Data from Sensor Networks: Challenges · PDF filefrom Sensor...

Post on 15-Feb-2018

219 views 1 download

transcript

Storage and Analysis of Big Data from Sensor Networks: Challenges

and Opportunities

Sameer Tilak,

Calit2, UCSD

Large-Scale Sensor Network Applications

• Environmental Monitoring

– Limnology, Marine Science

• Participatory Sensing (Healthcare)

• Disaster Management and Emergency Response (Floods, earthquakes)

• Smart Home and grid

Overview

Large-scale environmental observing systems consists of sensors embedded deeply with our physical environment. Typically these systems consist of highly-calibrated sensors deployed at strategic locations to generate science quality data. Recently, personalized mobile sensing (e.g., sensors mounted on cars or carried by people) has received considerable attention from the research community. Although, these systems consist of cheap, mobile sensors that are not well calibrated, they can provide spatial sampling diversity as compared to the traditional environmental observing systems. Together these large-scale sensor networks can gathering data at high spatio-temporal resolutions and have potential to provide scientists unprecedented insights into complex physical environment.

Research Challenges

• Software O&M (Sensor-Rocks)

• Power Management: Smart phones and sensors (context-aware sensing)

• Visualization (Calit2 Optiportal)

• Data Processing (Storm and Apache Big data ecosystem)

• Data Storage (HDD vs SSD Technologies)

Sensor-Rocks: Sensor Network Software O&M Automation

Sameer Tilak, Tony Fountain, Philip Papadopoulos, and Tajana Rosing Tim Telfer (MURPA)

Motivation: Automating Sensor Device and Network Definition

• Inexpensive sensor devices have no interactive screens and must therefore be configured only using a flashed-system image for which there exist only hand-build techniques today for this step

• Scaling to 100s or 1000s of sensors means that we must be able to handle hardware heterogeneity of individual sensors without the time-consuming process of building a highly-customized, independent image for each variant; and

• We want better reproducibility of the basic software configuration so that we can easily adjust to the rapid changes of the Android environment and reap the benefits of new capabilities.

Rocks Background

•Rocks is a software toolkit that solves the computing cluster definition, deployment and management problem and has reduced the time from raw hardware to a working system within the Data Centers from days/weeks to a few hours. • The toolkit treats a complete software footprint on any machine as a set of software packages and configuration that together form a Rocks Appliance (e.g., database, parallel file servers, load managers, software routers). Appliances can share packages and configuration.

Data Center Configuration using Rocks

A Portion and a Complete Rocks Config. Graph. Software installation on a given node is performed through the traversal of the configuration graphs.

A Rocks-based sample Data center architecture

Android Software Stack

Android application and kernel stack

key system definition differences between Android and Full Linux

• No scriptable system definition framework like Redhat's Kickstart exists today for Android • Unlike installed Linux systems, common tools for modifying configuration files via scripting

languages are not part of the installed Android environmet. This means that Linux can be used to automatically define and configure itself (this is what happens when you install from a DVD)nt, but Android does not have the same closed form.

CyanogenMod

• Custom device firmware based on AOSP

• Android benefits for Wireless Sensor Devices:

• CM’s improvements on android:

o Free

o Open source

o Large developer base

o Many supported devices

o Many communication

options

o Virtualised applications

o Hardware abstraction layer

o Older device support

o Greater battery efficiency

o Less bloat

o Nightly builds

o CPU under/overclocking

o Wifi/Bluetooth/USB tethering

Sensor-Rocks Profile graphs

• Profile graph to describe overall set of configurations

• Rolls as sub-graphs of nodes, containing packages and scripts

• Produces installable custom rom for selected distributions

• Ability to specify conditions and dependencies

Sensor-Rocks (technical)

• Graph, roll and node information stored in xmls

• Edify used to manage installation scripting – Instead of kickstart for standard Rocks

• Shell scripts used for post scripting

• Package Manager handles source repositories and compiling for packages

• Roll Manager takes care of creating roms and graph

• Device Manager will handle database of devices and configurations

Steps for Building a new device

• sensor-rocks-cm.py compile-roll -r earth-sensor -a "{'hw' : 'grouper'}"

• adb push <path-to-sensor-rolls>/out/rom.zip /sdcard/

• adb reboot recovery

Specifying Post-Config: Edify Script

Applications and Updater Script

• Apk file stored appropriately

• Updater Script generated automatically • Location • Contents

OA Deployment

Moorea LTER

Martz SeapHox: pH, conductivity, temperature

Pro-Oceanus CO2-Pro: PCO2

MBARI-modified: pH

Sea-Bird Inductive Modem

Lake Deployment

NTL LTER

Hydro-lab DS-5: chlorophyll, dissolved

oxygen

NexSens T-Node: temperature

Participatory Sensing: Healthcare

Patient-Centric Healthcare

Context-Aware Sensing

Source: Dr. Larry Smarr, Calit2, UCSD

Source: Dr. Larry Smarr, Calit2, UCSD

Data Visualization

Data Processing Challenges: Real-time and Batch-mode

• Real-time processing of tens of thousands of streams /second

• Data representation • Integration of analysis tools • Development of light-weight and scalable

algorithms and models • Accommodating failures: hardware (servers, disks Network,..) and software (OS, middleware, apps). • Accurate workload characterization • Development of efficient data pipeline

Big Data Ecosystem

• Apache Hadoop • Apache Pig • Apache Hive • Apache Cassandra • Apache Hbase • Apache Oozie • Storm • Esper • Add your favorite here…

Solid State Drives

• Solid-state (Flash) disks present an attractive storage option. Solid-state (Flash) drives are becoming cheaper and more common in Data Centers and we believe that this trend will continue to grow. By 2020, the quantity of electronically stored data will reach 35 trillion gigabytes. Big data technologies such as Apache Hadoop, HBase, Pig, Hive are striving to make the storage, manipulation and analysis of huge volumes of data cheaper and faster than ever. With current 6GBps SATA III, NAND-based Solid-State (flash) drives are delivering astounding performance (in terms of random I/O) compared to traditional hard-disk drives. However, their benefit for big data processing is not yet quantified.

FutureGrid key Concepts I • FutureGrid is an international testbed modeled on Grid5000

– June 27 2012: 225 Projects, 920 users

• Supporting international Computer Science and Computational Science research in cloud, grid and parallel computing (HPC)

– Industry and Academia

• The FutureGrid testbed provides to its users:

– A flexible development and testing platform for middleware and application users looking at interoperability, functionality, performance or evaluation

– Each use of FutureGrid is an experiment that is reproducible

– A rich education and teaching platform for advanced cyberinfrastructure (computer science) classes

Source: Prof. Geoffrey Fox, Indiana University.

FutureGrid: a Grid/Cloud/HPC Testbed

Private Public

FG Network

NID: Network Impairment Device

10TF Disk rich + GPU 320 cores

Source: Prof. Geoffrey Fox, Indiana University.

Compute Hardware

Name System type # CPUs # Cores TFLOPS Total RAM

(GB)

Secondary Storage

(TB) Site Status

india IBM iDataPlex 256 1024 11 3072 180 IU Operational

alamo Dell PowerEdge

192 768 8 1152 30 TACC Operational

hotel IBM iDataPlex 168 672 7 2016 120 UC Operational

sierra IBM iDataPlex 168 672 7 2688 96 SDSC Operational

xray Cray XT5m 168 672 6 1344 180 IU Operational

foxtrot IBM iDataPlex 64 256 2 768 24 UF Operational

Bravo Large Disk & memory

32 128 1.5 3072

(192GB per node)

192 (12 TB per Server)

IU Operational

Delta Large Disk & memory With Tesla GPU’s

32 CPU 32 GPU’s

192+ 14336 GPU

? 9 1536

(192GB per node)

192 (12 TB per Server)

IU Operational

TOTAL Cores 4384

Source: Prof. Geoffrey Fox, Indiana University.

FutureGrid: Inca Monitoring

Source: Prof. Geoffrey Fox, Indiana University.

5 Use Types for FutureGrid • 225 approved projects (~920 users) June 27 2012

– USA, China, India, Pakistan, lots of European countries – Industry, Government, Academia

• Training Education and Outreach (8%) – Semester and short events; promising for small universities

• Interoperability test-beds (3%) – Grids and Clouds; Standards; from Open Grid Forum OGF

• Domain Science applications (31%) – Life science highlighted (18%), Non Life Science (13%)

• Computer science (47%) – Largest current category

• Computer Systems Evaluation (27%) – XSEDE (TIS, TAS), OSG, EGI

• Clouds are meant to need less support than other models; FutureGrid needs more user support …….

30 Source: Prof. Geoffrey Fox, Indiana University.

Online MOOC’s • Science Cloud MOOC repository

– http://iucloudsummerschool.appspot.com/preview

• FutureGrid MOOC’s – https://fgmoocs.appspot.com/explorer

• A MOOC that will use FutureGrid for class laboratories (for advanced students in IU Online Data Science masters degree) – https://x-informatics.appspot.com/course

• MOOC Introduction to FutureGrid can be used by all classes and tutorials on FutureGrid

• Currently use Google Course Builder: Google Apps + YouTube – Built as collection of modular ~10 minute lessons

Source: Prof. Geoffrey Fox, Indiana University.

SSD experimentation using Lima, a

FutureGrid resource

Lima @ UCSD • 8 nodes, 128 cores

• AMD Opteron 6212

• 64 GB DDR3

• 10GbE Mellanox ConnectX 3 EN

• 1 TB 7200 RPM Ent SATA Drive

• 480 GB SSD SATA Drive (Intel 520)

TestDFSIO: Throughput mb/sec

0

20

40

60

80

100

120

140

160

180

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

0

20

40

60

80

100

120

140

160

180

1000 2000 3000 4000 5000 10000150002000025000

HDD

SSD

Random Write Random Read

0

100

200

300

400

500

600

700

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

0

200

400

600

800

1000

1200

1400

1000 2000 3000 4000 5000 10000 15000 20000 25000

HDD

SSD

TestDFSIO: Execution Time (sec)

Random Write Random Read

0

20

40

60

80

100

120

140

160

10 20 30 40 50 100 150 200 250

HDD

SSD

0

5

10

15

20

25

30

35

40

45

50

10 20 30 40 50 100 150 200 250

HDD

SSD

TestDFSIO: Throughput mb/sec

Random Write Random Read

0

100

200

300

400

500

600

10 20 30 40 50 100 150 200 250

HDD

SSD

0

100

200

300

400

500

600

700

800

10 20 30 40 50 100 150 200 250

HDD

SSD

Random Write Random Read

TestDFSIO: Execution Time (sec)

MURPA Experience

• Geoff Pascoe

• Thomas Moore (Journal article in SPE): Honors thesis

• Tim Telfer(One conference paper and will try to submit one more paper): Honors thesis

Why UCSD?

• World-renowned research university

• Calit2 is multi-disciplinary

• San Diego: great weather and culturally diverse