Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | domenic-campbell |
View: | 216 times |
Download: | 0 times |
Starting with Big Data
• Why care?• In your reach - big data and big compute on a budget• Start with data and apply math• D4M with Accumulo: New technology from MIT and NSA
that claims• It requires 100x less code; and is• 100x faster than other approaches
• Fundamentally mathematical analysis for big data• Lift the lid.
Virtualnation
Understand the world through data and math
• How do you want to understand and the world?• IT approaches have evolved from a past where IT was
expensive and controlled by the few• Modeled and constrained problems to not only fit onto
limited computers but fit in with the politics of the enterprise
• If you could observe without built in constraints and pre-conceived bias – how would you approach computing?
• Understand through scientific method - data and math
Virtualnation
The Primordial Web (92)
Virtualnation
Browser (html): Server (http):
Language:
Database (sql):
Client Server Database
http put
http get
SQL
data
Gopher
• Browser GUI? HTTP for files? Perl for analysis? SQL for data?• A lot of work just to view data.• Won’t catch on.
The Modern Web
Game (data): Server (http):
Language:
Database (triples):
Client Server Database
• Game GUI! HTTP for files? Perl for analysis? Triples for data!• A lot of work to view a lot of data.• Great view. Massive data.
http put
http get
java
data
Future Web?
Game (data): Server (http):
Language:
Database (triples):
Client Server Database
• Game GUI! Fileserver for files! D4M for analysis! Triples for data!
• A little work to view a lot of data. Securely.• Great view. Massive data.
http put
http get
java
data
Big Data and Big Compute on a budget
• ~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD• ~ $26K cost 270TB storage server• $199 4TB USB drive
• ZFS / Smart OS as a free virtualization technology
• ~68TB entire transactional corpus $45B Australian retailer
• How big are your possible data sets?
Virtualnation
Virtualnation
Apache Accumulo
NSA’s Big Table implementation and now top level Apache project
Cell level security to support privacy and need to know
Supports large scale processing of sparse matrices…
Parallel Warehouse Scale Computer
Virtualnation
Registers
Cache
Local Memory
Disk
Instruction Operands
Blocks
Pages
Remote Memory
Messages
CPU
RAM
disk
CPU
RAM
disk
CPU
RAM
disk
CPU
RAM
disk
Parallel Architecture
CPU
RAM
disk
CPU
RAM
disk
CPU
RAM
disk
CPU
RAM
disk
Network Switch
Memory HierarchyImplications
Ban
dw
id
th Late
ncy
Cap
aci
ty
Pro
gra
mm
ab
ilit
y
Unit of Memory
High
High
High
High SSD
See http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Starting with Big Data
• Now cheap to collect all data forever. • Unconstrained approach to data acquisition• No analysis up front or modeling• Much of it involves Graph Analytics
Virtualnation
Cyber
• GOAL: Detect cyber attacks or malicious software
Social
• GOAL: Identify hidden social networks
• GOAL: Identify anomalous patterns of life
ISR
Virtualnation
D4M - Signal Processing on Database
High Level Composable API: D4M (“Databases for Matlab”)
Weak Signatures,Noisy Data,Dynamics
Novel Analytics for:Text, Cyber, Bio
Interactive Super-computing
High Performance Computing: Cluster+ Hadoop
Distributed Database/ Distributed File System
Distributed Database: Accumulo/HBase (triple store)
Virtualnation
Matlab Demo - Reuters Corpus V1 (NIST)
810,000 Reuters news items
Demonstration picked 70,000 and found 13,000 entities
A is a 70Kx13K associative array with 500K entries.
D4M demonstrations
Virtualnation
D4M Stores Giant Space Matrices in the Accumulo Triple Store Database
Triple StoreDistributed Database
Query:T(:,ggaatctgcc)
Associative ArraysNumerical Computing Environment
D4MDynamic Distributed Dimensional Data Model
A
C
DE
B
A D4M query returns a sparse matrix or graph from a triple store…
…for statistical signal processing or graph analysis in Matlab
Triple store are high performance distributed databases for heterogeneous data