1 [
Augment Your Analytics
Ecosystem Through Scalable
Graph Analytics
Kiran Narsu, YarcData
2
What is Graph Analytics?
2
Models of complex networks of data
in a graph representation of nodes
and edges
The nodes represent entities of
interest and the edges represent
the relationship between entities
The analysis of nodes and edges
provides information on
relationships in the data
3
Not a chart
A Graph is a fundamental data structure
A collection of vertices (nodes) and edges (links, relationships, connections)
What is a Graph?
4
Graphs are Everywhere
5
Simple Graph
RDF Triple
Subject Predicate Object
1234 First Name
John
RDF Triple
Subject Predicate Object
1234 First Name John
6
Graph vs. Relational - Example
Customer
Cust ID Entity Type Tax ID
1234 Person 999-99-9999
Account Position Mapping
AcctId Instrument Quantity Instrument Type
AC567 IBM 1000 Equity
AC567 USIBM_OPT 100 Equity Option
CustomerAccount
Cust ID Account ID
1234 AC567
Account Master
Account ID Account Type
AC123 Trading Account
Person
Cust ID Last Name First Name
1234 Smith John
1234,First Name,John
1234,Last Name,Smith
1234,Entity Type,Person
1234,Tax ID,999-99-9999
1234,Account ID,AC567
AC567,Account Type,Trading Account
AC567,Instrument,IBM
AC567,Instrument,USIBM_OPT
IBM,Quantity,1000
USIBM_OPT,Quantity,100
IBM,Instrument Type,Equity
USIBM_OPT,Instrument Type, Equity Option
TR
AD
IT
IO
NA
L
GR
AP
H
RDF Triples
7
Data Representation in Graph Format
1234,First Name,John
1234,Last Name,Smith
1234,Entity Type,Person
1234,Tax ID,999-99-9999
1234,Account ID,AC567
AC567,Account Type,Trading Account
AC567,Instrument,IBM
AC567,Instrument,USIBM_OPT
IBM,Quantity,1000
USIBM_OPT,Quantity,100
IBM,Instrument Type,Equity
USIBM_OPT,Instrument Type, Equity Option
RDF Triples
1234 First Name
John
AC567
IBM Quantity
1000
100
Smith
999-
99-
9999
Person
Trading
Account
USIBM
_OPT
Equity
Option
Equity
Financial Instrument
Type O
f T
ype O
f
Option
Equity,Type Of,Financial Instrument
Equity Options,Type Of,Option
Option,Type Of,Financial Instrument
8
Emerging Questions Demand Emerging Approaches
Question Graph
Technique
Challenge
What is the shortest non-
obvious path connecting two
entities?
Path Analysis Pre-loading all paths to analyze
connections is difficult
Who are the central players in a
given fraud event?
Betweenness
Centrality
Fixed relational models inhibit
finding entities who are central
What clusters or communities
exist in a population?
Community
Detection,
Clustering
Finding communities without
“bias” and then discovering
attributes is technically difficult
Graph analytics can enable you to:
Connect widely disparate data, load it all in one place, and discover
connections in the data, without knowing the questions in advance
9
Graphs – The Ideal Structure for “Discovery”
Dynamic Data Sources
• Simple data model
• Support for multiple data types
• Schema information and data mix harmoniously
• Augment data and definitions
Increasing Volumes
• Low redundancy
• Compact data format
• Easy to add data “on-the-fly”
Greater Flexibility
• No fixed schema - no constraints on queries
• Relationships not hidden – true discovery
• Support for unique analytic techniques such as clustering, community detection, path analysis, etc.
Business Questions
Data Structure
10
Graphs Are Ideal for Interactive, Iterative “Discovery”
Graphs allow you to define
and redefine your analysis
as you go along
Graphs allow you to
explore connections and
relationships
Graphs can flexibly handle
new and different data
types and volumes
Graphs are hard to Partition
Unpredictable & extremely
slow to follow relationships
Graphs are not Predictable Graphs are highly Dynamic
High cost to follow multiple
competing paths
High cost to load multiple,
constantly changing datasets
?
While great for discovery, graphs pose challenges for traditional approaches
But: But: But:
11
Graph Analytics with Urika
Use ALL Your Data – No Subsets or
Partitioning
Large Shared Memory
Architecture
Up to 512 Terabytes of RAM
Get Answers in Seconds - Not Days or Weeks
Thousands of Massively
Multi-Threaded Processors
128 Threads/Processor
Load New Data in Minutes - Not Weeks
Scalable I/O – Load data at
up to 350TB per hour
Readily Deployable – No Proprietary Skills
Easy to Use
Open Standard Interface
Linux and W3C
The Urika
Respons
e
Supercomputing
heritage applied to
the largest Big Data
challenges
Key Graph Requirements:
Predictable, interactive performance on largest volumes of diverse data
Flexibility to add new data sources rapidly, in hours as opposed to months
Ability to analyze the “whole graph” without having to break it up across clusters
Leverage and extend IT skill sets and use standards-based approaches
Maximize portability of analytics
12
Urika and Existing Analytic Environments
Hadoop Clusters
13
Use Cases
14
Customer Insight – Relationship Discovery
Goal: Identify new cross-sell and upsell opportunities through discovery of communities, networks or clusters
Data sets: Customer data, customer transaction data, portolio, website traffic, positions,balances, demographics
Technical Challenges: Speed up ability to put facts together and identify hidden clusters, communities or affinities
Users: Product managers, business analysts
Usage model: Iteratively identify communities or clusters where there is an affinity which can be exploited by Marketing
Augmenting: Existing data warehouses, analytical tools
15
Identify Unknown or Emerging Cyber Threats
Goal: Proactively identify unknown cyber threats by examining all relationships
Data sets: IP, MAC, BGP, Firewall, DNS, Netflow, Whois, NVD, CIDR…
Technical Challenges: Volume and Velocity of data; Temporal dependencies; Real-time response
Users: Cyber Analysts
Usage model: Iterative analysis of all patterns across all traffic to explore deviations in frequency of occurrence, derivative patterns of known threats and linking patterns through relationships in offline data
Augmenting: Existing data appliances
16
Concluding Thoughts – Key Advantages of Graph Analytics
Business Advantages of Graph Analytics
Interactively answer your most complex questions &
discover new threats, breaches or revenue opportunities
Assess the impact of new data on your analysis
interactively, not after a 3-6 month data onboarding
process
Ask questions you’ve not thought of yet, against all your
relevant data, and rapidly gain new insights
Get up and running in weeks, while leveraging 100% of
existing internal IT and business skills
Leverage your investment and bring scale to any
business problem
Add a powerful NEW capability to your ecosystem, and
improve effectiveness of existing infrastructure
17
Who is YarcData?
A new division within Cray
100% focused on Big Data solutions
Rapidly-growing, multi-billion market
Experienced management team with deep enterprise roots
YarcData product proven at largest Gov’t/Intel clients
18
Cray’s Vision: The Fusion of Supercomputing and Big & Fast Data
Modeling The World
Cray Supercomputers solving “grand challenges” in science, engineering and analytics
Advanced
Analytic
Appliances
Storage & Data
Management Supercomputers
Data Models
Integration of datasets
and math models for
search, analysis,
predictive modeling and
knowledge discovery
Math Models
Modeling and simulation
augmented with data to
provide the highest
fidelity virtual reality
results
Data-Intensive
Processing
High throughput event
processing & data
capture from sensors,
data feeds and
instruments
One Way to Segment the “Big & Fast Data” Market…
Data Warehouses +Extensions (Oracle, Teradata,
Greenplum, DB2)
NoSQL Databases (MongoDB, CouchBase, DynamoDB, AsterData)
Big Data Solutions
These solutions can compete, but also can be very
complementary as each has strengths & weaknesses
Hadoop / MapReduce (Cloudera, HortonWorks,
MapR, Intel)
Graph Analytics (Neo4j, AllegroGraph, Objectivity, Virtuoso)
Big Data Fast Data Cray Brings Supercomputing to Analytics
20
SAN
Interconnects
Enterprise
Data
(structured)
GRID
LAN/WAN
interconnects
Distributed Memory
Big Data
\
CLOUD
Global
Memory
Fast Data
uRiKA
In-memory
Graph
Analytics
XC30
MPP Global
Memory
CS300
Cluster
Supercomputer
& Hadoop
Ethernet
Clusters
It is Really About Decision Making through
Fact Finding and Equation Solving
Key
Function
Language Data
Approach
“Airline”
Example
OLTP Declarative
(SQL)
Structured
(relational)
ATM transactions
Buying a seat on an airplane
OLAP
Ad Hoc
Declarative
(SQL+UDF)
or NoSQL
Structured
(relational)
Business Intelligence analysis of
bookings for new ad placements
or discounting policy
Semantic
Ad hoc
Declarative
(SPARQL)
Linked, Open
(graph-based)
Analyze social graphs and infer
who might travel where
API for
analysis
Procedural
(MapReduce)
Unstructured
(Hadoop files)
Application Framework for large
scale weblog analysis
Data
Assimilation
Procedural
(C++, Fortran)
Data merged
With simulations
Sensor data incorporated into
the computer simulation
Optimize
Models
Procedural
(Solver Libs)
Optimization
<-> Simulation
Complex Scheduling
Estimating empty seats
Simulate
Models
Procedural (Fortran, C++)
Matrix Math (Systems of Eq’s)
Mathematical Modeling and
simulation (design airplane)
Languages & Tools for
Programmers
Analyst
Query
22
Thank You