+ All Categories
Home > Data & Analytics > Multiplaform Solution for Graph Datasources

Multiplaform Solution for Graph Datasources

Date post: 16-Apr-2017
Category:
Upload: stratio
View: 186 times
Download: 0 times
Share this document with a friend
34
@StratioBD Multiplatform Spark solution for Graph datasourcess, Stratio Stratio Javier Domínguez
Transcript
Page 1: Multiplaform Solution for Graph Datasources

@StratioBD

Multiplatform Spark solution for Graph datasourcess, Stratio Stratio

Javier Domínguez

Page 2: Multiplaform Solution for Graph Datasources

Javier Dominguez Montes

Studied computer engineering at the ULPGC. He is passionate about Scala, Python and all Big Data technologies

and is currently part of the Data Science team at Stratio Big Data,

working with ML algorithms, profiling analysis based around Spark.

Page 3: Multiplaform Solution for Graph Datasources
Page 4: Multiplaform Solution for Graph Datasources

Graph use cases Results

What's next?

Dataset

Main process explanation

Notebooks show off

DataStores

Machine learning

Business example

Page 5: Multiplaform Solution for Graph Datasources

@StratioBD

Page 6: Multiplaform Solution for Graph Datasources

500 GB - 2 TB

4 TB - 8 TB

20 GB - 100 GB

80’S 2000 2010 2015 2020

100 TB

> 10 PB

Page 7: Multiplaform Solution for Graph Datasources

VALUE IS THE DATA VALUE IS UNDERSTANDING THE DATA

Page 8: Multiplaform Solution for Graph Datasources

DO NOT STAY ON THE SURFACE OF KNOWLEDGE

Page 9: Multiplaform Solution for Graph Datasources

• Graph use cases

• DataStores

• Machine learning

@StratioBD

Page 10: Multiplaform Solution for Graph Datasources

Example of how to exploit a massive database from different stages and through several graph technologies

MACHINE LEARNING LIFE CYCLE WITH BIG DATA

Page 11: Multiplaform Solution for Graph Datasources

Machine Learning life cycle

Show how a data sciencist is able to take advantage of a Graph Database through different datasources and technologies thanks to our solution.

Use as a example a masive dataset.

Query the datasource from different technologies like:

• GraphX• GraphFrames• Neo4j

And finally apply Machine Learning over our information!

Page 12: Multiplaform Solution for Graph Datasources

USE CASES

Page 13: Multiplaform Solution for Graph Datasources

Making use of a masive graph datasource implies make batch queries over it.We will need to maken them with our distributed technologies... The easier the better

Batch Queries

Motifs filter example

import org.graphframes._

val g: GraphFrame = Graph(usersRdd,relationshipsRdd0)

// Search for pairs of vertices with edges in both directions between them

val motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)")

motifs.show()

// More complex queries can be expressed by applying filters.

motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")

Page 14: Multiplaform Solution for Graph Datasources

Most of our clients or teammates will need to have fast and easy access to the information.We would need a way to make easy queries and of course a graphic representation of our data!

We would need of course microservices like REST operations over our datastore.

Online queries

Page 15: Multiplaform Solution for Graph Datasources

DATASTORES

Page 16: Multiplaform Solution for Graph Datasources

Spark

Apache Spark is a fast and generic engine for large-scale data processing.

GraphX

Spark API for the management and distributed calculation of graphs. It comes with a great variety of graph algorithms: Connected componentes PageRank Triangle count SVD++

GraphFrames

It aims to provide both the functionality of GraphX and extended functionality taking advantage of

Spark DataFrames. This extended functionality includes motif finding and highly expressive graph

queries.

Page 17: Multiplaform Solution for Graph Datasources

Neo4j

Neo4j is a highly scalable native graph database that leverages data relationships as first-class entities.

Big data alone used to be enough, but enterprise leaders need more than just volumes of information to

make bottom-line decisions. You need real-time insights into how data is related.

Page 18: Multiplaform Solution for Graph Datasources

MACHINE LEARNING

Page 19: Multiplaform Solution for Graph Datasources

It's possible to quickly and automatically produce models that can analyze bigger, more complex data and deliver faster, more accurate results – even on a very large scale. The result? High-value predictions that can guide better decisions and smart actions in real time without human intervention.

Machine learning

SVD

Will relate all the existing object in our dataset and infer possible new behaviors.

Page 20: Multiplaform Solution for Graph Datasources

• Dataset

• Main process explanation

• Notebooks show off

@StratioBD

Page 21: Multiplaform Solution for Graph Datasources

STRATIO INTELLIGENCE

Integration of different Open Source libraries of distributed machine learning algorithms.

Development environment adapted to each data scientist.

Real-time decision based on models based on machine learning algorithms

Integrated with all components of the Stratio Big Data Platform

Comprehensive knowledge lifecycle management

Page 22: Multiplaform Solution for Graph Datasources

DATASET

Page 23: Multiplaform Solution for Graph Datasources

Freebase aimed to create a global resource that allowed people (and machines) to access common information more effectively.

This model is based on the idea of converting the declarations of the resources in expressions with the subject-predicate-object which are called triplets.

Subject: It's the resource, what we are describing.Predicate: Could be a property or a relationship with the object value. Object value: Propertie's value or the related subject.

<'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 .

<'Cristiano Ronaldo'> <'Born in'> 'Portugal' .

Freebase Google

Total triplets: 1.9 Billion

Page 24: Multiplaform Solution for Graph Datasources

PROCESS EXPLANATION

Page 25: Multiplaform Solution for Graph Datasources

Transforms

CastRDF

Dataset

GraphFramesBatch

query

Neo4jGraphX

Extracts sample & transforms Online

query

Page 26: Multiplaform Solution for Graph Datasources

SVD

K-core

Decomposition Strongly

connected graph

Apply

algorithms

Behavior

Inference

Graph

Subject

equality

Page 27: Multiplaform Solution for Graph Datasources

A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k. Equivalently, it is one of the connected components of the subgraph of G formed by repeatedly deleting all vertices of degree less than k.

Objective

Remove all nodes with fewer connections.At the end, we want only the most representative and connected elements in our grah.In our use case we used K = 5.

K-Core process

Page 28: Multiplaform Solution for Graph Datasources

NOTEBOOKS SHOW OFF

Page 29: Multiplaform Solution for Graph Datasources

BUSINESS EXAMPLE

Page 30: Multiplaform Solution for Graph Datasources

Jaccard Graph Clustering

Node Clusterization based on concrete relations optimized for Big Data environments.

We've developed an straightforward functionality which is able to detect patterns and clusterize data in a graph database thanks to daily machine learning processes.

Neo4j

Scala Graph

functionalities

Jaccard

Indexation

Connected

ComponentesJava

HDFS / Parquet

Spark / GraphX

40BJaccard distance calculation

in everyday process

400Knodes graph clustering

Page 31: Multiplaform Solution for Graph Datasources

• Results

• What's next?

@StratioBD

Page 32: Multiplaform Solution for Graph Datasources

Semantic search engine

Include ElasticSearch for making text searchs as a search engine.

Apply more Machine Learning algorithms

• Connected components: As we've already done, try to cluster information thanks to their relationships.• PageRank: Measure the importance of a subject.• Triangle counting: Check posible triangle relationships inside our dataset to avoid redundancy.

New Graph use cases

• Fraud detection• Recommendation System • Profiling

Page 33: Multiplaform Solution for Graph Datasources

THANK YOU

UNITED STATES

Tel: (+1) 408 5998830

EUROPE

Tel: (+34) 91 828 64 73

[email protected]

www.stratio.com

@StratioBD

Page 34: Multiplaform Solution for Graph Datasources

[email protected]

WE ARE HIRING

@StratioBD


Recommended