+ All Categories
Home > Technology > Ted Willke, Intel Labs MLconf 2013

Ted Willke, Intel Labs MLconf 2013

Date post: 10-May-2015
Category:
Upload: sessionsevents
View: 5,688 times
Download: 1 times
Share this document with a friend
Description:
Ted Willke, Principal Engineer/GM, Intel Labs: "Avoiding Cluster-Scale Headaches with Better Tools for Data Quality and Feature Engineering"
Popular Tags:
35
Intel Labs Graph Analytics Operation
Transcript
Page 1: Ted Willke, Intel Labs MLconf 2013

Intel Labs Graph Analytics Operation

Page 2: Ted Willke, Intel Labs MLconf 2013

Machine Learning may nourish the soul… ... but Data Preparation

will consume it.

Source: Wikipedia (Hell)

Source: Wikipedia (Banquet)

Page 3: Ted Willke, Intel Labs MLconf 2013

Machine Learning on Large Datasets

3

Data Quality and Feature Engineering

New Data

Feature Data

Training Set

Validation Set

Build Model

Validate

Value

Input Data

Supervised Learning

Supervised and

Unsupervised Learning

• Figure out what’s there • Extract a bunch of features • Figure out what’s needed • Finalize and feed

Test Set

Extract Transform Load

Argghh!

Page 4: Ted Willke, Intel Labs MLconf 2013

Problems with Processing Large Datasets

Not turn-key

Are data scientists really expected to know…

how to set up Hadoop from scratch?

java, pig, Hadoop APIs?

how to extend with UDFs?

how to extract, analyze and visualize output beyond Hadoop?

“After hours of debugging our Hadoop setup, I was ecstatic to run a

Hadoop command without a java stack trace.”

- Zach

Page 5: Ted Willke, Intel Labs MLconf 2013

Not agile

Traditional Environment < 1 sec Simple Established methods Fast Iteration

Distributed Environment > 30 sec Several steps and changes Not clear Slow or linear

Command response Dependency inclusion Validation Development Cycle

Problems with Processing Large Datasets

Page 6: Ted Willke, Intel Labs MLconf 2013

Apache Pig

• A dataflow processing system for MapReduce

• A high-level scripting language -- Pig Latin

Page 7: Ted Willke, Intel Labs MLconf 2013

Why Pig for ETL?

• Easy to get up & running

• Easy to program – simple declarative scripting

language , built-in dataflow primitives

• Nested data model support

• First class extensibility – custom filters,

transforms, input/output formats, etc.

• Automatic dataflow optimization – Pig/MR runtime:

~0.97x for 0.12

• As configurable as MR

Page 8: Ted Willke, Intel Labs MLconf 2013

The story gets even better

• Elephant Bird – good support for different formats,

codecs, etc.

• DataFu – Pig UDFs for data mining & statistics

• PiggyBank – collection of additional UDFs

Page 9: Ted Willke, Intel Labs MLconf 2013

So, we’re done, right?

No. Many open challenges, including complex models.

Page 10: Ted Willke, Intel Labs MLconf 2013

Property Graph Data Models

Source: Tinkerpop (Property Graph)

Page 11: Ted Willke, Intel Labs MLconf 2013

Graph Applications

Mining

• Neural Networks

• Deep Learning (RBM)

• Belief Propagation

• Label Propagation/ARW

• Collaborative Filtering

(ALS, SGD, SVD)

• Topic Modeling (LDA)

• K-Means

Machine Learning

Traversal (Search)

• PageRank

• Random Walk with

Restart

• Connected Components

• Triangle Counting

• K-Truss

• Centrality Measures

• Network Diameter

• Degree Distribution

• Depth-First Search

• Breadth-First Search

Page 12: Ted Willke, Intel Labs MLconf 2013

Graphical Machine Learning

Intel

Graph Builder on

Graph

Query

Processing &

Storage

Input Data Construct Graph Build Model Serve Model Insight & Prediction

HDFS

DB

Web Docs

• Need fully-integrated solutions that are easy to program

• Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining

• Enables applications in broadband services, network security, retail, life sciences,

financial markets, etc.

Page 13: Ted Willke, Intel Labs MLconf 2013

Graph Processing: Technology Challenges

Intel Labs continues to work on the gaps.

Performance – Has skyrocketed with in-memory and asynchronous

graph engines and scalable graph query architectures

Integration – Multiple frameworks are difficult to synchronize,

coordinate, and manage

Data Models – Most large-scale work still on homogeneous graphs but

property graphs and meta-path concepts are more widely discussed

Algorithms – A wide range of toolkits with graph mining and graphical

machine learning algorithms, with more sophistication and scaled

versions arriving “every day”

Data Visualization – No great packages to visualize relationships du

jour and interactive big data sampling and projection too crude & slow

Programming – Challenging programming models in languages not

popular with data scientists, IT developers, and other end-users

Traction

Not so much

Data Preparation – Takes way too long, is way too manual, and is

fraught with error Progress!

Page 14: Ted Willke, Intel Labs MLconf 2013

Nothing specific for graph ETL. What’s needed:

• support for well-known input-output graph formats

• graph specific filters & transforms

• STORE functions for graph stores

Pig ETL for Graphs?

Original Vision

Page 15: Ted Willke, Intel Labs MLconf 2013

Graph Builder 2 Alpha • Construction of heterogeneous information networks with Pig

• Better “progressive refinement” during acquisition, cleaning, and integration

• Incremental graph construction

• Interfacing for popular graph databases (Titan, RDF output, etc.)

Ted Kushal

Mohit

Danny

Ivy

Frank

Nezih

friends

friends

friends

brothers

friends

friends

friends

friends

Food

Cart

likes

likes likes

Social Graph

Bicycles

likes likes

likes

Ratings Graph

uses

* Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler

Ted may like

bicycle-powered food cart

Product Graph

Page 16: Ted Willke, Intel Labs MLconf 2013

RDBMS

HDFS

NFS

HBase

Titan

Hadoop

HDFS

Giraph Zo

oK

ee

pe

r

Raw Data

Pig Graph

Builder

Graph ETL

Mahout

Graph

Analytics ML

Real-time Graph Queries

Blueprints Re

xste

r

Gremlin

Feature

Store

Model

Store

Example Stack Architecture

Page 17: Ted Willke, Intel Labs MLconf 2013

67033:-20071306431384422339653 http://www.kog.com http://www.dlstainedglass.com 2 91658:-20071306431384422339653 http://www.kog.com http://www.haegerstainedglass.com 2 941:-19442631361384422339653 http://www.ks-p.jp http://www.drag-race.nuhuh.bee.pl 1 44116:-18273037921384422339653 http://www.kune.fr http://www.chezfanny.fr 3 36891:-18273037921384422339653 http://www.kune.fr http://www.wp-jobboard.kune.fr 3 79906:-17817899301384422339654 http://www.kwc.edu http://www.umsl-sports.com 1 2238:-17817799001384422339654 http://www.kwc.org http://www.onlamp.com 1 68133:-17817799001384422339654 http://www.kwc.org http://www.tjhsst.edu 1 30677:-17817799001384422339654 http://www.kwc.org http://www.floydlandis.com 1 81185:-17817799001384422339654 http://www.kwc.org http://www.you-are-here.com 1 47527:-17817799001384422339654 http://www.kwc.org http://www.phonak-cycling.ch 1 63112:-17817799001384422339654 http://www.kwc.org http://www.link.brightcove.com 1 74837:-17817799001384422339654 http://www.kwc.org http://www.trustbut.blogspot.com 6 53668:-17817799001384422339654 http://www.kwc.org http://www.icanhascheezburger.com 4 97945:-17817799001384422339654 http://www.kwc.org http://www.mythbustersfanclub.com 12 93849:-17709983361384422339654 http://www.kwmd.us http://www.sierraclub.typepad.com 1 51421:-17700453681384422339654 http://www.kwne.jp http://www.ppvj.co.jp 1 13022:-17651665521384422339654 http://www.kwu.edu http://www.rollinghillszoo.com 2 16530:-17113867601384422339654 http://www.kyou.nu http://www.fan.unfading-scar.net 2

14199:-16755866041384422339654 http://www.kzy.com http://www.wbbm780.com 1 95253:-16755866041384422339654 http://www.kzy.com http://www.brewview.com 1 25828:-14077538951384422339655 http://www.lee.org http://www.kaiju.com 1 88133:-14077538951384422339655 http://www.lee.org http://www.sfgov.org 2 94243:-14077538951384422339655 http://www.lee.org http://www.liftport.com 1 56826:-14077538951384422339655 http://www.lee.org http://www.nishioka.com 1 88574:-14077538951384422339655 http://www.lee.org http://www.Smartflix.com 1 81966:-14077538951384422339655 http://www.lee.org http://www.smartflix.com 145 83164:-14077538951384422339655 http://www.lee.org http://www.torrentspy.com 1 99087:-14077538951384422339655 http://www.lee.org http://www.SerpentMother.com 1 39124:-14077538951384422339655 http://www.lee.org http://www.serpentmother.com 3 95995:-14077538951384422339655 http://www.lee.org http://www.toolbar.google.com 2

Extract Transform Load

Parse HTML, look for links and words

Graph Builder to Titan

Archive Records

{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "links": ["http://www.1stvwparts.com/shopping_cart.php", "http://www.partsfirm.com", ...], "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright“, “html”, “php”] }

http://www.1stvwparts.com/default.php?cPath=159 74.86.123.84 20091120145711 text/html 28628HTTP/1.1 200 OK <table border="0" width="100%" cellspacing="0" cellpadding="0"> <tr> <td width="100%" class="infoBoxHeading_search">Quick Find</td> </tr></table><table border="0" width="100%" cellspacing="0" cellpadding="0" class="infoBox_search"> <tr> <td><table border="0" width="100%" cellspacing="0" cellpadding="3“ . . .

PageRank and Latent Dirichlet Allocation

Graph ETL Example

row src dst #links

Page 18: Ted Willke, Intel Labs MLconf 2013

Development Flow

(or, what actually happened)

Extract with python

Develop transforms

Test on a couple files

Fix bugs

Run python in Jython (fail miserably)

Spend too much time enabling

Write UDF in Java

Find limitations

Develop custom load UDF instead

Development Pains with Pig As-Is

Data Process Flow

Load with Pig

Turn into edge list (Pig, UDF)

Store to HDFS (Pig)

Load into Titan (GraphBuilder)

Run ML algorithms (Giraph)

Model queries (Gremlin)

All of this before any Machine Learning!

Page 19: Ted Willke, Intel Labs MLconf 2013

Custom UDFs add a lot of complexity, time and effort.

If you don’t have this…. You’re stuck with this…

Out-of-the-Box Tools

package org.apache.pig.builtin;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.BagFactory;

import org.apache.pig.data.DataBag;

import org.apache.pig.data.Tuple;

import org.apache.pig.data.TupleFactory;

import org.apache.pig.impl.logicalLayer.schema.Schema;

import org.apache.pig.data.DataType;

public class TOKENIZE extends EvalFunc<DataBag> {

TupleFactory mTupleFactory = TupleFactory.getInstance();

BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException {

try {

DataBag output = mBagFactory.newDefaultBag();

Object o = input.get(0);

if (!(o instanceof String)) {

throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());

}

StringTokenizer tok = new StringTokenizer((String)o, " \",()*", false);

while (tok.hasMoreTokens()) output.add(mTupleFactory.newTuple(tok.nextToken()));

return output;

} catch (ExecException ee) {

// error handling goes here

}

}

public Schema outputSchema(Schema input) {

try {

Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", DataType.CHARARRAY);

Schema tupleSchema = new Schema(tokenFs);

Schema.FieldSchema tupleFs;

tupleFs = new Schema.FieldSchema("tuple_of_tokens", tupleSchema, DataType.TUPLE);

Schema bagSchema = new Schema(tupleFs);

bagSchema.setTwoLevelAccessRequired(true);

Schema.FieldSchema bagFs = new Schema.FieldSchema( "bag_of_tokenTuples",bagSchema, DataType.BAG);

return new Schema(bagFs);

} catch (Exception e) {

return null;

}

}

}

X = FOREACH A GENERATE

TOKENIZE(f1);

(More of these please)

Page 20: Ted Willke, Intel Labs MLconf 2013

Breadth of Knowledge

Load Raw Data

Extract Links

Filter Bad Data

Group Like Links Together

Store - HBase

Store into Titan (Graph Builder)

Pig Java MapReduce

Page 21: Ted Willke, Intel Labs MLconf 2013

Even if you have ninja skills, you’ll still need to deal with weirdness.

Page 22: Ted Willke, Intel Labs MLconf 2013

Random Record

{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }

Page 23: Ted Willke, Intel Labs MLconf 2013

Uselessly common words

Random Record

{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }

Common connector words can be trimmed

…with a bunch more ETL.

Page 24: Ted Willke, Intel Labs MLconf 2013

Words mangled together?

Random Record

{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }

Is there an edge case that’s causing this?

Page 25: Ted Willke, Intel Labs MLconf 2013

Were these actually visible?

Random Record

{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }

“html” was found in every record, something seems wrong.

Page 26: Ted Willke, Intel Labs MLconf 2013

raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader()

AS (header:chararray, html:chararray);

edge_list = FOREACH raw_data GENERATE ExtractLinks(*);

edge_list_filtered = FILTER edge_list BY FilterAny(*);

src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0);

src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1);

dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1);

dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*);

final = FILTER dest_based_self_loops_removed

BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*');

grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64;

with_link_count = FOREACH grouped GENERATE group.src_domain,

group.dest_domain,

COUNT(final) AS num_links:long;

with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*);

final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0)

AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long);

STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan');

Load raw data

Extract links

Filter & Normalize

Generate Link

Counts

Assign HBase Keys

Store into Titan

Page 27: Ted Willke, Intel Labs MLconf 2013

Demo.

Page 28: Ted Willke, Intel Labs MLconf 2013

Open Problems with Pig ETL

(for Data Science)

Page 29: Ted Willke, Intel Labs MLconf 2013

Complex JSON/XML processing is painful { "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]}

Interactive Mode

Built in Functions and Operators UDFs

MR Jobs

Open source packages

Embedded Mode (Java, Python, etc.)

Batch Mode

STORE Functions

LOAD Functions Pig Scripting Interface Parser

Planner

Data Type Support

Backend & Execution Engines

User Interface

1

json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray), FLATTEN($0#'Inner-Json') AS (inner_json: map[]);

unnested = FOREACH unnested GENERATE top_level_field_value, FLATTEN(inner_json#'Name') AS (inner_name: chararray), FLATTEN(inner_json#'Value') AS (inner_value:long);

Page 30: Ted Willke, Intel Labs MLconf 2013

Better high-level language integration Native-like experience with non-JVM languages (Python, R, etc.) REST interface can be improved (HCATALOG-182)

Interactive Mode

Built in Functions and Operators UDFs

MR Jobs

Open source packages

Embedded Mode (Java, Python, etc.)

Batch Mode

STORE Functions

LOAD Functions Pig Scripting Interface Parser

Planner

Data Type Support

Backend & Execution Engines

User Interface

2

Page 31: Ted Willke, Intel Labs MLconf 2013

Better data exploration & error reporting Faster iterative processing (Spark, YARN) Better SAMPLE (WIP: PIG-1713) SUMMARY for descriptive statistics More descriptive error messages

Interactive Mode

Built in Functions and Operators UDFs

MR Jobs

Open source packages

Embedded Mode (Java, Python, etc.)

Batch Mode

STORE Functions

LOAD Functions Pig Scripting Interface Parser

Planner

Data Type Support

Backend & Execution Engines

3

Page 32: Ted Willke, Intel Labs MLconf 2013

Better control with HBaseStorage

Inefficient for bulk loading

Better HBase filter support

Batching support

Fetch multiple versions

Interactive Mode

Built in Functions and Operators UDFs

MR Jobs

Open source packages

Embedded Mode (Java, Python, etc.)

Batch Mode

STORE Functions

LOAD Functions Pig Scripting Interface Parser

Planner

Data Type Support

Backend & Execution Engines

4

Page 33: Ted Willke, Intel Labs MLconf 2013

Questions?

• Graph Builder 2 Alpha Dec’13

• Apache 2.0 OS code available at: www.01.org/graphbuilder/

Page 34: Ted Willke, Intel Labs MLconf 2013

Legal Notices

• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

• Intel may make changes to specifications and product descriptions at any time, without notice.

• All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

• Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user

• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

• Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

• *Other names and brands may be claimed as the property of others.

• Copyright © 2013 Intel Corporation.

Page 35: Ted Willke, Intel Labs MLconf 2013

Abstract Intel is working hard to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop. The Graph Analytics Operation within Intel Labs is helping to transform Hadoop into a full-blown “knowledge discovery platform” that can deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. But, the analysis cannot start until features are engineered, a task that takes a lot of time and effort today. In this talk, I will describe some of the Hadoop-based tools we are developing to make it easier for data scientists to deal with data quality issues and construct features for scalable machine learning, including graph-based approaches


Recommended