+ All Categories
Home > Engineering > Drummond 9-29-2014 - Patently Innovative

Drummond 9-29-2014 - Patently Innovative

Date post: 05-Aug-2015
Category:
Upload: david-e-drummond
View: 131 times
Download: 1 times
Share this document with a friend
Popular Tags:
17
PATENTLY INNOVATIVE Finding where innovation lives! David E Drummond Insight Data Engineer
Transcript

PATENTLY INNOVATIVEFinding where innovation lives!

!!

David E Drummond Insight Data Engineer

INTRO

Which state has more innovation? Use number of patents to see which regions are “patently innovative”.

DATA PIPELINE Ingestion Batch Processing Real-Time Queries

XML, TSVData Cleansing

JSON Hive SerDe

HappyBase

DATA PIPELINE Ingestion Batch Processing Real-Time Queries

XML, TSVData Cleansing

JSON Hive SerDe

HappyBase

DATA SET ~90 MB/WEEK ~10 MILLION PATENTS >45GB

TSV XML

CONVERT TO JSON AND PARSE

NODE PARALLELIZATION

.PY/.SH

Name Node

Worker Nodes

SINGLE LINE JSON RECORDS

DATA PIPELINE Ingestion Batch Processing Real-Time Queries

XML, TSVData Cleansing

JSON Hive SerDe

HappyBase

JSON HIVE SER-DE

CLEANED TABULAR DATA

DATA PIPELINE Ingestion Batch Processing Real-Time Queries

XML, TSVData Cleansing

JSON Hive SerDe

HappyBase

HBASE SCHEMA

State 2005 2006 2007 … 2011 2012 2013CA 8530 7411 7120 … 7849 7799 9185TX 2167 1961 1806 … 2050 2121 2500

State 200501 200502 … 201408 201409CA 512 538 … 1380 1194

Denormalized schema for faster queries

TX 102 217 … 350 263

Yearly

Monthly

DAVID DRUMMONDEarned Ph.D in Physics from UC Riverside,

simulating fault tolerant parallel Quantum Computing systems. !

Love to travel and learn about everything!


Recommended