Using the cloud to process unstructured big data by Jason Cornez.

Post on 19-Jan-2017

53 views 1 download

transcript

May 21, 2016

Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain

RavenPack: Mapping the World’sBig Data for Financial Applications

Jason Cornez ‒ CTOjcornez@ravenpack.com

2ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• RavenPack delivers big data analytics to financial professionals• Top hedge funds and investment banks use RavenPack

for trading and risk management• Patented, proprietary technology and award-winning research• Archive of more than 300 million documents, spanning past 20 years

RavenPack processes hundreds of thousands of documents each day.

We produce machine readable analytics for each document in real time.

Expected processing time for a typical document is less than 250ms.

RavenPack at a Glance

3ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Classification Overview• Realtime Classification: Classic vs Cloud• Historical Classification: Classic vs Cloud• New Challenges: Spot Instances and The Weather• New Opportunities

Contents

4ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Extract meaning from Unstructured Text

• Tokenization

• Entity Detection

• Attribute Tagging

• Event Detection

• Consolidation

A stream-based Classification Framework allow us to add new classifiers into a stream of documents. As much as possible, classifiers use separate threads to run in parallel.

Classification Overview

5ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Dictionary of nearly 400,000 entities

• Point-in-time aware

• Rules per entity type

• Extensive entity relationship modeling

• Supports metadata and other hints

• Equivalent terms and stop words

We support: company (Oracle Corp.), organization (European Union), geo-political place (Spain), currency (US Dollar), nationality (Spanish), people (Barack Obama), commodity (Crude Oil), position (CEO, President), team (Real Madrid), product (iPhone 6S), and more.

Entity Detection

6ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Example: People Detection• Many people share the same or similar names

• Many people hold various positions at employers across time

• People have one or more nationalities

• People are related to other people

Melanie Griffith files for divorce from Banderas

Mai And Banderas Star In The New The King Of Fighters XIV Trailer

After year out, Tim Cook joins competitive Oregon State running back battle

Apple CEO Tim Cook Attends iPad Pro 9.7 inch Launch at Palo Alto Store

7ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model• 6 Servers, 19 KVM virtual machines

• Limited Storage - Expensive to Upgrade

• Multiple Points of Failure

Use Case: Realtime Classification

RDBMS

CollectorsRT Feed

Snapshots

Classifier

Files

8ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS• CloudFormation to model the Stack

• Unlimited, Distributed Storage

• Easy redundancy, failover and backup

Use Case: Realtime Classification

Amazon EC2

AWSCloudFormation

AmazonDynamoDB

AmazonS3

AmazonRDS

Amazon CloudSearch

Amazon Redshift

Amazon Kinesis

RT Feed

Snapshots

ClassifiersCollectors

Gonzalo Bahut
Using Cloudformation we can replicate the Stack in the same or different geographical regions. High-availability and client-oriented performance

9ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Lose central RDBMS → Lose transactions

• S3 great for documents, but no index

• DynamoDB great for index, but...

Must manage throughput

No foreign keys or integrity constraints

Eventual consistency

• RedShift amazing for OLAP, but not OLTP

So use Kinesis to stream and then batch

• Schema-free is a myth

Applications are more flexible and scalable, but also more complex.

Cloud Migration Challenges

Gonzalo Bahut
reconsider "Eventual consistency". Dynamodb read consistency is configurable. We can make our reads consistent at a price of lower performance.
Gonzalo Bahut
I think the term "loosing" is too severe. Nothing prevent us from running a big oracle RDBMS in the cloud as well.

10ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model• Same Limited Set of Servers, Same RDBMS

• Can affect Realtime System, Backups

• Full archive, 4-6 Classifiers → 6 weeks!

Use Case: History Classification

RDBMS FilesClassifiers

Classifiers

11ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS• Servers on Demand, Distributed Storage

• Independent of Realtime System

• Full archive, 100 Classifiers → 3 days!

Use Case: History Classification

Amazon EC2

AWSCloudFormation

AmazonDynamoDB

AmazonS3

AmazonRDS

Amazon Redshift

Availability ZoneAvailability Zone

...

Classifiers

Coordinator

12ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model - Clear skies!

• Well-known resources

• Predictable workload

• Predictable behavior

• Stable Behavior

We have full control over the resources.

We expect a service to be started seldom

and to run for a long time without interruption.

The Weather

13ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Spot Instances

• Bid for unused capacity

• Save money, control costs

• Great for jobs with no specific deadline

• Possible to bid above on-demand rates

Typically pay 1/2 to 1/10 the “on-demand” rates.

We use spot instances for our historical

classification runs.

The Weather

14ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Warning! Uncertain Conditions

• Someone else’s resources

• Unpredictable behavior

• Easy to move the spot market

We have no control over the resources or who

else might be using them. We expect a server

can be killed with little notice.

The Weather

15ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Warning! Uncertain Conditions

• Do work in multiple zones

• Optimize image startup

• Group work into well-defined chunks

• Use on-demand instances for co-ordination

Expect inclement weather and be prepared for it!

Dealing with Bad Weather

Availability Zone

16ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Download a Custom “Slice” of Analytics Data• Provide a Web-API and Web Service• Let client specify parameters

Data Set and Time Range

Entities and Events

Filters• Leverage Amazon RedShift and S3• Compression and Multiple Output Formats

Opportunity: Self-Service Data

AmazonS3

Amazon Redshift

Amazon EC2

Amazon API Gateway

17ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Let Clients upload Proprietary Contentto a Private and Secure VPC

• Provision Computing and Storage Resourceson a Per Project Basis

• View Private Analytics in Isolation or AlongsideStandard RavenPack Analytic DataSets

• Everything Goes Away when Project Completes

Opportunity: The RavenPack Cloud

AmazonDynamoDB

AmazonRDS

AmazonS3

Amazon Redshift

Amazon EC2

AWSCloudFormation

Amazon CloudSearch

May 21, 2016

Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain

Thank you! Gracias!

Jason Cornez ‒ CTOjcornez@ravenpack.com