Gilbane Boston 2011 big data

Get Ready for Big Data

Peter O'KellyPrincipal Analyst, O'Kelly Associates

Hadley ReynoldsManaging Director, Next Era Research

Kathleen ReidySenior Analyst, 451 Research

Wednesday November 30, 20112:40 – 4:00

http://gilbaneboston.com/speakers.html%23pokelly

http://gilbaneboston.com/speakers.html%23hreynolds

http://gilbaneboston.com/speakers.html%23kreidy

http://gilbaneboston.com/index.html

2

Agenda

• Big data in context• Big structured data• Big unstructured data• Big opportunities and risks• Q&A

3

Big Data in Context

• What is “big data”?– Unhelpfully, both “big data” and “NoSQL,” generally

considered a key part of the big data wave, are defined more in terms of what they’re not than what they are

– A typical big data definition (Wikipedia): • “[…] datasets that grow so large that they become

awkward to work with using on-hand database management tools”

http://en.wikipedia.org/wiki/Big_data

4

Big Data in Context

• With thanks to the Business SOA blog:– “[…] describe Big Data in the same way that the

Hitchhikers Guide to the Galaxy described space:– ‘Space,’ it says, ‘is big. Really big. You just won't believe how

vastly, hugely, mindbogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space, listen...’”

http://service-architecture.blogspot.com/2011/11/when-big-data-is-big-con.html

http://www.bbc.co.uk/cult/hitchhikers/

5

Big Data in Context

• Why is big data a big deal now?– Commodity hardware and the Internet

• Capability and price/performance curves that continue to defy all economic “laws”

• Also facilitating compelling cloud services

– Maturation and uptake of open source software, e.g., Hadoop• Powerful and often no- or low-cost

– IT market • Enthusiasm for “NoSQL” systems• Frustration with incumbent information management vendors

– Useful new data sources/resources, e.g., social network activity graphs, the “Internet of things,” sensor networks…

– Competitive and compliance imperatives

6

Big Data in Context

• A big data reality check– “Mindbogglingly”-scale information management is not new

• Consider, e.g., VLDB, multi-billion document repositories, and the World Wide Web…

– What is new and compelling• The combination of market dynamics producing new capability and

price/performance curves• Cloud

– No deep capital investment required to get started– Cloud-based information resources

• Some innovative marketing, suggesting – Self-proclaimed next-generation big data systems are magical and revolutionary– Deployed systems are obsolete and wasteful

http://en.wikipedia.org/wiki/VLDB

7

A Big-Picture Framework

• A digital information item dichotomy – Resources (~unstructured information)

• Digital artifacts optimized to convey stories– Organized in terms of narrative, hierarchy, and sequence

• Examples: books, magazines, documents (e.g., PDF, Word), Web pages, XBRL documents, video, hypertext…

– Relations (~structured information)• Application-independent descriptions of real-world

things and relationships• Examples: business domain databases, e.g., customer,

sales, HR…

8


Resource RelationW

ord

docs

DITA

doc

s

XBRL

doc

s

PDF d

ocs

Oper

ation

al d

b

Desk

top

db

Stre

amin

g db

9


Resources Relations

Conceptual Resources and links Entities, attributes, relationships, and identifiers

Logical Model: hypertextLanguage: XQuery (ideally)

Model: extended relationalLanguage: SQL

Physical Indexing (e.g., scalar data types, XML, full-text), locking and isolation levels, federation, replication, in-memory databases,

columnar storage, table spaces, caching, and more

10

Agenda


11

Big Structured Data

• NoSQL• Hadoop• RDBMS reconsidered• Back to the bigger picture

12

NoSQL

• No clear consensus on what “NoSQL” means– Started with what it’s against, not what it’s about

• And often finds a receptive audience due to frustration with RDBMS business-as-usual

– The “NoSQL” meme is a moving target• Initially implied “Just say ‘no’ to SQL”• Later quietly redefined as “Not Only SQL”• What may be next: “New Opportunities for SQL”

– I.e., some developers may reconsider the value of SQL and RDBMSs, after hitting NoSQL limitations

13

A NoSQL Taxonomy

• From the NoSQL Wikipedia article:

http://en.wikipedia.org/wiki/NoSQL

14

NoSQL Perspectives

• The “NoSQL” meme confusingly conflates– Document database requirements

• Best served by XML DBMS (XDBMS)

– Physical model decisions on which only DBAs and systems architects should focus

• And which are more complementary than competitive with RDBMS/XDBMS

– Object databases, which have floundered for decades• But with which some application developers are nonetheless

enamored, for minimized “impedance mismatch,” despite significant information management compromises

– Semantic models• Also more complementary than competitive with RDBMS/XDBMS

15

Hadoop

• Hadoop is often considered central to big data– Originating with Google’s MapReduce architecture, Apache Hadoop is

an open source architecture for distributed processing on networks of commodity hardware

• Commercial application domains include (from Wikipedia)– Log and/or clickstream analysis of various kinds– Marketing analytics– Machine learning and/or sophisticated data mining– Image processing– Processing of XML messages– Web crawling and/or text processing– General archiving, including of relational/tabular data, e.g. for

compliance

http://en.wikipedia.org/wiki/Hadoop

16

Hadoop

• Hadoop is popular and rapidly evolving– Most leading information management vendors,

including Microsoft, have embraced Hadoop– There is now a Hadoop ecosystem

http://www.zdnet.com/blog/microsoft/microsoft-drops-dryad-puts-its-big-data-bets-on-hadoop/11226

17

RDBMS Reconsidered

• RDBMS incumbents appear to be under siege, with – IT frustration with RDBMS business-as-usual

• Counterproductive RDBMS vendor policies and attitudes• DBA modus operandi often seen as excessively conservative

– Conventional wisdom about RDBMS limitations for, e.g.,• “Web scale”• “Agility”• The application/database “impedance mismatch”

– The advent of open source and/or specialized DBMSs• E.g., MySQL is the M in the “LAMP stack”• “The end of the one-size-fits-all DBMS era”

18

RDBMS Reconsidered

• An RDBMS reality check– Leading RDBMS products and open source initiatives are very

powerful and flexible• And will continue to evolve, e.g., with the mainstream deployment of

massive-memory servers and solid state disk (SSD) storage

– And they continue to expand• E.g., in-database processing, with, for example, analytics engines

running within DBMS kernels

– But the RDBMS incumbents nonetheless face unprecedented challenges

• Which sometimes resonate with frustrated architects and developers because of negative experiences that have more to do with how RDBMSs were used rather than what RDBMSs can effectively address

19

RDBMS in the Big-Picture Framework

Resources Relations

Conceptual Resources and links Entities, attributes, relationships, and identifiers

Logical Model: hypertextLanguage: XQuery

Model: extended relationalLanguage: SQL

Physical Indexing (e.g., scalar data types, XML, full-text), locking and isolation levels, federation, replication, in-memory databases,

columnar storage, table spaces, caching, and more

20

RDBMS Reconsidered

• A Forrester big data reality check (from “Stay Alert To Database Technology Innovation,” 11/19/2010): – “For 90% of BI use cases, which are often less than

50 terabytes in size, relational databases still are good enough” (p. 4)

– “Traditional relational databases are still good enough for the majority of transactional use cases” (p. 5)

21

Back to the Bigger Picture

• Compared with traditional enterprise data management, big data is– Essentially a collection of specialized physical

models for very large, analysis-oriented data management

– Expanding to encompass resources as well as relations

– More about the potential for displacing expensive and closed/proprietary distributed processing alternatives than displacing RDBMS or XDBMS

22

Structured Big Data: Recap

• Substantive, sustainable, and synergistic – RDBMS– XDBMS– Hadoop– The cloud as an information management

platform• Vaguely defined, transitory, and over-hyped

– NoSQL

23

Agenda


24

Big Unstructured Data

• Finding Facts about Data – IDC/EMC• Patterns for Unstructured Big Data• How-to issues – who will know?

25http://www.emc.com/leadership/programs/digital-universe.htm

26

27

284/28/2011

29

30

314/28/2011

32

33

34

Facebook:800M users500M visitors/day$100B potential value @ IPO

35http://inmaps.linkedinlabs.com/

36

Unstructured Big Data Patterns

• Search• Social• Mobile• Online Activities/Digital Marketing• Inquiry/Detection – Connecting Dots• Question Answering

37

Mobile Adds:

Location data pointsVoice searchesSiri questionsApp history profileBrowse history profileSearch history profilePast purchase profileCamera-generated outputs/inputsCoupon delivery & merchandisingFriends' locationsSocial searchLocal ad-match algo opportunities

384/28/2011

39

Online Activities/Digital Marketing

40

• Inquiry/Detection – Connecting Dots– Intelligence– Law Enforcement– Fraud Detection (Government, Financial, Health, …)– eDiscovery

41

Social Media Monitoring

424/28/2011

Question Answering

43

Question Answering Beyond Jeopardy

44

Twitter Analytics Questions• What can we tell about a user from their tweets?

– from the tweets of those they follow?– from the tweets of their followers?– from the ratio of followers/following

• What graph structures lead to successful networks?• User reputation?• Sentiment analysis?• What features get a tweet retweeted?

– How deep is the retweet tree?

• Long term duplicate detection• Machine learning• Language detection

45

46http://www.mckinsey.com/en/Features/Big_Data.aspx

47

Agenda


48

Big Data Opportunities• Improved visibility and insights

– Can explore previously impractical questions• Real-time analytics

– Less dependence on “dead data”• Blur the boundaries between structured and unstructured

information– Unified views of resources and relations

• Consolidation– Reduce the number of moving parts in your infrastructure

• Along with related licensing and maintenance expenses

• Compliance – capture and maintain data & records previously beyond firm's capabilities

49

Big Data Risks• The potential for an ever-expanding set of information silos

– Critical to relentlessly focus on minimized redundancy and optimized integration

• GIGO (garbage in, garbage out) at super-scale– Dramatic improvements in capabilities and price/performance

provide new opportunities for self-inflicted damage, for organizations that don’t model or query effectively

• Cognitive overreach – The potential for information workers to create nonsensical

queries based on poorly-designed and/or misunderstood information models

• Skills gaps create competitive disadvantages

50

Q&A

Peter O'Kelly - [email protected] Reidy - [email protected] Reynolds - [email protected]

Relational

Non-relational Analytic

OracleOperational IBM DB2 SQL Server

PostgreSQLMySQL Ingres

SAP Sybase ASE

Hadoop TeradataNetezza

JustOne

EMC Greenplum

Aster Data

ParAccel

HP Vertica

-as-a-Service

SimpleDB

Amazon RDS

Xeround

NewSQL

Calpont

GenieDB

VoltDB

ScalArc

NoSQL

DocumentLotus Notes

CouchDB

MongoDB

Graph

Key value

Big tables

ObjectivityMarkLogicInterSystems

Versant

Progress

McObject

HBase

Hypertable

RedisRiak

Voldemort

BerkeleyDB

Membrain

InfiniteGraphNeo4J

GraphDB

App EngineDatastore

Data Grid/Cache

Clustrix

Schooner MySQL

Tokutek

Akiban

CodeFutures

ContinuentScaleBase

Translattice

SQL Azure

FathomDB

EnterpriseDB

Database.com

Infobright SAP Sybase IQIBM InfoSphere

NimbusDB

VectorWise

HandlerSocket

Cassandra

Cloudant

MemcachedIBM eXtreme Scale

Oracle CoherenceGigaSpacesTerracotta

GridGain ScaleOut Vmware GemFire CloudTranInfiniSpan

Couchbase RavenDB

Drizzle

PiccoloDryad Hadapt

Mapr

Brisk

MySQL Cluster

Database market landscape

52

Num

ber &

Com

plex

ity o

f Tec

hnol

ogie

sBig Data Complexity Continuum

Time Horizon

eCommerce

IDC 2005

Sentiment extraction

Speech to text

Intelligent Machines

Log Analysis

Predictions

Historic

Relationship Detection

PatternDetection

Influence Networks

Brand monitoring

Climate Modeling And Prediction

Trend Analytics

Reputationmanagement

Voice of Customer

Gov’t IntelligenceApplications

Data mining

Current (Monitor)Future(Predict)

MedicaldiagnosticsFraud

Detection

Web search

Ad Targeting Retargeting

04/12/2023© IDC

Velocity Value

VolumeVariety/

Complexity

Big Data

Big Data CharacteristicsBig Data Characteristics

Date post:	19-May-2015
Category:	Technology
Upload:	peter-okelly
View:	2,805 times
Download:	3 times