Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | geraldine-marsh |
View: | 213 times |
Download: | 0 times |
1
Introduction to Big Data and NoSQLSQL Azure SaturdayApril, 21, 2012
Don Demsak
Advisory Solutions Architect
EMC Consulting
www.donxml.com
2
Meet Don
• Advisory Solutions Architect– EMC Consulting
• Application Architecture, Development & Design
• DonXml.com, Twitter: donxml• Email – [email protected]• SlideShare - http://www.slideshare.net/dondemsak
4
How did we get here?• Expensive
– Processors– Disk space– Memory– Operating Systems– Software– Programmers
• Monoculture– Limit CPU cycles– Limit disk space– Limit memory– Limited OS
Development– Limited Software– Programmers
• Mono-lingual• Mono-persistence
5
Typical RDBMS Implementations• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACID– Atomicity– Consistency– Isolation– Durability
9
3rd Step – Database Partitioning
Web TierBrowser B/L Tier Database
Customer #2
Web TierBrowser B/L Tier Database
Customer #1
Web TierBrowser B/L Tier Database
Customer #3
10
4th Step – Move to the cloud?
Web TierBrowser B/L TierSQL AzureFederation
Customer #2
Web TierBrowser B/L Tier SQL AzureFederation
Customer #1
Web TierBrowser B/L TierSQL AzureFederation
Customer #3
15
Where Did NoSQL Originate?• 1998 - Carlo Strozzi– NoSQL project - lightweight open-source
relational DB with no SQL interface
• 2009 - Eric Evans & Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases
16
NoSQL (loose) Definition• (often) Open source
• Non-relational
• Distributed
• (often) don’t guarantee ACID
17
Atlanta 2009• No:sql(east) conference– select fun, profit from real_world where
relational=false
• Billed as “conference of no-rel datastores”
20
Document Store• Apache Jackrabbit
• CouchDB
• MongoDB
• SimpleDB
• XML Databases– MarkLogic Server– eXist.
21
Document?• Okay think of a web page...– Relational model requires column/tag– Lots of empty columns– Wasted space
• Document model just stores the pages as is– Saves on space– Very flexible.
22
Graph Storage• AllegroGraph
• Core Data
• Neo4j
• DEX
• FlockDB
• Microsoft Trinity (research project)– http://research.microsoft.com/en-us/projects/
trinity/
23
What’s a graph?• Graph consists of– Node (‘stations’ of the graph)– Edges (lines between them)
• FlockDB– Created by the Twitter folks– Nodes = Users– Edges = Nature of relationship between nodes.
24
Key/Value Stores• On disk
• Cache in Ram
• Eventually Consistent– Weak Definition
• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”
– Strong Definition• “for a given update and a given replica eventually either the
update reaches the replica or the replica retires”
• Ordered– Distributed Hash Table allows lexicographical processing
30
Big Data Definition• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
31
Big Data Examples• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)
32
Real World Example• Twitter
– The challenges• Needs to store many graphs
Who you are following Who’s following you Who you receive phone
notifications from etc• To deliver a tweet requires
rapid paging of followers• Heavy write load as followers
are added and removed• Set arithmetic for @mentions
(intersection of users).
33
What did they try?• Started with
Relational Databases
• Tried Key-Value storage of denormalized lists
• Did it work?– Nope
• Either good at Handling the write
load Or paging large
amounts of data But not both
34
What did they need?• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of order– Or be processed more than once– Failures should result in redundant work
• Not lost work!
35
The Result was FlockDB• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency lists– List of all edges in a graph
• Key is the edge value a set of the node end points
• Optimized for fast read and write
• Optimized for page-able set arithmetic.
36
How Does it Work?• Stores graphs as sets of edges between
nodes
• Data is partitioned by node– All queries can be answered by a single partition
• Write operations are idempotent– Can be applied multiple times without changing
the result
• And commutative– Changing the order of operands doesn’t change
the result.
38
ACID• Atomicity– All or Nothing
• Consistency– Valid according to all defined rules
• Isolation– No transaction should be able to interfere with
another transaction
• Durability– Once a transaction has been committed, it will
remain so, even in the event of power loss, crashes, or errors
39
BASE• Basically Available– High availability but not always consistent
• Soft state– Background cleanup mechanism
• Eventual consistency– Given a sufficiently long period of time over
which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
41
Big Data Approach• MapReduce Pattern/Framework– an Input Reader– Map Function – To transform to a common
shape (format)– a partition function– a compare function– Reduce Function– an Output Writer
42
MongoDB Example
> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};
> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};
> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );
44
Big Data on Azure• Azure Table Storage– Azure Service Bus
• SQL Azure Federations
• MongoDB on Azure– http://www.mongodb.org/display/DOCS/MongoDB+on+Azur
e
• Hadoop on Azure– https://www.hadooponazure.com/
45
Using Azure for Computing
MasterClient
Data
Worker
Worker
Worker
Data
Data
DataJob/Task SchedulerSockets
46
Moving to Event Based Architecture
Web Role
Queue
Req
Web Role
Web Role
Req
Req
Monitor queuelength against
user’s expectations
Web Role
Web Role
Web Role
Worker Role
Worker Role
Worker Role
Worker Role
Worker Role
Worker Role
48
Visualizing Aggregates
ID: 1001
Customer: Ann
Line Items
32411234 2 $48 $96
707423234 1 $56 456
125145 1 $24 $24
Payment Details
Card: AmExCC#: 12343Expiration: 07/2015
Orders
Customers
Order Lines
Credit Cards
49
Visualizing Aggregates
ID: 1001
Customer: Ann
Line Items
32411234 2 $48 $96
707423234 1 $56 456
125145 1 $24 $24
Payment Details
Card: AmExCC#: 12343Expiration: 07/2015
{“SalesOrdersView”:{ ID: 1001, Customer: Ann, LineItems: []……………..…………….……………..}}
51
Next Steps
• Learn a NoSQL product– Great place to start – AppFabric Cache, Azure
Table Storage, MongoDB
• Pick a new programming language to learn– Not Java or C#/VB– Node.js, JavaScript, F#