(Distributed) (Structured) Storage SystemsMark Feltner
Big Data
2.5 Petabytes/day: Wal-Mart's transaction database
40 Terabytes/second: CERN 1 Terabyte/day: NYSE Trading data 10 billion: Facebook photos
Overview
Theory Algorithms Implementations & Technology
Relational databases
ACID
Atomicty
All-or-nothing
Consistency
Data is always in a valid state
Isolation
Serially executed transactions result in same state as concurrent transactions
Durability
COMMIT means transaction is permanent across all clients
Non-relational databases
Key-value
Document-oriented
Graphs
Distributed Systems
Fallacies of Distributed Computing
1. The network is reliable.2. Latency is zero.3. Bandwidth is infinite.4. The network is secure.5. Topology doesn't change.6. There is one administrator.7. Transport cost is zero.8. The network is homogeneous.
CAP Theorem
Consistency
Eventual consistency
“…there must exist a total order on all operations such that eachoperation looks as if it were completed at a single instant. This is equivalentto requiring requests of the distributed shared memory to act as if they wereexecuting on a single node, responding to operations one at a time.” (Gilbert, Lynch)
Availability“For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)
Partition Tolerance“In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)
(CA || CP || AP) ?
Algorithms
Row- versus column- orientationTitle Artist Album Year
Breaking the Law Judas Priest British Steel 1980
Aces High Iron Maiden Powerslave 1984
Kickstart My Heat Motley Crue Dr. Feelgood 1989
Raining Blood Slayer Reign in Blood 1986
I Wanna Be Somebody W.A.S.P. W.A.S.P. 1984
Row-orientedData Storage Model:Breaking the LawJudas PriestBritish Steel1980Aces HighIron MaidenPowerslave1984Kickstart My heartMotley CrueDr. Feelgood1989Raining BloodSlayerReign in Blood1986I Wanna Be SomebodyW.A.S.P.W.A.S.P.1984
Column-orientedData Storage Model:Breaking the LawAces HighKickstart My HeartRaining BloodI Wanna Be SomebodyJudas PriestIron MadienMotley CrueSlayerW.A.S.P.British SteelPowerslaveDr. FeelgoodReign in BloodW.A.S.P.19801984198919861984
Comparison of Row- vs. Column-Orientation
CREATE SELECT MAX, MIN, SUM, AVG, …
MapReduce
Technology
Implementations
BigTable
High performance MapReduce Powers: Google Reader, Maps,
Book Search, YouTube, Gmail, …
Hadoop
MapReduce Yahoo! World Record Holder!
Cassandra
Key-value MapReduce Facebook Eventual consistency Scalable, fault-tolerant
MySQL
Relational LAMP
Redis
Key-value What is lacks in durability, it makes
up for in speed / simplicity.
HBase
MapReduce Hadoop + HDFS Java and REST API Column-oriented Excellent fault-tolerance Replication Streaming
Neo4J
Graph Database
MongoDB
Document-oriented
Conclusions
Pick the right tool for the job.