BIGTHETO THEOF THE
ANSWERQUESTIONDATA
eleks DevTalks #1
by Victor Haydin
Gordon Moore
1975 2012Cost of 1 TB storage
$208 000 000 $110
Cost of 1 GFLOPS/s computing facility
$62 000 000 $1.50
Number of network hosts
57 > 1 000 000 000
World’s data amount
~130 GB ~2.9 ZB
1 ZB = 1 000 000 000 000 000 000 000 B(1021)
Commodity Hardware
Wikipedia: “Apache Hadoop is a software framework that supports data-intensive distributed applications”
Main Contributors
HDFS: Hadoop Distributed File System
Hardware Failure
Streaming Data Access
Large Data Sets
Simple Coherency Mode (write-once)
Portability
Moving Computation is cheaper then moving Data
MapReduce
Map(k1,v1) → list(k2,v2)
void map(string key, string value): for each word w in value: yield return KeyValuePair(w, 1);
Reduce(k2, list (v2)) → list(v3)
void reduce(string key, int[] values): int sum = 0; for each pc in values: sum += pc; return KeyValuePair(key, sum);
Demo
Ecosystem
ZooKeeper
3K+ nodes, 36+ PB
45K nodes, 180-200 PB
vspowered by
FutureCore:• HDFS: high-availability and scalability• MapReduce: modularity and alternative ways to perform queriesEcosystem development:• Apache BigTop: consolidation project• HBase, Hive, Pig, ZooKeeper, Avro, Sqoop: stabilizing, interoperability• Incubator: Flume, Ozzie, Whirr
Demo
Q&A