+ All Categories
Home > Technology > Hadoop Big Data A big picture

Hadoop Big Data A big picture

Date post: 08-Feb-2017
Category:
Upload: j-s-jodha
View: 204 times
Download: 0 times
Share this document with a friend
49
Hadoop A Big picture for Developers
Transcript
Page 1: Hadoop Big Data A big picture

Hadoop

A Big picture for Developers

Page 2: Hadoop Big Data A big picture

How much Data?

Over 1 terabytes data genrated by NYSE every day. We send more than 144.8 billion Email messages sent a day. On Twitter send more than 340 million tweets a day. On Facebook share more than 684,000 bits of content a day. We 72 hours of new video to YouTube a minute. We spend $272,000 on Web shopping a day. Google receives over 2 million search queries a minute. Apple receives around 47,000 app downloads a minute. Brands receive more than 34,000 Facebook ‘likes’ a minute. Tumblr blog owners publish 27,000 new posts a minute. Instagram photographers share 3,600 new photos a minute. Flickr photographers upload 3,125 new photos a minute. We perform over 2,000 Foursquare check-ins a minute. Individuals and organizations launch 571 new websites a minute. WordPress bloggers publish close to 350 new blog posts a

minute.

Page 3: Hadoop Big Data A big picture

Data Types

Structured• Sql db• Data

warehouses• Enterprise

system (CRM, ERP etc.)

Unstructured

• Analog Data• GPS Tracking

information• Audio, Video

Streams

Semi Structured

• XML• E-Mails• EDI

sql XML

Page 4: Hadoop Big Data A big picture

Worldwide Corporate Data Growth

2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

100000

200000

300000

400000

500000

600000

700000

Structured Semi Structured Unstructured

Data in petabytes

Page 5: Hadoop Big Data A big picture

Hadoop History

Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first

only to extend Nutch (the name is derived from Doug’s son’s toy elephant)

2006: Yahoo runs Hadoop on 5-20 nodes March 2008: Cloudera founded July 2008: Hadoop wins TeraByte sort benchmark (1st time a Java

program won this competition) April 2009: Amazon introduce “Elastic MapReduce” as a service on

S3/EC2 June 2011: Hortonworks founded 27 dec 2011: Apache Hadoop release 1.0.0 June 2012: Facebook claim “biggest Hadoop cluster”, totalling more

than 100 PetaBytes in HDFS 2013: Yahoo runs Hadoop on 42,000 nodes, computing about

500,000 MapReduce jobs per day 15 oct 2013: Apache Hadoop release 2.2.0 (YARN)

Page 6: Hadoop Big Data A big picture

Terms

Google calls it: Hadoop equivalent:

MapReduce Hadoop

GFS HDFS

Bigtable HBase

Chubby Zookeeper

Page 7: Hadoop Big Data A big picture

HDFS

Single Namespace for entire cluster

Data Coherency– Write-once-read-many access model– Client can only append to existing files

Files are broken up into blocks– Typically 128MB block size– Each block replicated on multiple DataNodes

Intelligent Client– Client can find location of blocks– Client accesses data directly from DataNode

Page 8: Hadoop Big Data A big picture

HDFS Architecture

Page 9: Hadoop Big Data A big picture

NameNode Metadata

Meta-data in Memory– The entire metadata is in main memory– No demand paging of meta-data

Types of Metadata– List of files– List of Blocks for each file– List of DataNodes for each block– File attributes, e.g creation time, replication factor

A Transaction Log– Records file creations, file deletions. etc

Page 10: Hadoop Big Data A big picture

DataNode

A Block Server– Stores data in the local file system (e.g. ext3)– Stores meta-data of a block (e.g. CRC)– Serves data and meta-data to Clients

Block Report– Periodically sends a report of all existing blocks to the NameNode

Facilitates Pipelining of Data– Forwards data to other specified DataNodes

Page 11: Hadoop Big Data A big picture

Block Placement

Current Strategy-- One replica on local node-- Second replica on a remote rack-- Third replica on same remote rack-- Additional replicas are randomly placed

Clients read from nearest replicaWould like to make this policy

pluggable

Page 12: Hadoop Big Data A big picture

Map Reduce

Compute to Data Instead

data to compute

Page 13: Hadoop Big Data A big picture

Hadoop Cluster

Page 14: Hadoop Big Data A big picture

MapReduce

Page 15: Hadoop Big Data A big picture

MapReduce Logical Data Flow

Page 16: Hadoop Big Data A big picture

MapReduce flow

Input: This is the input data / file to be processed. Split: Hadoop splits the incoming data into smaller pieces called "splits". Map: In this step, MapReduce processes each split according to the logic

defined in map() function. Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.

Combine: This is an optional step and is used to improve the performance by reducing the amount of data transferred across the network. Combiner is the same as the reduce step and is used for aggregating the output of the map() function before it is passed to the subsequent steps.

Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order, and grouped before sending them to the next step.

Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.

Output: Finally the output of reduce step is written to a file in HDFS.

Page 17: Hadoop Big Data A big picture

Data Integration (Sqoop)

Sqoop: is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

RDBMS HDFS

Page 18: Hadoop Big Data A big picture

Data Integration (Flume)

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Page 19: Hadoop Big Data A big picture

Data Integration (Chukwa)

A data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Page 20: Hadoop Big Data A big picture

Data Access (Pig)

A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.

Page 21: Hadoop Big Data A big picture

Data Access ( Hive)

Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS

Page 22: Hadoop Big Data A big picture

Data Storage ( Hbase)

A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.

Page 23: Hadoop Big Data A big picture

Data Storage ( Cassandra)

Built on Amazon’s Dynamo and Google’s BigTable, is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.  Cassandra offers capabilities that relational databases and other NoSQL databases.

Page 24: Hadoop Big Data A big picture

HCatalog

Built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. It also presents a REST interface to allow external tools access to Hive DDL (Data Definition Language) operations, such as “create table” and “describe table”.

INTERACTION, VISUALIZATION, EXECUTION, DEVELOPMENT

Page 25: Hadoop Big Data A big picture

Lucene

Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.

Also available as Lucene .Net

INTERACTION, VISUALIZATION, EXECUTION, DEVELOPMENT

Page 26: Hadoop Big Data A big picture

Hama

Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms.

INTERACTION, VISUALIZATION, EXECUTION, DEVELOPMENT

Page 27: Hadoop Big Data A big picture

Crunch

A Simple and Efficient MapReduce Pipelines. The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

INTERACTION, VISUALIZATION, EXECUTION, DEVELOPMENT

Page 28: Hadoop Big Data A big picture

Data serialization (Avro)

 A very popular data serialization format in the Hadoop technology stack. In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write data in Avro format.

Page 29: Hadoop Big Data A big picture

Data serialization ( Thrift)

For scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi ..

Page 30: Hadoop Big Data A big picture

Data Intelligence (Drill)

A framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery

Page 31: Hadoop Big Data A big picture

Data Intelligence (Mahout)

A machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform

Page 32: Hadoop Big Data A big picture

Management & monitoring (Ambari) A web-based tool for provisioning,

managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Page 33: Hadoop Big Data A big picture

Management & monitoring (ZooKeeper)ZooKeeper is a centralized service for

maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable

Page 34: Hadoop Big Data A big picture

Management & monitoring (Oozie) Oozie is a workflow scheduler system to manage

Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs

(DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow

jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Page 35: Hadoop Big Data A big picture

Other Tools

SPARK: is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets.

STORM: is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop® 2.x

SOLR: A platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites.

TEZ: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.

Page 36: Hadoop Big Data A big picture

Technology Stack / Ecosystem

Page 37: Hadoop Big Data A big picture

Windows Application

Page 38: Hadoop Big Data A big picture

Hadoop, We know Why?

Need to process Multi Petabyte Datasets Expensive to build reliability in each

application. Nodes fail every day

– Failure is expected, rather than exceptional.– The number of nodes in a cluster is not constant.

Need common infrastructure– Efficient, reliable, Open Source Apache License

The above goals are same as Condor, but Workloads are IO bound and not CPU bound

Page 39: Hadoop Big Data A big picture

Fact #1.

Hadoop consists of multiple products. We talk about Hadoop as if it’s one monolithic thing, but it’s actually a family of open source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.) The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL engine for low-latency data access to HDFS and Hive data.

Page 40: Hadoop Big Data A big picture

Fact #2.

Hadoop is open source but available from vendors, too. Apache Hadoop’s open source software library is available from ASF at www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools, maintenance, and technical support. A handful of vendors offer their own non-Hadoop-based implementations of MapReduce.

Page 41: Hadoop Big Data A big picture

Fact #3.

Hadoop is an ecosystem, not a single product. In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products (e.g., database management systems and tools for analytics, reporting, and DI) that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these. Ignorance of Hadoop is still common in the BI and IT communities. Hadoop comprises multiple products, available from multiple sources. 1 This section of the report was originally published as the expert column “Busting 10 Myths about Hadoop” in TDWI’s BI This Week newsletter, March 20, 2012 (available at tdwi.org). The column has been updated slightly for use in this report. 6 TDWI research Integrating Hadoop Into Bi/DW

Page 42: Hadoop Big Data A big picture

Fact #4.

HDFS is a file system, not a database management system (DBMS). Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a DBMS, such as indexing, random access to data, support for standard SQL, and query optimization. That’s okay, because HDFS does things DBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.

Page 43: Hadoop Big Data A big picture

Fact #5.

Hive resembles SQL but is not standard SQL. Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI believes that over time, Hadoop products will support standard SQL and SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.

Page 44: Hadoop Big Data A big picture

Fact #6.

Hadoop and MapReduce are related but don’t require each other. Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.

Page 45: Hadoop Big Data A big picture

Fact #7.

MapReduce provides control for analytics, not analytics per se. MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic and other applications—not just analytics.

Page 46: Hadoop Big Data A big picture

Fact #8.

Hadoop is about data diversity, not just data volume. Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS and related Hadoop products. After all, many types of big data that require analysis are inherently file based, such as Web logs, XML files, and personal productivity documents.

Page 47: Hadoop Big Data A big picture

Fact #9.

Hadoop complements a DW; it’s rarely a replacement. Most organizations have designed their DWs for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multi-structured data types most DWs simply weren’t designed for. Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adoptors offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel. HDFS is not a DBMS. Oddly enough, that’s an advantage for BI/DW. Hadoop promises to extend DW architecture to better handle staging, archiving, sandboxes, and unstructured data. tdwi.org 7 Introduction to Hadoop Products and Technologies

Page 48: Hadoop Big Data A big picture

Fact #10.

Hadoop enables many types of analytics, not just Web analytics. Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples—such as customer base segmentation, fraud detection, and risk analysis—can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.

Page 49: Hadoop Big Data A big picture

THANKS


Recommended