Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | cloudera-inc |
View: | 14,575 times |
Download: | 1 times |
Welcome to the webinar!
Audio/Telephone: +1 (215) 383-1016
Access Code: 421-634-457
Audio Pin: Shown after joining the Webinar
Hadoop, Hbase, Pig, Hive, Bigtop, Avro, Flume & Whirr are trademark of the Apache Software Foundation
The Future of Hadoop
Doug Cutting | A Founder of Apache Hadoop
Jeff Hammerbacher | Chief Scientist, Cloudera
Housekeeping
▪ All lines are on mute
▪ Ask questions at any time using the Questions panel on GoToMeeting
▪ Slides and recording will be available on www.cloudera.com/events
©2011 Cloudera, Inc. All Rights Reserved.
Presentation Outline
▪ 1. Context
▪ 2. Apache Bigtop
▪ 3. Apache Hadoop Core
▪ 4. Apache HBase, Hive, and Pig
▪ 5. Other components
▪ Questions and Discussion
©2011 Cloudera, Inc. All Rights Reserved.
1. Context
Context Data
▪ 1.8 ZB will be created and replicated in 2011
▪ Up 9x in the last five years
▪ More than 90% of this data is unstructured
▪ Enterprises have some liability for 80% of this data
▪ Enterprises will spend $4T on managing data in 2011
▪ Source: IDC Digital Universe Report 2011
©2011 Cloudera, Inc. All Rights Reserved.
Context Hadoop
▪ Apache Hadoop and related software are designed for this world
▪ Volume
▪ Commodity hardware and open source software lowers cost and increases capacity
▪ Velocity
▪ Data ingest speed aided by append-only and schema-on-read design
▪ Variety
▪ Multiple tools to structure, process, and access data
©2011 Cloudera, Inc. All Rights Reserved.
Context Hadoop
Context HDFS and MapReduce
▪ Apache Hadoop = HDFS + MapReduce
▪ Similar to kernel of an operating system
▪ Referred to as “Hadoop Core”
▪ Related components are often deployed with Hadoop
▪ For example: HBase, Hive, Pig, Oozie, Flume, Sqoop
▪ Together, these components form a “Hadoop Stack”
▪ Not all components must be deployed
Context Bigtop
▪ What standards should all components follow?
▪ How can we ensure all components of the stack work together?
▪ How can we find the right version of each component?
▪ How can we make it easy to install an additional component?
2. Apache Bigtop
Apache Bigtop
▪ Now incubating at Apache
▪ Hadoop ecosystem-wide project, including:
▪ Interoperability testing of components
▪ Packaging of compatible versions of components
▪ Like a Fedora, Debian or CentOS for Hadoop ecosystem
▪ Releases are not a single artifact
▪ Rather a set of interdependent, compatible components
©2011 Cloudera, Inc. All Rights Reserved.
Apache Bigtop
▪ Current components
▪ Hadoop
▪ HBase
▪ Hive
▪ Pig
▪ Oozie
▪ Sqoop
▪ Flume
▪ ZooKeeper
▪ Whirr
Apache Bigtop
▪ Outputs
▪ Source
▪ RPM
▪ Deb
▪ Tests
▪ Integration
▪ Package
▪ Smoke
▪ Release 0.1.0 under vote now!
3. Apache Hadoop Core
Apache Hadoop Core
▪ Current stable releases based on branches from 0.20
▪ Upcoming release: 0.22
▪ Includes both security and new implementation of append
▪ Not expected to be run at scale or commercially supported
▪ Nearly ready for vote
▪ Upcoming release: 0.23
▪ Build and dependency management moved to Maven
▪ Branch to happen soon
HDFS
▪ Robustness
▪ HDFS-1073: Checkpointing of image and edits log
▪ Availability
▪ HDFS-1623: High availability
▪ Performance
▪ HDFS-941: Faster random reads
▪ HDFS-2080: Faster checksums
©2011 Cloudera, Inc. All Rights Reserved.
HDFS
▪ Scalability
▪ HDFS-1052: Federation of the NameNode
▪ Source of diagram: http://www.hortonworks.com/an-introduction-to-hdfs-federation/
MapReduce
▪ Modularity
▪ MAPREDUCE-279: MapReduce 2.0
▪ Break JobTracker into ResourceManager and ApplicationMaster
▪ Replace TaskTracker with NodeManager
▪ Source of diagram: http://www.odbms.org/download/dean-keynote-ladis2009.pdf
MapReduce
▪ Potential New Frameworks
▪ MAPREDUCE-2719: Distributed shell
▪ MAPREDUCE-2720: Distributed Java commands
▪ MPI: Communication-intensive parallelism
▪ Fast scans and aggregations
▪ OpenDremel
▪ Bulk Synchronous Parallel
▪ Giraph, Golden Orb, Hama, et al.
▪ Actor Model (streaming)
▪ S4, Akka, Storm, et al.
4. HBase, Hive, and Pig
Apache HBase
▪ Upcoming release: 0.92.0
▪ Server-side triggers
▪ HBASE-2000: Coprocessors
▪ Availability
▪ HBASE-1730/4213: Online schema changes
▪ Performance
▪ HBASE-3857: HFile 2.0
▪ HBase book in September!
©2011 Cloudera, Inc. All Rights Reserved.
Apache Hive
▪ Upcoming release: 0.8
▪ Data transfer
▪ HIVE-306: INSERT INTO
▪ HIVE-1918: EXPORT/IMPORT
▪ Indexes
▪ HIVE-1644: Automatically use indexes
▪ HIVE-1803: Bitmap indexes
▪ Data formats
▪ HIVE-895: Avro support
©2011 Cloudera, Inc. All Rights Reserved.
Apache Pig
▪ Recent release: 0.9
▪ Scripting
▪ PIG-1479: Embedding Pig in Python
▪ PIG-1793: Macro expansion
▪ Debugging
▪ PIG-1712: ILLUSTRATE rework
▪ Data formats
▪ PIG-1748: Avro support
©2011 Cloudera, Inc. All Rights Reserved.
5. Other Components
Other Components
▪ Apache Incubator
▪ Sqoop, Flume, and Oozie now incubating
▪ Whirr graduated to a top-level Apache project
▪ Apache Avro
▪ Interoperability with Protocol Buffers and Thrift
▪ Column-oriented file format
▪ Python MapReduce implementation
▪ Apache ZooKeeper
▪ Multi-update
▪ Kerberos authentication of clients
©2011 Cloudera, Inc. All Rights Reserved.
Q & A Visit www.hadoopworld.com
• November 8-9, 2011 in New York City
• Early bird discount ends September 5, 2011
Enter Today: www.facebook.com/cloudera
• Click the “Be a Cloudera Hero for Apache
Hadoop” tab
• Share what you think Apache Hadoop can
do for you
• Win a personal hackathon with Doug Cutting
in San Francisco, CA