Big Data and its emergence

Tech - Meet

Kalpesh Pradhan (@kalpeshpradhan) Sr. Developer, Hungama Digital Media Pvt

Ltd. Designed solution for migrating SQL

database to NoSQL Contributed skills in implementing Search

Engine using Apache Cassandra and Solr Designed solution for replacing social

analytics in-house using Apache Cassandra.

Who Am I ?

Big data means collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

In survey made by Gatner, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of Exabyte of data.

1 Exabyte = 1 048 576 terabytes

What is Big Data?

Science Field◦ meteorology, genomics, connectomics, complex

physics simulations, and biological and environmental research.

Technology Field◦ Internet Search, Social Networks, Server logs,

User action tracking on websites. Other Sources

◦ Stock Markets, e-commerce transactions.

Sources for Big Data?

Big data is changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data.

What is changing data?

Insights from big data can enable all organizations to make◦ Better decisions◦ Deepening customer engagement◦ Optimizing operations◦ Preventing threats and fraud

What is changing data?

Competitive advantage• Data is emerging as the world's newest resource for competitive advantage.• Analysis of data gives a competitive edge to organization to derive its strategy.• Example: Presenting website as per user history.

Decision Making

• Big data harnesses organization to make certain decisions in a smarter way.•Example: Jubilant FoodWorks

Collects user information and order Analyzes the data Predicts when a particular user is coming

back to order Predicts what user can order on the basis of

past order. Harnesses the call center person with data

which can help costumer to order.

Case Study: Dominos

Value of Data• Data is always collected as Raw information• Challenge is to derive value out of collected data.• Computing relevance from the collected data is a challenge.Example: Targeting customer with new credit card scheme based on transactional history.

NoSQL, Not so SQL Apache Hadoop Apache Cassandra Map Reduce Apache Hbase Apache Hive Pig Latin

Technologies

Yahoo On February 19, 2008, Yahoo! Inc. launched what

it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.

Prominent Users

FacebookIn 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 Peta Byte of storage.On June 13, 2012 they announced the data had grown to 100 PetaByte. On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PetaByte per day

1 PetaByte = 1048576

Prominent Users

Prominent Users

Apache Cassandra

15

Free Open Source No SQL Distributed database system Manage large amount of Structure, Semi-

structure , and Un Structure data Scale to a very large size across many

commodity servers with no single point of failure

Allows maximum flexibility and performance at scale

16

About Cassandra

17

Prerequisites

Memory◦ Minimum of than 8GB of RAM is needed◦ Recommended 16GB – 32GB◦ Java heap space should be set to a maximum of

8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods)

18

Hardware

CPU◦ Insert-heavy workloads are CPU-bound in Cassandra

before becoming memory-bound.◦ For dedicated hardware, 8-core processors are the

current price-performance sweet spot

19

Hardware

Disk◦ Ideally Cassandra needs at least two disks,

One for the commit log and Other for the data directories. At a minimum the commit log should be on its own

partition.◦ Commit log disk

Does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).

20

Hardware

Disk◦ Most workloads are best served by using less

expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).

◦ use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.

21

Hardware

Number of Nodes◦ Using a greater number of smaller nodes is better

than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction.

22

Hardware

Network◦ choose reliable, redundant network interfaces and

make sure that your network can handle traffic between nodes without bottlenecks Recommended bandwith is 1000 Mbit/s (Gigabit) or

greater. Bind the Thrift interface (listen address) to a specific

NIC (Network Interface Card). Bind the Thrift interface (listen address) to a specific

NIC (Network Interface Card).

23

Hardware

http://www.datastax.com/docs/1.0/configuration/node_configuration

http://www.datastax.com/docs/1.0/configuration/node_configuration

Network PortsOps Center Specific

50031 OpsCenter HTTP proxy for Job Tracker 61620 OpsCenter intra-node monitoring port 61621 OpsCenter agent ports

Intranode Ports 1024+

Public Ports used 22 SSH 8888 Ops Center 7000 Cassandra intra-node port 9160 Cassandra Client port

24

Hardware

Calculating Data Size◦ As with all data storage systems, the size of your

raw data will be larger once it is loaded into Cassandra due to storage overhead.

◦ On average, raw data will be about 2 times larger on disk after it is loaded into the database, but could be much smaller or larger depending on the characteristics of your data and column families.

25

Hardware

Column Overhead - Every column incurs 15 bytes of overhead. Since each row in a column family can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, add an additional 8 bytes (23 bytes column overhead). So the total size of a regular column is:◦ total_column_size = column_name_size +

column_value_size + 15

26

Hardware

Row Overhead - Just like columns, every row also incurs some overhead when stored on disk. Every row in Cassandra incurs 23 bytes of overhead.

27

Hardware

Primary Key Index - Every column family also maintains a primary index of its row keys. Primary index overhead becomes more significant when you have lots of skinny rows. Sizing of the primary row key index can be estimated as follows (in bytes):◦ primary_key_index = number_of_rows * (32 +

average_key_size)

28

Hardware

Replication Overhead - The replication factor obviously plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of your data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead.◦ replication_overhead = total_data_size *

(replication_factor - 1)

29

Hardware

Java Prerequisites◦ Before installing Cassandra on Linux, Windows, or

Mac, ensure that you have the most up-to-date version of Java installed on your machine.

30

Software

Download the ”DataStax Community Edition Server”, which is a bundle containing the most up-to-date version of Cassandra along with all the utilities and tools we will need. You can also download directly from a terminal window using wget on Linux or curl on Mac and the following URL:http://downloads.datastax.com/community/dsc.tar.gz

31

Software

◦ KeySpace is the container for application data, similar to a database or schema in a relational database.

◦ Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns

32

Cassandra Data Model

Cassandra does not enforce relationships between column families the way that relational databases do between tables

There are no formal foreign keys in Cassandra, and joining column families at query time is not supported.

Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from application.

33

Cassandra Data Model

http://en.wikipedia.org/wiki/Apache_Cassandra

http://www.datastax.com/docs/1.0/index

http://www.cloudera.com

34

Sources

35

Q & A

Date post:	15-Jan-2015
Category:	Technology
Upload:	koolkalpz
View:	78 times
Download:	0 times