Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | koolkalpz |
View: | 78 times |
Download: | 0 times |
Tech - Meet
Kalpesh Pradhan (@kalpeshpradhan) Sr. Developer, Hungama Digital Media Pvt
Ltd. Designed solution for migrating SQL
database to NoSQL Contributed skills in implementing Search
Engine using Apache Cassandra and Solr Designed solution for replacing social
analytics in-house using Apache Cassandra.
Who Am I ?
Big data means collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
In survey made by Gatner, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of Exabyte of data.
1 Exabyte = 1 048 576 terabytes
What is Big Data?
Science Field◦ meteorology, genomics, connectomics, complex
physics simulations, and biological and environmental research.
Technology Field◦ Internet Search, Social Networks, Server logs,
User action tracking on websites. Other Sources
◦ Stock Markets, e-commerce transactions.
Sources for Big Data?
Big data is changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data.
What is changing data?
Insights from big data can enable all organizations to make◦ Better decisions◦ Deepening customer engagement◦ Optimizing operations◦ Preventing threats and fraud
What is changing data?
Competitive advantage• Data is emerging as the world's newest resource for competitive advantage.• Analysis of data gives a competitive edge to organization to derive its strategy.• Example: Presenting website as per user history.
Decision Making
• Big data harnesses organization to make certain decisions in a smarter way.•Example: Jubilant FoodWorks
Collects user information and order Analyzes the data Predicts when a particular user is coming
back to order Predicts what user can order on the basis of
past order. Harnesses the call center person with data
which can help costumer to order.
Case Study: Dominos
Value of Data• Data is always collected as Raw information• Challenge is to derive value out of collected data.• Computing relevance from the collected data is a challenge.Example: Targeting customer with new credit card scheme based on transactional history.
NoSQL, Not so SQL Apache Hadoop Apache Cassandra Map Reduce Apache Hbase Apache Hive Pig Latin
Technologies
Yahoo On February 19, 2008, Yahoo! Inc. launched what
it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.
Prominent Users
FacebookIn 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 Peta Byte of storage.On June 13, 2012 they announced the data had grown to 100 PetaByte. On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PetaByte per day
1 PetaByte = 1048576
Prominent Users
Prominent Users
Apache Cassandra
15
Free Open Source No SQL Distributed database system Manage large amount of Structure, Semi-
structure , and Un Structure data Scale to a very large size across many
commodity servers with no single point of failure
Allows maximum flexibility and performance at scale
16
About Cassandra
17
Prerequisites
Memory◦ Minimum of than 8GB of RAM is needed◦ Recommended 16GB – 32GB◦ Java heap space should be set to a maximum of
8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods)
18
Hardware
CPU◦ Insert-heavy workloads are CPU-bound in Cassandra
before becoming memory-bound.◦ For dedicated hardware, 8-core processors are the
current price-performance sweet spot
19
Hardware
Disk◦ Ideally Cassandra needs at least two disks,
One for the commit log and Other for the data directories. At a minimum the commit log should be on its own
partition.◦ Commit log disk
Does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).
20
Hardware
Disk◦ Most workloads are best served by using less
expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).
◦ use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.
21
Hardware
Number of Nodes◦ Using a greater number of smaller nodes is better
than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction.
22
Hardware
Network◦ choose reliable, redundant network interfaces and
make sure that your network can handle traffic between nodes without bottlenecks Recommended bandwith is 1000 Mbit/s (Gigabit) or
greater. Bind the Thrift interface (listen address) to a specific
NIC (Network Interface Card). Bind the Thrift interface (listen address) to a specific
NIC (Network Interface Card).
23
Hardware
Network PortsOps Center Specific
50031 OpsCenter HTTP proxy for Job Tracker 61620 OpsCenter intra-node monitoring port 61621 OpsCenter agent ports
Intranode Ports 1024+
Public Ports used 22 SSH 8888 Ops Center 7000 Cassandra intra-node port 9160 Cassandra Client port
24
Hardware
Calculating Data Size◦ As with all data storage systems, the size of your
raw data will be larger once it is loaded into Cassandra due to storage overhead.
◦ On average, raw data will be about 2 times larger on disk after it is loaded into the database, but could be much smaller or larger depending on the characteristics of your data and column families.
25
Hardware
Column Overhead - Every column incurs 15 bytes of overhead. Since each row in a column family can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, add an additional 8 bytes (23 bytes column overhead). So the total size of a regular column is:◦ total_column_size = column_name_size +
column_value_size + 15
26
Hardware
Row Overhead - Just like columns, every row also incurs some overhead when stored on disk. Every row in Cassandra incurs 23 bytes of overhead.
27
Hardware
Primary Key Index - Every column family also maintains a primary index of its row keys. Primary index overhead becomes more significant when you have lots of skinny rows. Sizing of the primary row key index can be estimated as follows (in bytes):◦ primary_key_index = number_of_rows * (32 +
average_key_size)
28
Hardware
Replication Overhead - The replication factor obviously plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of your data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead.◦ replication_overhead = total_data_size *
(replication_factor - 1)
29
Hardware
Java Prerequisites◦ Before installing Cassandra on Linux, Windows, or
Mac, ensure that you have the most up-to-date version of Java installed on your machine.
30
Software
Download the ”DataStax Community Edition Server”, which is a bundle containing the most up-to-date version of Cassandra along with all the utilities and tools we will need. You can also download directly from a terminal window using wget on Linux or curl on Mac and the following URL:http://downloads.datastax.com/community/dsc.tar.gz
31
Software
◦ KeySpace is the container for application data, similar to a database or schema in a relational database.
◦ Inside the keyspace are one or more column family objects, which are analogous to tables. Column families contain columns, and a set of related columns is identified by an application-supplied row key. Each row in a column family is not required to have the same set of columns
32
Cassandra Data Model
Cassandra does not enforce relationships between column families the way that relational databases do between tables
There are no formal foreign keys in Cassandra, and joining column families at query time is not supported.
Each column family has a self-contained set of columns that are intended to be accessed together to satisfy specific queries from application.
33
Cassandra Data Model
http://en.wikipedia.org/wiki/Apache_Cassandra
http://www.datastax.com/docs/1.0/index
http://www.cloudera.com
34
Sources
35
Q & A