+ All Categories
Home > Documents > NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...

NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...

Date post: 17-Feb-2018
Category:
Upload: phungmien
View: 230 times
Download: 0 times
Share this document with a friend
36
NoSQL database Called as ‘‘Not Only SQL’’, is a current approach for large and distributed data management and database design. Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’. On the contrary, NoSQL does not avoid SQL. Some NoSQL systems are entirely non-relational Some NoSQL systems simply avoid selected relational functionality such as fixed table schemas and join operations. Some analytic platforms like SQLstream and Cloudera Impala series still use SQL in its database systems, Because SQL is more reliable and simpler query language with high performance in stream Big Data real-time analytics. The mainstream Big Data platforms adopt NoSQL to break and transcend the rigidity of normalized RDBMS schemas.
Transcript
Page 1: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

NoSQL database

�Called as ‘‘Not Only SQL’’, is a current approach for large and distributed data management and database design.

� Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’.�On the contrary, NoSQL does not avoid SQL.

�Some NoSQL systems are entirely non-relational

�Some NoSQL systems simply avoid selected relational functionality such as fixed table schemas and join operations.

� Some analytic platforms like SQLstream and Cloudera Impala series still use SQL in its database systems, �Because SQL is more reliable and simpler query language with high performance in stream

Big Data real-time analytics.

�The mainstream Big Data platforms adopt NoSQL to break andtranscend the rigidity of normalized RDBMS schemas.

Page 2: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

NoSQL databasefor Unstructured or non-relational Data

�Data storage and management are separated into two indepenentparts.

�This is contrary to relational databases

i)In the storage part which is also called key-value storage, NoSQL focuses on the scalability of data storage with high-performance.

ii)In the management part, NoSQL provides low-level access mechanism

�Data management tasks can be implemented in the application layer rather than having data management logic spread across in SQL or DB-specific stored procedure languages

�NoSQL systems are very flexible for data modeling

� NoSQL systems are easy to update application deployments

Page 3: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Hbase NoSQL Database System Architecture. (Apache Hadoop)

Hbase is one of the most famous used NoSQL databases

Page 4: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

NoSQL databasefor unstructured or non-relational data

�An important property of the most NoSQL databases that they are commonly schema-free.

�The biggest advantage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to rewrite tables.

� It possesses greater flexibility when the structured data is heterogeneously stored.

�In the data management layer, the data is enforced to be integrated and valid.

Page 5: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

NoSQL databasefor unstructured or non-relational data

�The most popular NoSQL database is Apache Cassandra.

�Cassandra was released as open source in 2008.

� Cassandra was Facebook proprietary database

�Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort.

�Companies that use NoSQL include Twitter, LinkedIn and NetFlix

Page 6: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Reliable, Scalable and Maintainable Applications

�A data-intensive application is built from standard building blocks

�These blocks provide commonly needed functionality.,

� For example, many applications need to:

�Store data so that they, or another application, can find it again later (databases)

�Remember the result of an expensive operation, to speed up reads (caches)

�Allow users to search data by keyword or filter it in various ways (search indexes)

�Send a message to another process, to be handled asynchronously (stream processing)

�Periodically crunch a large amount of accumulated data (batch processing)

Page 7: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

�The traditional data systems are such a successful abstraction

�we use them all the time without thinking too much.

� When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch

�because databases are a perfectly good tool for the job.

But reality is not that simple…

�There are many database systems with different characteristics

�because different applications have different requirements.

� There are various approaches to caching, several ways of building search indexes, and so on.

�We still need to figure out which tools and which approaches are the most appropriate for the task at hand.

� It can be hard to combine several tools when you need to do something that a single tool cannot do alone.

Page 8: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

An Example

�If you have an application-managed caching layer (using memcachedor similar)

�If you have a full-text search server separate from your main database (such as Elastic search or Solr),

� It is normally the application code’s responsibility to keep those caches and indexes in sync with the main database

Page 9: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

An Architecture for a Data System that CombinesSeveral Components

Page 10: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Data Systems

�When we combine several tools in order to provide a service

�The service’s interface or API usually hides those implementation details from clients.

�Now we have essentially created a new, special-purpose data systemfrom smaller, general-purpose components.

�Our composite data system may provide certain guarantees

�cache will be correctly invalidated or updated on writes

�outside clients see consistent results.

�We are not only an application developer, but also a data system designer.

Page 11: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Reliability, Scability, Maintainability

Reliability

�The system should continue to work correctly (performing the correct function at the desired performance) even in the face of adversity (hardware or software faults, and even human error).See Scalability

�As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.

Maintainability

�Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases)�they should all be able to work on it productively.

Page 12: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Describing Load

�Load can be described with a few numbers which we call load parameters.

� The best choice of parameters depends on the architecture of yoursystem:

�it’s requests per second to a webserver,

� ratio of reads to writes in a database,

� the number of simultaneously active users in a chat room,

� the hit rate on a cache, or something else.

�the average case is what matters for you,

�your bottleneck is dominated by a small number of extreme cases.

Page 13: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Twitter as an example

using data published in November 2012

Two of Twitter’s main operations are:

Post tweet

A user can publish a new message to their followers (4.6 k requests/sec on average, over 12 k requests/sec at peak).

Home timeline

A user can view tweets recently published by the people they follow (300 k requests/sec)

Page 14: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Twitter as an example

�Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy.

However;

�Twitter’s scaling challenge is not primarily due to tweet volume

�scaling was due to fan-out*

�Each user follows many people, and each user is followed by many people.

*Fan-out is a term that defines the maximum number of digital inputs that the output of a single logic gate can feed

Page 15: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Two Different Approaches for Tweeet Implementation

1. Posting a tweet simply inserts the new tweet into a global collectionof tweets.

�When a user requests home timeline, look up all the people they follow, find all recent tweets for each of those users, and merge them (sorted by time).

In a relational database, this would be a query along the lines of:

SELECT tweets.*, users.* FROM tweets

JOIN users ON tweets.sender_id = users.id

JOIN follows ON follows.followee_id = users.id

WHERE follows.follower_id = current_user

Page 16: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Simple relational schema for implementing a Twitter home timeline

Page 17: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

�For this usage version of Twitter the systems struggled to keep up with the load of home timeline queries

Therefore;

� The company switched to the following solution

Page 18: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

2. Maintain a cache for each user’s home timeline  like a mailbox of tweets for each recipient user

�When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches.

�Then the request to read the home timeline is cheap

Because

�The result has been computed ahead of time

�This works better than the previous solution

�The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads

�It’s possible to do more work at write time and less at read time.

Page 19: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Twitter’s data pipeline for delivering tweets to followers, with load

parameters as of November 2012

Page 20: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

�Posting a tweet requires a lot of extra work.

�On average, a tweet is delivered to about 75 followers

� 4.6 k tweets per second become 345 k writes per second to the home timeline caches.

�This average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers.

�This means that a single tweet may result in over 30 million writes to home timelines!

� Doing this in a timely manner 

 Twitter tries to deliver tweets to followers within 5 seconds 

 is a significant challenge.

Tweeter Implementation

Page 21: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Twitter is moving to a hybrid of both approaches.

�Most users’ tweets continue to be fanned out to home timelines at the time when they are posted

�A small number of users with a very large number of followers are excepted from this fan-out

�When the home timeline is read, the tweets followed by the user are fetchedseparately and merged with the home timeline when the timeline is read, like in the first approach

�This hybrid approach is able to deliver consistently goodperformance.

Page 22: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Describing Performance

�Once we have described the load on our system, we can investigate what happens when the load increases.

We can look at it in two ways:

�When we increase a load parameter, and keep the system resources (CPU, memory, network bandwidth,etc.) unchanged, how is performance of your system affected?

�When we increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Page 23: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Describing Performance

Both questions require performance numbers

�In a batch-processing system such as Hadoop, we usually care about throughput 

� the number of records we can process per second,

� the total time it takes to run a job on a dataset of a certain size.

� In online systems, the response time of a service is usually more important 

� The time between a client sending a request and receiving a response.

Page 24: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Latency and Response time

�Latency and response time are often used synonymously but they are not the same.

�The response time is what the client sees

� besides the actual time to process the request (the service time)

�it includes network delays and queueing delays.

�Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service

Page 25: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Data Models and Query Languages

�When we want to store data structures we express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database or a graph model.

�The engineers who built our database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network.

� There presentation may allow the data to be queried, searched, manipulated and processed in various ways.

Page 26: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Relational Model vs. Document Model

�In the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance.

� The term NoSQL is unfortunate, since it doesn’t actually refer to any particular technology � it was intended simply as a catchy Twitter hashtag for a meetup on open

source, distributed, non-relational databases in 2009

�The term NoSQL strucked a nerve, and quickly spreaded through the web startup community and beyond.

�A number of interesting database systems are now associated withthe #NoSQL hashtag, �it has been retroactively re-interpreted as Not Only SQL

Page 27: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

The Adoption of NoSQL databases

�A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput

�A widespread preference of free and open source software over commercial database products

�Specialized query operations that are not well supported by the relational model

�Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

Page 28: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup
Page 29: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Representation of a LinkedIn profile as a JSON document{ "user_id": 251,

"first_name": "Bill",

"last_name": "Gates",

"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", "region_id": "us:91",

"industry_id": 131,

"photo_url": "/p/7/000/253/05b/308dd6e.jpg",

"positions": [ {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"}, {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}

],

"education": [

{"school_name": "Harvard University", "start": 1973, "end": 1975}, {"school_name": "Lakeside School, Seattle", "start": null, "end": null}

],

"contact_info": {

"blog": "http://thegatesnotes.com",

"twitter": "http://twitter.com/BillGates"

}

}

Page 30: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

One-to-Many Relationships with Tree Structure

Page 31: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

The company name is not just a string, but a link to a

company entity.

Page 32: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Many-to-Many Relationships

�The data within each dotted rectangle can be grouped into one document

�The references to organizations, schools and other users need to be

represented as references, and require joins when queried.

Page 33: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Schema Flexibility in the Document Model

�Document databases are sometimes called schemaless

�The code that reads the data usually assumes some kind of structure 

� There is an implicit schema, but it is not enforced by the database

�For the term schema-on-read, the structure of the data is implicit, and only interpreted when the data is read,

�For the term schema-on-write, the traditional approach of relational databases ( the schema) is explicit and the database ensures all data conforms to it.

Page 34: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Schema-on-Read Schema-on Write

�Schema-on-read is similar to dynamic (run-time) type-checking in programming languages

�Schema-on-write is similar to static (compile-time) type-checking.

�Just as the advocates of static and dynamic type-checking have big debates about their relative merits, enforcement of schemas in database is a contentious topic

Page 35: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Document DataBase

�We can start writing new documents with the new fields

�The code in the application handles the case when old documents are read

For Example:

if (user && user.name && !user.first_name)

{ // Documents written before Dec 8, 2013 don't have first_name

user.first_name = user.name.split(" ")[0];

}

Page 36: NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a batch-processing systemsuch as Hadoop, ... it wasintended simply as a catchy Twitter hashtag for a meetup

Statically Typed Database Schema

� We can perform a migration along the lines:

ALTER TABLE users ADD COLUMN first_name text;

UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL

UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL

�Schema changes have a bad reputation of being slow and requiring downtime.

�The reputation is not deserved.

�Most relational database systems execute ALTER TABLE statement


Recommended