Date post: | 10-Feb-2017 |
Category: |
Engineering |
Upload: | soner-altin |
View: | 327 times |
Download: | 3 times |
NOSQL & DOCUMENTEDBSONER ALTIN
@kahve• Soner ALTIN
• BizDev @T2
• soner.in
• Organise hackathons (t2hackathon.com)
• Strong interest in Led Zeppelin
HISTORY OF DBMS AND RDBMSDatabase management systems first appeared on the scene in 1960 as computers began to grow in power and speed. In the middle of 1960, there were several commercial applications in the market that were capable of producing “navigational” databases. These navigational databases maintained records that could only be processed sequentially, which required a lot of computer resources and time.
Relational database management systems were first suggested by Edgar Codd in the 1970s. Because navigational databases could not be “searched”, Edgar Codd suggested another model that could be followed to construct a database. This was the relational model that allowed users to “search” it for data. It included the integration of the navigational model, along with a tabular and hierarchical model.
60’s 70’s 80’s 90’s 00’s
A relational database is a digital database whose organization is based on the relational model of data
RDMBS 40 YEARS!
1. A simple way of representing data/ business models
2. An easy-to-use language to retrieve and query that data (SQL)
3. Bulletproof data integrity and security built right into the database without having to rely on application rules and logic.
ACCESS AND STORAGE
▸ It is generally easier to access data that is stored in a relational database. This is because the data in a relational database follows a mathematical model for categorization. Also, once we open a relational database, each and every element of that database becomes accessible, which is not always the case with a normal database (the data elements may need to be accessed individually).
▸ Relational databases are harder to construct, but they are better structured and more secure. They follow the ACID (atomicity, consistency, isolation and durability) model when storing data. The relational database system will also impose certain regulations and conditions that may not allow you to manipulate data in a way that destabilizes the integrity of the system.
PERSISTENCE
REPORTINGTRANSACTIONS SQL
INTEGRATION
3V - VOLUME VARIETY VELOCITY
▸ Five years ago, Amazon found that every 100ms of latency cost them 1% of sales. Google discovered that a half-second increase in search latency dropped traffic by 20%.
▸ The volume of required data handling today is skyrocketing. Facebook houses 1.5 PB (Peta Bytes) of uploaded photos. Google processes 20PB of data each day. Every 60 seconds over 204 million emails are exchanged, 3,600 photos are shared on Instagram and 2 million search queries are processed by Google. RDBMSs struggle in the face of such huge data volumes and RDBMS solutions capable of handling such volumes are extremely expensive.
▸ Big Data also demands collection of an extremely wide variety of data types, but RDBMSs have inflexible schemas. The problem is that Big Data primarily comprises semi-structured data, such as social media sentiment analysis and text mining data, while RDBMSs are more suitable for structured data, such as weblog, sensor and financial data.
▸ In addition, Big Data is accumulated at a very high velocity. Since RDBMSs are designed for steady data retention, rather than for rapid growth, using RDBMSs for Big Data is prohibitively expensive.
60’s 70’s 80’s 90’s 00’s 10’s
TODAY
▸ Developers are working with applications that create massive volumes of new, rapidly changing data types — structured, semi-structured, unstructured and polymorphic data.
▸ Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams work in agile sprints, iterating quickly and pushing code every week or two, some even multiple times every day.
▸ Applications that once served a finite audience are now delivered as services that must be always-on, accessible from many different devices and scaled globally to millions of users.
▸ Organizations are now turning to scale-out architectures using open source software, commodity servers and cloud computing instead of large monolithic servers and storage infrastructure.
Structured Unstructured Semi-structured
Pre-defined God knows Pre-defined
Relational Non-relational So so
Constant Flexible Easy to change
RDBMS HDFS *
CRM, Travel, Phone numbers Web, Video, Music, Photo Tagging, Comments
%5 %15 %80
No need to scale horizontally Fully scalable Fully scalable
/* * Copyright 2007 Yusuke Yamamoto */ /** * A data interface representing one single status of a user. * * @author Yusuke Yamamoto - yusuke at mac.com */
public interface Status extends Comparable<Status>, TwitterResponse, EntitySupport, java.io.Serializable {
Date getCreatedAt(); long getId(); String getText(); String getSource(); boolean isTruncated(); long getInReplyToStatusId(); long getInReplyToUserId(); String getInReplyToScreenName(); GeoLocation getGeoLocation(); Place getPlace(); boolean isFavorited(); boolean isRetweeted(); int getFavoriteCount(); User getUser(); boolean isRetweet(); Status getRetweetedStatus(); long[] getContributors(); int getRetweetCount(); boolean isRetweetedByMe(); long getCurrentUserRetweetId(); boolean isPossiblySensitive(); String getLang(); Scopes getScopes(); String[] getWithheldInCountries(); long getQuotedStatusId(); Status getQuotedStatus(); }
/* * Copyright 2007 Yusuke Yamamoto */ /** * A data interface representing Basic user information element * * @author Yusuke Yamamoto - yusuke at mac.com */ public interface User extends Comparable<User>, TwitterResponse, java.io.Serializable { long getId(); String getName(); String getScreenName(); String getLocation(); String getDescription(); boolean isContributorsEnabled(); String getProfileImageURL(); String getBiggerProfileImageURL(); String getMiniProfileImageURL(); String getOriginalProfileImageURL(); String getProfileImageURLHttps(); String getBiggerProfileImageURLHttps(); String getMiniProfileImageURLHttps(); String getOriginalProfileImageURLHttps(); boolean isDefaultProfileImage(); String getURL(); boolean isProtected(); int getFollowersCount(); Status getStatus(); String getProfileBackgroundColor(); String getProfileTextColor(); String getProfileLinkColor(); String getProfileSidebarFillColor(); String getProfileSidebarBorderColor(); boolean isProfileUseBackgroundImage(); boolean isDefaultProfile(); boolean isShowAllInlineMedia(); int getFriendsCount(); Date getCreatedAt(); int getFavouritesCount(); int getUtcOffset(); String getTimeZone(); String getProfileBackgroundImageURL(); String getProfileBackgroundImageUrlHttps(); String getProfileBannerURL(); String getProfileBannerRetinaURL(); String getProfileBannerIPadURL(); String getProfileBannerIPadRetinaURL(); String getProfileBannerMobileURL(); String getProfileBannerMobileRetinaURL(); boolean isProfileBackgroundTiled(); String getLang(); int getStatusesCount(); boolean isGeoEnabled(); boolean isVerified(); boolean isTranslator(); int getListedCount(); boolean isFollowRequestSent(); URLEntity[] getDescriptionURLEntities(); URLEntity getURLEntity(); String[] getWithheldInCountries(); }}
/* * Copyright 2007 Yusuke Yamamoto */
/** * A data interface representing one single URL entity. * @author Mocel - mocel at guma.jp */ public interface URLEntity extends TweetEntity, java.io.Serializable {
String getText();
String getURL();
String getExpandedURL();
String getDisplayURL();
int getStart();
int getEnd(); }
/** * @author Yusuke Yamamoto - yusuke at mac.com */ public interface Place extends TwitterResponse, Comparable<Place>, java.io.Serializable { String getName();
String getStreetAddress();
String getCountryCode();
String getId();
String getCountry();
String getPlaceType();
String getURL();
String getFullName();
String getBoundingBoxType();
GeoLocation[][] getBoundingBoxCoordinates();
String getGeometryType();
GeoLocation[][] getGeometryCoordinates();
Place[] getContainedWithIn(); }
https://dev.twitter.com/rest/reference/get/statuses/retweets_of_me
SCALABILITY
NON RELATIONAL
Provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases
REQUIREMENTS
▸ over 425 million unique users
▸ store 20 TB of JSON document data
▸ available globally to serve all markets
▸ store for 40+ apps / device combinations
▸ under 15 ms writes and single digits ms reads
CONTROL OVER AVAILABILITY
HORIZONTAL SCALABILITY
SIMPLICITY OF DESIGN
BIG DATA
REAL TIME APPLICATIONS
EASIER DEVELOPMENT
SCALABILITY VS FUNCTIONALITYsc
alab
ility
& p
erfo
rman
ce
depth of functionality
rmdbs
nosql
memcachedkey/value store
ECONOMICS
The goal of a business, of course, is to make money, and that’s accomplished by providing more for less. NoSQL databases drastically reduce the need for insanely big machines. Typically, they use clusters of cheap commodity servers to manage exploding data and transaction volumes. The cost-per-gigabyte or transaction/second for NoSQL can be considerably lower than the cost for RDBMSs, thereby dramatically reducing the cost of data processing and storage. Another area of key savings is in manpower. By lowering administrative costs one can free up developers to code new features that will generate more revenue.
SCHEMALESS - DATA UPDATE
The documents stored in the database can have varying sets of fields, with different types for each field. One could have the following objects in a single collection:
{ name : “Joe”, x : 3.3, y : [1,2,3] }
{ name : “Kate”, x : “abc” }
{ q : 456 }
Of course, when using the database for real problems, the data does have a fairly consistent structure. Something like the following would be more common:
{ name : “Joe”, age : 30, interests : ‘football’ }
{ name : “Kate”, age : 25 }
One of the great benefits of dynamic objects is that schema migrations become very easy. With a traditional RDBMS, releases of code might contain data migration scripts. Further, each release should have a reverse migration script in case a rollback is necessary. ALTER TABLE operations can be very slow and result in scheduled downtime.
With a schemaless database, 90% of the time adjustments to the database become transparent and automatic. For example, if we wish to add GPA to the student objects, we add the attribute, resave, and all is well – if we look up an existing student and reference GPA, we just get back null. Further, if we roll back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was well written.
NOSQL
data model performance scalability flexibility complexity
column high high moderate low
document high variable high low
key-value high high high none
graph variable variable high high
NOSQL TYPES
data model examples
column Cassandra, HBase
document DocumentDB, MongoDB, ElasticSearch
key-value Redis, MemcacheDB
graph Neo4J, OrientDB
fully featured RDBMS
transactional processing
rich query
managed as a service
elastic scale
internet accessible http/rest
schema-free data model
arbitrary data formats
schema free query
Relational and hierarchical query of application defined JSON data. Support for SQL queries with transforms, projections and inline evaluation of user defined JavaScript functions (UDFs). Automatic and consistent indexing of all properties.
JavaScript as a modern T-SQL
Transactional execution of application defined stored procedures and triggers directly against database collections. Native JavaScript support eliminating the impedance mismatch between application and database schema.
tunable consistency
Well defined consistency levels to achieve optimal tradeoff between consistency and performance. Four distinct consistency levels for queries and read – Strong, Bounded-Staleness, Session and Eventual. Granular control over consistency, availability and latency
fully managed
Simple to provision and access databases without managing VM or cluster infrastructure. Operated with 99.95% availability and automatically backed up to prevent against regional failures
{ }
PRICING
DocumentDB collections are available in the Standard service tier. Collections are billable entities, each billed hourly, based on the performance level assigned to the collection. Collections are set to one of three performance levels – S1, S2 or S3. You can also dynamically change the performance level of a collection – for example, create an S1 collection, scale up to S3 then back to S2.
TUNABLE CONSISTENCY
type latency performance
strong high low
bounded staleness moderate moderate
session low for session fast for session
eventual low fast
RAPID DEVELOPMENT
No setup cost
Auto scale
High available
No configuration management cost
Integration with all Azure services
SDK support for JavaScript, Java, Node.js, Python, and .NET.
PREPARATION
CONFIGURATION
QUERIES
{ "id": "AndersenFamily", "lastName": "Andersen", "parents": [ { "firstName": "Thomas" }, { "firstName": "Mary Kay"} ], "children": [ { "firstName": "Henriette Thaulow", "gender": "female", "grade": 5, "pets": [{ "givenName": "Fluffy" }] } ], "address": { "state": "WA", "county": "King", "city": "seattle" }, "creationDate": 1431620472, "isRegistered": true }
{ "id": "WakefieldFamily", "parents": [ { "familyName": "Wakefield", "givenName": "Robin" }, { "familyName": "Miller", "givenName": "Ben" } ], "children": [ { "familyName": "Merriam", "givenName": "Jesse", "gender": "female", "grade": 1, "pets": [ { "givenName": "Goofy" }, { "givenName": "Shadow" } ] }, { "familyName": "Miller", "givenName": "Lisa", "gender": "female", "grade": 8 } ], "address": { "state": "NY", "county": "Manhattan", "city": "NY" }, "creationDate": 1431620462, "isRegistered": false }
QUERIES{ "id": "AndersenFamily", "lastName": "Andersen", "parents": [ { "firstName": "Thomas" }, { "firstName": "Mary Kay"} ], "children": [ { "firstName": "Henriette Thaulow", "gender": "female", "grade": 5, "pets": [{ "givenName": "Fluffy" }] } ], "address": { "state": "WA", "county": "King", "city": "seattle" }, "creationDate": 1431620472, "isRegistered": true }
* Operator
SELECT * FROM Families f WHERE f.id = "AndersenFamily"
[{ "Family": { "Name": "WakefieldFamily", "City": "NY" } }]
Where
SELECT {"Name":f.id, "City":f.address.city} AS Family FROM Families f WHERE f.address.city = f.address.state
[ { "givenName": "Jesse" }, { "givenName": "Lisa"} ]
Join SELECT c.givenName FROM Families f JOIN c IN f.children WHERE f.id = 'WakefieldFamily' ORDER BY f.address.city ASC
QUERIES[{ "$1": { "state": "WA", "city": "seattle" }, "$2": { "name": "AndersenFamily" } }]
Nested properties
SELECT { "state": f.address.state, "city": f.address.city }, { "name": f.id } FROM Families f WHERE f.id = "AndersenFamily"
[ { "AreFromSameCityState": false }, { "AreFromSameCityState": true } ]
Scalar expression
SELECT f.address.city = f.address.state AS AreFromSameCityState FROM Families f
ORDER BY
SELECT f.id, f.address.city FROM Families f ORDER BY f.address.city
[ { "id": "WakefieldFamily", "city": "NY" }, { "id": "AndersenFamily", "city": "Seattle" } ]
QUERIES
{ "Type": "Stratovolcano", "Status": "Tephrochronology", "Location": { "type": "Point", "coordinates": [ -121.49, 46.206 ] } }
Geospatial WITH_IN
SELECT v.Type, v.Status, v.Location FROM volcanoes v WHERE ST_WITHIN(v.Location, { "type": "Polygon", "coordinates": [[ [-124.63, 48.36], [-123.87, 46.14], [-122.23, 45.54], [-119.17, 45.95], [-116.92, 45.96], [-116.99, 49.00], [-123.05, 49.02], [-123.15, 48.31], [-124.63, 48.36]]]} )
Geospatial ST_DISTANCE
SELECT v.Elevation, v.Type, v.Region, v["Volcano Name"] FROM volcanoes v WHERE ST_DISTANCE(v.Location, { "type": "Point", "coordinates": [-122.19, 47.36] }) < 100 * 1000 AND v.Type = "Stratovolcano" AND v["Last Known Eruption"] = "Last known eruption from 1800-1899, inclusive"
{ "Elevation": 4392, "Type": "Stratovolcano", "Region": "US-Washington", "Volcano Name": "Rainier" }
LET’S TRY SOME QUERIES
JAVA SPRING APP
TWITTER STREAMING APP
MICROSERVICE
TWITTER STREAMING APP
<DEPENDENCY> <GROUPID>COM.MICROSOFT.AZURE</GROUPID>
<ARTIFACTID>AZURE-DOCUMENTDB</ARTIFACTID> <VERSION>1.5.1</VERSION>
</DEPENDENCY>
BIT.LY/DEVNOT-CODE
DEMO