Data Platform Airlift21 de Outubro \\ Microsoft Lisbon Experience
What’s new in the Azure Data PlatformRicardo Peres
Luis Calado
Azure DocumentDB
Azure Search
Azure Machine Learning Marketplace
Azure SQL Database
Azure Data Lake
Azure Data Factory
Agenda
Headline
Core Concepts
Resources
Indexes
Querying
Paging
Updating
Transactions
Partition Resolvers
User Defined Functions
Stored Procedures
Triggers
Security
Limits
Search
Best Practices
Headlines
NoSQL database as a service for JSON documents
Schemaless
RESTful
Part of Azure – only available online
Highly scalable
Several bindings (.NET, JavaScript, Python, ...)
Core Concepts
Resources (1 of 3)
Documents that live in DocumentDB
All have a unique addressable URL (_rid or id):https://{account}.documents.azure.com/dbs/{_rid-db}/colls/{_rid-col}/docs/{_rid-doc}
All live inside a collection
A collection lives inside a database
A database belongs to an account
A collection can take different kinds of documents
Resources (2 of 3)
Either POCOs or inherit from Resource
Some built-in properties:
If an id property is not specified, one will be provided (Guid)
Case matters!
Property User Settable Purpose
_rid No System generated, unique and hierarchical
identifier
_etag No HTTP etag required for optimistic concurrency
control
_ts No Last updated timestamp
_self No Unique addressable URL
id Yes User defined unique name
Resources (3 of 3)
Can have attachments:https://{account}.documents.azure.com/dbs/{_rid-db}/colls/{_rid-col}/docs/{_rid-doc}/attachments/{_id-attch}
Additional properties:
Property User Settable Purpose
contentType Yes The content type of the attachment
media Yes The URL link or file path where the
attachment resides
Indexes (1 of 2)
Consistency can be configured per collectionConsistent: indexes are updated synchronously
Lazy: indexes are updated asynchronously
None
Indexes (2 of 2)
By default, all paths are indexed, can be overriden
Three kinds of property indexes:Hashed: for exact matchesRange: for range comparisons, orderingSpatial: for geospatial queries
Three kinds of property value indexes (from JSON):String (precision: 1-100 or -1)Number (precision: 1-8 or -1)Point
A collection can have several indexes at once
If a collection does not have an index, it cannot be queried except by id or self link!
Querying – SQL (1 of 3)
Returns JSON
Joins only inside document (collections)
No comparison of different data types (undefined)
Math: +, -, *, /, %
Bitwise: |, &, ^, <,>>, >>>
Logical: AND, OR, NOT
Comparison: =, !=, <, >, <=, >=, <>
String: ||
Ternary and coalesce: ?, ??
IN, BETWEEN, ORDER BY
Parameterized – no SQL injection
Querying – SQL (2 of 3)
SQL functions:Math: ABS, CEILING, EXP, FLOOR, LOG, LOG10, POWER, ROUND,
SIGN, SQRT, SQUARE, TRUNC, ACOS, ASIN, ATAN, ATN2, COS, COT, DEGREES, PI, RADIANS, SIN, TAN
Type checking: IS_ARRAY, IS_BOOL, IS_NULL, IS_NUMBER, IS_OBJECT, IS_STRING, IS_DEFINED, IS_PRIMITIVE
String: CONCAT, CONTAINS, ENDSWITH, INDEX_OF, LEFT, LENGTH, LOWER, LTRIM, REPLACE, REPLICATE, REVERSE, RIGHT, RTRIM, STARTSWITH, SUBSTRING, UPPER
Array: ARRAY_CONCAT, ARRAY_CONTAINS, ARRAY_LENGTH, ARRAY_SLICE
Spatial: ST_DISTANCE, ST_WITHIN, ST_ISVALID, ST_ISVALIDDETAILED
Querying – SQL (3 of 3)
SQL Ternary and coalesce: ?, ??
SELECT (c.grade < 5)? "elementary": "other" AS gradeLevel
FROM Families.children[0] c
SELECT f.lastName ?? f.surname AS familyName
FROM Families f
Projecting into new JSON objects:SELECT { "state": f.address.state, "city": f.address.city, "name": f.id }
FROM Families f
WHERE f.id = "AndersenFamily“
Creating arrays:SELECT [f.address.city, f.address.state] AS CityState
FROM Families f
Returning single values:SELECT VALUE “Hello World”
[{ "$1": { "state": "WA", "city": "seattle" }, "$2": { "name": "AndersenFamily" } }]
[ { "CityState": [ "seattle", "WA" ] }, { "CityState": [ "NY", "NY" ] } ]
[ "Hello World" ]
Querying - LINQ
LINQ functions:Math: Abs, Acos, Asin, Atan, Ceiling, Cos, Exp, Floor, Log, Log10,
Pow, Round, Sign, Sin, Sqrt, Tan, Truncate
String: Concat, Contains, EndsWith, IndexOf, Count, ToLower, TrimStart, Replace, Reverse, TrimEnd, StartsWith, SubString, ToUpper
Array: Concat, Contains, and Count
Spatial: Distance, Within, IsValid, and IsValidDetailed
Paging
Can specify maximum number of items to retrieve
Has more results / get next results
Ordering
Updating
InsertsFrom POCOFrom StreamBatching:
Document ExplorerData Migration ToolStored Procedures
ReplacesConcurrency control from Etags
DeletesBy self link or id
Transactions
No explicit transactions
Implicit inside triggers and stored procedures – only at collection level
Partition Resolvers
Specified per database
Possibly several
Can decide on which collection a document is to be saved or retrieved from
Included:HashPartitionResolver: distribute data evenly accross collections
RangePartitionResolver<T>: when there is a “natural” ordering, such as with date and time
User Defined Functions
JavaScript-based
Exist in collections
No side effects
var regexMatchUdf = new UserDefinedFunction {
Id = "REGEX_MATCH",
Body = "function (input, pattern) {
return input.match(pattern) !== null;
};",
};
SELECT udf.REGEX_MATCH("ardo", s.Id) FROM Session s
Stored Procedures
JavaScript-based
Exist in collections
Can do batching
Implicit transactions
function (gender) {
var response = getContext().getResponse();
var collection = getContext().getCollection();
var query = 'SELECT * FROM c WHERE c.Gender= "' + gender + '"';
collection.queryDocuments(collection
.getSelfLink(), query, {},
function(err, documents, options) {
response.setBody(response.getBody() + JSON.stringify(documents));
}
);
}
Triggers
JavaScript-based
Exist in collections
Two types:Pre trigger
Post trigger
function updateTrigger() {
var request = getContext()
.getRequest();
var doc = request.getBody();
doc[‘message’] = ‘Added by trigger’;
request.setBody(doc);
}
Security
Access keys:
Master (single)
Read only (multiple)
Database users – specify use at DocumentClient level
Permissions for users over resources (resource tokens: default expiration is 1h, up to 5h):
All
Read
Resources:
Collections
Documents
Attachments
Stored procedures
Triggers
User defined functions
LimitsFeature Limit
Maximum Request Units / second / collection 2500
Maximum execution time for stored procedure
and trigger
5 s
Provisioned document storage / collection 50 GB
Maximum collections per database account* 100
Maximum document storage per database
(100 collections)*
1 TB
Maximum Length of the Id property 255 chars
Maximum request size of document and
attachment
512 KB
Maximum number of JOINs per query* 5
Number of stored procedures, triggers and
UDFs per collection*
25
Number of users per database account 500.000
Search
Based on Elasticsearch and Lucene
.NET + REST APIs
Can retrieve data from DocumentDB
Best Practices
Cache the DocumentClient instance
Choose right collection index update policy
Index only properties that will be searchable and with appropriate values – watch out for ranges
Store small documents
Measure and tune request costs
Retrieve only what you need – paging, projections
Cache self links – they never change
Use partition resolvers for distributing burden
Beware throttling!
Meet the Competition
MongoDBOpen source + support model
No joins
Aggregations
Time to live
Offline deployment
Replication
Eventual consistency
ACID transactions
Map/Reduce
Several programming languages supported
RavenDBOpen source + support model
Joins across documents
Aggregations
Expiry
Offline deployment
Replication
Eventual consistency
ACID transactions
Map/Reduce
.NET, REST
References
Query Playground: https://www.documentdb.com/sql/demo
.NET Azure DocumentDB Samples: https://github.com/Azure/azure-documentdb-net
DocumentDB Studio: https://studiodocumentdb.codeplex.com/
Azure DocumentDB Data Migration Tool: http://www.microsoft.com/en-us/download/details.aspx?id=46436
Pricing: https://azure.microsoft.com/en-us/pricing/details/documentdb/
Connecting DocumentDB with Azure Search using indexers: https://azure.microsoft.com/en-us/documentation/articles/documentdb-search-indexer/
A search-as-a-service solution allowing developers to incorporate great search experiences into applicationswithout managing infrastructure or needing to become search experts.
Type Ahead
FacetsFacets
Hit Highlighting
Spelling Mistakes
Geo-Spatial Search
Paging
Sorting & Scoring
New indexers (SQL Database and DocumentDB)
New language support (35 languages including pt-PT)
Index creation in the new Management Portal
New Regions
New APIs for index creation
•
• Distance
• Intersection
•
Full Text Search
Secure data with authentication, authorization and encryption
Extended Events
•
•
•
Azure Portal
Azure Ops Team
ML Studio
Data Scientist
HDInsight
Azure Storage
Training Set
from on-prem
Azure Portal &
ML API service
Azure Ops Team
PowerBI/DashboardsMobile AppsWeb Apps
ML API service Developer
ML Studio and the Data Scientist
• Access and prepare data
• Create, test and train models
• Collaborate
• One click to stage for
production via the API service
Azure Portal & ML API serviceand the Azure Ops Team
• Create ML Studio workspace
• Assign storage account(s)
• Monitor ML consumption
• See alerts when model is ready
• Deploy models to web service
ML API service and the Developer
• Tested models available as an url that can be called from any end point
Business users easily access results:
from anywhere, on any device
Cloud
Event Hubs
ML Studio ML API Service
Microsoft
Azure Portal
Blob Storage
ML Apps
Marketplace
ML Operationalization
ML Studio
ML Algorithms
Observation
Pattern
Theory
Hypothesis
What will happen?
How can we make it happen?
Predictive
Analytics
Prescriptive
Analytics
What happened?
Why did it happen?
Descriptive
Analytics
Diagnostic
Analytics
Top-Down
Confirmation
Theory
Hypothesis
Observation
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting & Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand Corporate Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather Requirements
Business Requirements
Technical Requirements
Ingestregardless of requirements
Storein native format without
schema definition
AnalyzeUsing analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Store and analyse data of any kind and size
Develop faster, debug and optimise smarter
Interactively explore patterns in your data
No learning curve—use U-SQL, Spark, Hive, HBase and Storm
Managed and supported with an enterprise-grade SLA
Dynamically scales to match your business priorities
Enterprise-grade security with Azure Active Directory
Built on YARN, designed for the cloud
`
AZURE DATA LAKE
DEV
TOOLSVisual
Studio
PowerShell
MS
Azure Data Factory
Azure Stream
Analytics*
MS
HDInsight
Kona
Azure SQL
DW*
AzureML*
3rd Party
Informatica*
3rd Party
Cloudera*
Hortonworks*
MapR*
Open Source
Sqoop
Flume
MS
RevolutionR*
PowerBI*
3rd Party
TBA
PLATFORMS
APPLICATIONS
DATA INTEGRATION TOOLS
Last Name First Name Country Age …
Flasko Mike Canada 32
Anand Subbaraj USA 30
Gaurav Malhotra USA 72
… …. …. ….
Last Name First Name At risk of
churning
….
Flasko Mike Yes
Anand Subbaraj No
Gaurav Malhotra Yes
… ….
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Data Sources Ingest Transform & Analyze Publish
Customer
Call Details
Customers
Likely to
Churn