Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | mongosoup |
View: | 111 times |
Download: | 3 times |
Webinar: The rmongodb R packageDr. rer. nat. Markus Schmidberger
January 30th, 2014
Email:
Twitter: @cloudHPC
OutlineIntroduction to Big Data, MongoDB, MongoSoup, R
Introduction to R Database packages as rmongodb
rmongodb Live Demo
Summary & Outlook & Questions
Big DataWikipedia: … a collection of data sets so large and complex that itbecomes difficult to process using on-hand databasemanagement tools or traditional data processing. …
storing
processing
Storing: NoSQL - MongoDBNoSQL: databases using looser consistency models to storedata
MongoDB most popular NoSQL database system
document oriented
JSON-like documents with dynamic schemas
http://docs.mongodb.org/manual/reference/sql-
comparison/
MongoDB - some commandsdb.collection.find()
db.collection.find().pretty()
db.collection.find( { _id: 5 } )
db.collection.find( { pop: { $gt: 25 } } )
db.collection.insert( { item: “card”, pop: 15 } )
db.collection.ensureIndex( { orderDate: 1, zipcode: -1 } )
db.collection.update( { _id: 1 }, { $set: { “name”: “Warner” } } )
MongoSoupGerman MongoDB as a Service
cloudControl Add-On
running on AWS EU-Region or in Munich (Germany)
all features available: shared / dedicated hosting, replicaset, sharding
24/7 support available
Processing: Analyzing with R and Hadoopbackward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, moreprocessing time required
efficient processing technology required: R, Hadoop, …
check for my Strata London 2013 Tutorial “Big DataAnalyses with R”
Introduction to RR is a free software environment for statistical computingand graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community
One statistical Examplekmeans(dat, 4)
K-means clustering with 4 clusters of sizes
17, 30, 22, 31
Cluster means:
[,1] [,2]
1 0.02846 -0.3379
2 0.76616 1.0020
3 1.37160 0.9707
4 -0.06849 0.1409
Clustering vector:
[1] 4 2 4 4 1 1 4 1 4 4 1 4 4 4 4 4 1 4 4 2
4 4 4 4 4 4 4 1 4 4 1 1 1 1 2
[36] 1 1 4 4 4 1 1 4 4 4 1 1 1 4 4 3 2 3 2 3
3 2 2 3 2 3 2 2 3 2 2 3 2 2 3
[71] 3 2 2 3 3 2 2 2 2 2 2 2 3 2 2 4 3 2 3 2
2 3 3 3 3 3 3 2 3 2
Within cluster sum of squares by cluster:
[1] 1.836 4.660 1.994 3.047
(between_SS / total_SS = 84.1 %)
Available components:
[1] "cluster" "centers" "totss"
"withinss"
[5] "tot.withinss" "betweenss" "size"
"iter"
[9] "ifault"
plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex =
4)
R and DatabasesSQL provides a standard language to filter, aggregate, group,sort data
SQL in new places: Hive, Impala, …
many R packages to connect to the SQL world
R stores relational data in data.frames (extended lists)
data(iris)
head(iris[,1:3], n=3)
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
class(iris)
[1] "data.frame"
R package: sqldfrunning SQL statements on R data frames
library(sqldf)
sqldf("select
Sepal_Length,Sepal_Width,Petal_Length from
iris limit 2")
Sepal_Length Sepal_Width Petal_Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
sqldf("select count(*) from iris")
count(*)
1 150
Other relational R packageRMySQL
RPostgreSQL
ROracle
RJDBC
RODBC
RSQLite (SQLite engine is included)
One big problem:all packages read the full query results in R memory
R and MongoDBon CRAN there are two packages to connect R withMongoDB
rmongodb supported by MongoDB, Inc.
powerful for big data
RMongo
easy to use
limited functionality
reads full query results in R memory
R package: RMongolibrary(RMongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",
"dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo,
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
dbShowCollections(mongo)
[1] "zips" "ccp"
"system.users" "system.indexes"
[5] "test_data"
dbGetQuery(mongo, "zips","{'state':'AL'}",
skip=0, limit=5)
X_id state loc pop
city
1 35004 AL [ -86.51557 , 33.584132] 6055
ACMAR
2 35005 AL [ -86.959727 , 33.588437] 10616
ADAMSVILLE
3 35006 AL [ -87.167455 , 33.434277] 3205
ADGER
4 35007 AL [ -86.812861 , 33.236868] 14218
KEYSTONE
5 35010 AL [ -85.951086 , 32.941445] 19942
NEW SITE
dbInsertDocument(mongo, "test_data", '{"foo":
"bar", "size": 5 }')
[1] "ok"
# e.g. no command to remove collections
# e.g. no command to create indices
dbDisconnect(mongo)
R package: rmongodbdeveloped on top of the MongoDB supported C driver
new maintainer:
new repository:
please provide feedback or contribute via Pull Requests
https://github.com/mongosoup/rmongodb
library(rmongodb)
mongo <-
mongo.create(host="dbs001.mongosoup.de",
db="cc_JwQcDLJSYQJb",
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
mongo
[1] 0
attr(,"mongo")
<pointer: 0x102e4aac0>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0
Live DemoLive Demo with RStudio and MongoSoup
JSON <-> BSON <-> Rnew functionality in development
still problems with sub-documents and JSON arrays
using jsonlite package helps
library(rmongodb)
library(jsonlite)
bson <-
mongo.bson.from.JSON('{"state":"AL"}')
bson
state : 2 AL
list <- mongo.bson.to.list(bson)
list
$state
[1] "AL"
toJSON(list)
[1] "{ \"state\" : [ \"AL\" ] }"
SummaryR is a powerful statistical tool to analyse many different kindof data
R can access databases
MongoDB and rmongodb ready for Big Data
some open issues for simple usability
OutlookFixing JSON to BSON issues
Provide efficient functionality for mongoDB to data.frames
Use new mongodb-c library
a lot of work: re-engineering rmongodb back-end
-> more speed, more functionality
go on developing plyrmongodb package:https://github.com/schmidb/dplyrmongodb
Questions & Answersthanks a lot for your attention
demo code available as vignette in the rmongodb package ongithub
Email: Twitter: @cloudHPC