Date post: | 27-Jan-2015 |
Category: |
Software |
Upload: | jim-cooley |
View: | 103 times |
Download: | 0 times |
What is
In Three Minutes…
Visual Analytics
Pure Spark and More
Machine Learning
Irregular Data
Integrated Platform
Pure Spark and More
Ubix is an integrated platform built on Apache Spark. It allows you to take data from a variety of sources and process them through multiple analytics steps and transformations, producing powerful interactive visualizations on historic and streaming data.
HDFS
S3
In Memory(RDD)
Streaming AnalyticsTransformAnalytics Analytics
HDFS
S3
In Memory(RDD)
Visualizations
“Why don’t I build it myself?”
Pure Spark and More
You could do this…
“It’s just Spark, right?”
Pure Spark and More
package org.apache.spark.examples import java.util.Random import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ /** * Usage: GroupByTest [numMappers] [numKVPairs] [KeySize] [numReducers] */ object GroupByTest { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("GroupBy Test") var numMappers = if (args.length > 0) args(0).toInt else 2 var numKVPairs = if (args.length > 1) args(1).toInt else 1000 var valSize = if (args.length > 2) args(2).toInt else 1000 var numReducers = if (args.length > 3) args(3).toInt else numMappers val sc = new SparkContext(sparkConf) val pairs1 = sc.parallelize(0 until numMappers, numMappers).flatMap { p => val ranGen = new Random var arr1 = new Array[(Int, Array[Byte])](numKVPairs) for (i <-‐ 0 until numKVPairs) { val byteArr = new Array[Byte](valSize) ranGen.nextBytes(byteArr) arr1(i) = (ranGen.nextInt(Int.MaxValue), byteArr) } arr1 }.cache // Enforce that everything has been calculated and in cache pairs1.count println(pairs1.groupByKey(numReducers).count) sc.stop() } }
GroupBy on Spark library(SparkR) args <-‐ commandArgs(trailing = TRUE) if (length(args) != 2) { print("Usage: wordcount <master> <file>") q("no") } # Initialize Spark context sc <-‐ sparkR.init(args[[1]], "RwordCount") lines <-‐ textFile(sc, args[[2]]) words <-‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-‐ lapply(words, function(word) { list(word, 1L) }) counts <-‐ reduceByKey(wordCount, "+", 2L) output <-‐ collect(counts) for (wordcount in output) { cat(wordcount[[1]], ": ", wordcount[[2]], "\n") }
WordCount in SparkR
And this…
Example… Or you could do this.
Pure Spark and More
load csv -‐s http -‐p https://s3.amazonaws.com/ubix.datasets/titanic.csv -‐h -‐t passengers pipe passengers | histogram -‐c age 10 | plot bar -‐x start_1 -‐y age train lasso -‐t passengers -‐n cl_lass_pasngr_surv2 -‐i20 -‐s20 -‐l survived -‐f passengerclass,gender,siblingorstepsibling,parentorchild,ticketprice,age pipe passengers | predict -‐n cl_lass_pasngr_surv2 -‐c passengerclass,gender,siblingorstepsibling,parentorchild,ticketprice,age | bin -‐c cl_lass_pasngr_surv2_prediction -‐k 2 | as pass_surv2 pipe passengers | plot parallel -‐x passengerkey -‐y passengerclass,gender,age,ticketprice,survived
“It’s just Spark, right?”
Pure Spark and More
SparkOperation Scala Statements LinesPageRank 57 2 1WordCount 5 2 1WordCount3on3a3Stream 174 2 1Twitter3Popular3Tags 85 1Training3an3LR3model 134 2 1
UbixComparing Ubix with Scala
SparkOperation Scala Statements LinesPageRank 57 2 1WordCount 5 2 1WordCount3on3a3Stream 174 2 1Twitter3Popular3Tags 85 5 1Training3an3LR3model 134 2 1
Ubix
Visual Analytics
autocorr bin bind blomquistbeta cat columns cor count countby cov create datalibrary datasets density distinct drop ecdf extend fill filter fs
170 functions and growing! gfilter help histogram history interquartilerange job join kurtosis limit linear load max mean median mediandeviation merge min mode moment operator plot
product print prop quartile quartileskewness rankedmin read save sed sortby spearmanrho split splitbyindex stddev stream streams subscribe sum summary topn train
usage variance workspace zipwithindex
��� ������������������
We’ve made it super easy to deploy your own Spark cluster!
Visual Analytics
Pure Spark and More
Machine Learning
Irregular Data
Visual Analytics
Good Data Visualization takes the effort off the brain and puts it on the eyes. Abscombe’s Quartet goes further and demonstrates why visualization is critical to data understanding. Constructed in 1973 by the French statistician Francis Abscombe, the following distributions have identical basic descriptive statistics, yet the visualization immediately shows how different they are:������
Visualizing Data: Abscombe’s Quartet
Visual Analytics
Ubix contains the following 28 visualizations in it’s toolkit, with more being introduced every week:������
Ubix Data Visualization
Bubble Treemap Heatmap Tree Sankey Violin
Parallel Axis Scatter Matrix Sunburst Donut Stacked Area Contour
Circular Heat Bullet Dependency Wheel
Hexbin Bivariate Area Sparklines
Visual Analytics
Ubix Data Visualization (cont’d)
Scatter Plot Icicle Radial Sunburst Line Plot Bar Chart
Pie Chart Candlestick Horizontal
All of Ubix’s charts are interactive, so you can continue to explore your data visually. If displaying stream data, they also refresh so that you can see new data as it comes in.
Visual Analytics
Pure Spark and More
Machine Learning
Irregular Data
Machine Learning
Binary Classification (logistic, svm)���Binary classification is a supervised learning problem in which we want to classify entities into one of two distinct categories or labels, e.g., predicting whether or not emails are spam. This problem involves executing a learning Algorithm on a set of labeled examples, i.e., a set of entities along with underlying category labels. The algorithm returns a trained Model that can then be used to predict the label for new entities for which the underlying label is unknown. Ubix supports both Logistic Regression and Linear Support Vector Machine based binary classifiers.������Linear Regression (linear, ridge, lasso)���Linear regression is another classical supervised learning model. In this case, each entity is associated with a numeric label (which can have more than two values, unlike binary classification), and we want to predict labels as closely as possible given numerical features of the entities. Ubix supports classical Linear Regression as well as L1 (Lasso) and L2 (Ridge).��� ���Clustering (kmeans)���Clustering is an unsupervised learning problem whereby we aim to group subsets of entities with one another based on some notion of similarity. Clustering is often used for exploratory analysis and/or as a component of a hierarchical supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). Ubix supports a parallelized version of the classic K-Means clustering.������Collaborative Filtering (recommender)���Collaborative filtering is commonly used for recommender systems, the classic example of this is Amazon’s purchase recommendations. These techniques aim to fill in the missing entries of a user-item association matrix based on preferences exhibited by similar users. Ubix supports model-based collaborative filtering via the Alternative Least Squares algorithm.
Machine Learning: Ubix can apply ML functions on Tables and Streams!
Visual Analytics
Pure Spark and More
Machine Learning
Irregular Data
Machine Learning
Graph-based Data (GraphX/Pregel)���Today, data processing is hardly complete without running into non-rectangular structured data: graphs. GraphLab & Pregel are quickly becoming the standard for rapid graph processing. Utilizing GraphX and introducing a host of primitives and functions for operating on Graphs, Ubix provides an unparalleled capability to manipulate your data in whatever form it exists, or whatever form is most efficient for processing. Coercion happens automatically, when matching graph and tabular data, so you don’t have to worry about it. ������Streams (dstreams, akka, caml)���Ubix is designed around streams, but we process historical or static data just the same. In fact, by changing a single statement, the processing flows can operate on streams or historic data interchangeably. Recognizing the importance of data connectivitiy, Ubix is integrated with multiple streaming protocols and protocol brokers, so that you can focus on processing the data and not on integrating it.������Unstructured Text (ReX, ElasticSearch/Solr)���Both data wrangling and processing of unstructured text is a key component to data understanding. Ubix supports integration with ElasticSearch/Solr, and a host of text processing functions, including regular expression filtering, find/replace, processing of missing or empty values, and more. ���
Irregular Data You can operate on your data in its most convenient representation, and combine data from multiple types.