+ All Categories
Home > Software > 2014.06.24.what is ubix

2014.06.24.what is ubix

Date post: 27-Jan-2015
Category:
Upload: jim-cooley
View: 103 times
Download: 0 times
Share this document with a friend
Description:
"What is Ubix" - Ubix design preview.
Popular Tags:
18
What is In Three Minutes
Transcript
Page 1: 2014.06.24.what is ubix

What is

In Three Minutes…

Page 2: 2014.06.24.what is ubix

Visual Analytics

Pure Spark and More

Machine Learning

Irregular Data

Page 3: 2014.06.24.what is ubix

Integrated Platform

Pure Spark and More

Ubix is an integrated platform built on Apache Spark. It allows you to take data from a variety of sources and process them through multiple analytics steps and transformations, producing powerful interactive visualizations on historic and streaming data.

HDFS

S3

In Memory(RDD)

Streaming AnalyticsTransformAnalytics Analytics

HDFS

S3

In Memory(RDD)

Visualizations

Page 4: 2014.06.24.what is ubix

“Why don’t I build it myself?”

Pure Spark and More

You could do this…

Page 5: 2014.06.24.what is ubix

“It’s just Spark, right?”

Pure Spark and More

package  org.apache.spark.examples    import  java.util.Random    import  org.apache.spark.{SparkConf,  SparkContext}  import  org.apache.spark.SparkContext._    /**      *  Usage:  GroupByTest  [numMappers]  [numKVPairs]  [KeySize]  [numReducers]      */  object  GroupByTest  {      def  main(args:  Array[String])  {          val  sparkConf  =  new  SparkConf().setAppName("GroupBy  Test")          var  numMappers  =  if  (args.length  >  0)  args(0).toInt  else  2          var  numKVPairs  =  if  (args.length  >  1)  args(1).toInt  else  1000          var  valSize  =  if  (args.length  >  2)  args(2).toInt  else  1000          var  numReducers  =  if  (args.length  >  3)  args(3).toInt  else  numMappers            val  sc  =  new  SparkContext(sparkConf)            val  pairs1  =  sc.parallelize(0  until  numMappers,  numMappers).flatMap  {  p  =>              val  ranGen  =  new  Random              var  arr1  =  new  Array[(Int,  Array[Byte])](numKVPairs)              for  (i  <-­‐  0  until  numKVPairs)  {                  val  byteArr  =  new  Array[Byte](valSize)                  ranGen.nextBytes(byteArr)                  arr1(i)  =  (ranGen.nextInt(Int.MaxValue),  byteArr)              }              arr1          }.cache          //  Enforce  that  everything  has  been  calculated  and  in  cache          pairs1.count            println(pairs1.groupByKey(numReducers).count)            sc.stop()      }  }  

GroupBy on Spark library(SparkR)    args  <-­‐  commandArgs(trailing  =  TRUE)    if  (length(args)  !=  2)  {      print("Usage:  wordcount  <master>  <file>")      q("no")  }    #  Initialize  Spark  context  sc  <-­‐  sparkR.init(args[[1]],  "RwordCount")  lines  <-­‐  textFile(sc,  args[[2]])    words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,  function(word)  {  list(word,  1L)  })    counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)    for  (wordcount  in  output)  {      cat(wordcount[[1]],  ":  ",  wordcount[[2]],  "\n")  }  

WordCount in SparkR

And this…

Page 6: 2014.06.24.what is ubix

Example… Or you could do this.

Pure Spark and More

load  csv  -­‐s  http  -­‐p  https://s3.amazonaws.com/ubix.datasets/titanic.csv  -­‐h  -­‐t  passengers    pipe  passengers  |  histogram  -­‐c  age  10  |  plot  bar  -­‐x  start_1  -­‐y  age    train  lasso  -­‐t  passengers  -­‐n  cl_lass_pasngr_surv2  -­‐i20  -­‐s20  -­‐l  survived  -­‐f  passengerclass,gender,siblingorstepsibling,parentorchild,ticketprice,age    pipe  passengers  |  predict  -­‐n  cl_lass_pasngr_surv2  -­‐c  passengerclass,gender,siblingorstepsibling,parentorchild,ticketprice,age  |  bin  -­‐c  cl_lass_pasngr_surv2_prediction  -­‐k  2  |  as  pass_surv2    pipe  passengers  |  plot  parallel  -­‐x  passengerkey  -­‐y  passengerclass,gender,age,ticketprice,survived      

Page 7: 2014.06.24.what is ubix

“It’s just Spark, right?”

Pure Spark and More

SparkOperation Scala Statements LinesPageRank 57 2 1WordCount 5 2 1WordCount3on3a3Stream 174 2 1Twitter3Popular3Tags 85 1Training3an3LR3model 134 2 1

UbixComparing Ubix with Scala

SparkOperation Scala Statements LinesPageRank 57 2 1WordCount 5 2 1WordCount3on3a3Stream 174 2 1Twitter3Popular3Tags 85 5 1Training3an3LR3model 134 2 1

Ubix

Page 8: 2014.06.24.what is ubix

Visual Analytics

autocorr bin bind blomquistbeta cat columns cor count countby cov create datalibrary datasets density distinct drop ecdf extend fill filter fs

170 functions and growing! gfilter help histogram history interquartilerange job join kurtosis limit linear load max mean median mediandeviation merge min mode moment operator plot

product print prop quartile quartileskewness rankedmin read save sed sortby spearmanrho split splitbyindex stddev stream streams subscribe sum summary topn train

usage variance workspace zipwithindex

Page 9: 2014.06.24.what is ubix

��� ������������������

We’ve made it super easy to deploy your own Spark cluster!

Page 10: 2014.06.24.what is ubix

Visual Analytics

Pure Spark and More

Machine Learning

Irregular Data

Page 11: 2014.06.24.what is ubix

Visual Analytics

Good Data Visualization takes the effort off the brain and puts it on the eyes. Abscombe’s Quartet goes further and demonstrates why visualization is critical to data understanding. Constructed in 1973 by the French statistician Francis Abscombe, the following distributions have identical basic descriptive statistics, yet the visualization immediately shows how different they are:������

Visualizing Data: Abscombe’s Quartet

Page 12: 2014.06.24.what is ubix

Visual Analytics

Ubix contains the following 28 visualizations in it’s toolkit, with more being introduced every week:������

Ubix Data Visualization

Bubble Treemap Heatmap Tree Sankey Violin

Parallel Axis Scatter Matrix Sunburst Donut Stacked Area Contour

Circular Heat Bullet Dependency Wheel

Hexbin Bivariate Area Sparklines

Page 13: 2014.06.24.what is ubix

Visual Analytics

Ubix Data Visualization (cont’d)

Scatter Plot Icicle Radial Sunburst Line Plot Bar Chart

Pie Chart Candlestick Horizontal

All of Ubix’s charts are interactive, so you can continue to explore your data visually. If displaying stream data, they also refresh so that you can see new data as it comes in.

Page 14: 2014.06.24.what is ubix

Visual Analytics

Pure Spark and More

Machine Learning

Irregular Data

Page 15: 2014.06.24.what is ubix

Machine Learning

Binary Classification (logistic, svm)���Binary classification is a supervised learning problem in which we want to classify entities into one of two distinct categories or labels, e.g., predicting whether or not emails are spam. This problem involves executing a learning Algorithm on a set of labeled examples, i.e., a set of entities along with underlying category labels. The algorithm returns a trained Model that can then be used to predict the label for new entities for which the underlying label is unknown. Ubix supports both Logistic Regression and Linear Support Vector Machine based binary classifiers.������Linear Regression (linear, ridge, lasso)���Linear regression is another classical supervised learning model. In this case, each entity is associated with a numeric label (which can have more than two values, unlike binary classification), and we want to predict labels as closely as possible given numerical features of the entities. Ubix supports classical Linear Regression as well as L1 (Lasso) and L2 (Ridge).��� ���Clustering (kmeans)���Clustering is an unsupervised learning problem whereby we aim to group subsets of entities with one another based on some notion of similarity. Clustering is often used for exploratory analysis and/or as a component of a hierarchical supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). Ubix supports a parallelized version of the classic K-Means clustering.������Collaborative Filtering (recommender)���Collaborative filtering is commonly used for recommender systems, the classic example of this is Amazon’s purchase recommendations. These techniques aim to fill in the missing entries of a user-item association matrix based on preferences exhibited by similar users. Ubix supports model-based collaborative filtering via the Alternative Least Squares algorithm.

Machine Learning: Ubix can apply ML functions on Tables and Streams!

Page 16: 2014.06.24.what is ubix

Visual Analytics

Pure Spark and More

Machine Learning

Irregular Data

Page 17: 2014.06.24.what is ubix

Machine Learning

Graph-based Data (GraphX/Pregel)���Today, data processing is hardly complete without running into non-rectangular structured data: graphs. GraphLab & Pregel are quickly becoming the standard for rapid graph processing. Utilizing GraphX and introducing a host of primitives and functions for operating on Graphs, Ubix provides an unparalleled capability to manipulate your data in whatever form it exists, or whatever form is most efficient for processing. Coercion happens automatically, when matching graph and tabular data, so you don’t have to worry about it. ������Streams (dstreams, akka, caml)���Ubix is designed around streams, but we process historical or static data just the same. In fact, by changing a single statement, the processing flows can operate on streams or historic data interchangeably. Recognizing the importance of data connectivitiy, Ubix is integrated with multiple streaming protocols and protocol brokers, so that you can focus on processing the data and not on integrating it.������Unstructured Text (ReX, ElasticSearch/Solr)���Both data wrangling and processing of unstructured text is a key component to data understanding. Ubix supports integration with ElasticSearch/Solr, and a host of text processing functions, including regular expression filtering, find/replace, processing of missing or empty values, and more. ���

Irregular Data You can operate on your data in its most convenient representation, and combine data from multiple types.


Recommended