A Recommendation System Illustrated with...

Post on 22-May-2020

3 views 0 download

transcript

A RecommendationSystem Illustrated

with SparkLauri Niskanen

lauri.i.niskanen@gmail.com

AgendaScala - some nice partsScala - syntax illustrated with few examplesAkka framework - short introductionSpark framework, concepts and main featuresRecommondation system principlesA simple recommandation system demo using SparkSummary

Scala, some nice partsFunctional compiled programming with a strong type systemCurryingImmutability embracedPattern matching constructsAnonymous functions and closuresScala collections

Scala syntax - some selected parts

Scala - Must have Hello World exmpale

val str:String="Hello Scala World"

str split(" ") foreach({l => println(l)})

HelloScalaWorld

Notes

val's are immutableeverything is typed, eventhough some times can be omitted and is inferred by thecompilerno operators => every operator is a method of some objectdots are not needed in chaining methods => DSL's arising…

Scala - Spice up with curryingfunctions can have multiple parameter lists, invoked at later timehandy to "initialize" with some other functionExample: calculate distance values on the same data

import scala.math.{sqrt,abs,pow}import scala.util.Random

case class Pair(x:Double,y:Double)

val euclDist=(a:Pair,b:Pair)=>sqrt(pow(a.x-b.x,2)+pow(a.y-b.y,2))val manhDist=(a:Pair,b:Pair)=>abs(a.x-b.x)+abs(a.y-b.y)

val R=new Random()val scoresA=(for (i <- Range(1,10)) yield Pair(R.nextDouble*5.0+0.5,R.nextDouble*5.0+0.5)).toListval scoresB=(for (i <- Range(1,10)) yield Pair(R.nextDouble*5.0+0.5,R.nextDouble*5.0+0.5)).toList

def setData(x:List[Pair],y:List[Pair])(func:(Pair,Pair)=>Double)={ x.zip(y).map({e=> func(e._1,e._2)}).sum}

val distance=setData(scoresA,scoresB)(_)

println("Eucl.Distance: "+distance(euclDist))println("Manh.Distance: "+distance(manhDist))

Eucl.Distance: 21.093889034545143Manh.Distance: 27.09414385739492

Scala - The RegExp matching artNice and simple extractor

import scala.util.matching.Regex

val testStr="5,Father of the Bride Part II (1995),Comedy"

val Movie= """(\d+),([̂,]*)(.*)""".r

val movieTitle= testStr match { case Movie(id,title,rest)=> title case _ => ""}

println(movieTitle)

Father of the Bride Part II (1995)

Scala - Close up with functional magic

case class Sales(amount:Int,price:Double,product:String)

val data:List[Sales]=List(Sales(10,20.5,"Apple"),Sales(21,39.0,"Apple"),Sales(10,18.0,"Orange"),Sales(30,27.0,"Orange"))

case class Summary(amount:Int,price:Double)

val averagePricesPerFruit=data.groupBy({e=>e.product}).mapValues({v=>v.foldLeft(Summary(0,0.0))({(acc,a)=>Summary(acc.amount+a.amount,acc.price+a.price)})}).mapValues({s=>s.price/s.amount})

println(averagePricesPerFruit)

Map(Orange -> 1.125, Apple -> 1.9193548387096775)

Is this readable?…well not really. Let's see the same in step wise motion.

Same in step wise motion

case class Sales(amount:Int,price:Double,product:String)

val data:List[Sales]=List(Sales(10,20.5,"Apple"),Sales(21,39.0,"Apple"),Sales(10,18.0,"Orange"),Sales(30,27.0,"Orange"))

case class Summary(amount:Int,price:Double)

//GroupBy fruitprintln(data.groupBy({e=>e.product}))

//Sum the amounts and prices for each fruit println(data.groupBy({e=>e.product}).mapValues({v=>v.foldLeft(Summary(0,0.0))({(acc,a)=>Summary(acc.amount+a.amount,acc.price+a.price)})}))

//And in last stage count the averageprintln(data.groupBy({e=>e.product}).mapValues({v=>v.foldLeft(Summary(0,0.0))({(acc,a)=>Summary(acc.amount+a.amount,acc.price+a.price)})}).mapValues({s=>s.price/s.amount}))

Map(Orange -> List(Sales(10,18.0,Orange), Sales(30,27.0,Orange)), Apple -> List(Sales(10,20.5,Apple), Sales(21,39.0,Apple)))Map(Orange -> Summary(40,45.0), Apple -> Summary(31,59.5))Map(Orange -> 1.125, Apple -> 1.9193548387096775)

Scala - Why functionality andimmutability matters

You cannot bring all big data to your host environment for calculationYou may need to count results on partitions (segments) of data in remote machinebefore getting the resultsImmutable data structures and closures (functions with data essentially) are passedon to remote partitions to do work for youStrong static typed systems are nice for production because runtime errors arelimted to different types as input and output are compile time secured withstrongly typed signaturesAbove mentioned things enable good chaining of functions in parallel operations

Akka - in few slidesAkka is a asynchronous Actor Model system (encapsulate the behavior and state)

with message based communicatotion and supervision structure inspired byframeworks in Erlang programming language. Attractiveness comes from

Picture by Ryan Knight.

1. Light weigthness (no threading)2. Isolation (of actors)3. Transparent restart of an actor upon a failure4. Messaging across devices or processes

Used under the hood in the Spark among others.

Akka - ping pong, re-starting example

Akka - ping pong ActorSystem and supervisor

import akka.actor.{Actor,ActorRef,ActorSystem,PoisonPill,ExtensionKey,Extension}import akka.actor.{Props,ExtendedActorSystem}import com.typesafe.config.{ConfigFactory,ConfigValueFactory}

object Wizard { def run()={ val conf = ConfigFactory.load() val system = ActorSystem("ActorSystem") val supervisor = system.actorOf(Props(classOf[Supervisor]),name="Controller") }}

class Supervisor extends Actor { //Just to retrieve long address form val remoteAddr = RemoteAddressExtension(context.system).address val thisPath = self.path.toStringWithAddress(remoteAddr) println("\nConstructor: " + thisPath) val ponger=context.actorOf(Props(classOf[Ponger]),name="ponger") val pinger=context.actorOf(Props(classOf[Pinger],ponger),name="pinger")

// Actor's message receive handling def receive = { case "STOP-THE-SYSTEM" => println("STOP-THE-SYSTEM received from "+sender);context.system.shutdown() case msg => }

override def preStart() = { println("preRestart called for " + thisPath) } override def postRestart(reason: Throwable) = { println("postRestart called for " + thisPath) } override def postStop() = { println("postStop called for " + thisPath) }}

Akka - ping pong cont

class Pinger(peer:ActorRef) extends Actor { val remoteAddr = RemoteAddressExtension(context.system).address val thisPath = self.path.toStringWithAddress(remoteAddr) println("Constructor: " + thisPath)

peer ! "TEST"

def receive = { case "I-WANT-OUT" => sender ! PoisonPill; context.parent ! "STOP-THE-SYSTEM" case someMsg => println(someMsg + " received from "+sender); sender ! "SHOW-ME-CRASH" }

override def preStart() = { println("preRestart called for " + thisPath) } override def postRestart(reason: Throwable) = { println("postRestart called for " + thisPath) } override def postStop() = { println("postStop called for " + thisPath) }

}

Akka - ping pong cont2

class Ponger extends Actor { val remoteAddr = RemoteAddressExtension(context.system).address val thisPath = self.path.toStringWithAddress(remoteAddr) println("Constructor: " + thisPath)

def receive = { case "TEST" => println("TEST received from "+ sender); sender ! "ROGER" case "ONE-MORE" => println("ONE-MORE received from "+sender); sender ! "I-WANT-OUT" case "SHOW-ME-CRASH" => println("SHOW-ME-CRASH received from"+sender); 1/0 }

override def preStart() = { println("preRestart called for " + thisPath) } override def postRestart(reason: Throwable) = { println("postRestart called with reason " + reason) context.actorSelection("akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pinger") ! "I-WANT-OUT" } override def postStop() = { println("postStop called for " + thisPath) }}

//This is just means to show the complete path in the systemclass RemoteAddressExtensionImpl(system: ExtendedActorSystem) extends Extension { def address = system.provider.getDefaultAddress}

object RemoteAddressExtension extends ExtensionKey[RemoteAddressExtensionImpl]

Akka - config used

akka { actor { provider = "akka.remote.RemoteActorRefProvider" } remote { enabled-transports = ["akka.remote.netty.tcp"] netty.tcp { hostname = "127.0.0.1" port = 2552 } }}

Akka - example output

scala> Wizard.run[INFO] [11/09/2015 19:13:45.200] [run-main-4] [Remoting] Starting remoting[INFO] [11/09/2015 19:13:45.377] [run-main-4] [Remoting] Remoting started; listening on addresses :[akka.tcp://ActorSystem@127.0.0.1:2552][INFO] [11/09/2015 19:13:45.381] [run-main-4] [Remoting] Remoting now listens on addresses: [akka.tcp://ActorSystem@127.0.0.1:2552]Constructor: akka.tcp://ActorSystem@127.0.0.1:2552/user/ControllerConstructor: akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pongerpreRestart called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pongerConstructor: akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pingerpreRestart called for akka.tcp://ActorSystem@127.0.0.1:2552/user/ControllerTEST received from Actor[akka://ActorSystem/user/Controller/pinger#-454979336]preRestart called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pingerROGER received from Actor[akka://ActorSystem/user/Controller/ponger#-1639700423]SHOW-ME-CRASH received fromActor[akka://ActorSystem/user/Controller/pinger#-454979336]postStop called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/ponger[ERROR] [11/09/2015 19:13:45.405] [ActorSystem-akka.actor.default-dispatcher-2] [akka://ActorSystem/user/Controller/ponger] / by zerojava.lang.ArithmeticException: / by zero at Ponger$$anonfun$receive$3.applyOrElse(some.scala:76) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at Ponger.aroundReceive(some.scala:68)Constructor: akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pongerpostRestart called with reason java.lang.ArithmeticException: / by zeroSTOP-THE-SYSTEM received from Actor[akka://ActorSystem/user/Controller/pinger#-454979336]postStop called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pongerpostStop called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller/pingerpostStop called for akka.tcp://ActorSystem@127.0.0.1:2552/user/Controller[INFO] [11/09/2015 19:13:45.427] [ActorSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ActorSystem@127.0.0.1:2552/system/remoting-terminator] Shutting down remote daemon.[INFO] [11/09/2015 19:13:45.429] [ActorSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ActorSystem@127.0.0.1:2552/system/remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.[INFO] [11/09/2015 19:13:45.459] [ActorSystem-akka.actor.default-dispatcher-4] [Remoting] Remoting shut down[INFO] [11/09/2015 19:13:45.459] [ActorSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ActorSystem@127.0.0.1:2552/system/remoting-terminator] Remoting shut down.

Akka - example summaryAsynchronous actors are lightweightCommunication is trivial message basedComms works nicely across hostsCrash recovery fast (let it crash)

Akka - Supervisor and actor creation

import akka.actor.{Actor,ActorRef,ActorSystem,PoisonPill,ExtensionKey,Extension}import akka.actor.{Props,ExtendedActorSystem}import com.typesafe.config.{ConfigFactory,ConfigValueFactory}

//Object to run actor systemobject Wizard { def run()={ val conf = ConfigFactory.load() val system = ActorSystem("PingPongActorSystem") val supervisor = system.actorOf(Props(classOf[Supervisor]),name="Controller") }}

// create supervising actor called Supervisor (here)class Supervisor extends Actor { //Just to retrieve long address form val remoteAddr = RemoteAddressExtension(context.system).address val thisPath = self.path.toStringWithAddress(remoteAddr)

//Create supervised actor val ponger=context.actorOf(Props(classOf[Ponger]),name="ponger")

//Create another supervised actor and //pass the reference of the ponger in the constructor val pinger=context.actorOf(Props(classOf[Pinger],ponger),name="pinger")

Akka - Message handler andsupervisioning callbacks

// Actor's message receive handling def receive = { case "STOP-THE-SYSTEM" => println("STOP-THE-SYSTEM received from "+sender);context.system.shutdown() case someMsg => println("message "+someMsg + "received from " + sender) }

//Right after starting the actor, its preStart method is invoked. override def preStart() = { println("preRestart called for " + thisPath) }

//The old actor is informed by calling preRestart with the exception which //caused the restart and the message which triggered that exception. override def preStart(reason: Throwable message: Option[Any]) = { println("preRestart called for " + thisPath) }

//The new actor’s postRestart method is invoked with the exception which caused the restart. override def postRestart(reason: Throwable) = { println("postRestart called for " + thisPath) }

//After stopping an actor, its postStop hook is called. override def postStop() = { println("postStop called for " + thisPath) }}

So what is SparkGeneral engine for large-scale data processingParalell and in-memory (or file base or combination) data management systemLibraries supporting streaming, data frames,SQL, graph analysis and MachineLearningAPI's for , , , Written in Scala, utilizes underneath previously mentioned Akka frameworkData source support for HDFS, Cassandara, HBase …and good old plain text files

Scala Java Python R

Spark history in 30 secondsStarted as a research project in UC Berkley RAD Lab 2009At introduciontion time was already 10-20x faster than hadoopHadoop Map Reduce was not good enough for iterative and interactivedevelopmentOpen sourced 2010Became part of Apache Foundation on 2013

Spark is active

Spark concepts

Driver connects via context to a cluster of spark nodes co-ordinated by a clustermanager (transparent)Application is essentially your code, the driver + your app's executorsRDD (Resilient Distributed Data) represents immutable data on partitionsTransformations are done via RDD functions dived into smaller independent tasksData (copies,mapped dirs or via cluster filesystem) and code (right versions) mustexist on all cluster nodesTask is split into paralell jobs and further staged tasks

Spark Operations - Good to understandTransformations

operate on your RDD's on the cluster and return new RDD'snot cached (cleared after usage) unless specifically cached (for reuse in nextoperation)

Actions

will actually execute transformationsRETURN DATA …make sure you know the expected return size of the result

Lazy evaluation

All operations are lazy i.e. only executed when needed not at the definition point

Spark Major Data Structures Resilient Distributed Data, such as transformations and actions:RDD

map - apply a function to each element in the RDD, return a new RDDflatMap - Returns a new RDD by first applying a function to all elements of this RDD,and then flattening the results.intersect - Returns an RDD with common elements found in both RDD'scollect - Return all elements from the RDDreduce - Combine the elements of the RDD together in paralelltake - Returns number of elements fromt the RDD

is key value version of RDD with additional functionspairRDD

mapValues - Applies a function to each value without changing the keykeys - A new RDD of the keys in given RDDjoin - A new RDD of the inner joined two RDD's

is equivalent to a relational table in Spark SQLData frames

SQLish access style supported, not covered in this presentation

For distributed environment there and variables as well.Accumulators Broadcast

Spark job execution

Picture by Ashwini Kuntamukkala,Software Architect, SciSpike

Configuring your Spark/home/lniskanen/Spark/spark├── bin│   ├── spark-class│   ├── spark-class2.cmd│   ├── spark-class.cmd│   ├── sparkR│   ├── sparkR2.cmd│   ├── sparkR.cmd│   ├── spark-shell│   ├── spark-shell2.cmd│   ├── spark-shell.cmd│   ├── spark-sql│   ├── spark-submit│   ├── spark-submit2.cmd│   └── spark-submit.cmd├── conf│   ├── slaves│   ├── slaves.template│   ├── spark-defaults.conf.template│   ├── spark-env.sh│   └── spark-env.sh.template└── sbin ├── slaves.sh ├── spark-config.sh ├── spark-daemon.sh ├── spark-daemons.sh ├── start-all.sh ├── start-history-server.sh ├── start-master.sh ├── start-mesos-dispatcher.sh ├── start-mesos-shuffle-service.sh ├── start-shuffle-service.sh ├── start-slave.sh ├── start-slaves.sh └── start-thriftserver.sh

3 directories, 31 files

Fire up spark cluster and Spark UI

$SPARK_HOME/sbin/start-all.sh

To view cluster and jobs status on your browser open SparkUI on your local host port 8080

http://127.0.1.1:8080/

Starting spark-shell (REPL)

lniskanen@Machine:~/Spark/spark$ ./bin/spark-shell --master spark://Machine:7077 --jars /home/lniskanen/ScalaApps/RecommendationSystem/recommender/target/scala-2.11/recommender_2.11-0.0.1.jar,/home/lniskanen/ScalaApps/RecommendationSystem/common/target/scala-2.11/common_2.11-0.0.1.jar Spark context available as sc.SQL context available as sqlContext.Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ ̀/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)Type in expressions to have them evaluated.Type :help for more information.

scala> sc.accumulable clearCallSite getLocalProperty master setLocalProperty accumulableCollection clearFiles getPersistentRDDs metricsSystem setLogLevel accumulator clearJars getPoolForName newAPIHadoopFile sparkUser addFile clearJobGroup getRDDStorageInfo newAPIHadoopRDD startTime addJar defaultMinPartitions getSchedulingMode objectFile statusTracker addSparkListener defaultMinSplits hadoopConfiguration parallelize stop appName defaultParallelism hadoopFile range submitJob applicationAttemptId emptyRDD hadoopRDD requestExecutors tachyonFolderName applicationId externalBlockStoreFolderName initLocalProperties runApproximateJob textFile asInstanceOf files isInstanceOf runJob toString

String binaryFiles getAllPools isLocal sequenceFile un

ion binaryRecords getCheckpointDir jars setCallSite version broadcast getConf killExecutor setCheckpointDir wholeTextFiles cancelAllJobs getExecutorMemoryStatus killExecutors setJobDescription cancelJobGroup getExecutorStorageStatus makeRDD setJobGroup

scala> sc. |

SparkUi - example

Recommendation systems very shortlyAccording to Wikipedia:

Recommender systems or recommendation systems(sometimes replacing "system" with a synonym such as

platform or engine) are a subclass of information filteringsystem that seek to predict the 'rating' or 'preference'

that a user would give to an item.

For seperation use some metricDistance

Dis =tEucl ( −∑ni xi yi)2

− −−−−−−−−−−√Similarities

Similarit =yEucl1

1+ ( −∑ni xi yi)

2√

Similarit =yPearson−∑n

i XiYi

∑ni

Xi ∑ni

Yi

N

( − )( − )∑ni Xi

(∑ni

Xi)2

N∑n

i Yi

(∑ni

Yi)2

N√

Movie dataset from GroupLensMovie dataset kindly provided by the University of MinnesotaGroupLens is a research lab in the Department of Computer Science andEngineering at the University of Minnesota

It contains 20000263 ratings (only first 100k rows used here) and 465564 tagapplications across 27278 movies.All selected users had rated at least 20 movies.

http://grouplens.org/datasets/http://files.grouplens.org/datasets/movielens/ml-20m-README.html

Data set structure

Using only:

movies.csv

title of the moviegenre

ratings.csv

one rating of one movie by one user, at least 20 ratings per userratings (0.5 stars - 5.0 stars)

Movies data

import scala.io.Source

val movies="/home/lniskanen/Ammatti/DataScienceMeetup/2015_December/ml-20m/movies.csv"

val movieIter = io.Source.fromFile(movies).getLines()

movieIter.take(10).foreach(println)

movieId,title,genres1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy2,Jumanji (1995),Adventure|Children|Fantasy3,Grumpier Old Men (1995),Comedy|Romance4,Waiting to Exhale (1995),Comedy|Drama|Romance5,Father of the Bride Part II (1995),Comedy6,Heat (1995),Action|Crime|Thriller7,Sabrina (1995),Comedy|Romance8,Tom and Huck (1995),Adventure|Children9,Sudden Death (1995),Action

Ratings data

val ratings="/home/lniskanen/Ammatti/DataScienceMeetup/2015_December/ml-20m/ratings.csv"val ratingsIter = io.Source.fromFile(ratings).getLines()ratingsIter.take(10).foreach(println)

userId,movieId,rating,timestamp1,2,3.5,11124860271,29,3.5,11124846761,32,3.5,11124848191,47,3.5,11124847271,50,3.5,11124845801,112,3.5,10947857401,151,4.0,10947857341,223,4.0,11124855731,253,4.0,1112484940

Project structure for the demo/home/lniskanen/ScalaApps/RecommendationSystem├── build.sbt├── common│   ├── src│   │   └── main│   │   └── scala│   └── target│   └── scala-2.11│   ├── common_2.11-0.0.1.jar│   └── common-assembly-0.0.1.jar├── project│   ├── build.properties│   ├── Build.scala│   └── target├── recommender│   ├── src│   │   └── main│   │   ├── resources│   │   └── scala│   └── target│   └── scala-2.11│   ├── recommender_2.11-0.0.1.jar│   └── recommender-assembly-0.0.1.jar├── src│   └── main│   ├── resources│   └── scala│   └── recommBootstrapper.scala└── target └── scala-2.11 ├── rmsbootstrapper_2.11-0.0.1.jar └── rmsBootstrapper-uber.jar

21 directories, 10 files

Project component relations

Data types used here

abstract sealed trait MovieType

case class Movie(movieId:Int,movie:String,genres:String) extends MovieType

case class Rating(userId:Int,movieId:Int,rating:Double,timeStamp:Long) extends MovieType

Case Class is a construct in Scala for which scala compiler:

prefixes parameters with val (immutable)generates equals and hashCode for object comparisongenerates also copy, companion object with apply and unapply methods

Using sealed in Scala means that all definitions are only in this file.

Parsing a CSV file

package recommender

import java.io.{File}import scala.io.{Source,BufferedSource}

// The constructor is initialized with a parser function that// accepts a string and returns some abstract type Option[T] class CsvIterator[T](file:File,parse:(String)=>Option[T]) extends Iterator[Option[T]] {

val bufSource:BufferedSource=Source.fromFile(file.getAbsoluteFile()) val linesIter=bufSource.getLines()

override def hasNext:Boolean={ linesIter.hasNext}

override def next():Option[T]={ parse(linesIter.next())

}

}

Readily packages csv readers available such as at .OpenCsv Maven repository

Loading the movies data into Sparkcluster

def loadMovies(filePath:String):Unit={ // just replace in linux prefix ~ (if any) with actual home directory path val path = filePath.replaceFirst("̂~",System.getProperty("user.home"))

//Loading data as RDD type movies:Option[RDD[Rating]]=Some(sc.makeRDD(new CsvIterator[Movie](new File(path),str2Movies).filter({e=> e != None}).map({c=>c.get}).toSeq).cache())

println("movies loading finished...")}

// Csv parsing functiondef str2Movies(str:String):Option[Movie]={ val MovieReg="""(\d+),(.+),(.+)""".r str match { case MovieReg(id,m,_) => Some(Movie(id.toInt,m)) case _ => None }}

In Scala Option[T] is a container for an optional value of type T. If the value of type T ispresent, the Option[T] is an instance of Some[T], containing the present value of type

T. If the value is absent, the Option[T] is the object None.

Example snippets - similarity

/** * Calculate euclidean "similarity" for ratings of two users * * @param uid1 Rating.userId * @param uid2 Rating.userId * @return Similarity */ def eucl_sim(uid1:Int,uid2:Int):Double={

// search records for uid1 and uid2, groupBy movieId field val uid1Ratings=ratings.get.filter({i=> i.userId==uid1}).groupBy({r=>r.movieId}) val uid2Ratings=ratings.get.filter({i=> i.userId==uid2}).groupBy({r=>r.movieId}) // join ratings by movieId val commonRatings=uid1Ratings.join(uid2Ratings)

val N=commonRatings.keys.count

val dist = N match { case 0 => 10000 // no ratings in common, some arbitrary high distance case _ => commonRatings.mapValues({v=>pow(v._1.head.rating-v._2.head.rating,2)}).values.sum }

//Return similarity 1.0/(1.0+dist) }

Example snippets - get commonratings

/** * Movies by uid2 not rated by uid1. * *@return count of items only in uid1 and only in uid2, and common in both */def disjointCommon(uid1:Int,uid2:Int):(Long,Long,Long)={ // search records for uid1 and uid2, groupBy movieId field val uid1Ratings=ratings.get.filter({i=> i.userId==uid1}).groupBy({r=>r.movieId}).cache() val uid2Ratings=ratings.get.filter({i=> i.userId==uid2}).groupBy({r=>r.movieId}).cache()

val notInUid1=uid2Ratings.subtractByKey(uid1Ratings).keys.count

val notInUid2=uid1Ratings.subtractByKey(uid2Ratings).keys.count

val common=uid1Ratings.join(uid2Ratings).map({k=>k._1}).collect.length

(notInUid2,notInUid1,common)}

Example snippets - show closest

/** * Get number of common ratings with uid1 and uid2 * * @param uid1 Rating.userId * @param uid2 Rating.userId */ def commonRatings(uid1:Int,uid2:Int):Long={ lazy val uid1Ratings=ratings.get.filter({i=> i.userId==uid1}).groupBy({r=>r.movieId}) lazy val uid2Ratings=ratings.get.filter({i=> i.userId==uid2}).groupBy({r=>r.movieId}) uid1Ratings.join(uid2Ratings).map({k=>k._1}).collect.length

}

/** Display most similar movie rater comparted to uid based on euclidean similarity where compared rater has at least 8 common movies with the uid rater. * * @param uid Rating.userId */ def closest(uid:Int):Unit={ val uids=ratings.get.groupBy({r=>r.userId}).keys.filter({i=>i != uid}).collect val m=uids.filter({k=> commonRatings(uid,k) > 7}).map({u=> (u,eucl_sim(uid,u))})

val max = m.maxBy({e=>e._2})

println("other recommenders: "+uids+ " and closest uid is "+ max) }

Example snippets - show movierecommendations

/** * Movies rated bt uid2 but NOT rated by uid1. */ def notRatedByUid1(uid1:Int,uid2:Int):Array[Int]={ // search records for uid1 and uid2, groupBy movieId field val uid1Ratings=ratings.get.filter({i=> i.userId==uid1}).groupBy({r=>r.movieId}) val uid2Ratings=ratings.get.filter({i=> i.userId==uid2}).groupBy({r=>r.movieId})

uid2Ratings.subtractByKey(uid1Ratings).keys.collect.toArray

}/** * Returns list of movie names for given list of movieId's */ def movieNames(movieIds:Array[Int]):Array[String]={

val mkeys=movies.get.groupBy(_.movieId)

val res:Array[String]=movieIds.map({id=>mkeys.lookup(id)}).map({s=> s.map({i=>i.toArray}).flatten }).flatten.map(_.movie).toArray

res }

def showMovieProposals(uid1:Int,uid2:Int):Unit={ movieNames(notRatedByUid1(uid1,uid2)).foreach(println) }

Demo - using command lineapplication

You can bunddle scala client apps to fat jars with all includeddependencies, and then launch it from the command line

with java.

~/ScalaApps/RecommendationSystem/target/scala-2.11$ java -Dlog4j.configuration=file:/home/lniskanen/Spark/spark/conf/log4j.properties -jar ./rmsBootstrapper-uber.jar

Demo - using spark-shell

./bin/spark-shell –master <master address> –jars<some1.jar>,<some2.jar>

./bin/spark-shell --master spark://Machine:7077 --jars /home/lniskanen/ScalaApps/RecommendationSystem/recommender/target/scala-2.11/recommender_2.11-0.0.1.jar,/home/lniskanen/ScalaApps/RecommendationSystem/common/target/scala-2.11/common_2.11-0.0.1.jar

SummaryGlimpse of Scala, reasons why to use it in big data environmentsBasics of Akka, the framework under the hood of SparkSpark basic conceptsHow to get started with Spark cluster and spark-shellLoading data into Spark and making basic queriesVery basic example of recommendation system application

In my trainings you can further learn how to

get the user's expected rating for a given unwatched movie (product)find similar moviesdo error analysis for predictions and iterate better algorithms

Investigating GPU accelerationAccelerating calculations with GPU's using from Gpu SystemsLibra SDK provides GPU agnostic interface to GPU programming

is a realtime compute provider and accelerates computations usingmodern accelerators such as CPUs, GPUs and future devices while simplifiyingimplementations.

Libra SDK

Gpu Systems

A short list of my consultation andtraining services

Getting started with ML projects - How to do data intelligence projectsPredictive Analytics - Elaborated hands on examples using RRecommendation Systems - Techniques behind Netflix successProgramming in R - Defacto data science programming languageProgramming in Scala - Functional and object flavored language for reliable, scalablesystemsReactive programming - Using Akka framework with ScalaFast access to big data using in-memory technologies - Spark with ScalaBecomming a Data Scientist - Compentencies and Career Planning

Who am I

Big Data SW Product Architect and Machine Learning, R andScala -programming consultantSymbio 2009-2014 Presales, Business Director, Global IT DirectorArdites 2005-2009 Founder of Tampere operations, SW Development businessNokia 1996-2005 Mobile phones R&D; Cellular Testing, SW Architect,SW ReleaseManager,Technology Director

LinkedIn: Lauri NiskanenIntelligentpipe Oy

- FOR MORE INSIGHTSNÄKEMYKSIIN