Scala - THE language for Big Data

Post on 15-Apr-2017

1,517 views 0 download

transcript

private static class Person { String firstName; String lastName;}

private List<Person> firstNFamilies(int n, List<Person> persons) { final List<String> familiesSoFar = new LinkedList<>(); final List<Person> result = new LinkedList<>(); for (Person p : persons) { if (familiesSoFar.contains(p.lastName)) { result.add(p); } else if (familiesSoFar.size() < n) { familiesSoFar.add(p.lastName); result.add(p); } } return result;}

case class Person(firstName: String, lastName: String)

def firstNFamilies(n: Int, persons: List[Person]): List[Person] = { val firstFamilies = persons.map(p => p.lastName).distinct.take(n) persons.filter(p => firstFamilies.contains(p.lastName))}

class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext) extends ParquetOutputCommitter(outputPath, context) { … }

Java class from org.apache.parquet:parquet-hadoop

Scala class from org.apache.spark:spark-core_2.10

http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg

http://vmturbo.com/wp-content/uploads/2015/05/ScaleUpScaleOut_sm-min.jpg

val numbers = 1 to 100000val result = numbers.map(slowF)

val numbers = 1 to 100000val result = numbers.par.map(slowF)

Parallelizes next manipulations over available CPUs

val numbers = 1 to 100000val result = sparkContext.parallelize(numbers).map(slowF)

Parallelizes next manipulations over scalable cluster, by creating a Spark RDD - a Resilient Distributed Dataset

Map

Map

MapMap Map (retry)