Home > Documents > Introduction to Big Data - Online...

Introduction to Big Data - Online...

Date post: 22-May-2020
Category:
Author: others
View: 8 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 47 /47
1 Introduction to Big Data Daniel Hagimont [email protected]
Transcript
  • 1

    Introduction to Big DataDaniel Hagimont

    [email protected]

  • 2

    Context

    We generate more and more dataIndividuals and companiesKb → Mb → Gb → Tb → Pb → Eb → Zb → Yb → ???

    Few numbersIn 2013, Twitter generates 7 Tb per day and Facebook 10 TbThe Square Kilometre Array radio telescope

    Products 7 Pb of raw data per second, 50 Tb of analyzed data per day

    Airbus generates 40 Tb for each plane testCreated digital data worldwide

    2010 : 1,2 Zb / 2011 : 1,8 Zb / 2012 : 2,8 Zb / 2020 : 40 Zb90 % of data were created in the last 2 years

  • 3

    Context

    Many data sourcesMultiplication of computing devices and connected electronic equipmentsGeolocation, e-commerce, social networks, logs, internet of things …

    Many data formatsStructured and unstructured data

  • 4

    Applications domains

    Scientific applications (biology, climate …)E-commerce (recommandation)Equipment supervision (e.g. energy)Predictive maintenance (e.g. airlines)Espionage

    The NSA has built an infrastructure that allows it to intercept almost everything. With this capability, the vast majority of human communications are automatically ingested without targeting. E Snowden

    https://www.theguardian.com/us-news/nsa

  • 5

    New jobs

    Data ScientistGeek/hacker : know how to develop, parameterize, deploy toolsHPC specialist : parallelism is keyIT specialist : know how to manage and transform dataStatistician : know how to use mathematics to classify, group and analyze informationManager : know how to define objectives and identify the value of information

  • 6

    Computing infrastructures

    The reduced cost of infrastructures

    Main actors (Google, Facebook, Yahoo, Amazon …) developed frameworks for storing and processing dataWe generally consider that we enter the Big Data world when processing cannot be performed with a single computer

  • 7

    Definition of Big Data

    DefinitionRapid treatment of large data volumes, that could hardly be handled with traditional techniques and tools

    The three V of Big DataVolumeVelocityVarietyTwo additional V

    VeracityValue

  • 8

    General approachMain principle : divide and conquer

    Distribute IO and computing between several devices

  • 9

    Solutions

    Two main families of solutionsProcessing in batch mode (e.g. Hadoop)

    Data are initially stored in the clusterVarious requests are executed on these dataData don't change / requests change

    Processing in streaming mode (e.g. Storm)Data are continuously arriving in streaming modeTreatments are executed on the fly on these dataData change / Requests don't change

    } This lecture

  • 10

    The map-reduce principle

    We have to manage many stores around the worldA large document registers all the sales

    For each sale : day – city – product - priceObjective : compute the total of sales per store

    The traditional methodA Hashtable memorizes the total for each store ()We iterate through all records

    For each record, if we find the city in the Hashtable, we add the price

  • 11

    The map-reduce principleWhat happens if the document size is 1 Tb ?

    I/O are slowMemory saturation on the hostTreatment is too long

    Map-ReduceDivide the document in several fragmentsSeveral machines for computing on the fragmentsMappers : execute in parallel on the fragmentsReducers : aggregate the results from mappers

    Mappers Reducers

  • 12

    The map-reduce principleMappers

    Gather from a document fragment pairsSend them to reducers according to city

    ReducersEach reducer is responsible for a set of cityEach reduce computes the total for each city

  • 13

    Hadoop

    Support the execution of Map-Reduce applications in a cluster

    The cluster could group tens, hundreds or thousands of nodesEach node provides storage and compute capacities

    ScalabilityIt should allow storage of very large volumes of dataIt should allow parallel computing of such dataIt should be possible to add nodes

    Fault toleranceIf a node crashes

    ongoing computing should not fail (jobs are re-submitted)Data should be still available (data is replicated)

  • 14

    Hadoop principles

    Two main partsData storage : HDFS (Hadoop Distributed File System)Data treatment : Map-Reduce

    PrincipleCopy data to HDFS – data is divided and stored on a set of nodesTreat data where they are stored (Map) and gather results (Reduce)Copy results from HDFS

  • 15

    A new file system to read and write data in the clusterFiles are divided in blocks between nodesLarge block size (initially 64 Mb)Blocks are replicated in the cluster (3 times by default)Write-once-read-many : designed for one write / multiple readsHDFS relies on local file systems

    HDFS : Hadoop Distributed File System

  • 16

    HDFS architecture

  • 17

    Programming with Hadoop

    Basic entity : key-value pair (KV)The map function

    Input : KVOutput : {KV}The map function receives successively a set of KV from the local block

    The reduce functionInput : K{V}Output : {KV}Each key received by a reduce is unique

  • 18

    Execution scheme

  • 19

    WordCount example

    The WordCount applicationInput : a large text file (or a set of text files)

    Each line is read as a KV Output : number of occurrence of each word

    Hello World Bye World

    Hello Hadoop Goodbye Hadoop

    Map1

    < Bye, [1]> < Goodbye, [1]> < Hadoop, [1,1]> < Hello, [1,1]> < World, [1,1]>

    Splits

    < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

    Map2

    MergeReduce

    < Hello, 1> < World, 1> < Bye, 1> < World, 1>

    < Hello, 1>` < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

  • 20

    public static class TokenizerMapper extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}

    Mapmap(key, value) → List(keyi, valuei)

    Hello World Bye World < Hello, 1> < World, 1> < Bye, 1> < World, 1>

  • 21

    Shuffle and sort

    Shuffle : group KVs whose K is identicalSort : sort by KDone by the framework

    Map1

    Map2

    < Hello, 1> < World, 1> < Bye, 1> < World, 1>

    < Hello, 1>` < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

    < Bye, [1]> < Goodbye, [1]> < Hadoop, [1,1]> < Hello, [1,1]> < World, [1,1]>

  • 22

    Reducereduce(key, List(valuei)) → List(keyi, valuei)

    public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);}}

    < Bye, [1]> < Goodbye, [1]> < Hadoop, [1,1]> < Hello, [1,1]> < World, [1,1]>

    < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

  • 23

    Several reduces

    < Hello, [1,1]> < World, [1,1]>

    < Hello, 2> < World, 2>

    Reduce < Hello, 1> < World, 1> < Bye, 1> < World, 1>

    < Hello, 1>` < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

    Map1

    Map2

    shuffle&

    sort

    < Bye, [1]> < Goodbye, [1]> < Hadoop, [1,1]>

    < Bye, 1> < Goodbye, 1> < Hadoop, 2>

  • 24

    Combiner functions

    Reduce data transfer between map and reduceExecuted at the ouput of mapOften the same function as reduce

    < Hello, [1,1]> < World, [2]>

    < Hello, 1> < World, 1> < Bye, 1> < World, 1>

    < Hello, 1>` < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

    Map1

    Map2

    shuffle&

    sort

    < Bye, [1]> < Goodbye, [1]> < Hadoop, [2]>

    < Hello, 1> < World, 2> < Bye, 1>

    < Hello, 1>` < Hadoop, 2> < Goodbye, 1>

    Reduce

    Combiner

  • 25

    Main program

    public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

  • 26

    Execution in a cluster

    DataNode

    Map

    HDFS

    DataNode DataNode DataNode DataNode DataNode

    ReduceMap Map MapReduce

  • 27

    Evolution from HadoopSpeed: reducing read/write operations

    Up to 10 times faster when running on diskUp to 100 times faster when running in memory

    Multiple-languages: Java, Scala or PythonAdvance analytics: not only Map-Reduce

    SQLStreaming dataMachine LearningGraph algorithms

    Sparks in few words

  • 28

    Iterative scheme with Spark

    Keep data in memory as long as possibleStore on disk only if memory is not sufficientAlso don’t have to restart JVMs

  • 29

    Programming with Spark (Java)

    Initialization

    Spark relies on Resilient Distributed Datasets (RDD)Datasets that are partitioned on nodesCan be operated in parallel

    SparkConf conf = new SparkConf().setAppName("WordCount"); JavaSparkContext sc = new JavaSparkContext(conf);

  • 30

    Programming with Spark (Python)

    Initialization

    Spark relies on Resilient Distributed Datasets (RDD)Datasets that are partitioned on nodesCan be operated in parallel

    conf = SparkConf().setAppName("WordCount") sc = SparkContext(conf=conf)

  • 31

    Programming with Spark (Java)

    RDD created from a Python data

    RDD created from an external storage (file)

    List data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD rdd = sc.parallelize(data);

    JavaRDD rdd = sc.textFile("data.txt");

  • 32

    Programming with Spark (Python)

    RDD created from a Java object

    RDD created from an external storage (file)

    data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data)

    rdd = sc.textFile("data.txt")

  • 33

    Programming with Spark

    Driver program: the main programTwo types of operation on RDD

    Transformations: create a new RDD from an existing onee.g. map() passes each RDD element through a given function

    Actions: compute a value from a existing RDDe.g. reduce() aggregates all RDD elements using a given function and computes a single value

    Transformations are lazily computed when needed to perform an action (optimization)By default, transformations are cached in memory, but they can be recomputed if they don't fit in memory

  • 34

    Programming with Spark (Java)

    Example with lambda expressionsmap(): apply a function to each element of a RDDreduce(): apply a function to aggregate all values from a RDD

    Function must be associative and commutative for parallelism

    Or with Java functions

    JavaRDD lines = sc.textFile("data.txt"); JavaRDD lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce((a, b) -> a + b); lineLengths.persist(StorageLevel.MEMORY_ONLY());

    JavaRDD lines = sc.textFile("data.txt"); JavaRDD lineLengths = lines.map(new Function() { public Integer call(String s) { return s.length(); } }); int totalLength = lineLengths.reduce(new Function2() { public Integer call(Integer a, Integer b) { return a + b; } });

  • 35

    Programming with Spark (Python)

    Example with lambda expressionsmap(): apply a function to each element of a RDDreduce(): apply a function to aggregate all values from a RDD

    Function must be associative and commutative for parallelism

    Or with a function

    lines = sc.textFile("data.txt") LineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) lineLengths.persist()

    def lenFunc(s): return len(words)

    lines = sc.textFile("data.txt") sc.textFile("file.txt").map(lenFunc) ...

  • 36

    Programming with Spark (Java)

    Execution of operations (transformations/actions) is distributed

    Variables in the driver program are serialized and copied on remote hosts (they are not global variables)

    Should use special Accumulator/Broadcast variables

    int counter = 0; JavaRDD rdd = sc.parallelize(data);

    // Wrong: Don't do this!! rdd.foreach(x -> counter += x); println("Counter value: " + counter);

  • 37

    Programming with Spark (Python)

    Execution of operations (transformations/actions) is distributed

    Variables in the driver program are serialized and copied on remote hosts (they are not global variables)

    Should use special Accumulator/Broadcast variables

    counter = 0 rdd = sc.parallelize(data)

    # Wrong: Don't do this!! def increment_counter(x): global counter counter += x

    rdd.foreach(increment_counter) print("Counter value: ", counter)

  • 38

    Programming with Spark (Java)

    Many operations rely on key-value pairsExample (count the lines)

    mapToPairs(): each element of the RDD produces a pairreduceByKey(): apply a function to aggregate values for each key

    JavaRDD lines = sc.textFile("data.txt"); JavaPairRDD pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD counts = pairs.reduceByKey((a, b) -> a + b);

  • 39

    Programming with Spark (Python)

    Many operations rely on key-value pairsExample (count the lines)

    map(): each element of the RDD produces a pairreduceByKey(): apply a function to aggregate values for each key

    lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a + b)

  • 40

    WordCount example (Java)

    JavaRDD words = sc.textFile(inputFile).flatMap(s -> Arrays.asList(s.split(" ")).iterator());

    JavaPairRDD counts =

    words.mapToPair(w -> new Tuple2(w,1)).reduceByKey((a,b) -> a + b);

  • 41

    WordCount example (Python)

    words = sc.textFile(inputFile).flatMap(lambda line : line.split(" "))

    counts = words.map(lambda w : (w, 1)).reduceByKey(lambda a, b: a + b)

  • 42

    Many APIs

  • 43

    Using Spark (Java)Install Spark

    tar xzf spark-2.2.0-bin-hadoop2.7.tgzDefine environment variables

    export SPARK_HOME=/spark-2.2.0-bin-hadoop2.7export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

    Development with eclipseCreate a Java ProjectAdd jars in the build path

    • $SPARK_HOME/jars/spark-core_2.11-2.2.0.jar• $SPARK_HOME/jars/scala-library-2.11.8.jar• $SPARK_HOME/jars/hadoop-common-2.7.3.jar• Could include all jars, but not very clean

    Your application should be packaged in a jarLaunch the application

    spark-submit --class --master Centralized: = local or local[n]Cluster: = url to access the cluster's master

  • 44

    Using Spark (Python)Install Python3

    The default Python should refer to Python3Install Spark

    tar xzf spark-2.2.0-bin-hadoop2.7.tgzDefine environment variables

    export SPARK_HOME=/spark-2.2.0-bin-hadoop2.7export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

    Development with eclipseGo to the Help/MarketPlace and install PyDevGo to Windows/Preferences … Python Interpreter

    Libraries/New Zip• add /python/lib/py4j-0.10.7-src.zip• add /python/lib/pyspark.zip

    Environment• add SPARK_HOME =

  • 45

    Using Spark (Python)

    Development with eclipseCreate a PyDev ProjectDevelop your modules

    Launch the applicationspark-submit --master

    Centralized: = local or local[n]Cluster: = url to access the cluster's master

  • 46

    Cluster mode

    Starting the masterstart-master.shYou can check its state and see its URL at http://master:8080

    Starting slavesstart-slave.sh -c 1 // -c 1 to use only one core

    FilesIf not running on top of HDFS, you have to replicate your files on the slavesElse you program should refer to the input file in HDFS with a URL

  • 47

    Conclusion

    Spark is just the beginningYou should have a look at

    Spark streamingSpark SQLML LibGraphX...

    Diapo 1Diapo 2Diapo 3Diapo 4Diapo 5Diapo 6Diapo 7Diapo 8Diapo 9Diapo 10Diapo 11Diapo 12Diapo 13Diapo 14Diapo 15Diapo 16Diapo 17Diapo 18Diapo 19Diapo 20Diapo 21Diapo 22Diapo 23Diapo 24Diapo 25Diapo 26Diapo 27Diapo 28Diapo 29Diapo 30Diapo 31Diapo 32Diapo 33Diapo 34Diapo 35Diapo 36Diapo 37Diapo 38Diapo 39Diapo 40Diapo 41Diapo 42Diapo 43Diapo 44Diapo 45Diapo 46Diapo 47


Recommended