K. Zhang (pic source: mapr.com/blog) BUDT 758
Lecture 7 (03/12, 03/14): Pig and Sqoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018
2
3
4
5
6
7
8
9
10
11
Sqoop lab
13
14
15
16
Pig lab (page 7-18)
• Use Pig for ETL processingq Local modeq Then launch to cluster
• Analyze Ad campaign data with Pigq Low cost sitesq High cost keywords
Go through the corresponding slides first
• Pig-Introduction.pdfq Everything needed to complete the Pig Lab
• (optional): Pig-AdvancedFunctions.pdfq More advanced functions of Pig
• Combine, join, analyze sales dataq Answer more complex question
• Is the ad campaign effective?
What we will cover next
• How to run Pig• Pig Latin syntax • Loading data, output, etc• Pig Data concepts
q Fields, Tuples, Bag, Relation• Commonly used operations and functions
20
21
22
Pig Latin Syntax
23
24
Pigs are lazy and …smart
• Optimize your codes q Ex1: define a field but never use it è Pig won’t bother to
load data for that field q Ex2: step 1, step 2, step 3 è Pig might change it to step1,
step3, step 2 if that is more efficient.
• Will not do anything until output is required q DUMP q STORE
• Now you understand why LOAD data in Pig seems so amazingly fast q Even for terabytes of data q That’s because it only caches the command, rather than
execute it.
26
27
28
29
30
31
32
33
Pig Data Concepts
34
35
36
37
38
39
40
41
Extract and re-order columns
42
43
44
45
46
Group
• Slides: “Pig-Introduction.pdf” pages 9-85 to 9-88
• esp. 9-88
48
49
50
51
52
53
54
Hope you are now a Pig-Master