+ All Categories
Home > Documents > White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019....

White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019....

Date post: 20-Jan-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
26
White-Box Testing of Big Data Analytics with Complex User-Defined Functions Muhammad Ali Gulzar 1 Shaghayegh Mardani 1 Madan Musuvathi 2 Miryung Kim 1 1 University of California, Los Angeles 2 Mircrosoft Research 1
Transcript
Page 1: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

White-BoxTestingofBigDataAnalyticswithComplexUser-DefinedFunctions

MuhammadAliGulzar 1 Shaghayegh Mardani 1 MadanMusuvathi2Miryung Kim1

1UniversityofCalifornia,LosAngeles2MircrosoftResearch

1

Page 2: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

2

SoftwareDevelopmentCycleofBigDataAnalytics

InadequateTestingofBigDataAnalytics

1 Developlocally

2 TestlocallywithSampleData

3Executethejobonthecloudhopingthatitwouldwork

4Severhourslater,thejobcashesorproduceswrongoutput

5 GotoStep 2

Repeat

Page 3: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

3

MotivatingExampleFindthetotalnumberoftripsmadefromUCLAusingapublictransport,apersonalvehicle,oronfoot.

TripsDataset(20GB)

#,ORIG,DEST,DIST,TIME1,90034,90024,10,12,90001,90024,16,1.4….

Zip,Location90034,“UCLA”90024,“Westwood”…

LocationsDataset(100MB)

BigDataApplicationinApacheSparkval trips = sc.textFile(“trips”)

.map { s => val c = s.split(","); (c(1), c(3).toInt / c(4).toInt)} val locations = sc.textFile(”zipcode”)

.map { s => val c= s.split(","); (c(0), c(1))}

.filter { s => s._2.equals(“UCLA") } val result= trips.join(locations).map { s =>

if (s._2._1 > 40) ("car", 1) else if (s._2._1 > 15) ("public", 1) else ("onfoot", 1)}

.reduceByKey(_ + _)

Page 4: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

val trips = sc.textFile(“trips”) .map { s => val c = s.split(","); (c(1), c(3).toInt / c(4).toInt)}

val locations = sc.textFile(”zipcode”) .map { s => val c= s.split(","); (c(0), c(1))} .filter { s => s._2.equals(“UCLA") }

val result= trips.join(locations).map { s => if (s._2._1 > 40) ("car", 1)

else if (s._2._1 > 15) ("public", 1) else ("onfoot", 1)}

.reduceByKey(_ + _) 4

CharacteristicsofBigDataAnalytics

Relationalskeleton

Customlogicasuser-definedfunctions

Stringoperationsarecommon

Fluidinterchangebetweentypes

Howdowetestabigdataapplicationeffectivelyandefficiently?

Page 5: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

5

Option1:SampleInputData

• randomsampling,

• topnsampling

• topk%sample,etc.

Limitations:

• Thesamplemayonlyexercisealimitedsetofprogrampaths (lowcodecoverage).

• Thesamplemaynotincludetheinputsleadingtoaprogramcrash.

• Alargesamplemayhavehighercoveragebutincreaselocaltestingtime.

Page 6: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

6

Option2:TraditionalTestGenerationforJava

• BigDataAnalyticsprogramscompiletoJavabytecode

• Butthisincludestheentiresystem(700KLOCforApacheSpark)

• Symbolicexecutionwithoutabstractionisinfeasible andwouldnotscale

Page 7: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

7

OurApproach:White-BoxTesting

sc.textFile("hdfs").map(s=> s.toInt).filter(w => w > 0)).reducebyKey(_+_)

Input:BigDataAnalyticsApplication

BigTestPC Input

X>0 X=“1”

X≤0 X=“0”

Output:TestInputData

1. DecomposerelationalskeletonandUDFs

2. Logicalspecificationsforrelationaloperators

3. SymbolicexecutionofUDFs

4. Generateinputsbyjointpathconstraints

Page 8: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

8

ModellingDataflowOperators

Step1Decomposition

Step3:SymbolicExecution

Step2:LogicalSpecs

Step4:TestGeneration

Trips Zipcode

MapMap

Join:⨝

Map

ReduceByKey

Filter

Page 9: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

9

ModellingDataflowOperators

Trips Zipcode

MapMap

Join:⨝

Map

ReduceByKey

FilterTrue False

Non-MatchingKeys

Non-MatchingKeys

• Handleterminating and non-terminatingcasesofdataflowoperators

• E.g.Join canintroduce3cases

• 2casesinwhichkeysfromrightandleftdonotmatch

• 1caseinwhichrightandleftkeysmatch

Step1Decomposition

Step3:SymbolicExecution

Step2:LogicalSpecs

Step4:TestGeneration

Page 10: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

10

ModellingUser-definedFunctions

Trips Zipcode

MapMap

Join:⨝

Map

ReduceByKey

FilterTrue False

Non-MatchingKeys

Non-MatchingKeys

Decomposition UDFSE LogicalSpecs

Testgeneration

s.split(“,”).length > 2

V>40

=>

“car”

15<V≤40 V<15

=>

“public”

=>

“walk”

Step1Decomposition

Step3:SymbolicExecution

Step2:LogicalSpecs

Step4:TestGeneration

• Handlestrings,collections,andtuples

Page 11: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

11

JoinDataflowandUDF(JDU)Path

Trips Zipcode

Map:𝑓map1

Map:𝑓map2

Filter:𝑓filter

Join:⨝

Map:𝑓map3

ReduceByKey:𝑓Agg

~𝑓filter(K2 ,V2)

T1

T4

FalseTrue

𝑓filter(K2,V2)⋀ K1 =K2

(K1 ,V1)(K2 ,V2)

(K1 ,(V1,V2))

(S,1)

(S,N)

𝑓filter(K2,V2)⋀K1 ∉ Zipcode

K1 ∉ Zipcode K2 ∉ Trips

𝑓filter(K2,V2)⋀K2 ∉ Trips

T2 T3

T Z

Z.split(“,”)[1]=“Palms” ⋀Z.split(“,”).length >1 ⋀

T.split(“,”)[1] = Z.split(“,”)[0] ⋀

T.split(“,”).length >1 ⋀ …

Step1Decomposition

Step3:SymbolicExecution

Step2:LogicalSpecs

Step4:TestGeneration

Page 12: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

12

TestInputGeneration

Z.split(“,”)[1]=“Palms” ⋀Z.split(“,”).length >1 ⋀

T.split(“,”)[1] = Z.split(“,”)[0] ⋀

T.split(“,”).length >1 ⋀ …

(assert (= T (str.++ (str.++ line20 ",") line21))) (assert (= Z

(str.++ (str.++ " " ",") (str.++ (str.++ line11 ",")(str.++ (str.++ " " ",") (str.++ (str.++ line13 ",") line14))))))

(assert(and (not (= (str.to.int line14) 0)) (and (isinteger line14) (and (isinteger line13) (and (= "Palms" line21) (and (= x11 line20) (and (<= s21 15)(and (<= s21 40) (and (= s21 x621) (and (= s1 x61) (=

s22 x622))))))))))))))) (assert

(and (= x11 line11) (and (= x12 (/ (str.to.int line13) (str.to.int line14))) (and

(= x61 x11) (and (= x621 x12) (and (= x622 x42) (and (= x71 "walk") (= x72

1))))))))))))

Trips Location

_, "\x00", _, "0", "1" "\x00", "Palms"

GeneratedTestData

Step1Decomposition

Step3:SymbolicExecution

Step2:LogicalSpecs

Step4:TestGeneration

Page 13: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

13

Evaluation

RQ1: HowmuchtestcoverageimprovementcanBigTest achieve?

RQ2: HowmanyfaultscanBigTest detect?• Webuiltthefirstbenchmarkoffaultydataflowprogramsbasedon

oursurveyofsuchprogramsonQ/Aforumse.g. StackOverflow .

RQ3: HowmuchtestdatareductiondoesBigTest provideandhowlongdoesBigTest taketogeneratetestdata?

Page 14: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

14

ExperimentalSetting

• Weusesevensubjectprogramsfromearlierworks

• Allsubjectapplicationshavecomplexstring,complexarithmetic,Tupletypeforkey-valuepairs,andcollectionswithcustomlogic.

SubjectProgram Dataflow Operators #ofOperators

JDUPathsK=2

#ofUDFs

IncomeAggregate map,filter,reduce 3 6 4

MovieRatings map,filter,reduceByKey 4 5 4

AirportLayover map,filter,reduceByKey 3 14 4

CommuteType map,fitler,join,reduceByKey 6 11 5

PigMix-L2 map,join 5 4 6

GradeAnalysis flatmap,filter,reduceByKey,map 5 30 3

WordCount flatmap,map,reduceByKey 3 4 3

Page 15: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

15

StudyofBigDataAnalyticsFaults

• Noexistingbenchmarkoffaultyapplications

• Westudythecharacteristicsofreal-worldbigdataanalyticsbugspostedonStackOverflow andApacheSparkMailingLists.

Community

SurveyStatisticsKeywordsSearched ApacheSparkexceptions,

taskerrors,failures,wrongoutputs

PostsStudied Top50

PostswithCodingErrors

23

CommonFaultTypes 7

TotalFaulty Programs 31

FaultTypes ExampleIncorrectStringOffset str.substring(1,0)

IncorrectColumnSelection str.split(“,”)[1]

Wrong Delimiters str.split(“\t”)[1]

IncorrectBranchCondition If(age>10 && age<9)

WrongJoin Type LeftOuterJoin

Key-ValueSwap (Value, Key)

Others Division by zero

Page 16: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

16

RealWorldFaultInjection

• Identified7commoncodefaulttypes

• Manuallyinsertedthesefaultsintobenchmarks

• Leadstoatotalof31faultybigdataapplications.

val trips = sc.textFile(“trips”) .map { s =>

val c = s.split(","); (c(1), c(3).toInt / c(4).toInt)

} val loc = sc.textFile(”zipcode”) . . . .

val trips = sc.textFile(“trips”) .map { s =>

val c = s.split(","); - (c(1), c(2).toInt / c(4).toInt)} val loc = sc.textFile(”zipcode”) . . . .

c(5).

Afterinjectingfaultbasedonfaulttype ”IncorrectColumnSelection”,theprogramextractsthecolumnatindex5insteadof4.

OriginalProgram FaultyProgram

Page 17: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

Sedge[ASE’13]generatesexamplesfordataflowprogramsbutithandlesaUDFasuninterpreted functionanddoesnotmodelitsinternals.

17

RQ1:CodeCoverage

RQ1 RQ2 RQ3

100 100 100 100 100 100 100

17

40

14 1825

1325

6760

29

55

75 77

100

0

20

40

60

80

100

IncomeAggregate

MovieRatings AirportLayover

CommuteType PigMixL2 GradeAnalysis WordCount

JDUPathCoverageonSubjectPrograms

BigTest Sedge EntireDataset

JDUPathCo

verage

Normalize

d

Page 18: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

18

RQ1:CodeCoverage

RQ1 RQ2 RQ3

JDUPathCo

verage

Normalize

d

BigTest improvesJDUpathcoverageby78%againstSedgeand34%againsttheentiredataset.

100 100 100 100 100 100 100

17

40

14 1825

1325

6760

29

55

75 77

100

0

20

40

60

80

100

IncomeAggregate

MovieRatings AirportLayover

CommuteType PigMixL2 GradeAnalysis WordCount

JDUPathCoverageonSubjectPrograms

BigTest Sedge EntireDataset

Page 19: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

19

RQ2:FaultDetectionCapability

RQ1 RQ2 RQ3

BigTest detects2XmorefaultsthanSedgebecauseitmodelstheinternalsemanticsofUDFswiththespecificationsofdataflowoperators.

Applications TotalSeededFaults

DetectedbyBigTest

DetectedbySedge 1 2 3 4 5 6 7

IncomeAggregate 3 3 1 ✓ NA NA ✓ NA NA ✓

MovieRating 6 6 6 ✓ ✓ ✓ ✓ NA ✓ ✓

AirportLayover 6 6 4 ✓ ✓ ✓ ✓ NA ✓ ✓

CommuteType 6 6 4 NA ✓ ✓ ✓ ✓ ✓ ✓

PigMix-L2 4 4 2 NA ✓ ✓ NA ✓ ✓ NA

GradeAnalysis 4 4 3 NA ✓ ✓ ✓ NA NA ✓

WordCount 2 2 0 NA ✓ NA NA NA NA ✓

InjectedFaultType

Page 20: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

20

RQ3:TestSizeReductionRQ1 RQ2 RQ3

6 5 14 11 430

6

4.00E+09

5.21E+05

4.48E+08 3.20E+08 2.40E+08 4.00E+07 1.11E+08

1E+00

1E+02

1E+04

1E+06

1E+08

1E+10

IncomeAggregate

MovieRatings AirportLayover

CommuteType PigMixL2 GradeAnalysis WordCount

TestDatasetSize

BigTest EntireDataset

#ofRow

s

Comparedtotheentiredataset,BigTest achievesmoreJDUpathcoveragewith105Xto108Xsmallertestdata,translatinginto194Xtestingspeedup.

Page 21: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

21

Summary

• NeedSEtoolsforbigdataanalyticsapplications

• BigTest providesexhaustive,automatic,andfast testing

• Contributions:

1. DemonstratedtheneedtointerpretUDFs

2. Modelstrings,collections,andtuples

3. Logicalspecificationsfordataflowoperatorshandlingterminatingandnonterminatingcases

4. ProvidethefirstsymbolicexecutionengineforApacheSpark/Scala

5. Presentastudyofbigdataanalyticsbugsandthefirstbugbenchmark

Publicallyavailableat:https://github.com/maligulzar/BigTest

Page 22: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

22

RQ3:BreakdownofBigTest’s TestingTimeRQ1 RQ2 RQ3

4.7 0.6

66.5

12.7 0.3 8.5 0.33.7 3.8

3.5

3.9 3.8 3.82.6

2.2 2.9

4.2

6.4

3.95.3

1.8

0

20

40

60

80

IncomeAggregate

MovieRatings AirportLayoverCommuteType PigMixL2 GradeAnalysis WordCount

BreakdownofTestingTime

TheoremSolver ConstraintsGeneration Testing

Timeinse

cond

s

Byrunningtestslocally,BigTest improvesthetestingtime(CPUseconds)by194X,onaverage,comparedtotestingtheentiredataseton16-nodecluster.

Page 23: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

23

InadequateTestGenerationToolsforBigDataAnalyticsTraditionalSoftwareTestGeneration BigDataAnalyticsTestGeneration

def concat(append: boolean, a:String, b: String ) {

result: String = null;If (append)result = a + b;return

result.toLowerCase();}

sc.textFile("hdfs").flatMap(s=> s.split(",")).map(w =>(w,1)).reducebyKey(_+_)

• Standaloneapplication• SymbolicExecutionCompatible• Welldefinedsemantics• Logicalexecutionissimilarto

physicalexecution

• Heavilydependsonframework• Non-existenceSymbolicExecutionfor

dataflowoperators• Newoperatorswithchangingsemantics• Logicalexecutionisdifferenttophysical

execution

Page 24: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

24

ProgramDecomposition

• Challenge:DuetothecomplexityofDISCframeworks’code,symbolicexecutionisinfeasibleonDISCapplications.

• Insight:TheindividualUDFsofDISCapplicationarerelativelysmaller(<100LOC)makingsymbolicexecutionfeasible.

• Solution:WedecomposeaDISCapplicationusingASTanalysisintoasetofindividualUDFsanddataflowoperators.

. . .

.map { s => val c= s.split(",")(c(0), c(1))

}.filter {

s => s._2.equals("Palms") }. . .

class UDF_MAP{static void main(String args[]){

apply(null);}static Tuple2 apply(String s){

String[] arr = s.split(",");return Tuple2(arr[0], arr[1]);

} }

map

class UDF_FILTER{static void main(String args[]){

apply(null);}static Boolean apply(String s){

return s.equals(”Palms");}}

filter

Decomposition UDFSE LogicalSpecs

Testgeneration

Page 25: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

25

SymbolicExecutionofUDFs

• Challenges:Strings,Collections,andObjectareeminentinDISCapplicationsbutnotfullysupportbysymbolicexecutiontooli.e JavaPathFinder.

• Insight:InDISCapplications,mostunboundedtypesareeventuallybounded.WeperformlazySEonsuchtypese.g Split(“,”) isunboundedArraybutSplit(“,”)[1] isbounded.

• Solution:UsingJPF,wesymbolicallyexecuteUDFsinisolationtogeneratedpathconstraintsandeffects.LoopsandArraysareboundedbyK=2.

class UDF{static void main(String args[]){

apply(null);}static Tuple2 apply(Tuple3 s){

if (s._2()._1() > 40)return Tuple2("car", 1);

else if (s._2()._1() > 15)return Tuple2("public", 1);

elsereturn Tuple2("onfoot", 1);

}}

FromStep1

sym>40

sym>15Car,1

public,1 onfoot,1

PathConstraints Effect

sym >40 Car,1

40≥sym>15 Public,1

sym ≤ 15 onfoot ,1

map

Decomposition UDFSE LogicalSpecs

Testgeneration

Page 26: White-Box Testing of Big Data Analytics with Complex User …gulzar/assets/pdf/bigtest... · 2019. 10. 27. · Inadequate Testing of Big Data Analytics 1 Develop locally 2 Test locally

26

LogicalSpecificationsofDataflowOperators

• Challenges: DataflowoperatorsinDISCapplicationsareaccompaniedwith100Kslinesofframeworkcodemakingsymbolicexecutioninfeasible.

• Insight: Dataflowoperatorshavestandardsemanticsbutimplementeddifferentlyforoptimizationpurposes.

• Solution: Usingthesesemantics,weabstracttheirimplementationinlogicalspecificationsandusedthespecificationstotietogetherUDFs’symbolictrees.

FromStep2

map

map

filter

Join

LogicalSpecsofOperator

SymbolicTreeUDF map

map

filter

Join

map

Decomposition UDFSE LogicalSpecs

Testgeneration

map


Recommended