Towards Scalable RDF Graph Analytics on MapReduce
Padmashree RavindraVikas V. DeshpandeKemafor Anyanwu
{pravind2, vvdeshpa, kogan}@ncsu.edu
COUL - semantic COmpUting research
Lab
IntroductionGrowing interest in exploiting RDF
data for decision-making Requires support for analytical-style
querying
e.g. : Sales (Cust, prod, price, loc, month, year)
* For each prod, count for each month of 2008, the sales that were between previous month’s avg sale and next month’s avg sale
- More complex than traditional SPJ queries
- Often include multiple groupings and / or aggregations
- Next release of SPARQL expected to include such constructs
(prev_avg_sale,
next_avg_sale)
Prod Month Count
Prod1 Feb 3
* Example from [1]
Analytical Query ProcessingTraditional OLAP techniques
Requires star / snowflake schema Enterprise-scale
But Semantic Web data (RDF) Semi-structured (labeled graphs)Absence of star-like schema Billion triple data sets
Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics.
MapReduce-based Data Processing
High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL
Hybrid approach - HadoopDB [5] MapReduce in RDF processing
Graph pattern queries [8], [9] Graph closure computation [10]
RAPID [6] Succinct expression of complex queries Optimize multiple groupings /
aggregations
RDF data modelStatements (triples) Graph representationSub Prop Obj
R1 type Ranking
R1 pageRank 11
R1 pageURL Url1
R1 avgDuration 97
UV1 type UserVisits
UV1 srcIP 158.112.27.3
UV1 destURL url1
UV1 adRevenue 339.08142
UV1 visitDate 1979/12/12
UV1 userAgent SCOPE
UV1 cCode VNM
UV1 iCode VNM-KH
UV1 sKeyword comets
UV1 avgTime 3
Rankings
UserVisits
Groups = Stars
SPARQL Query Matching graph pattern
Traditional Querying of RDF Graph pattern matching
E.g. Get details about all pages visited by particular users between “1979/12/01” and “1979/12/30”
Example Analytical Query on RDF data
Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30
Pattern matchingStar sub graphs – Rankings, UserVisitsJoin between the stars
Grouping based on value of srcIP propertyAggregation on value of pageRank and adRevenue
Pig : Data Processing Express data processing tasks using
high-level query primitives usability, code reuse, automatic optimizationPig Latin data model : atom, tuple, bag
(nesting) Operators : LOAD, STORE, JOIN, GROUP BY,
COGROUP, FOREACH, SPLIT, aggr. functions Extensibility support via UDFs Operators compile into MapReduce jobs
Partition REL A using values in age column ($1)
SPLIT A into minors IF $1 < 18, majors IF $1 >= 18;
Equijoin on REL A (column 0) and REL B (column 1) JOIN A by $0, B by $1;
Package tuples
JOIN A by $1, B by $0;
Compiling Pig Latin’s JOIN to MapReduce
$0 $1
C1 P1
C1 P2
C2 P1
$0 $1
P1 18
P2 25
REL A REL B
$0 $1 $2 $3
C1 P1 P1 18
C2 P1 P1 18
C1 P2 P2 25
Annotate based on $1 (join key)
map
reduce
P1
P1
C1 P1 P1 18
Reducer 1
C2 P1 P1 18
P2
Reducer 2
C1 P2 P2 25
P2P1
P2
P1
Pattern Matching in Pig : Approach 1
Sub
Prop Obj
R1 type RankingR1 pageRa
nk11
R1 pageURL
Url1
UV1 type UserVisitsUV1 srcIP 158.112.2
7.3
Sub
Prop Obj
R1 type RankingR1 pageRa
nk11
R1 pageURL
Url1
UV1 type UserVisitsUV1 srcIP 158.112.2
7.3
Sub Prop Obj
R1 type RankingR1 pageRan
k11
R1 pageURL Url1UV1 type UserVisitsUV1 srcIP 158.112.2
7.3
R1
11
Ranking
type
url1
RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub;
Rankings
triples1 triples2 triples3
Issues- Self-joins on very large relations high I/O costs- Generate meaningless tuples additional
filtering step (R1, type, Ranking, R1, type, Ranking, R1, type, Ranking)
Rankings star pattern = 3-way self-joinUserVisits star pattern = 5-way self-join
pageRank
pageURL
Triple store
LOAD all the RDF triples
Sub Prop ObjR1 type RankingR2 type Ranking
Sub Prop ObjR1 type RankingR2 type Ranking
typeRanking
Sub Prop ObjUV1 destURL url1UV2 destURL url1
Sub Prop ObjUV1 destURL url1UV2 destURL url1
destURL
Sub Prop ObjR1 pageURL url1R2 pageURL url2
Sub Prop ObjR1 pageURL url1R2 pageURL url2
pageURL
Sub Prop ObjR1 pageRank 11R2 pageRank 27
Sub Prop ObjR1 pageRank 11R2 pageRank 27
pageRank
Sub Prop ObjUV1 type userVisitsUV2 type userVisits
Sub Prop ObjUV1 type userVisitsUV2 type userVisits
typeUV
Sub Prop ObjUV1 scrIP 158.112.27.3UV2 scrIP 159.222.21.9
Sub Prop ObjUV1 scrIP 158.112.27.3UV2 scrIP 159.222.21.9
srcIP
Sub Prop ObjUV1 adRev 339.08142UV2 adRev 330.51248
Sub Prop ObjUV1 adRev 339.08142UV2 adRev 330.51248
adRev
Sub Prop ObjUV1 visitDate 1979/12/12UV2 visitDate 1980/02/02
Sub Prop ObjUV1 visitDate 1979/12/12UV2 visitDate 1980/02/02
visitDate
Ranking = JOIN(compute Star Pattern)
UserVisits = JOIN(compute Star Pattern)
JOIN between Ranking, UserVisits
GROUP BY srcIP
FOREACH group GENERATE aggregations
SPLIT
Sub Prop ObjUV1 visitDate 1979/12/12UV4 visitDate 1979/12/02
Sub Prop ObjUV1 visitDate 1979/12/12UV4 visitDate 1979/12/02
visitDate
Approach 2: Vertical Partitioning
Filter
LOAD all the RDF triples
Sub Prop ObjR1 type RankingR2 type Ranking
Sub Prop ObjR1 type RankingR2 type Ranking
typeRanking
Sub Prop ObjUV1 destURL url1UV2 destURL url1
Sub Prop ObjUV1 destURL url1UV2 destURL url1
destURL
Sub Prop ObjR1 pageURL url1R2 pageURL url2
Sub Prop ObjR1 pageURL url1R2 pageURL url2
pageURL
Sub Prop ObjR1 pageRank 11R2 pageRank 27
Sub Prop ObjR1 pageRank 11R2 pageRank 27
pageRank
Sub Prop ObjUV1 type userVisitsUV2 type userVisits
Sub Prop ObjUV1 type userVisitsUV2 type userVisits
typeUV
Sub Prop ObjUV1 scrIP 158.112.27.3UV2 scrIP 159.222.21.9
Sub Prop ObjUV1 scrIP 158.112.27.3UV2 scrIP 159.222.21.9
srcIP
Sub Prop ObjUV1 adRev 339.08142UV2 adRev 330.51248
Sub Prop ObjUV1 adRev 339.08142UV2 adRev 330.51248
adRev
Sub Prop ObjUV1 visitDate 1979/12/12UV2 visitDate 1980/02/02
Sub Prop ObjUV1 visitDate 1979/12/12UV2 visitDate 1980/02/02
visitDate
Ranking = JOIN(compute Star Pattern)
SPLIT
Approach 2: Vertical Partitioning
Issues SPLIT : Concurrent sub
flowsRisk of Disk spills I/O
costs Structure of intermediate
relations
FILTER
FILTER
FOREACH
Compilation to MapReduce Jobs
JOIN
map1
JOIN
GROUP BY
reduce1
map3
reduce3
map4
reduce4
JOIN
map2
reduce2
Step 1 : Pattern MatchingStep 2 : GroupingStep 3 : Aggregation
Rankings UserVisits
Our Approach : RAPID+
Goal : Minimize I/O costs
Strategy:
Concurrent computation of star patterns using grouping-based algorithm
Can improve efficiency using Operator-coalescing and Look-ahead processing
Concurrent Star Pattern Matching
Sub
Prop Obj
R1 type RankingR1 pageRank 11R1 pageURL Url1UV1 type UserVisitsUV1 srcIP 158.112.2
7.3UV1 destURL url1UV1 adRevenue 339.08142UV1 visitDate 1979/12/1
2
Sub Prop ObjR1 type RankingR1 pageRank 11R1 pageURL Url1R1 avgDuration 97UV1 type UserVisitsUV1 srcIP 158.112.2
7.3UV1 destURL url1UV1 adRevenue 339.08142UV1 visitDate 1979/12/1
2UV1 userAgent SCOPEUV1 cCode VNMUV1 iCode VNM-KHUV1 sKeyword cometsUV1 avgTime 3
Use grouping-based algorithm on a triple storage model- GROUP BY Subject
More efficient if prior filtering of irrelevant triples`
Filter irrelevant properties
Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30
Sub Prop ObjR1 type RankingR1 pageRank 11R1 pageURL Url1R1 avgDuration 97UV1 type UserVisitsUV1 srcIP 158.112.2
7.3UV1 destURL url1UV1 adRevenue 339.08142UV1 visitDate 1979/12/1
2UV1 userAgent SCOPEUV1 cCode VNMUV1 iCode VNM-KHUV1 sKeyword cometsUV1 avgTime 3
Ranking
UserVisits
Concurrent Star Pattern Matching -2
Filter irrelevant triples by coalescing LOAD and FILTER operators
input = LOAD ‘\data’ using loadFilter ( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits )
LOAD
FILTER
map1
LOAD
loadFilter
Our Approach
OperatorCoalescing
Savings by Coalescing:Context switchingParameter passingMultiple handling of same data
Using Pig Latin
map1
Grouping-based Pattern Matching
Sub
Prop Obj
R1 type Ranking
R1 pageRank 11
R1 pageURL Url1
UV1 type UserVisits
UV1 srcIP 158.112.27.3
UV1 destURL url1
UV1 adRevenue 339.08142
UV1 visitDate 1979/12/12
GROUP BY
Subject
BUT heterogeneous bags
starSubgraphs = GROUP input BY $0;
Filtering the GroupsBUT all possible sub patterns computedFilter non-matching sub patterns
Value-based filtering validate each sub graph against filter condition
Structure-based filtering eliminate sub graphs with missing properties
Missing srcIPvisitDate between 1979/12/01
and 1979/12/30
Joining the Stars : Look-ahead Processing
Annotate based on Subject
Process each bag Annotate based on value of join property
Group by SubjectProcess each bag Structure-based and value-based filtering
Join between the star sub graphs
map map
reduce reduce
Star Pattern Matching Cycle
Next Cycle(Joining the Stars)
Group by SubjectProcess each bag Structure-based and value-based filtering Annotate based on value of join prop
No repeated processing
Example : Look-ahead Processing
Star Pattern Matching Joining the Stars
Structure-based filteringValue-based filtering
Look-Ahead - Annotate bag based on join key
Join between the star sub
graphsEliminate properties irrelevant for future processing (join and filter prop) Minimize size of intermediate results
Comparison : Pig vs RAPID+Pig Approach RAPID+
Multiple map-reduce cycles- N star sub graphs N cycles
Single cycle- N star sub graphs 1 cycle
Potential for increased I/O (i)Disk spills (SPLIT operator)(ii)Materialization of several intermediate results due to sequential computation of star patterns
Minimized I/O(i)Filtering in triple storage model + load-filter coalescing(ii)Concurrent computation of star patterns (single intermediate result)
Would require advanced optimization techniques- Introduce project operator to eliminate unneeded columns
Smaller intermediate result sizes- Eliminate tuples and columns not necessary in future steps of processing
Not applicable Minimize repeated tuple handling by look-ahead processing
Case Study Setup: 5-node / 20-node Hadoop clusters
on NCSU’s Virtual Computing Lab [13] Dataset: Synthetic benchmark data set
[4] Tasks: Baseline case
Task A (PM) – basic pattern matching(2 star patterns and a join between the stars) Task B (PM+GA) – pattern matching with
grouping and aggregation (two look-ahead processing opportunities)
Experimental Results
Cost Analysis for Task A (PM)5-node cluster
Cost Analysis for Task B (PM+GA)5-node cluster
Experimental ResultsScalability Study 5-node vs 20-nodes
1.8GB per node 2.8GB per node
Conclusion and Ongoing work
Promising results even for baseline caseFurther opportunities for improvement
First-class operators vs UDFs Exploit combiners during aggregations More efficient data structures for
processing bags Further look-ahead optimizations during
multiple groupings and aggregations
References[1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex
OLAP” ICDE 2001, 108–121[2] J. Dean and S. Ghemawat. “MapReduce : Simplified Data Processing on Large Clusters”. In Proc.
Of OSDI'04, 2004[3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language
for data processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110 [4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A
Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009[5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural
Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009[6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the
semantic web. ISWC 2009[7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.:
DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008
[8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008
[9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008.
[10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC ‘09, 2009
[11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007
[12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer
[13] VCL Setup at NC State University, https://vcl.ncsu.edu/ [14] HiveQL, http://hadoop.apache.org/hive/ [15] JAQL, http://code.google.com/p/jaql[16] RDF, http://www.w3.org/RDF/
Thank You!