Taxonomy-Based Software Construction for Algorithmic Families · PDF...

transcript

Taxonomy-Based Software Construction for Algorithmic Families

Bruce W. Watson

with Loek Cleophas & Derrick Kourie

bruce@bruce-watson.com

VaMoS 2016 2016.01.28

Aim & Motivation

RuZA Workshop, CSIR 2014/08/14

AimGive overview and examples of Taxonomies and of the TABASCO approach—TAxonomy-BAsed Software COnstruction

Motivation Understand and bring order to (algorithmic) domain, and construct reusable (algorithmic) software for it

VaMoS 2016 2016.01.28

Random Quotes

RuZA Workshop, CSIR 2014/08/14

Bjarne Stroustrup“infrastructure software” has stronger quality and elegance requirements

C.A.R. (Tony) Hoare “…[your] taxonomies are to the field of algorithmics what the Standard Model is to Particle Physics…”

VaMoS 2016 2016.01.28

4VaMoS 2016 2016.01.28

TABASCO Exercises

• Keyword pattern matching • Finite automata construction • Deterministic finite automata minimization • Minimal acyclic deterministic finite automata construction • Lempel-Ziv-style compression • Tree automata (pattern matching & acceptance) • Graph representations • Approximate & 2D pattern matching

5VaMoS 2016 2016.01.28

Case Study: Generalised Stringology

• Regular Grammar and Regular Expression – Different types, transformations between them

• Problems – Membership/Acceptance – Keyword Pattern Matching (KPM)

• Finite Automaton – Nondeterministic with/without epsilon-transitions, deterministic

• Theoretical Results (1950s) – Equivalence of NFA and DFA (subset construction) – Equivalence of RG, RE, and FA – Solve by constructing and using FA based on RG/RE

6VaMoS 2016 2016.01.28

Case Study: Generalised Stringology (cont.)

• In practice (1960s - now): – Many applications

• Natural language text search • DNA processing • Network intrusion and virus detection

– Many FA constructions, acceptance/KPM algorithms—O(102) • More efficient; for specific situations

– Difficult to find, understand, compare – Separation between theory and practice – Hard to compare and choose implementations

VaMoS 2016 2016.01.28

11VaMoS 2016 2016.01.28

Motivation Arbology (Tree Formal Languages)

• Regular Tree Grammar (and Regular Tree Expression) – Different types, transformations between them

• Problems – Membership/Tree (Grammar) Acceptance (TGA), Tree Parsing – Tree Pattern Matching (TPM)

• Finite Tree Automaton (TA) – Nondeterministic with/without epsilon-transitions, deterministic – Undirected, root-to-frontier (RF), frontier-to-root (FR)

• Theoretical Results (1960s) – Equivalence of TAs (except DRFTA) (subset construction) – Equivalence of RTG, RTE, and TA (except DRFTA) – Solve by constructing and using TA based on RTG

12VaMoS 2016 2016.01.28

Motivation Tree Formal Languages/Algorithmics

• In practice (ca. 1975 - now): – Quite a few application domains as well

• Code generation • Term rewriting • Model transformation

– Many TA constructions, TGA/TPM algorithms • More efficient; for specific situations

– Difficult to find, understand, compare – Separation between theory and practice – Hard to compare and choose implementations

13VaMoS 2016 2016.01.28

Generic Domain “Attractions”

• Well-established theory • Algorithmic problems—related, with related solutions • Many algorithms • Many applications

14VaMoS 2016 2016.01.28

Generic Domain Deficiencies

• Inaccessibility of theory and algorithms • Difficulty of understanding and comparing algorithms

– Difference in style – Difference in formality

• Separation between theory and practice • Lack of large collection of implementations • Difficulty of choosing between algorithms

15VaMoS 2016 2016.01.28

TABASCO— Domain Deficiencies & Taxonomies

• Classification (in particular Taxonomy) – Show commonality & variation in algorithm & data representation – Show correctness – Easily find and compare algorithms

16VaMoS 2016 2016.01.28

TABASCO— Domain Deficiencies & Toolkits

• Toolkit, GUI, DSL – Give insight into algorithm properties, performance – Understand and compare algorithms in practice – Allow easy choice and use

17VaMoS 2016 2016.01.28

TABASCO—Steps

Process consists of multiple steps: 1. Selection of domain 2. Literature survey 3. Classification construction 4. Toolkit design 5. Toolkit implementation 6. Benchmarking 7. DSL/GUI design 8. DSL/GUI implementation

VaMoS 2016 2016.01.28

. . . Eukaryotes

Plantae Animalia

Mammalia

Proboscidea

Elephantidae

Loxodonta Africana

Primates

Homo Sapiens

19VaMoS 2016 2016.01.28

Classifications Biological Taxonomies

• Classify organisms • From abstract, general

to concrete, specific • Properties (details) explicit • Allow comparison

t-acceptor

match-set

tabulate filter

tabulate

s-path

sp-matcher

aca-spm drfta-spm

20VaMoS 2016 2016.01.28

Classifications Algorithm Taxonomies

• Similar to biologicaltaxonomies

• Algorithm taxonomiesclassify algorithmsbased on essential details

• Depicted as tree/DAGNodes refer to algorithms,branches to details

• Algorithms solving one algorithmic problem – From abstract, general to concrete, specific – Root represents high-level algorithm

AC-OPT

AC-FAIL

KMP-FAIL

INDICES

NLAU OLAU

BMCW NLA

S F FO (SO)EGC

RSA RFA RFO (RSO)

INDICES

FWD REV OM

NONE SFC FAST SLFC

Aho-Corasick

Commentz-Walter

Boyer-Moore

Knuth-Morris-Pratt

21VaMoS 2016 2016.01.28

Taxonomies Presentation & Correctness—Top-down• Root represents high-level algorithm

– With pre-/postcondition, invariants, ... – Correctness easily shown

• Adding detail – Obtains refinement/variation

(from literature or new) – Branch connecting

algorithm node to child node – Associated correctness arguments—correctness-preserving

• Correctness of root and of details on rootpath implycorrectness of node—correctness-by-construction approach (Dijkstra et al., Eindhoven; Kourie & Watson, 2012)

AC-OPT

AC-FAIL

KMP-FAIL

INDICES

NLAU OLAU

BMCW NLA

S F FO (SO)EGC

RSA RFA RFO (RSO)

INDICES

FWD REV OM

NONE SFC FAST SLFC

Aho-Corasick

Commentz-Walter

Boyer-Moore

Knuth-Morris-Pratt

22VaMoS 2016 2016.01.28

Taxonomies Presentation & Correctness—Top-down• Allow comparison

– Commonalitieslead to common pathfrom root*

• Multiple pathsto same solution possible

• Main goal: improve understandingof algorithms and their relations,i.e. commonalities and variabilities

• Taxonomy forms domain model, classification – So do feature model, formal concept lattice, topic map, ...

23VaMoS 2016 2016.01.28

Dijkstra refinement

!"#$%!&'(!)*%!(+,%*(-*.,#/0!123,%!

4)5%6!7*89*+-%:)#(;4!"<!#$.42!

#$)#'(!:$)#!.#!.(!68=/0!%5).>6!

9*+-%?@)(#)*A;*B0!@)(#)*A;*B!@;*!

:%90!)47!53!,$;4%!4+59%*!

CD8EFG8HII8HJDEA!K;*!L;9!#.#>%M

,;(.#.;4!"+47%*!53!4)5%N/O!<!:)(!

#$.42.4B!;@!PK;+47%*Q!;*!PR$.%@!

1-.%4#.(#Q!;*!PR;8@;+47%*Q

IF statementRefine {P}S{Q} to

{ P }if G

! { P ^ G0

{ Q }[] G

! { P ^ G1

{ Q }fi

if P =) G0

For example

{ pre m and n are integers }if m � n ! x := m; y := n[] m n ! x := n; y := mfi

{ post x = mmax n ^ y = mmin n }

Note nondeterminism

!"#$%!&'(!)*%!(+,%*(-*.,#/0!123,%!

4)5%6!7*89*+-%:)#(;4!"<!#$.42!

#$)#'(!:$)#!.#!.(!68=/0!%5).>6!

9*+-%?@)(#)*A;*B0!@)(#)*A;*B!@;*!

:%90!)47!53!,$;4%!4+59%*!

,;(.#.;4!"+47%*!53!4)5%N/O!<!:)(!

#$.42.4B!;@!PK;+47%*Q!;*!PR$.%@!

1-.%4#.(#Q!;*!PR;8@;+47%*Q

{ P }if G

! { P ^ G0

{ Q }[] G

! { P ^ G1

{ Q }fi

if P =) G0

For example

Note nondeterminism

24VaMoS 2016 2016.01.28

Dijkstra refinement

!"#$%!&'(!)*%!(+,%*(-*.,#/0!123,%!

4)5%6!7*89*+-%:)#(;4!"<!#$.42!

#$)#'(!:$)#!.#!.(!68=/0!%5).>6!

9*+-%?@)(#)*A;*B0!@)(#)*A;*B!@;*!

:%90!)47!53!,$;4%!4+59%*!

,;(.#.;4!"+47%*!53!4)5%N/O!<!:)(!

#$.42.4B!;@!PK;+47%*Q!;*!PR$.%@!

1-.%4#.(#Q!;*!PR;8@;+47%*Q

{ P }if G

! { P ^ G0

{ Q }[] G

! { P ^ G1

{ Q }fi

if P =) G0

For example

Note nondeterminism

!"#$%!&'(!)*%!(+,%*(-*.,#/0!123,%!

4)5%6!7*89*+-%:)#(;4!"<!#$.42!

#$)#'(!:$)#!.#!.(!68=/0!%5).>6!

9*+-%?@)(#)*A;*B0!@)(#)*A;*B!@;*!

:%90!)47!53!,$;4%!4+59%*!

,;(.#.;4!"+47%*!53!4)5%N/O!<!:)(!

#$.42.4B!;@!PK;+47%*Q!;*!PR$.%@!

1-.%4#.(#Q!;*!PR;8@;+47%*Q

DO loops

For invariant I and variant expression V we get

{ P }{ I }do G ! { I ^ G }

{ I ^ (V decreased) }od

{ I ^ ¬G }{ Q }

Remember to check P =) I and I ^ ¬G =) Q

25VaMoS 2016 2016.01.28

• Detail choice and order dependon personal preference& domain understanding

• Inclusion of different ordersfor single algorithm leads todirected acyclic graph

• Initial version by Watson& Zwaan (1992-1996)

• Revised & extended – Cleophas (2003) – Cleophas, Watson

& Zwaan (2004; 2010)

AC-OPT

AC-FAIL

KMP-FAIL

INDICES

NLAU OLAU

BMCW NLA

S F FO (SO)EGC

RSA RFA RFO (RSO)

INDICES

FWD REV OM

NONE SFC FAST SLFC

Aho-Corasick

Commentz-Walter

Boyer-Moore

Knuth-Morris-Pratt

Taxonomies Example: Keyword Pattern Matching

26VaMoS 2016 2016.01.28

ACAC-OPT

AC-FAIL KMP-FAIL

INDICES

NLAU OLAU

BMCW NLA

SHOBPLMIN

BMHBMH

S F FO SO

RSA RFA RFO (RSO)

backward(suffix,factor,

factor oracle -based)

forward (prefix-based)

shiftfunctions

(leading tosublinear

algorithms)

choice of f(P) & dR,f (automatonrecognizingf(P)R)

VaMoS 2016 2016.01.28

34VaMoS 2016 2016.01.28

Algorithm (and Problem) Details (e.g.)

4.1. INTRODUCTION AND RELATED WORK 45

okw (Problem detail 4.77) The set of keywords contains one keyword.

indices (Algorithm detail 4.82) Represent substrings by indices into the completestrings, converting a string-based algorithm into an indexing-based algo-rithm.

cw (Algorithm detail 4.90) Consider any shift distance that does not lead tothe missing of any matches. Such shift distances are called safe.

nla (Algorithm detail 4.103) The left and right lookahead symbols are not takeninto account when computing a safe shift distance. The computation of ashift distance is done by using two precomputed shift functions applied tothe current longest partial match.

lla (Algorithm detail 4.104) The left lookahead symbol is taken into accountwhen computing a safe shift distance.

cw-opt (Algorithm detail 4.108) Compute a shift distance using a single precom-puted shift function applied to the current longest partial match and theleft lookahead symbol.

bmcw (Algorithm detail 4.116) Compute a shift distance using a single precom-puted shift function which is applied to the current longest partial matchand the left lookahead symbol. The function yields shifts that are no greaterthan the function in detail (cw-opt).

near-opt (Algorithm detail 4.121) Compute a shift distance using a single precom-puted shift function applied to the current longest partial match and the leftlookahead symbol. The function is derived from the one in detail (bmcw),and it yields shifts which are no greater.

norm (Algorithm detail 4.127) Compute a shift distance as in (nla) but addi-tionally use a third shift function applied to the lookahead symbol. Theshift distance obtained is that of the normal Commentz-Walter algorithm.

bm (Algorithm detail 4.135) Compute a shift distance using one shift functionapplied to the lookahead symbol, and another shift function applied to thecurrent longest partial match. The shift distance obtained is that of theBoyer-Moore algorithm.

rla (Algorithm detail 4.137) The right lookahead symbol is taken into accountwhen computing a safe shift distance.

35VaMoS 2016 2016.01.28

ACAC-OPT

AC-FAIL KMP-FAIL

INDICES

NLAU OLAU

BMCW NLA

SHOBPLMIN

BMHBMH

S F FO SO

RSA RFA RFO (RSO)

backward(suffix,factor,

factor oracle -based)

forward (prefix-based)

shiftfunctions

(leading tosublinear

algorithms)

choice of f(P) & dR,f (automatonrecognizingf(P)R)

t-acceptor

match-set

tabulate filter

tabulate

s-path

sp-matcher

aca-spm drfta-spm

filter

tfilt sfilt ifilt cfilt

van Dinther, 1987

Brainerd, 1967 & 1969Turner, 1986van Dinther, 1987Weisgerber & Wilhelm, 1989 Hemerik & Katoen, 1989 Ferdinand, Seidl & Wilhelm, 1994Wilhelm & Mauer, 1995

Chase, 1987Hemerik & Katoen, 1989Ferdinand, Seidl & Wilhelm, 1994 Cleophas, 2008

Aho, Ganapathi & Tjang, 1985, 1988van de Meerakker, 1988Weisgerber & Wilhelm, 1989Ferdinand, Seidl & Wilhelm, 1994Wilhelm & Mauer, 1995Cleophas, Hemerik & Zwaan, 2005 & 2006

36VaMoS 2016 2016.01.28

Taxonomies Example: Tree Acceptance

t-acceptor

match-set

tabulate filter

tabulate

s-path

sp-matcher

aca-spm drfta-spm

filter

37VaMoS 2016 2016.01.28

Tree Acceptance Taxonomy One Algorithm Path

|[ const G = . . .;t : . . .;

var b : B| let M = . . . be a ta such that L(M) = L(G);b := t � L(M)

|[ const G = . . .;t : . . .;

var b : B| b := t � L(G)]|

(t-acceptor)

38VaMoS 2016 2016.01.28

(t-acceptor, fr, det) |[ const G = . . .;t : . . .;

var b : B| let M = . . . be a dfrta such that L(M) = L(G);

b := Traverse(�) ⇤ Qra

func Traverse(⇥ n : D) : Q =|[ var q1, . . . , qn : Q| let a = t(n);if n > 0 �

Traverse := Ra(Traverse(n · 1), . . . ,Traverse(n · n))[] n = 0 �

Traverse := Ra()f i

39VaMoS 2016 2016.01.28

Construction of automaton separate issue

t-acceptor

match-set

tabulate filter

tabulate

s-path

sp-matcher

aca-spm drfta-spm

filter

t-matcher

ra-loops

match-set

tabulate filter

tabulate

s-path

sp-matcher

aca-spm drfta-spm

filter

40VaMoS 2016 2016.01.28

Taxonomies Tree Acceptance and Tree Pattern Matching

41VaMoS 2016 2016.01.28

Taxonomies Tree Automata Constructions

• About 50 different constructionsin tree acceptance and tree pattern matching taxonomies – differ in e.g. direction, epsilons, determinism, advanced techniques

• Construction presentation – uniform style – defines state set, transition relation, ... – gives example – correctness arguments – related constructions and literature – identified by sequence of labels indicating details, e.g.

(TPM-TA:ALL-SUB:REM-Epsilon:FR:SUBSET)

42VaMoS 2016 2016.01.28

Automata Construction Taxonomy144 CHAPTER 6. FA CONSTRUCTION ALGORITHMS

rem-ε-dual

e-mark

filtsym

rem-ε

Wfilt Xfilt

6.65BS (6.39)

6.636.19

MYG (6.44)

6.43 6.68

b-mark

ASU (6.86)

subsetuse-s

subset

subset use-s

subset

subset use-s

subset

Ant. (6.55)

Brz. (6.57)p. 158

Figure 6.1: A taxonomy of finite automata construction algorithms. The larger graphrepresents the main part of the taxonomy, while the smaller graph represents the twoinstantiations of the filt detail that are discussed in this dissertation. The numbersappearing at some of the vertices correspond to the algorithm or construction numbers inthe text of this chapter. In some cases, the algorithm is not presented explicitly, and thepage number is given instead. The use of duality is clearly shown by the symmetry inthe graph. The algorithms in the dashed-line subtree (on the right of the graph) are nottreated in this dissertation, since they are the duals of algorithms in the left half and it isnot clear that the duals would be more efficient or enlightening.

43VaMoS 2016 2016.01.28

DFA Minimization7.1. INTRODUCTION 193

ASU (7.21)

Hopcroft-Ullman (7.24)

(7.28)

imperative program

(7.18)

(7.19) (7.22)

eq. classes eq. classes

optimized list update

Hopcroft (7.26)

Brzozowski (§ 7.2)

(§ 7.4.6)

pointwise

memoization

approx. from below

Improved

Equivalence of states (§ 7.3)

equivalence relation

approx. from above

(§ 7.4.1–7.4.5, 7.4.7)

(§ 7.4.1–7.4.5)

layerwise unordered state pairs

(7.27)

(7.23)

(p. 207)

(p. 212)

Figure 7.1: The family trees of finite automata minimization algorithms. Brzozowski’sminimization algorithm is unrelated to the others, and appears as a separate (single vertex)tree. Each algorithm presented in this chapter appears as a vertex in this tree. For eachalgorithm that appears explicitly in this chapter, the construction number appears inparentheses (indicating where it appears in this chapter). For algorithms that do notappear explicitly, a reference to the section or page number is given. Edges denote arefinement of the solution (and therefore explicit relationships between algorithms). Theyare labeled with the name of the refinement.

44VaMoS 2016 2016.01.28

Taxonomies Advantages and Disadvantages

+ Algorithm comparison easier + Clear and correct algorithm presentation + Orders field, usable as teaching aid + Well suited for exploratory algorithmics + Formal specifications + Aids in construction of toolkit - Takes much time and effort (abstraction (bottom-up!), sequential addition of details) - Overkill for some domains?

45VaMoS 2016 2016.01.28

TABASCO—Steps

Process consists of multiple steps: 1. Selection of domain 2. Literature survey 3. Classification construction 4. Toolkit design 5. Toolkit implementation 6. Benchmarking 7. DSL/GUI design 8. DSL/GUI implementation

VaMoS 2016 2016.01.28

func Traverse(⇥ n : D) : Q =|[ var q1, . . . , qn : Q| let a = t(n);if n > 0 �

Traverse := Ra(Traverse(n · 1), . . . ,Traverse(n · n))[] n = 0 �

Traverse := Ra()f i

private static AbstractAutomatonState Traverse(AbstractDFRTA M, Node n) { AbstractTAState[] childStates = new AbstractTAState[n.children().size()]; for (int i=0; i < n.children().size(); i++) { childStates[i] = Traverse(M, n.children().get(i)); } if (n.children().size() > 0) { state = M.nextState(childStates, (RankedSymbol)n.symbol()); } else { state = M.nextState(childStates, (RankedSymbol)n.symbol()); } return state; }

51VaMoS 2016 2016.01.28

Toolkit vs Taxonomy

52VaMoS 2016 2016.01.28

Algorithm Performance13.4. RESULTS 299

0 2 4 6 8 10 12 14

Shortest keyword length

CW-WBMCW-NORM

AC-OPTAC-FAIL

Figure 13.8: Algorithm performance (in megabytes/second) versus the length of the short-est keyword in a given set. The performance of the CW-WBM and CW-NORM algorithmsare almost coincidental (shown as the ascending solid line).

The performance of the CW algorithms, which declined with increasing keyword setsize, was consistently better than the AC-OPT algorithm. In some cases, the CW-NORMalgorithm displayed a five to ten-fold improvement over the AC-OPT algorithm.

13.4.2 Performance versus minimum keyword length

For each algorithm, the average number of megabytes processed per second was graphedagainst the length of the shortest keyword in a set. For the multiple-keyword tests thegraphs are superimposed in Figure 13.8.

Predictably, the AC-OPT algorithm has performance that is independent of the key-word set. The AC-FAIL algorithm has slightly lower performance, improving with longerminimum keywords. The average performance of the CW algorithms improves almostlinearly with increasing minimum keyword lengths. The low performance of the CW al-gorithms for short minimum keyword lengths is explained by the fact that the CW-WBMand CW-NORM shift functions are bounded above by the length of the minimum keyword(see Chapter 4). For sets with minimum keywords no less than than four characters, theCW algorithms outperform the AC algorithms.

As predicted, the CW-NORM algorithm outperforms the CW-WBM algorithm. Theperformance ratio of the CW-WBM algorithm to the CW-NORM algorithm is shown inFigure 13.9. The figure indicates that the performance gap is wide with small minimumkeyword lengths, and diminishes with increasing minimum keyword lengths. (This effect

53VaMoS 2016 2016.01.28

Ongoing and Future Work

• Existing taxonomies and toolkits developed over 20 years – update and integrate

• e.g. >50 new keyword pattern matching algorithms in 2001-2010 – need to be selective...

– multiple DSLs and GUIs on top • bioinformatics, computational linguistics, network intrusion detection • student view?

• Application to other algorithmic or data structure fields

54VaMoS 2016 2016.01.28

Concluding Remarks

Overview of taxonomy construction and TAxonomy-BAsed Software COnstruction

– algorithmic domains – bringing order and improving understanding

• bonus: exploratory algorithmics

– also aimed at development of large-scale toolkit • allows comparison in practice

– benchmarking data – algorithm selection

• with DSLs/GUIs to simplify usage

– TABASCO is the only such method that takes correctness-by-construction into account

55VaMoS 2016 2016.01.28

References

• L. Cleophas, B.W. Watson, D.G. Kourie, A. Boake & S. Obiedkov,TABASCO: Using Concept-Based Taxonomies in Domain Engineering.SACJ, 37:30–40, December 2006.

• L. Cleophas & B.W. Watson, Taxonomy-based softwareconstruction of SPARE Time: a case study.In IEE Proceedings – Software, 152(1), February 2005.

• L. Cleophas & B.W. Watson, Applying and spicing upTABASCO: taxonomy-based software and how toincrease its usability. In Formal Aspects of Computing—Essays dedicated to Derrick Kourie on the occasionof his 65th Birthday, 173–183, Shaker Verlag, 2013.

• D.G. Kourie & B.W. Watson, The Correctness-by-Construction Approach to Programming, Springer, 2012.

• B.W. Watson, D.G. Kourie & L. Cleophas, Experience with Correctness-by-Construction. To appear in Science of Computer Programming, special issue on New Ideas and Emerging Results in Understanding Software, 2013.

Taxonomy-Based Software Construction for Algorithmic Families · PDF...

Documents