Architecture optimisationts.data61.csiro.au/publications/papers/FitzRoyDale:phd.pdf ·...

Architecture optimisation

Doctor of PhilosophySchool of Computer Science and Engineering

The University of New South Wales

Nicholas FitzRoy-Dale

March 2010

Abstract

This dissertation describes architecture optimisation, a novel optimisation tech-nique.Architecture optimisation improves the performance of software components orapplications by modifying the way in which they communicate with other components,or with the operating system. This is a significantly different focus to traditional com-piler optimisations, which typically operate on a single application and do not attemptto change the way it interacts with the rest of the system. To perform an architectureoptimisation, the author of a programming interface writes a small, domain-specific opti-misation specification which describes both the conditions necessary for the architectureoptimisation to be valid, and the way in which such an optimisation should be perfor-mend. This specification is then used as input to an architecture optimiser, which appliesthe optimisation to a particular application. Architecture optimisation does not requireapplication source code, effectively decoupling optimisation from compilation.

To demonstrate its usefulness, an implementation of architecture optimisation, namedCurrawong, is described. Currawong is a complete architecture optimiser, supportingtwo languages (Java and C) and two completely different software platforms (the An-droid smartphone operating system, and CAmkES, a research-focused component-basedsystem). Currawong is applied to several optimisable applications on both platforms, andachieves significant performance improvements.

The two major contributions of the work are a concise specification language for archi-tecture optimisations, and early proof that the technique is useful for real-world applica-tions, in the form of benchmark results demonstrating significant (up to 2x) performanceimprovements.

i

ORIGINALITY STATEMENT

I hereby declare that this submission is my own work and to the best of my knowledgeit contains no materials previously published or written by another person, or substantialproportions of material which have been accepted for the award of any other degree ordiploma at UNSW or any other educational institution, except where due acknowledge-ment is made in the thesis. Any contribution made to the research by others, with whomI have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I alsodeclare that the intellectual content of this thesis is the product of my own work, exceptto the extent that assistance from others in the project’s design and conception or in style,presentation and linguistic expression is acknowledged.

.................................................................Nicholas FitzRoy-DaleMarch 31, 2010

Acknowledgements

Thank you to my partner, Catie Flick, for her love, insight, and surprising willingnessto learn about component system arcana. Thank you to my parents, Robyn FitzRoy andChris Dale, for their unconditional love and support.

Thank you to my supervisor, Gernot Heiser, for his encouragement, and for his enviableability to consistently provide on-target, insightful comments. Thank you to to my co-supervisor, Ihor Kuz, for his expertise, attention to detail, and unflagging dedication tothe cause. Thank you also to Charles Gray and Ben Leslie, whose ideas and feedbackwere vital to the genesis of this work.

Thank you to the members of the ERTOS group, both past and present, for your friendshipand support over the years. It’s been a privilege to work with such cheerful, resourceful,and intelligent people.

Finally, thank you to the anonymous reviewers of this dissertation. Your comments im-proved it significantly.

iii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Acknowledgements iii

1. Introduction 11.1. Why optimise? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Currawong: an architecture optimiser . . . . . . . . . . . . . . . . . . . 51.3. Related approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Related work 92.1. Traditional compiler optimisation . . . . . . . . . . . . . . . . . . . . . 92.2. Dynamic optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3. Source-based rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1. Refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2. Beyond structural modification . . . . . . . . . . . . . . . . . . 12

2.4. Pre-execution binary rewriting . . . . . . . . . . . . . . . . . . . . . . 152.5. Optimisation in component-based systems . . . . . . . . . . . . . . . . 172.6. Dynamic upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.7. Metalanguages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3. Background 223.1. The CAmkES component system . . . . . . . . . . . . . . . . . . . . . 22

3.1.1. Binary CAmkES . . . . . . . . . . . . . . . . . . . . . . . . . 233.2. Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1. Android is written in multiple languages . . . . . . . . . . . . . 263.3. Running examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1. The componentised video player . . . . . . . . . . . . . . . . . 273.3.2. Examples in Android . . . . . . . . . . . . . . . . . . . . . . . 28

4. Performance anti-patterns 294.1. Performance improvement examples . . . . . . . . . . . . . . . . . . . 29

4.1.1. sendfile() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2. API call specialisation . . . . . . . . . . . . . . . . . . . . . . . 304.1.3. Integrated Layer Processing . . . . . . . . . . . . . . . . . . . . 32

iv

Contents

4.1.4. Protocol header optimisation . . . . . . . . . . . . . . . . . . . 334.1.5. FBufs: high-bandwidth transfer . . . . . . . . . . . . . . . . . . 34

4.2. Summary of performance anti-patterns . . . . . . . . . . . . . . . . . . 354.2.1. Context switching . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.2. Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.3. Overly-generic or inflexible APIs . . . . . . . . . . . . . . . . . 374.2.4. Unsuitable data structures . . . . . . . . . . . . . . . . . . . . . 374.2.5. Reprocessing data . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3. Remedies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.1. Remedy 1: Combining protection domains . . . . . . . . . . . . 394.3.2. Remedy 2: Replacing components or libraries . . . . . . . . . . 404.3.3. Remedy 3: Component interposition . . . . . . . . . . . . . . . 414.3.4. Remedy 4: Modifying component-to-component APIs . . . . . 42

4.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5. Design 445.1. Currawong overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1. Use of source code . . . . . . . . . . . . . . . . . . . . . . . . 465.1.2. Multiple-API limitations . . . . . . . . . . . . . . . . . . . . . 495.1.3. A domain-specific programming language . . . . . . . . . . . . 49

5.2. Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.1. Example 1: CAmkES same-domain decoder . . . . . . . . . . . 505.2.2. Example 2: CAmkES RGB conversion . . . . . . . . . . . . . . 515.2.3. Example 3: CAmkES protocol translation . . . . . . . . . . . . 525.2.4. Example 4: Android touch events . . . . . . . . . . . . . . . . . 535.2.5. Example 5: Android redraw . . . . . . . . . . . . . . . . . . . . 545.2.6. Summary of example requirements . . . . . . . . . . . . . . . . 55

5.3. Providing an application representation . . . . . . . . . . . . . . . . . . 565.4. Matching and checking . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4.1. The importance of specification . . . . . . . . . . . . . . . . . . 585.4.2. Recognising anti-patterns with Currawong . . . . . . . . . . . . 59

5.5. Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.5.1. Application architecture transformation . . . . . . . . . . . . . 655.5.2. Application code transformation . . . . . . . . . . . . . . . . . 65

5.6. Specification language requirements . . . . . . . . . . . . . . . . . . . 675.7. Currawong Specification Language . . . . . . . . . . . . . . . . . . . . 69

5.7.1. Specification structure . . . . . . . . . . . . . . . . . . . . . . . 695.7.2. Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.8. The Currawong API . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.8.1. Structure search . . . . . . . . . . . . . . . . . . . . . . . . . . 745.8.2. Control-flow search . . . . . . . . . . . . . . . . . . . . . . . . 745.8.3. Data-flow search . . . . . . . . . . . . . . . . . . . . . . . . . 755.8.4. Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 765.8.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v

Contents

5.9. Transformation and looping . . . . . . . . . . . . . . . . . . . . . . . . 785.10. Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6. Implementation 806.1. Overview of Currawong . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2. Implementing Currawong specification language . . . . . . . . . . . . . 82

6.2.1. Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2.2. Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3. Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.3.1. Structure search . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3.2. Control-flow search . . . . . . . . . . . . . . . . . . . . . . . . 866.3.3. Data-flow search . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.4. Supporting code transformation . . . . . . . . . . . . . . . . . . . . . . 876.5. Android-specific portions . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.5.1. Unpacking, disassembling, and reassembling . . . . . . . . . . 896.5.2. Application representation . . . . . . . . . . . . . . . . . . . . 916.5.3. Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.5.4. Transformation and output . . . . . . . . . . . . . . . . . . . . 94

6.6. CAmkES-specific portions . . . . . . . . . . . . . . . . . . . . . . . . 956.6.1. Unpacking and application representation . . . . . . . . . . . . 966.6.2. Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.6.3. Transformation and output . . . . . . . . . . . . . . . . . . . . 98

6.7. Example Java optimisation . . . . . . . . . . . . . . . . . . . . . . . . 996.7.1. Application representation . . . . . . . . . . . . . . . . . . . . 1006.7.2. Match object generation . . . . . . . . . . . . . . . . . . . . . . 1006.7.3. Method renaming . . . . . . . . . . . . . . . . . . . . . . . . . 1016.7.4. Application rewriting . . . . . . . . . . . . . . . . . . . . . . . 101

6.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7. Evaluation 1047.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.1.1. Test hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.1.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.2. CAmkES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.1. Same-domain decoder . . . . . . . . . . . . . . . . . . . . . . . 1077.2.2. Eliminate RGB conversion . . . . . . . . . . . . . . . . . . . . 1097.2.3. Protocol translation . . . . . . . . . . . . . . . . . . . . . . . . 112

7.3. Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.1. Touch events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.2. Redraw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4. Costs of running Currawong . . . . . . . . . . . . . . . . . . . . . . . . 1197.4.1. Application run-time cost . . . . . . . . . . . . . . . . . . . . . 1197.4.2. Currawong execution cost . . . . . . . . . . . . . . . . . . . . . 120

7.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

vi

Contents

8. Conclusion 1238.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.1.1. Designing to be optimised . . . . . . . . . . . . . . . . . . . . 1248.2. Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.3. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A. Glossary 127

B. Summary: Currawong API 128B.1. Application object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

B.1.1. match(+Name) . . . . . . . . . . . . . . . . . . . . . . . . . . 128B.1.2. rename call(+Scope, +Old, +New) . . . . . . . . . . . . . . 128

B.2. CAmkES-specific Application object rules . . . . . . . . . . . . . . . . 128B.2.1. access(+Scope, +Object, =Funcs) . . . . . . . . . . . . . 128B.2.2. AddToPD(+Satellite, +Planet) . . . . . . . . . . . . . . . 129B.2.3. DisjointComponentPDs(+PD1, +PD2) . . . . . . . . . . . . . 129B.2.4. replace component(+Old, +New) . . . . . . . . . . . . . . . 129B.2.5. replace connector(+Old, +New) . . . . . . . . . . . . . . . 129B.2.6. interpose(+Connector, +Component) . . . . . . . . . . . . 129

B.3. Java-specific Application object rules . . . . . . . . . . . . . . . . . . . 130B.3.1. add module(+Name) . . . . . . . . . . . . . . . . . . . . . . . 130B.3.2. merge(+Match, +Merge) . . . . . . . . . . . . . . . . . . . . 130B.3.3. merge all(+Match, +Merge) . . . . . . . . . . . . . . . . . . 130

B.4. Match object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130B.4.1. feature(+Path) . . . . . . . . . . . . . . . . . . . . . . . . . 130

vii

List of Figures

1.1. System software architecture optimisation process . . . . . . . . . . . . . 3

2.1. A Coccinelle semantic patch, replacing local generation of “y” with aparameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2. Renaming a property in Eclipse (example from Eclipse documentation) . 182.3. A TXL replacement rule. . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4. A binary delta to add a method to all classes implementing an interface . 20

3.1. System design with CAmkES . . . . . . . . . . . . . . . . . . . . . . . . 243.2. Android architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3. Interaction with the System Server in Android . . . . . . . . . . . . . . . 263.4. Componentised video player . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1. File transmission using read() and write() . . . . . . . . . . . . . . . . . 304.2. File transmission using sendfile() . . . . . . . . . . . . . . . . . . . . . . 304.3. File access using the POSIX API . . . . . . . . . . . . . . . . . . . . . 314.4. File access using Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 314.5. Reading from a file using Pebble portals . . . . . . . . . . . . . . . . . . 324.6. The componentised video player with a chain of image filters between

client and display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.7. The componentised video player with a chain of image filters with data

access mediated by a controller . . . . . . . . . . . . . . . . . . . . . . . 344.8. A network packet with multiple headers . . . . . . . . . . . . . . . . . . 344.9. Componentised video player showing protection domains . . . . . . . . . 364.10. Video player: client and decoder occupy the same protection domain . . . 404.11. Component interposition . . . . . . . . . . . . . . . . . . . . . . . . . . 414.12. API modification via interposition . . . . . . . . . . . . . . . . . . . . . 43

5.1. Currawong overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2. A small component (bold portions represent information preserved by the

compiler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3. Protection domain merging in the componentised video player . . . . . . 505.4. Component replacement in the componentised video player . . . . . . . . 525.5. Component interposition in the componentised video player . . . . . . . 525.6. File system memory-sharing optimisation . . . . . . . . . . . . . . . . . 535.7. Touch events optimisation before (A) and after (B) . . . . . . . . . . . . 545.8. Redraw pathway (A) before optimisation and (B) after. . . . . . . . . . . 55

viii

List of Figures

5.9. An example template . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.10. Unnecessary data manipulation in a video player (API calls have a double

border) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.11. Verification as search in Broadway [Guyer and Lin 2005] . . . . . . . . . 645.12. Application architecture transformation . . . . . . . . . . . . . . . . . . 665.13. An optimisation specification written in CSL. . . . . . . . . . . . . . . . 675.14. Finding functions in a class through pattern-matching (A), unification

(B), and iteration (C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.15. Implicit return values in CSL rules. . . . . . . . . . . . . . . . . . . . . . 705.16. CSL templating example . . . . . . . . . . . . . . . . . . . . . . . . . . 725.17. Matching in Currawong . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.18. Control-flow matching API . . . . . . . . . . . . . . . . . . . . . . . . 755.19. Data-flow matching example . . . . . . . . . . . . . . . . . . . . . . . . 765.20. Adding code with Currawong . . . . . . . . . . . . . . . . . . . . . . . 77

6.1. Currawong workflow (optimisation specification perspective) . . . . . . . 816.2. Currawong implementation (system agnostic version) . . . . . . . . . . . 816.3. Control flow between CSL parsers . . . . . . . . . . . . . . . . . . . . . 836.4. Using the ”.feature” rule to access a Java method . . . . . . . . . . . . . 856.5. Data-dependent control flow modification . . . . . . . . . . . . . . . . . 876.6. Code transformation process . . . . . . . . . . . . . . . . . . . . . . . . 876.7. Currawong implementation (Android extensions) . . . . . . . . . . . . . 896.8. Baksmali’s disassembly of an automatically-generated constructor . . . . 906.9. In-memory class hierarchy for a portion of the Android Lunar Lander game 906.10. Currawong implementation (CAmkES extensions) . . . . . . . . . . . . 956.11. The CAmkES assembly process (Currawong is inside the dotted portion) . 966.12. A Binary CAmkES component (ELF file) . . . . . . . . . . . . . . . . . 966.13. Example application using the MouseEventHandler API . . . . . . . . . 996.14. Optimisation specification for the MouseEventHandler optimisation . . . 996.15. The MouseEventHandler example application, internal representation . . 1006.16. Disassembled code for the MouseEventHandler example application . . . 1016.17. The MouseEventHandler match object . . . . . . . . . . . . . . . . . . . 1026.18. The method rename description . . . . . . . . . . . . . . . . . . . . . . . 102

7.1. Componentised video player . . . . . . . . . . . . . . . . . . . . . . . . 1067.2. Protection domain merging in the componentised video player . . . . . . 1077.3. The “Merge protection domains” optimisation specification . . . . . . . 1087.4. Component replacement in the componentised video player . . . . . . . . 1097.5. The “Eliminate RGB conversion” optimisation specification . . . . . . . 1107.6. Component interposition in the componentised video player . . . . . . . 1127.7. The “Protocol translation” optimisation specification . . . . . . . . . . . 1137.8. Touch events optimisation before (A) and after (B) . . . . . . . . . . . . 1147.9. The “Touch events” optimisation specification . . . . . . . . . . . . . . . 1157.10. Redraw optimisation before (A) and after (B). . . . . . . . . . . . . . . . 117

ix

List of Figures

7.11. The Android redraw optimisation . . . . . . . . . . . . . . . . . . . . . 118

x

List of Tables

1.1. Three design challenges for smartphone system software stacks . . . . . 2

6.1. Example type mappings, CSL to Python . . . . . . . . . . . . . . . . . . 82

7.1. Summary of CAmkES-specific examples . . . . . . . . . . . . . . . . . 1067.2. The “Merge protection domains” optimisation, results . . . . . . . . . . 1087.3. The “Eliminate RGB conversion” optimisation, results . . . . . . . . . . 1117.4. The “Protocol translation” optimisation, results . . . . . . . . . . . . . . 1137.5. The “touch events” optimisation, results . . . . . . . . . . . . . . . . . . 1157.6. The “Redraw” optimisation, results . . . . . . . . . . . . . . . . . . . . 117

xi

1. Introduction

By what course of calculation can these results be arrived at by the machinein the shortest time?

– Charles Babbage [Babbage 1864]

This dissertation describes system software architecture optimisation, a technique toimprove the performance of applications by modifying the way in which they communi-cate with the rest of the system.

1.1. Why optimise?

It is difficult to design efficient system software. This is particularly true for system soft-ware that forms the operating system on a device which is tightly resource-constrained,such as a mobile Internet device. Three factors contribute to the problem. The first fac-tor is support for third-party applications: the system must support applications writtenby programmers other than the system designer, the exact requirements of which are un-known at design time. This means that the system designer cannot take performanceshortcuts based on total knowledge of the system. The second factor is a large code base:modern operating systems must support a wide variety of functionality, with the resultthat they tend to be rather large. This means that keeping track of “corner cases”, or un-desirable interactions between portions of the software under certain conditions, is harderthan it used to be when the average system was smaller. The third factor is unknownor moving hardware targets: architectures which were designed around the requirementsand capabilities of one hardware platform are frequently ported to other platforms, oftentaking a lowest-common-denominator approach to features in the process.

Even though these three design issues are quite distinct, they all nonetheless tendto result in the same kind of application performance problem. Many applications forresource-constrained devices are IO-bound (the case for most applications), or they doa lot of IO but also perform a lot of computation (the case for games). As a result, theefficiency of data transfer between application and system stack (constituting all non-application parts of the system, such as system libraries) has a significant impact on theoverall performance of the application. This efficiency is directly impacted by the threedesign issues above.

We can, therefore, generalise a little about these three design issues: they can all affectperformance when one portion of the system attempts to communicate with another por-tion of the system across an interface. In the first case (support for unknown applications)the system designer must anticipate the kinds of things application programmers may

1

1. Introduction

Design issue Resultant performance issueSupport for third-party applications Performance shortcuts must be avoided, be-

cause they could compromise system securityLarge code base Applications may use the system in unex-

pected ways, highlighting inefficient cornercases

Unknown hardware System may be unintentionally optimised fora particular device

Table 1.1. Three design challenges for smartphone system software stacks

want to do, so that interfaces between applications, system libraries, and the operatingsystem are efficient. The second case (overall size complexity of the stack) is similar tothe first: designers must anticipate ways in which different portions of the system will beused together by an application. The third case (changing hardware) requires the designerto anticipate the nature of future hardware, so that applications using interfaces designedfor the current hardware generation are also efficient on the next generation. These issuesare summarised in Table 1.1.

Although the performance issues described above apply to a wide variety of resource-constrained systems, this dissertation focuses on smartphones. Smartphones represent anew but rapidly-growing portion of the computing market. Compared with laptops andnetbooks, these devices have lower-powered processors, less capable graphical hardware,and significantly smaller batteries. Nonetheless, their users expect them to perform thesame demanding and varied tasks that they would of more expensive systems. The typicalmodern smartphone can render Web pages, run graphically-intensive games, play video,and work with complex documents. Unlike older “dumb phones”, where all the softwarefunctionality was determined by the manufacturer, smartphones can run third-party ap-plications. They behave, essentially, like miniature general-purpose computers. The highperformance requirements placed on these devices, combined with their modest hardwareresources, present a unique challenge for the design of their system software.

The easiest way to improve software performance is to apply standard compiler-styleoptimisations, such as strength reduction and loop unrolling. These certainly help: withthese kinds of optimisations enabled at compilation, code runs approximately twice asfast as equivalent unoptimised code, without any special effort required on the part ofthe programmer. Standard compiler-style optimisation is, however, usually limited to asingle module—typically represented by a single text file in popular languages such as C,C++, and Java. This limitation means that standard compiler-style optimisations do notaddress the particular problems of smartphone system software stack design.

Another simple response to the problems posed is that if the system is designed cor-rectly, the performance problems are avoided. This is certainly true to an extent: systemdesigners have a good idea of the ways in which a system will probably be used, basedon knowledge of previous systems, knowledge of the hardware that will be in the device,and so on. However, intuitively, it seems unlikely that such an optimal system can be de-

2

1.1. Why optimise?

signed to cater to all unknown applications and future hardware. For example, the API foraccessing camera image data in Android phones uses the YUV colour space by default.Future phones may use the RGB colours space instead, requiring applications either tosupport both formats or suffer a performance hit.

The three system design challenges described above all arise when control or dataflows across an API boundary between an application and system software (or betweentwo applications, via the system). This dissertation introduces an optimisation technique,system software architecture optimisation, that specifically (and uniquely) addresses per-formance issues which arise when applications interact with their environment. The name“system software architecture optimisation” is very specific but rather long, so it is usuallyreferred to as “architecture optimisation” in the rest of this dissertation.

Architecture optimisation is a form of context-sensitive optimisation: performing an ar-chitecture optimisation requires some semantic knowledge of the code. In this sense it isa specific example of high-level optimisation, defined by Veldhuizen and Gannon as “op-timizations which require some understanding of the operation being performed” [Veld-huizen and Gannon 1998].

Application

Optimisation specification Optimisation Optimised

application

System-specific rules

System userOptimiser userOptimisation writer

Figure 1.1. System software architecture optimisation process

Architecture optimisation involves the following roles:

The system user: This is the end user of the optimised system.

The optimiser user: This is the person or process using the optimisation tool to producean optimised application. This may be the system user.

The optimisation writer: This may be the system designer, writing an optimisation for asystem sub-system; or it may be a hardware manufacturer, writing an optimisationfor a particular piece of hardware.

In general, an architecture optimisation involves the following operations:

1. The optimisation writer thinks of a way in which cross-interface interaction couldbe improved, or identifies a particular type of cross-interface interaction whichcauses problems. The problem might exist only for a very specific type of applica-tion, or it may only arise on a particular combination of hardware and software.

3

1. Introduction

2. The optimisation writer formalises the optimisation. This formalisation includesa description of the exact conditions under which the optimisation is valid, plusthe optimisation itself. Both the conditions for optimisation and the optimisationitself are written in a domain-specific language, and refer to specific features ofthe system. For example, the optimisation criteria may state that the optimisationis only valid when a specific API call is performed by the application, and theoptimisation itself may specify a different API call for the application to make.

3. The optimiser user selects an unoptimised application.

4. The optimiser user runs the architecture optimiser, or an automatic process causesthe architecture optimiser to be run at an appropriate time. This optimiser takes asinput an optimisation description, or set of optimisation descriptions, and an un-optimised application. The optimiser determines whether the optimisation is valid,and, if it is, modifies the application according to the optimisation description.

5. The output of the optimiser is a new application, which is installed on the system,ready for use by the system user.

In step 4, the optimiser verifies the optimisation. To facilitate this process, the optimi-sation begins with zero or more statements of fact about the application. The optimiserverifies that each statement of fact is correct about the particular application being opti-mised. The statements of fact fall into two categories:

• Statements about the path-insensitive data flow of an application. These are state-ments regarding structural features of the application code. For example, “Theapplication makes a call to library function x” is a statement of this form.

• Statements about the path-sensitive data flow of an application. These are state-ments which require the optimiser to make statements about the values of data inthe application relevant to the optimisation. The statement “When the applicationcalls x(a), a is always equal to 1” is a statement of this type.

Once the optimisation has been verified, the application is modified by applying one ormore remedies. A remedy is a modification to the application (for example, by changingthe class that application makes use of). These are covered in more detail in Chapters 4and 5.

The language in which the optimisation is written is customisable and extensible, bothto add support for new types of verification, and to add support for modifying applicationsin a particular system. A graphical representation of the whole optimisation process isgiven in Figure 1.1.

The above sequence identifies the optimiser writer, optimiser user, and system userroles, but does not specify whether these roles should be filled by the same person, or byseparate people. The system is designed so that the optimiser could run at applicationinstallation stage, possibly without knowledge of either the application writer or the enduser. Architecture optimisation is completely independent of other forms of optimisation,

4

1.2. Currawong: an architecture optimiser

and can thus, for example, be used in conjunction with compiler-style optimisation tosynergistic effect.

1.2. Currawong: an architecture optimiser

The design presented in this dissertation is implemented in Currawong, an architectureoptimiser named after the distinctive Australasian bird. Currawong is an architectureoptimisation testbed, capable of performing a variety of architectural optimisations. Cur-rawong rigidly enforces separation of the roles illustrated in Figure 1.1: the optimisationwriter, optimiser user, and system user can be completely separate entities. Currawongis used by the Optimisation Writer and Optimiser User roles: by the former in order todevelop and appropriate optimisation, and by the latter to apply the optimisation to a par-ticular application. Importantly, the final role, System User, does not involve Currawongat all: because Currawong optimisation is a pre-runtime process, optimised applicationsdo not rely on Currawong at run-time.

A key component of Currawong is a method by which optimisations are specified. Cur-rawong focuses on keeping specifications concise and high-level, so it uses a declarativeand extensible specification language, but hides the details of the process from the op-timisation writer. For example, an optimisation writer can specify a particular class orfunction call to optimise, but need not specify exactly how to find that piece of code, northe details of making changes to the system.

Currawong supports multiple methods for verifying that an optimisation is correct,some of which are built in to the syntax of the specification language. It is capable ofperforming architecture optimisation on two systems: the Android smartphone platform,as well as a microkernel-based research component system.

An important difference between Currawong and many other optimisation tools is thatthis technique does not require application source code. This is a highly practical choice,because it means that application developers do not need to be involved in the process ofoptimising their own applications. Obviously, it is preferable for developers to supporttheir own applications, but there are many reasons why a particular developer may notapply a particular architecture optimisation:

• The optimisation may be specific to hardware or software that the developer doesnot have. For example, an architecture optimisation may only work with the latestversion of the phone’s operating system, or may only support a particular manufac-turer’s touchscreen.

• The developer may have discontinued support for the application.

• The developer may simply not realise that their application can be optimised in thisway.

End users, rather than application developers, can thus apply their own optimisations.Alternatively, suitable system-specific architecture optimisations could be applied to ap-plications without any user intervention required.

5

1. Introduction

Non-reliance on application source code is particularly useful for API evolution, whichis the practise of updating the API of a component or library to assist with maintenanceof that library, to add support for new features, or to improve the library’s performance.Because API modifications are a core motivation for architecture optimisation, and archi-tecture optimisation techniques encompass API evolution techniques, API evolution maybe considered to be a subset of architecture optimisation.

Currawong’s effectiveness is demonstrated by variety of optimisations for several ap-plications on two different systems.

1.3. Related approaches

Architecture optimisation builds upon two large areas of related work: library optimisa-tion, and component-based software engineering. Work on domain-specific and library-level optimisation is motivated by some of the same goals that motivated this work, andimplements similar solutions. The general idea is that a library author supplies a numberof domain-specific optimisations along with the library. The application programmer thenuses a custom-written optimising compiler to apply the optimisations to application code.As with architecture optimisation, the advantage to the application programmer is that heor she need not know the best way to make use of the library, because that information iseffectively supplied with the library in the form of annotations. This approach providessome, but not all, of the advantages of architecture optimisation. Because library-leveloptimisations involve compiler modifications, they are restricted to a single source lan-guage. Additionally, library-level optimisations require application source code, makingthem unacceptable for API evolution purposes. Because of these limitations, the opti-misation specification language is irrevocably linked to the target language, making thespecifications difficult to generalise.

Optimisation of communication across an interface boundary is a core area of researchin the general area of component-based systems, and several approaches attack the samegeneral problem by providing a flexible component system: either one that offers appli-cations a choice in the way they communicate with other components (or the operatingsystem) or that decides, through simulation or other heuristic, which communicationsmethod would be best. Architecture optimisation builds on that approach. These tech-niques are typically only applicable at system creation time, when the source code isavailable. Additionally, this work supports performance analysis of a system for whichonly binary code is available, and thus delivers more flexibility with respect to when theoptimisation can be performed.

1.4. Scope

The variety of use cases presented above show that architecture optimisation is a rathergeneral concept. Architecture optimisation could apply to application-to-application,rather than application-to-operating system, optimisation; it may also apply to systemsother than smartphones, such as laptops and netbooks; and the same techniques may be

6

1.4. Scope

usefully applied in distributed systems. This is too broad a scope on which to build anexperimental framework, so some degree of scope narrowing is required.

As discussed above, smartphones are an attractive optimisation platform. To recapitu-late: smartphones have significantly slower processors and slower graphics acceleratorsthan desktop devices, but are expected to provide desktop-like performance in areas suchas gaming and Web browsing. Smartphones provide rich APIs, but, because the genre isrelatively new, it is not necessarily clear to developers which APIs are most performant fora given application. Compounding this confusion is the heterogeneity of hardware avail-able, resulting in varying performance capabilities between phones. Finally, smartphoneoperating systems are evolving rapidly, resulting in a heterogeneity of installed operat-ing system versions. Smartphones also have the practical advantage of large and easily-searchable ecosystems of third-party applications, in the form of “application stores” forthe various platforms. This dissertation presents performance benchmarks taken fromreal applications taken from such a store.

I selected applications which moved large amounts of data between application andoperating system as domain applications to optimise. I chose these sorts of applicationsspecifically because they rely on efficient application-to-system data transfer in order toperform acceptably, and thus stood to benefit from architectural optimisation.

Architecture optimisation is aimed at the middle ground between individual programmodules and the entire system. It is not concerned with techniques that can be performedjust as effectively by existing optimising compilers. It is also not concerned with ex-haustive whole-system analysis techniques (such as model checking). The state of theart in the area of techniques such as model checking is not at the point where suchcomputationally-intensive techniques can be aimed at the very large amount of code thatmakes up a smartphone system software stack plus its applications.

For similar performance-related reasons, the examples presented in this dissertationinvolve only modest amounts of static analysis. There are two reasons for this. A simpleand pleasant reason is that for the class of optimisation for which Currawong is intended,a large amount of static analysis turned out not to be necessary. A second reason isthat because architecture optimisation involves whole-program analysis, potentially withmany hundreds of optimisation descriptions, each resulting in one or more program mod-ifications, it is important to keep processing time to a minimum, to minimise the stateexplosion problem.

The case studies presented here are limited to two systems: the Android commercialsmartphone operating system, and a simple demonstration operating system based onthe OKL4 microkernel. These choices are largely arbitrary, but they are both useful fordemonstration purposes as both are particularly open systems.

Some discussion should be given to the matter of energy conservation. This is oftensupplied as a justification for reducing the CPU workload on mobile platforms. However,power management is outside the scope of this dissertation, and energy consumption ismentioned only indirectly—I made no attempt to quantitate energy saving as a result ofoptimisation, for example. Efficient energy usage is simply another benefit of reducingoverall CPU and peripheral power consumption through architecture optimisation tech-niques.

7

1. Introduction

1.5. Contributions

The major contributions of this work are:

• the concept and description of binary-only, interface-level optimisation, and casestudies demonstrating, via implementation of the technique, that it is viable anduseful;

• a domain-specific and target-language-independent language for the architectureoptimisation description;

• an implementation of the technique described, named Currawong; and

• a binary-only component specification for microkernel-based embedded systems.

1.6. Overview

The dissertation proceeds with a study of related work in Chapter 2. Chapter 3(Background) provides a brief introduction to the software stacks referenced in thedissertation—the OKL4 microkernel and the Android software stack in particular. Chap-ter 4 (Performance anti-patterns) gives a detailed motivation for the work and introducesa set of running examples. The motivation is used as a basis for design of the system,which is given in Chapter 5. An implementation of an architecture optimiser, named Cur-rawong, is described in Chapter 6 (Implementation). An evaluation of Currawong is thenpresented in Chapter 7 (Evaluation). Finally, a summary and overall evaluation of thework is given in Chapter 8 (Conclusions).

8

2. Related work

Machines take me by surprise with great frequency.

– Alan Turing [Turing 1956]

Before the major optimisation methods are discussed, a short introduction to traditionalcompiler optimisation is given, to serve as a point of reference for other optimisationtechniques. I then identify three major optimisation methods, and discuss each one insequence.

The first method, dynamic optimisation, concerns rewriting that occurs in real time(that is, at the same time as the execution of the code). The second method, source-based rewriting, works with source code. This type of rewriting either involves a separateoptimiser, or makes use of a special type of compiler capable of performing both high-level optimisation and compilation. The final method, pre-execution binary rewriting, isa compromise between dynamic and source-based approaches.

Currawong also builds on work in the fields of component-system optimisation, dy-namic upgrade, and program transformation metalanguages. Each of these areas is dis-cussed separately, after the three major optimisation methods.

2.1. Traditional compiler optimisation

Traditional compiler optimisation refers to the set of techniques employed by modernoptimising compilers to improve the compiler’s output in some way, most commonly tomake the compiled code faster. Architecture optimisation does not rely on optimisingcompiler techniques. They are, however, often used as a point of reference, so it is worthdiscussing them here.

Optimising compilers have a long history. The second compiler in the world, releasedin 1957 for the FORTRAN language, was capable of simple optimisations. For example,this compiler performed common subexpression elimination, in which sub-expressionsof an arithmetic expression were only evaluated once, even if they appeared multipletimes [IBM 1956].

Modern optimising compilers perform a wide variety of optimisations, which may beclassified in various ways. For example, intra-procedural optimisations work only with asingle function, whereas inter-procedural optimisations make use of multiple functions.In the former category are optimisations that operate on loops, such as loop unrolling, inwhich the body of a loop is copied one or more times (with a corresponding reductionin the number of loop iterations) to reduce loop overhead and increase instruction par-allelism. In the latter category are optimisations such as function inlining, in which the

9

2. Related work

body of a called function is copied verbatim into the body of the calling function, in orderto eliminate function-call overhead [Bacon et al. 1994].

Some optimisations require information about the run-time behaviour of the applica-tion. For example, if the compiler knows that the probability of a particular branch in-struction being taken is very likely, it can reorganise the code so that the cost (in termsof cache misses) of taking the branch is low. Historically, compilers have not had accessto run-time performance profiles of the application, so information on the probability ofvarious branches had to be provided by programmers, in the form of compiler-specificinstructions (annotations) embedded in the source code. However, some modern opti-mising compilers can compile code in a special mode to enable trace-based profiling.This causes the application to generate a log file containing run-time performance infor-mation, such as the number of times various branches were taken. After the applicationhas been run (perhaps multiple times), the compiler re-compiles the application, takingadvantage of the supplied performance information [Gatlin 2010].

2.2. Dynamic optimisation

Dynamic optimisation is the technique of optimising code as it runs, rather than beforeit is run. Dynamic optimisation can apply to both native code and bytecode. Dynamicoptimisation is most well-known as one of a number of techniques used by just-in-timecompilers, or JITs, to produce efficient native code from bytecode. JITing is, itself, mostwell-known for optimisation of Java bytecode. In this setting, a lot of useful informationis known to the JIT at run time, so there is potential for significant performance improve-ment. For example, the entire application and all dependent libraries are available to theoptimiser, making multi-function optimisation feasible; all data types are known (if it isa Java virtual machine), allowing for domain-specific optimisation; and commonly-takencode paths (“hot paths”) are known, acting as a guide to direct the optimiser. JIT is not atall a new idea: despite the recent popularity, JIT systems have been available for variouslanguages since the 1960s [Aycock 2003].

The challenge with JIT compilers has always been in balancing advanced optimisationagainst increased application latency: when a portion of the application is to be opti-mised, the compiler must run before the application can continue. Perversely, the betterthe compiler is at identifying (and, therefore, being required to just-in-time compile) crit-ical portions of code which must run quickly, the more acute this problem becomes. Thisissue is so important that recent versions of Sun’s Java HotSpot virtual machine actu-ally contain two JIT implementations: a server compiler, which sacrifices some latencyto achieve better optimisation; and a client compiler, which makes the sacrifice in theopposite direction, discarding optimisations that would significantly increase perceivedlatency [Kotzmann et al. 2008].

HotSpot can perform traditional multifunction optimisations, such as function inlin-ing; and can co-locate frequently-referenced objects to improve cache performance (i.e.using temporal locality to optimise spatial locality). This latter example, which requiresrun-time locality information, demonstrates a benefit of JIT compilers in general over tra-

10

2.3. Source-based rewriting

ditional before-execution compilers (“pre-compilers”). This advantage is, however, beingslowly eroded by modern profile-guided optimisation [Gatlin 2010].

Given the inherent performance challenge described above, dynamic optimisers are attheir worst when optimising native code. In this situation, the dynamic optimiser startsat a disadvantage: running any dynamic optimiser code means not running applicationcode, so the act of optimising itself reduces the performance of the application. This isnot so noticeable with JIT compilers for virtual machines, because the JIT activity cantake place concurrently with other activities necessary to execute the non-native code(such as creation of data structures, or even during interpretation).

Mojo is a dynamic optimising compiler for native code on the Windows platform [Chenet al. 2000]. The Mojo compiler provides a good reference point for this sort of dynamicoptimiser. Mojo follows other dynamic optimisation systems, such as Dynamo [Balaet al. 2000], by attempting to find “hot traces” in executed code, where a hot trace is acommonly-executed sequence of instructions. Mojo then atttempts to rewrite the codeimplementing the hot trace to eliminate jump instructions on the critical path. Off-pathjumps execute as normal. In their paper, Mojo’s authors describe their experience apply-ing Mojo to a number of large commercial applications for the Microsoft Windows op-erating system—a tough challenge. Unfortunately, Mojo’s performance was lacklustre:application execution times generally increased significantly under Mojo. The authorsattribute this to the overhead imposed by the optimisation engine itself.

One commonality with dynamic optimisation systems is their relative immaturity com-pared with source-code-based optimisation techniques, discussed in more detail below.The systems are generally not extensible, for example, and do not implement domain-specific optimisations. Further, it seems that implementing good domain-specific opti-misations while not slowing the system down is a significant, and perhaps impossible,technical challenge. This is particularly relevant on embedded systems, where processorand memory resources are tight.


Source-based rewrite engines represent the largest and most diverse class of all optimisa-tion types examined. Unlike dynamic rewriting, which is commercially successful, mostsource-based rewrite engines remain highly specialised research systems. The two ma-jor sub-categories discussed below are those that only modify the structure of the code(refactoring engines) and those that also modify the code’s behaviour.

2.3.1. Refactoring

Refactoring represents the simplest type of source-based transformation. It is a pro-gram transformation that changes the structure of a program but does not modify itsbehaviour [Fowler et al. 1999]. A simple example of refactoring involves renaming aclass. Once the class itself has been renamed, all references to it are updated to use thenew name. Although refactoring is not directly used to perform optimisation, refactor-

11

2. Related work

ing can frequently assist with API evolution, by helping programmers modify their sourcecode in response to API changes. One study of several large Java programs which supportplug-ins showed that 80% of the API changes to the plug-in interfaces could be modelledas refactorings, and the authors conclude that better support for automated refactoringtools would help programmers deal with changing APIs [Dig and Johnson 2005].

API evolution is an important and difficult problem, and is of direct interest to thisthesis. A recent study showed that API developers deal with API evolution in a varietyof ad hoc ways. One de facto standard is to use deprecation. A class or method thathas been deprecated is still fully functional, and its behaviour does not change. It is,however, marked as “deprecated” in API documentation. Depending on the program-ming language, the compiler may also display a warning if a programmer uses depre-cated functionality. Deprecation places the onus on application programmers to updatetheir applications, an expectation that sometimes backfires: deprecated classes in the JavaAPI, for example, have sometime been “un-deprecated” because of their continued use;or, if they are eventually removed, application developers may use older versions of theAPI, with the result that the library developer ends up supporting multiple versions of thelibrary [Henkel and Diwan 2005].

Another ineffective solution is simply to expand the API, supporting both the legacy in-terface and the new one. This takes the onus of support off the application programmer—the correct thing to do—but results in ever-larger APIs, with corresponding support andcomprehensibility problems, particularly for new programmers (“Which of these two sim-ilar solutions should I use?”). The authors of Catchup!, an experimental automated refac-toring tool [Henkel and Diwan 2005], suggest that the difficulties associated with APIevolution act to strongly discourage API developers from making incompatible changes,and speculate that reducing the cost to upgrade client code may ultimately result in betterAPIs. The Catchup! authors also suggest that support for more advanced static anal-ysis would make their tool more useful. In particular, they identify support for simpletemporal constraints, such as “calls method X, then calls method Y” as a useful addition.

Despite Catchup!’s age (it is 10 years old, at time of writing), there has been little workin the area since then. One more recent project, ReBA [Dig et al. 2008], presents an alter-native approach: a client-specific “compatibility library” is generated. This stub librarysupports the old API interface, but transparently translates calls to use the new interface.This approach has limited optimisation potential—it was not designed for optimisation—and requires source code for the application.

2.3.2. Beyond structural modification

Aspect-oriented programming

Aspect-oriented programming (AOP) resides somewhere between refactoring, as dis-cussed above, and active libraries, discussed below. It is a software engineering techniquein which the code implementing an application is divided into several aspects—where an“aspect” represents a distinct functionality of the application. Aspects are separated outin this manner even if—in fact, especially if—the functionality they encompass cannot

12


cleanly be encapsulated in a class or module using traditional modular design techniques.The goal of AOP is to enable better “separation of concerns”, by representing the con-cerns as aspects, rather than as traditional modules, classes, and functions [Kiczales et al.1996].

AOP attempts to address what has become known as the “library scaling problem”,after Biggerstaff [Biggerstaff 1994]—the observation that as components become larger,they become more specialised: less generally useful, but more useful in their particulardomain; and, conversely, as components become smaller, they become more generallyuseful, but less useful for any particular purpose. AOP attempts to deal with this quandryby providing another alternative to the traditional module-based system that defines li-braries, addressing what Biggerstaff refers to as “limits of representation”.

Canonical examples of aspects include code which ensures that locks are taken andremoved at appropriate times; calls to log the progress of the application for debuggingpurposes; and code to manage the memory allocation of the application. The theory isthat once these aspects are removed from the core application, the remaining code, orbase aspect, becomes much easier to understand. The application becomes less proneto bugs, because programmers maintaining one aspect need not worry about the otheraspects: most significantly, programmers working on one aspect can focus entirely on itwithout being distracted by code implementing other aspects.

At build time, all the aspects are combined into a single application using a source-to-source compiler called an aspect weaver. Each aspect includes one or more point-cutspecifications, which are essentially a declarative description of where code implementingthe aspect should be included in the base aspect. The aspect weaver uses these point-cutdescriptions to weave a complete application.

Notably, point-cuts can only insert code—they can not modify what is already there.This follows from the core concept of AOP as a program composition methodology.

AOP is interesting from a program transformation perspective, nonetheless, becausethe point-cut specifications must somehow refer to specific positions in the code at whichthe new functionality (known as advice) must be added. In effect, each AOP implementa-tion (i.e. each implementation of an aspect weaver) defines a matching by which portionsof code may be identified.

Point-cuts typically identify structural references, such as “class name starts with set ”or “within the com.example namespace”.

Active libraries

Veldhuizen and Gannon coined the term high-level optimization to refer to optimisa-tions that require “some knowledge of the operation being performed” (as mentionedabove) [Veldhuizen and Gannon 1998]. They distinguish this from low-level optimisa-tion, which can be performed “without any knowledge of what the code is supposed todo”. Traditional compilers perform a known set of low-level optimisations over whichdevelopers have very little control, but a relatively-recent trend in research compilers isto support experimentation with (at least) low-level optimisations without requiring theexperimenter to have expert knowledge of the compiler’s code. These compilers, no-

13

2. Related work

tably SUIF [Wilson et al. 1994] and LLVM [Lattner and Adve 2004], add support foruser-specifiable transformations within a larger compiler infrastructure, and have opti-misation research as a specific goal. These compilers provide structured access to theirintermediate representation(s) (IR) so that optimiser writers can experiement with custom“passes”.

This approach works well for research into low-level optimisations: LLVM in particu-lar boasts an impressive set of well-modularised optimisations. Access to the intermediaterepresentation would seem, however, to be the wrong level to implement high-level op-timisation. Related work in this area has tended to focus on expressing properties aboutthe code to be optimised either in a domain-specific language devoted to specifying codeproperties, or in the implementation language itself. Both options are preferable to theintermediate representation as they more closely resemble the original code.

The most coherent expression of this idea was popularised by Veldhuizen and Gannonas Active Libraries [Veldhuizen and Gannon 1998]. The concept as originally described isquite general: any library that attempts to guide its compiler to produce domain-specificoptimisations counts as “active”. This definition includes certain types of C preprocessorusage, or C++ templating features, but also covers techniques requiring special compilersor preprocessors—in particular, partial evaluation (the technique of determining the val-ues of one or more parameters to a function at compile-time, inlining the function, andthen compiling as if the identified parameter were a constant).

Research on active libraries has tended to focus on libraries for scientific computing,because they are often structured as a set of functions operating on a small number ofdata structures. This format makes them particularly amenable to the wide applicationof a small number of optimisations. For example, a parallel matrix operation librarycontains a number of functions to perform various matrix operations. Each function takesone or more matrices as arguments. An active library can make use of the mathematicalproperties of matrix functions to produce application-specific optimisations. For example,the library could replace multiplication by the identity matrix with a no-op.

The techniques proposed by Veldhuizen and Gannon were refined and extended in theBroadway domain-specific compiler [Guyer and Lin 2005]. Broadway is a source-to-source compiler which uses dataflow analysis techniques to implement domain-specificoptimisations. The compiler requires application source code plus an annotation file fora library used by the application. That annotation file consists of annotations, each ofwhich apply to a particular function in the library (Broadway also supports global an-notations which apply to all functions. They do not function significantly differently toper-function annotations, so are not discussed further here). These annotations specifya number of properties about the function: they comprise an extensible domain-specificlanguage. In particular, data flow behaviour of the function in terms of its input and outputparameters is described, so that Broadway can construct a dataflow lattice for the wholesystem comprising application and library.

Broadway annotations primarily perform optimisation by allowing inlining, and byreplacing function calls with more domain-specific calls if certain properties of the inputparameters can be shown to hold. One of the optimisations applied by Broadway is theidentity-matrix example described above.

14

2.4. Pre-execution binary rewriting

Broadway is flexible for two reasons. Firstly, the underlying analysis technique isgeneric and proven, in that the dataflow analysis technique, and its capabilities and con-straints, are well-known and described in other literature. Secondly, annotations are sup-plied with libraries, rather than the compiler itself, allowing for an unbounded number oflibrary-specific optimisations. The authors argue that while it may be difficult to come upwith the annotation, that difficulty is only encountered once, by the library authors, andthe benefit can then by enjoyed by all users of the library. Another benefit of this divisionof labour is that the optimisation is done by a domain expert—the library author—ratherthan an application programmer who may not be as familiar with the internals of thelibrary.

It should be noted that the idiosyncratic structure of scientific-computing libraries—alarge function set operating on a small data-structure set—is not common in other typesof libraries, and there is no guarantee that the techniques which can be applied to validateoptimisations in scientific libraries will apply equally well to other libraries. Indeed, itseems unlikely: Broadway shows its most promising results when applied to linear alge-bra and matrix operation libraries; when it was applied to various non-scientific librariesit was restricted to simple inlining optimisations and to error-checking. For example,Broadway could warn at run-time if an application attempted to read or write to a filethat was known to be closed. This is not to imply that the actual optimisations imple-mented by Broadway (i.e. various types of specialisation) are unsuitable but, rather, thatBroadway-style dataflow analsyis may not, by itself, be the best approach for optimisationof a heterogeneous collection of libraries.

Several projects address the problem of API evolution using methods that extend be-yond function-call matching. Coccinelle [Brunel et al. 2009], for example, defines adomain-specific semantic patch language in order to support API evolution. In the Coc-cinelle system, a semantic patch describes a set of source-code level changes to codewhich must be made to support an API evolution. The “semantic” portion of the namerefers to the way these patches are applied: rather than doing a simple textual match,Coccinelle parses both patch and code and uses static analysis to determine whether themeaning of the patch is present in the code. This allows Coccinelle to apply semanticpatches even when the code being patched does not superficially resemble the text of thepatch. Figure 2.1 shows a Coccinelle semantic patch to add a parameter to a function,replacing code which would generate the information as a local variable.

Currawong’s templating system, discussed in Section 5.3, echoes the semantic patchideas of Coccinelle.

2.4. Pre-execution binary rewriting

Pre-execution binary rewriting, also known as link-time binary rewriting, provides a com-promise between the source-based and dynamic transformation methods. In some ways,it is the best of both worlds. Since the optimiser can run before the application, the speedof optimisation is less critical, which means that the optimiser can do more work andperhaps produce better optimisations. No dependence on source code means that authors

15

2. Related work

@@function a_proc_info;identifier x,y;@@

int a_proc_info(int x+ ,scsi *y

){- scsi *y;

...- y = scsi_get();- if(!y) { ... return -1; }

...- scsi_put(y);

...}

Figure 2.1. A Coccinelle semantic patch, replacing local generation of “y” with a param-eter.

of frameworks or libraries can release updated optimisations without requiring applica-tion authors to recompile their code. The major problem with binary transformation isthe lack of type information, a problem because type information is a useful way to spec-ify constraints on optimisation. This issue is solved in various ways depending on theimplementation language.

The refactoring systems discussed above operate on source code, but binary-only refac-toring systems have also been developed. Keller and Holzle describe a binary componentadaptation system—their term for an extended refactoring system which works exclu-sively with binary code [Keller and Holzle 1998]. Their system loads Java byte code andscans for structural features (such as class and method names). These features are thenmodified according to a small declarative language. The focus on binaries makes the sys-tem a little more flexible, because the refactorings which are necessary to deal with APIevolution do not need to be implemented by the programmer. Instead, the system auto-matically refactors code at load time, so that it works with the installed API. This wastaken one step further with PROSE [Popovici et al. 2002], which extends aspect-orientedprogramming techniques to compiled Java code.

Native binary rewriting is more challenging than rewriting of byte code. In a binary, thehigh-level control flow information is lost, as are the data types. These must be re-createdor inferred. Some of these challenges are addressed by the DIABLO binary rewriting sys-tem [Put et al. 2005]. DIABLO is a “link-time rewriting framework”, which can reducethe space used by binaries by removing unused code. It finds dead code by examining alljumps taken by the application to produce a “whole-program control flow graph”. Deadcode can be inferred by determining, for each byte of program code, whether it can bereferenced from the graph. Unfortunately, most systems-oriented languages are known tobe difficult to analyse in this regard due to their support for function pointers. There is noindication that DIABLO deals with these correctly.

16

2.5. Optimisation in component-based systems

2.5. Optimisation in component-based systems

The same techniques that have been applied to libraries can also be applied to component-based systems. The VEST system is an implementation of AOP for real-time component-based systems. In VEST, aspects can cross-cut component boundaries across control-flowpaths (i.e. the criteria for code insertion can involve multiple components) [Stankovicet al. 2003]. Because it is designed to produce real-time systems, VEST also employsschedule checking tools to ensure the execution time of the resulting transformed systemis within acceptable bounds.

Of the large selection of component-based systems for embedded systems [Friedrichet al. 2001], many support some degree of optimisation at the level of component de-scription. A simple example is the Knit framework for the Flux OS Toolkit componentsystem [Reid et al. 2000, Ford et al. 1997]. Knit acts as a component-aware linker, allow-ing inlining across component boundaries, as well as performing other non-optimisation-focused tasks. An alternative approach, the Pebble component-based operating system,supports hardware-mediated memory protection, but allows components to communicatevia portals. Portal communication involves executing a small amount of code in the ker-nel. Normally this code is the same for all communicating processes in the system but,in Pebble, components can specify their own portal traversal code. This code executeswith kernel privileges, so Pebble uses a number of mechanisms to make sure that safetyproperties are enforced: portal traversal code is written in a domain-specific language andcompiled and checked by Pebble [Bruno et al. 1999].

2.6. Dynamic upgrade

Some problems addressed by Currawong resemble the problem of dynamic update, thatis, that is, replacing one or more binary components in a running system through run-time binary modification. One such system, the K42 Dynamic Update system [Baumannet al. 2005], relies on details of K42’s object system to implement dynamic update withminimal service interruption. In K42, upgradable parts of the system are implemented asclustered objects, which expose a functional interface. Other parts of K42 access clus-tered objects via indirection through an object translation table (OTT). Dynamic updateis implemented by a hot-swapper making use of the object translation table. To beginan update, the hot-swapper interposes mediatior objects in the OTT. These mediators canat first track, and then delay, access to the clustered object being upgraded, allowing thehot-swapper to wait for quiescence in the system, prevent threads from accessing the clus-tered object whilst it is being upgraded, and finally to replace references to the mediatorobject in the OTT with references to the new object.

Upstart [Ajmani et al. 2006] also implements dynamic upgrade for distributed systemsusing a mediator-style approach. Upstart-upgradable systems communicate using an RPCprotocol which is dispatched by a mediator running on each node. When performing anupgrade, nodes replace calls to functions representing the object with calls to mediatorobjects which perform one of a number of tasks related to the upgrade. These systems

17

2. Related work

and Currawong both perform updates on binary code through interposition of functions.

2.7. Metalanguages

This section discusses languages that describe program transformations. To avoid con-fusion, I follow Kleene’s nomenclature by distinguishing between the object language—the language in which code to be transformed is written—and the metalanguage—thelanguage which describes those transformations [Kleene 1967].

The most popular refactoring tools rely on Eclipse [Eclipse Foundation 2010a]. Eclipseis a large program-development system comprising an integrated development environ-ment (IDE), complete with editor, language reference, and debugger; a refactoring en-gine; and a refactoring API supporting plug-ins (to enable third-party refactorings), aswell as many other features.

Most fundamentally, however, Eclipse is an IDE. As a result, Eclipse-based refactoringtools tend to be focused on graphical user interaction and not automated—the program-mer must indicate which refactorings to perform. The metalanguages of choice for pro-grams which perform only refactoring—i.e., structural manipulation—tend to be rathersimilar: They are usually imperative, and follow a top-down parsing model based arounda syntax tree (to some greater or lesser degree of abstraction).

A good example of this style of refactoring language is Eclipse itself [Eclipse Founda-tion 2010b]. Eclipse refactorings are written in Java and run as a sequence of callbacks—in Java terms, refactorings are classes implementing a Eclipse refactoring interface. Asmall portion of a complete refactoring to rename a property in Eclipse is shown in Fig-ure 2.2.

private Change createRenameChange() {// create a change object for the file that contains the property// which the user has selected to renameIFile file = info.getSourceFile();TextFileChange result = new TextFileChange( file.getName(), file );// a file change contains a tree of edits. Add the root of themMultiTextEdit fileChangeRootEdit = new MultiTextEdit();result.setEdit( fileChangeRootEdit );// edit object for the text replacement in the fileReplaceEdit edit = new ReplaceEdit( info.getOffset(),info.getOldName().length(),info.getNewName() );fileChangeRootEdit.addChild( edit );return result;

}

Figure 2.2. Renaming a property in Eclipse (example from Eclipse documentation)

Because Eclipse refactorings must integrate with the rest of Eclipse, which is primarilya programmer’s IDE, the refactoring language is more complex than necessary: in the ex-

18

2.7. Metalanguages

ample, most of the code consists of creation and manipulation of objects that describe thechange to Eclipse in terms of an edits tree, obscuring the code transformation being per-formed. Other required portions of the refactoring language include exception handling,and code related to keeping the user interface updated—all of which could be expressedseparately. Obviously, there is room for improvement.

Several languages exist which address the lack-of-clarity issue common to Eclipse-style program transformation by replacing the imperative-style transformation languagewith a declarative one. The Tree Transformation Language (TXL [Cordy 2006]) providesone such implementation. TXL was not designed for refactoring, or indeed for programtransformation at all per se; it was designed to help with rapid prototyping of program-ming languages. However, the metalanguage used by TXL is quite interesting from aprogram-transformation perspective, so it is worth discussing here.

TXL is a programming language transformer. A complete TXL specification effec-tively defines two object languages: the original object language Oo, and the transformedobject language Ot. Oo is specified using an EBNF-like grammar as used by parser gen-erators such as yacc. Ot is specified in one or more replacement rules, which are writtenin the metalanguage. A simple replacement rule is shown in Figure 2.3. This rule re-places statements of the form V := V + E with a statement of the form V += E (in otherwords, it translates standard assignments to augmented assignments). TXL uses a simpleextensible metalanguage capable of including embedded portions of Oo and Ot. This fea-ture, known as “pattern matching” within the functional-language community, provides avery concise, expressive method to represent code changes. Also notable is TXL’s notionof types, which are derived from the parse tree. This type concept is notable because itmeans that TXL’s concept of types may differ from the object language’s concept.

rule simplifyAssignmentsreplace [statement]

V [reference] := V + E [term]by

V += Eend rule

Figure 2.3. A TXL replacement rule.

TXL is by design language-agnostic, with the result that it is incapable of performingmost context-sensitive transformations (except when the context refers to type informa-tion derivable from the parse tree). Keller and Holzle’s language for refactoring of Javabytecode [Keller and Holzle 1998](discussed above) also resembles a simple pattern-matching language. This language is, however, Java-specific, and by virtue of being sospecialised it is even more concise than TXL. The language consists of the Java grammarplus a number of refactoring-specific keywords. The writer of the refactoring constructsa file describing the refactorings to perform, known as a delta file, an example of which isshown in Figure 2.4. The delta is Java-like, apart from the keyword “delta”, and the prefix“add method”. This delta concisely expresses the modification to perform, in a languagewhich Java programmers already understand.

19

2. Related work

delta interface Enumeration {add method public Object lastElement() {/* Java code omitted */

}}

Figure 2.4. A binary delta to add a method to all classes implementing an interface

Metalanguages for programming-language-independent specification have been devel-oped, with the justification that presenting a consistent language-independent interfacewill aid in the development of higher-level refactoring tools. An example is FAMIX,a metalanguage which supports transformations in Java, C++, and Smalltalk [Tichelaaret al. 2000]. This approach seems rather self-limiting. A language to describe multipleobject languages must either limit itself to the language features shared by every objectlanguage—the lowest-common-denominator approach—or it must support all featuresof every object language—the inclusive approach. Neither option is appealing. In thelowest-common-denominator case, not all possible object language transformations willbe possible in the metalanguage. In the inclusive case, the metalanguage is more com-plicated than necessary for any particular object language, and some transformations thatcan be expressed in the metalanguage cannot be implemented in object languages whichdo not support a particular metalanguage feature (such as multiple inheritance).

2.8. Conclusion

There is a lot of work in the area of program analysis for bug finding or security check-ing [Engler et al. 2001, Yang et al. 2006, Ball et al. 2006, Ball and Rajamani 2001, Holz-mann 1997], but there is relatively little dealing with optimisation or API evolution asa general problem. Perhaps the most promising area of work is in the field of active li-braries, but these tend to be limited for two major reasons: firstly, they generally requiresource code; and, secondly, they tend to focus on a single analysis technique.

Requiring program source code for optimisation makes providing advanced analysistechniques, such as data-flow analysis, relatively easy. However, the reliance on sourcecode heavily restricts the applicability of the techniques. Deployed software is almostalways distributed in binary form. Even if the most advanced analsis techniques cannotbe applied to binary code, there is room for an optimiser that supports a simpler techniquebut remains binary-only.

There is no categorical best approach to determining information about a program fortransformation purposes. Rather, the best approach depends on the nature of the transfor-mation. Candidates discussed above include control-flow matching, data-flow analysis,and structural analysis. An ideal program transformer would support multiple analysismethods, and allow them to be used interchangeably. There is little related work in thisarea, though preliminary documentation on the Cake transformation system (developedconcurrently with, but independently of, this thesis), indicate that others have identified

20

2.8. Conclusion

the same problems: Cake supports multiple methods for determining information about aprogram [Kell 2009].

Finally, optimisation and transformation systems tend to be restricted to a single lan-guage. This is unrealistic in a world where almost all systems are multi-language. Forexample, applications for the Android system are written in Java but frequently extendedwith C and C++; applications for the iPhone are typically (but not necessarily) writtenin Objective-C, but may also make use of C and C++ libraries (these two systems arediscussed in more detail below). It seems that a multi-language binary-only optimisationwould provide an interesting way to explore optimisations not possible with current work.

In summary, Currawong is novel in two major ways:

• It demonstrates the novel concept of binary-only, interface-level optimisation; and

• It uses a simplified optimisation specification language capable of representing avariety of optimisations in a consistent way.

21

3. Background

I am rarely happier than when spending an entire day programming my com-puter to perform automatically a task that it would otherwise take me a goodten seconds to do by hand.

– Douglas Adams

This work grew from research work on optimal component selection in an experimentalmicrokernel-based component system. The core ideas were then applied to a real-worldsystem in order to demonstrate their applicability to complex systems. Consequently, ref-erence is made throughout this document both to the research component system, namedCAmkES, and to the real-world system—the Android mobile operating system. Thischapter gives a brief introduction to both of these systems and introduces some motivat-ing examples which are used throughout the thesis.

3.1. The CAmkES component system

The Component Architecture for Microkernel-based Embedded Systems, normally re-ferred to by the rather intimidating abbreviation CAmkES [Kuz et al. 2007], is a tool forrapid development of component-based embedded systems. It facilitates the component-based software engineering (CBSE) software development technique, in which systemsare modelled as a set of interacting components. In CAmkES, components may be sep-arated from each other using hardware-mediated memory protection: an implementationrestriction currently requires one component per protection domain (see below), which isessentially equivalent to a single component per process.

The microkernel used by CAmkES is OKL4 [Heiser et al. 2007], a modern variantof the high-performance L4 family originally described by Liedtke [Liedtke 1995]. L4does not have an explicit process (or address space) concept, but instead maintains onlythe concepts of threads and protection domains, where a thread is a schedulable flowof control, and a protection domain represents all the regions of virtual address spaceaccessible to the thread. This arrangement encourages memory sharing. In the rest of thisdissertation, all three terms “thread”, “process”, and “protection domain” are used whenappropriate.

Components interact using connectors, which are small pieces of code, usually auto-matically generated, which perform the system-specific operations necessary to facilitateinteraction between components. This interaction is defined statically at both a compo-nent level and at a whole-system level. At the component level, possible interactions withthe component are defined by one or more interfaces, written in a subset of CORBA

22

3.1. The CAmkES component system

IDL [Object Management Group 2004]. At the whole-system level, all component-component interactions are specified in a CAmkES-specific architecture definition lan-guage (ADL).

A system description written in ADL is transformed into a bootable system by theADL compiler. This compiler runs when the system is being built and generates system-specific connector code from connector templates, which describe the connector’s be-haviour generically. It also generates a program to initialise the system by performingkernel-specific tasks, such as creating threads and initialising shared memory regions.

Figure 3.1 shows three diagrammatic representations of CAmkES systems. These dia-grams are a graphical representation of the ADL-based system description. In the figure,the system architecture labelled A represents a three-component system, where each com-ponent is represented by a large rounded rectangle. Components communicate via the FSand Block connectors, which are functional connectors—they simulate a function-callinterface. The Client and FAT File System components are also connected via a sharedmemory region (shown as small squares), as are the FAT File System and Block Devicecomponents. Note that connectors are implemented using one of a number of system-level communication primitives: for example, functional connectors can be implementedusing IPC if the communicating components are in separate protection domains, but mayuse direct function calls if the components are in the same protection domain.

In the same figure, the system architectures labelled B and C show different ways thatthe basic architecture can be modified using CAmkES. System architecture B shows com-ponent replacement. Here, the file system component has been changed, from FAT FileSystem to EXT2 File System. In CAmkES, components can be interchanged as longas they support the same interfaces. System architecture C demonstrates the interposi-tion pattern to adapt old components to new interfaces. Here, the client uses the OldFSinterface, but the FAT Filesystem uses the FS interface. To solve the problem, a newcomponent, named Shim, is interposed between the two, to transparently convert uses ofthe old interface to the new interface and vice-versa.

Component-based systems make a good testbed for optimisation systems precisely be-cause of this well-defined separation: some classes of optimisation can be implementedsimply by inserting shim components, replacing components with alternatives which im-plement the same interface, or by changing the code which implements connectors be-tween components.

3.1.1. Binary CAmkES

Several features were added to CAmkES over the course of this work. The new systemis named Binary CAmkES, but is usually referred to simply as CAmkES in this disserta-tion. The most important new feature is support for composition of binary-only modules.CAmkES requires source code for all components when building the system, because itgenerates system-specific C / C++ header files from a template which the componentsmust use. Binary CAmkES solves the same problem using a linker. In this regard it wasinspired by, and extends, Knit [Reid et al. 2000].

23

3. Background

Client

FAT File System

Block Device

FS

Block

Client

EXT2 File system

Block Device

FS

Block

Client

FAT File System

Block Device

OldFS

Block

Shim

FS

A B C

Shared memory IPC interfaceComponent

Figure 3.1. System design with CAmkES

In Binary CAmkES, each component consists of a single object file on disk. Thecomponent file is named using a naming scheme which guarantees uniqueness.

3.2. Android

Android is a complete environment for running code on mobile phones and other mobiledevices. Google describes Android as [Google Inc. 2010]

[A] software stack for mobile devices that includes an operating system, mid-dleware and key applications. The Android SDK provides the tools and APIsnecessary to begin developing applications on the Android platform using theJava programming language.

Android’s architecture makes it an interesting target for optimisation research:

• It is a real system: Android is based on industry-standard components such as theLinux kernel and Java, so optimisations applied to the system will be applied tocode that has already been optimised using traditional techniques—there should befew “low-hanging fruit”;

24

3.2. Android

• It has a large API: Architecture optimisation operates at the API level, so a largeAPI provides more opportunities for optimisation;

• It is testable: A large number of free applications are available for Android, includ-ing applications which require as much performance as possible, such as games;and

• It is componentised: Android represents a practical compromise between perfor-mance and a highly componentised system architecture.

Nobody could claim that it is a textbook example of a componentised system (forexample, almost all user-level services run within a single process), but Android wasnonetheless obviously influenced by component-based software engineering principles.Android supports installation of third-party software. Separate programs normally runas separate Linux processes, but can communicate with other processes using Binder,an Android-specific inter-process communication mechanism. Like CAmkES interfaces,Binder interfaces are well-defined (using Java as a specification language). Android’score functionality is split between system libraries (written in Java and C), the SystemServer, which manages access to devices, and launches applications, and a POSIX-likelayer. This highly-layered approach provides multiple optimisation opportunities, al-though the examples shown later in this work focus on the system library-to-applicationinterface.

Android was designed to run on a wide variety of hardware. However, the three mostpopular Android-based smartphones available at the time of writing, the HTC Dream, theMotorola Droid, and the Nexus One, share the following characteristics:

• Capacitative touchscreen (suited to finger-based, rather than stylus-based, input);

• Screen resolution of at least 480x320 pixels; and

• Hardware-accelerated graphics using (at least) OpenGL ES 1.1.

This de facto hardware standard has acted as a set of minimum hardware specificationsfor developers, and many assume that these characteristics will be present for all Android-capable devices. For example, all of the applications discussed below rely on the abovescreen resolution, most rely on a touchscreen, and some require OpenGL. Even relatively-small hardware details drive application design: for example, some applications assumethat fingers (instead of a stylus) will be used for touchscreen input, thus assuming that thetouchscreen is capacitative rather than resistive.

The Android architecture as provided by Google is shown in Figure 3.2 [Google Inc.2010]. This diagram shows the layering described above and, in particular, emphasisesapplications’ dependence on the Android application framework. This diagram does not,however, do a particularly good job of showing the amount of communication performedby a single application: all application input and output makes use of the System Server,which is essentially just an Android application itself (albeit one with special privileges).This communication takes place using Android’s IPC mechanism, Binder. This means

25

3. Background

Applications

Application Framework

Android RuntimeLibraries

Linux Kernel

Home Contacts Phone ...

Window Manager

Content Providers

Package Manager ...

Surface Manager

Media Framework

OpenGL | ES ...

Core Libraries

Dalvik VM

Drivers Binder IPC Power management

Figure 3.2. Android architecture

that, for example, every time the user touches the screen, a Java class representing theevent is generated in the System Server and transmitted, via Binder, to the application.Similarly, whenever the application wishes to refresh the display, it must communicatewith the System Server.

An application’s interaction with the System Server is shown in Figure 3.3. The ap-plication transmits bitmaps to the System Server for display using shared memory (rep-resented by small squares in the diagram). The application communicates via function-call-based IPC to instruct the drawing server to update the display (The “Drawing ctl.”connector). The System Server communicates via the same IPC mechanism to notify theapplication of input events, such as a finger moving on the touch screen. Figure 3.3 issomewhat simplified and omits many other interactions between the application and theSystem Server.

System Server Application

BitmapsDrawing ctl.

Input events

Figure 3.3. Interaction with the System Server in Android

3.2.1. Android is written in multiple languages

An important characteristic of Android is that most of the system is written in Java, butportions are written in C. In fact, all Android applications contain at least some Java, and

26

3.3. Running examples

many are completely Java. It is very common for smartphone applications to be multi-language. For example, iPhone applications are written in a combination of Objective-Cand C; and applications written for Windows Mobile 7 Series phones will run in Mi-crosoft’s Common Language Runtime, which supports many languages (the most com-mon being VB.NET, C#, and Managed C++).

3.3. Running examples

A single running example is used to illustrate architecture optimisation in a research com-ponent system, and several Android applications are used to demonstrate some of thesame techniques in Android.

3.3.1. The componentised video player

Component-system-based optimisations will be demonstrated on the video player shownin Figure 3.4. This figure is a graphical representation of the underlying CAmkES archi-tecture description (in ADL).

FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Figure 3.4. Componentised video player

In the video player, the Client component continuously requests blocks of encodedvideo data from the FileSystem component. This data is passed to the Decoder compo-nent. The decoded frames are eventually sent to the Display component. Each componentis connected with two connectors: one remote-procedure-call connector (which providesa function-call interface); and one dataport connector (which provides a shared memoryregion).

27

3. Background

The ultimate outcome of any optimisation involving this system results in a change tothe ADL, resulting in a component being changed, a component being added (interposed)between two components, or a component being removed.

3.3.2. Examples in Android

Android optimisation was demonstrated using several applications written specifically totest Currawong, as well as two Android games. The following criteria were used to selectthe games:

• Popular: the game should be in the “popular” list in the Australian version of “An-droid Market”, Google’s online application store, as of February, 2010.

• Graphically intensive: the game should have a high rate of screen updates, i.e. itshould be an action game. This requirement ensures that the application performsa large number of application-to-system interactions.

• Optimisable: the game should be able to be optimised by one of the two high-leveloptimisations written for this dissertation. Chapter 7 describes the two high-leveloptimisations in detail.

The first two games which met all the above criteria were selected from the “Popular”list of the Android Market. The games are:

• Bonsai Blast: A puzzle game, in which the player is required to connect three ormore bubbles of the same colour to eliminate all bubbles on the screen.

• Space War: A top-down shoot-em-up.

Android evaluation is discussed in more detail in Chapter 7.

28

4. Performance anti-patterns

Error is endlessly diversified. It has no reality but is the pure and simplecreation of the mind that invents it.

– Benjamin Franklin

No two software systems are the same, so, when a performance problem is encoun-tered, it is tempting to conclude that it is unique to the system under examination. Thisis not necessarily the case. Often, a performance issue can be viewed as an example ofwell-defined class of similar problems. In this chapter, I first look at examples of architec-tural fixes to particular system architecture optimisation problems—in effect, instances ofmanually-implemented and domain-specific system software architecture optimisation. Ibriefly discuss the relevance of each example. I then discuss the general problems moti-vating this related work, to produce a set of performance anti-patterns. Here I use patternin the sense described by Gamma et al. in “Design Patterns” [Gamma et al. 1995]—thatis, a general solution to a design problem. A performance anti-pattern, then, is the oppo-site: a design-level cause of peformance problems.

In the final section of this chapter, I discuss potential remedies for the identified per-formance anti-patterns, where a remedy for a performance problem is either a descrip-tion of how the problem may be fixed, or an implementation, in code, of that fix. Theanti-patterns and their remedies form a motivation for a general-purpose architecture op-timiser, the design of which is discussed in the next chapter.

4.1. Performance improvement examples

4.1.1. sendfile()

Many operating systems provide a function to send the contents of a file over a TCP socket(some implementations are more generic, but this is the core functionality). This function,often called sendfile or similar, provides highly-specialised support for Internet fileservers (such as Web servers). The sendfile function provides a significant performanceimprovement over the traditional method of sending a file through a socket: one studyclaims an improvement of 51% [Nahum et al. 2002].sendfile’s impressive performance is the result of a large number of improvements

on older methods of file transmission. Figure 4.1 is a UML activity diagram showingthe traditional method by which network servers send files on UNIX-like systems, us-ing the read() and write() system calls. The file is transmitted in blocks. Sendingeach block requires two system calls, resulting in four context switches per transmission.

29


More significantly, each block is copied four times: from the disk into the kernel’s filesystem cache, from the cache into the buffer provided by the application, from the appli-cation’s buffer into a region of memory accessible to the network card, and finally ontothe network itself.

call read()

Read block to buffer (DMA)

Copy to user space (CPU)

call write()

Copy to network buffer (CPU)

Application

Kernel

More data?

Yes

No

Send over network (DMA)

Figure 4.1. File transmission using read() and write()

By contrast, the sendfile() API attempts to minimise both context switches andcopies. Figure 4.2 shows the behaviour of sendfile(). Each block of data is copiedonly twice, rather than four times, and only two context switches are performed for theentire file, rather than four per block.

call sendfile()

Read block to buffer (DMA)

Application

Kernel

More data?Yes No

Send over network (DMA)

Figure 4.2. File transmission using sendfile()

On Android smartphones, which are based on Linux, sendfile() exists as an API forapplications to use, and is thus directly relevant as a motivating optimisation.

4.1.2. API call specialisation

Several experimental systems provide a method to fine-tune their APIs on a per-application basis. This concept as applied to a traditional UNIX-like kernel was pop-ularised by the Synthesis kernel [Pu and Massalin 1988]. Rather than expose a traditionalsystem call API, Synthesis instead exposes an interface to produce domain-specific sys-tem calls as required by the application. The actual system call is generated using partialevaluation based on the arguments supplied to the generator.

The biggest performance improvement in Synthesis relates to file access. Figures 4.3and 4.4 illustrate the differences between Synthesis and a traditional UNIX system—theauthors use 4.3BSD in their paper. In Figure 4.3, the open() system call (line 2) performsa number of security checks and creates entries in various data structures to manage the

30


1 char data[BUFSIZE];2 int fd = open("/dev/mem", O_RDONLY);3 read(fd, data, BUFSIZE);4 close(fd);

Figure 4.3. File access using the POSIX API

1 char data[BUFSIZE];2 struct SIO_if *file = IO_create(FILEACCESS, "/dev/mem", FA_RDONLY);3 read(file, data, BUFSIZE);4 SIO_terminate(file);

Figure 4.4. File access using Synthesis

file, returning an integer file descriptor to the application. The read() call (line 3) per-forms the same checks, accesses data structures to discover the kernel’s data structurerepresenting the file, and calls through several layers (system call interface, virtual filesystem interface, concrete file system, device driver) to retrieve the data. Finally, theclose() call (line 4) causes the kernel to deallocate any memory allocated to the filedescriptor.

By contrast, the Synthesis IO create system call (Figure 4.4, line 2) causes the dy-namic compilation of a specialised function to perform reading from the file. In thisparticular case, the specialisation routines notice that the file being referenced is the/dev/mem device, which presents a simple byte-for-byte view into memory. Becausethe procedure to access data from /dev/mem is so simple (for example, the file-index-to-device-location translation is trivial), the resulting specialised function bypasses both thevirtual file system and concrete file system layers to interact directly with the device. Theresulting function results in significantly less code being executed.

This is a contrived example, as the authors themselves acknowledge, but it serves todemonstrate the potential for highly-specialised system calls. This same optimisation(implemented a completely different way) was implemented by Gabber et al. in the Peb-ble operating system [Bruno et al. 1999]. In Pebble, as discussed in Chapter 2, commu-nication between protection domains takes place via portals, small pieces of code (and,optionally, data), which run in kernel context. Portals are written in tiny declarative lan-guage, designed in such a way that only the application invoking the portal code musttrust that portal.

Figure 4.5 shows the process of opening and then reading from a file in Pebble. Theopen() call results in a call to a library function, which communicates via a portal tothe file server (a separate application). The server creates several new portals in theapplication’s portal space: one each for reading, writing, seeking in, and closing the file.The call returns a “file descriptor” which is really nothing more than the index into theportal table of the first portal in the newly-created set. When the application calls read(),the associated library routine uses the file descriptor as a portal index, and invokes theappropriate portal. The portal code can contain within it a pointer to the file data structure

31


call open()Application

File library

Portal code

Create caller portals: read, write, seek, close

call read()

File server

Map target buffer for server

Invoke portal for open()

Create file control block

Standard IPC

Invoke portal for read()

Read from device

Figure 4.5. Reading from a file using Pebble portals

in the file server’s address space. Thus, the creation of custom code avoids several tablelook-ups: the file server knows that the file data structure it is passed is correct, becausethe portal code is trusted.

Android does not contain an easily-identifiable example of API call specialisation.However, the CAmkES componentised video player, introduced in Section 3.3.1, con-tains a good example of the technique, in which standard file system read() and write()calls are specialised according to the requirements of the player so that memory can beshared between components. Section 5.2.3 contains a full treatment.

4.1.3. Integrated Layer Processing

Several authors have recognised the inefficiencies resulting from layered protocols. Acommon occurrence is that data are traversed more than once, resulting in suboptimalstack performance. For example, a network stack may copy data in one function, andcompute a checksum in another (lower-level) function. Both these operations requireexamining every byte of the source data.

One remedy is to attempt to perform operations which would normally be performedin different functions, at the same time. This approach is known as Integrated Layer Pro-cessing, or ILP [Smith 1990, Abbott and Peterson 1993, Clark et al. 1989]. Abbott andPeterson provide a good treatment of the problem [Abbott and Peterson 1993]. They de-scribe a generalised data-processing function which they call a word filter. A word filteraccepts a single machine word as input, and outputs zero or more machine words. Multi-stage operations on data can thus be described using a chain of word filters: a relativelysimple example is computing a checksum on a data packet while simultaneously copyingit. Word filters are assembled at compile time and, for speed reasons, are written in a sim-ilar way to C-style macros. The authors achieve significant (17% to 36%) improvementsby implementing their ideas.

Optimisation opportunities of this type occur in image processing. It is common topost-process images in some way in order to change their appearance. Simple transforma-tions involve changing the image’s contrast or brightness—such transformations involveapplying a mathematical formula to each pixel in the image. More complex transforma-

32


tions are applied to groups of adjacent pixels. For example, convolution filters apply a2D convolution in the spatial domain to an image by performing a 2D convolution op-eration on each pixel in the image, taking its neighbours into account. Common imageconvolutions include blurring, sharpening, and edge detection.

Figure 4.6 shows a filter chain applied to the componentised video player. Here, eachfilter acts as a framebuffer, accepting image data, processing it, and forwarding it to thenext filter in the chain. This very straightforward arrangement results in each filter pro-cessing an entire image. This results in a streaming-data cache usage pattern, in whichdata for the first bytes of an image, as used by the first filter, has already left the cache bythe time it is to be re-used for the second filter. This is the anti-pattern addressed by ILP.

Figure 4.7 shows a possible ILP-centric optimisation of the same filter chain. In this op-timisation, an ILP-aware FilterControl component is interposed between the filter, Client,and Display components. Because it is connected to all filters, FilterControl can instructeach filter to process only a small portion of the framebuffer at one time. This means thatthe small portion of framebuffer loaded into the data cache for the first filter can also beused for the second filter, thus avoiding the streaming-cache antipattern.

Client

Decoder Sharpen

Codec FrameBuffer

Increase brightness

Display

FrameBuffer

FrameBuffer

Figure 4.6. The componentised video player with a chain of image filters between clientand display

4.1.4. Protocol header optimisation

In some multi-module systems, particularly networking stacks, a common pattern is fora module to accept a packet, add some data to that packet, and then pass the packet onto a lower-level function. A particularly common case is when headers are added to thepacket. Figure 4.8 shows a typical TCP/IP packet ready to be transmitted over an Ethernetnetwork. The packet contains three headers, each of which may have have been added

33


Client

Decoder

FilterControl

CodecFrameBuffer

Increase brightnessSharpen Display

F'Buffer F'Buffer F'Buffer

Figure 4.7. The componentised video player with a chain of image filters with data accessmediated by a controller

by separate sub-systems. A naıve approach to adding these headers is to copy the packetmultiple times, allocating new storage space each time. This is obviously inefficient.Another alternative is to over-allocate storage space for the packet, leaving empty spaceat the beginning for extra headers. This is the approach traditionally taken by Linux, withits sk buff structure [Miller 2010].

Data

TCP headerIP header

Ethernet header

Figure 4.8. A network packet with multiple headers

A more generic approach is to represent the packet as a list of contiguous regions.When the packet is transmitted, the regions are either copied so that they are literallycontiguous, or, if the network hardware supports scatter-gather I/O, the pieces are sentto the hardware as-is, relying on the scatter-gather hardware to assemble the pieces intoa complete packet. This technique became popular with BSD 4.3’s mbufs [Leffler andMcKusick 1989]. Several other networking stacks support a similar structure. For exam-ple, the light-weight IP stack (LWIP [Dunkels 2001]) uses pbufs, which are based heavilyon mbufs, as its core packet structure.

34

4.2. Summary of performance anti-patterns

4.1.5. FBufs: high-bandwidth transfer

It is often desirable to transfer a large amount of data between protection domains. Thesesorts of communications occur more frequently in microkernel-based systems—a multi-server network stack is the canonical example—but may also occur in monolithic systems,particularly in systems dealing with video and audio data.

The traditional way to pass data between protection domains is to copy it. This formsthe basis of UNIX pipes and fifos, and is also the only supported data-copying mechanismin early microkernels. Copying becomes inefficient when significant amounts of datamust be transferred. Pipes are also rather difficult to customise—the buffer size of thepipe, in particular, can’t generally be adjusted. A too-small buffer results in unnecessarycontext switches between data producer and data consumer; a too-large buffer wastesmemory. Thus pipes pose problems in terms of both data flow and control flow: dataflow, because of the imposed copy; and control flow, because of the lack of control overthe extent and frequency of context switches between communicating processes.

The solution is to use shared memory—in particular, a circular buffer. This is a morecomplicated data structure and its use raises a number of design issues. Which processallocates the memory? Which process “owns” the buffer (and must, presumably, free it)?What if two processes attempt to access the same data at the same time? What if morethan two processes wish to access the same data? Druschel and Peterson proposed FBufsto address these problems [Druschel and Peterson 1993]. FBufs has been reimplementedand extended upon multiple times [Pai et al. 2000, Mosberger and Peterson 1996, Chu1996, de Bruijn and Bos 2008], and implementations exist in entirely user-level applica-tions for non-research operating systems (for example, the vo video output library, partof the MPlayer video player, uses an implementation of the technique [MPlayer authors2010]). Druschel and Peterson demonstrate an eight-fold performance increase over datacopying in their paper. Interestingly, they propose a number of optimisations to makeFBufs even faster, at the expense of a certain amount of flexibility (for example, theyrequire that FBufs be mapped at the same virtual address in all participating processes).These optimisations produce a further twelve-fold increase over their original FBufs im-plementation, for a total one hundred times improvement over copying.

FBufs builds on many other concepts, including the BSD mbufs described above.


Each of the performance problems described above is caused by at least one of the fol-lowing five performance anti-patterns. Frequently, a particular performance problem willinvolve more than one such anti-pattern.

1. Context switching

2. Copying

3. Overly-generic or inflexible APIs

35


4. Unsuitable data structures

5. Reprocessing data

Each anti-pattern is discussed with reference to the network video player described inChapter 3. For this discussion, it is necessary to extend the original video player diagram,Figure 3.4, to explicitly show each component’s protection domain. The new video playeris shown in Figure 4.9. In this figure, the dashed regions represent protection domains.In this version of the video player componentised system each component resides in one,and only one, protection domain.

FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Figure 4.9. Componentised video player showing protection domains

4.2.1. Context switching

Both sendfile() and FBufs reduce context switch frequency. Context switches reduceapplication performance both directly and indirectly. The direct cost is due to the kernelcode which must be run to implement the switch, and is usually relatively small. Theindirect cost is due to the cache impact of the switch: if the total working set size of allprograms exceeds the cache size, programs will suffer a performance hit as code and datawhich was evicted from the cache are re-loaded. On some architectures the entire cacheis flushed during a context switch, causing performance impact potentially 100 timesgreater than the direct cost. The best modern coverage of this phenomenon, for IA-32architectures, is given by Li et al. [Li et al. 2007].

However, when processes being context-switched are interacting, we can expect a lesspessimistic result. For example, consider the interaction between the Client and Decodercomponents in Figure 4.9. In this system, a context switch occurs from Client to Decoderwhenever the client requests that some data be decoded. When the decoded data are ready,another context switch occurs, from Decoder to Client. However, because Client relies onDecoder to perform a service, the code in Decoder would have to be run even if Decoderwere in the same protection domain as Client—so there is little unnecessary code cache

36


pollution. Because the same code is being run, the impact on the data cache will alsobe minimal (assuming the two components use shared memory when running in separateprotection domains). Therefore we can expect that the context-switching cost betweentwo interacting components is, in a well-designed system, mostly due to the direct cost ofa context switch.

In most cases, the direct cost of a context switch is not significant [FitzRoy-Dale andKuz 2009, Li et al. 2007]. Nonetheless, a large amount of context switching tends to besymptomatic of other performance issues. The general solution to the problem of exces-sive context switching due to IPC between two components is to decouple the controlflow between the two components, so that, to take an example from the componentisedvideo player, the Decoder and the Client can both work independently.

4.2.2. Copying

sendfile(), FBufs, and protocol header optimisation all reduce the amount of copy-ing performed in the system. Unnecessary data copying has an obvious performanceimpact: it consumes processor cycles and fills data caches. It also has some impact onother caches, such as instruction caches and the translation lookaside buffer, due to theexecution of the copying code. There are three options to avoid copying:

1. Remove one component’s involvement in the process. The sendfile() optimisa-tion essentially relieves the application of any involvement in the task of sending afile. This automatically has the effect of removing extraneous copies, because theAPI responsible for the copying (i.e. the API comprising the POSIX read() andwrite() functions) is no longer required.

2. Use shared memory. FBufs takes this approach. Switching to shared memorymeans addressing considerations such as memory ownership and concurrent ac-cess.

3. Use better data structures. This is discussed in Section 4.2.4.

4.2.3. Overly-generic or inflexible APIs

Overly-generic APIs can result in multiple performance issues. Both Synthesis and Peb-ble aim to eliminate one of the most common: parameter validation and data lookup.They both achieve this by pre-validating the parameters and performing any necessarylook-ups ahead of time. For example, Synthesis custom-compiles a read function withfile-specific information compiled into the function.

Both Synthesis and Pebble achieve their performance goals by providing a single,application-customised API. An alternative is simply to provide a large variety of APIsfor the same task, each with differing performance benefits. The significant disadvan-tages to this approach were covered in Chapter 1. To summarise: multiple similar APIsmakes application development more confusing for the developer, and makes supportingthe API more difficult for the API writer.

37


4.2.4. Unsuitable data structures

Unsuitable data structures result in excess data manipulation. In the protocol headeroptimisation example, this resulted in data copying when a header was added to a packet.The network stack was simply using the wrong data structure for the job, so the datastructure had to be changed. The two alternatives presented (sk buffs with additionalspace at the beginning, or mbufs, which can be chained) both address the problem byusing a more capable data structure.

4.2.5. Reprocessing data

Reprocessing of data is similar to data copying in terms of performance loss, but withoutthe associated cache issues: the data are not copied, but portions of data are accessedmultiple times, unnecessarily. The Integrated Layer Processing example in Section 4.1.3avoids data reprocessing through architectural means, by performing multiple operationsfor each word when processing a network packet.

4.3. Remedies

Once a performance anti-pattern has been identified, an appropriate remedy can be ap-plied. Remedies describe the actual changes that must be made to the system in order toeffect an optimisation.

It is sometimes possible to optimise a component-based system without changing anycomponent’s implementation. Instead, changes are only made to the connections be-tween components. This style of inter-component optimisation bypasses the potentially-complex task of verifying component source code, because the source code is not modi-fied. This feature also makes inter-component optimisations easier to reason about.

There is a small and well-defined set of inter-component modifications. An inter-component modification changes the protection domains in which components reside;or modifies the component graph by adding, removing, or changing components. Onlysome of these modifications are optimisations; others may be used to enforce securityor correctness requirements. For example, it is sometimes useful to place two compo-nents into separate protection domains. A typical reason to do this is to isolate a sensitivecomponent from untrusted code. However, communication between the two componentsmust now cross a protection-domain boundary, reducing performance.

In this section, the following set of component-system modifications are discussed asremedies:

1. Combine protection domains

2. Replace component or library

3. Interpose component

38

4.3. Remedies

Inter-component optimisations are attractively simple, but not all optimisations can beexpressed this way. Sometimes architecture optimisations require modifying a compo-nent’s implementation. Component code is significantly more complicated than a com-ponent architecture, so correctly modifying code is a more significant challenge than cor-rectly modifying a component architecture. This dissertation does not attempt to providea general code-verifcation system. Instead, it focuses on a single relatively-safe form ofcode modification:

4. Modify component-to-component APIs

The important distinction between the first three remedies and this one is that the firstthree remedies are structural remedies: they do not rely on modification to componentcode, but instead focus on recognising and modifying the connections between compo-nents. The fourth remedy is a code remedy: it requires analysis, and, possibly, modi-fication of source code. The details of this fourth intra-component optimisation, and adiscussion of its relatively safety, are provided in later chapters.

In general, the modifications required to implement a remedy are small. However, asdescribed in the introduction, some types of optimisations must be verified. The exactnature of the verification depends on both the intent of the optimisation and the particularremedy being applied.

In this chapter, I distinguish between verification that requires knowledge of data inthe system, data-sensitive verification, and verification that does not require knowledgeof data, data-insensitive verification. An optimisation that only requires knowing that anapplication calls a particular API function can be verified using data-insensitive verifica-tion, because no knowledge of the API parameters, or the rest of the program’s state, isrequired. If, however, performing the optimisation safely requires knowing, for example,that the first parameter is always a non-null pointer, then data-sensitive verification is re-quired: the optimiser must prove a fact about the parameter before optimisation can takeplace. This distinction is covered in more detail in Chapter 5.

Each remedy is described below. For each description, an example of a system optimi-sation making use of that remedy is given. The verification conditions for the optimisationare also described, informally. A complete formalisation is given in the next chapter.

For the sake of clarity, the supplied examples are illustrated with reference to the run-ning example of the componentised video player, shown with protection domains in Fig-ure 4.9. However, each remedy’s application in a real-world, less-idealised system, suchas Android, is also discussed.

It is important to note that there is not a one-to-one correlation between optimisa-tion remedies and performance anti-patterns: the examples of performance anti-patternsshown in this chapter could be addressed in various different ways. Currawong is, how-ever, limited to the four remedies described above.

4.3.1. Remedy 1: Combining protection domains

A simple way to improve performance in a system involving two isolated components isto place them both into the same protection domain.

39


Figure 4.10 shows the video player with the video decoder (Decoder component) andthe user interface (Client component) occupying the same protection domain. In Androidor another Unix-based system, this scenario is analogous to replacing a separate decodingprocess with a shared library loaded by the Client. Combining protection domains is away to reduce context switching and copying.

FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Figure 4.10. Video player: client and decoder occupy the same protection domain

Combining protection domains is a very generic and potentially highly-dangerous rem-edy. The hardware-enforced memory protection boundary between the components isnow removed, and a buggy or malicious component can now crash its communicationpartner by overwriting its data. For example, the Decoder component could overwritethe Player component’s private data and cause it to crash or behave unexpectedly. Alongsimilar lines, if the Player component had specific privileges (such as being able to com-municate with the Display component), and if these privileges are managed on a per-protection-domain basis, the Decoder component could now also make use of those priv-ileges. A component written with malicious intent could exploit any of these vulnerabili-ties for its own gain.

In general, if we wish to ensure that the new system has exactly the same safety prop-erties as the old, the optimiser must guarantee complete noninterference between compo-nents in the new protection domain: a complex path-dependent optimisation. However,in some cases shortcuts can be taken. For example, if one of the combined componentsis written in a high-level language and is run on a virtual machine (such as Java), thesystem optimiser can choose to trust the correctness of the virtual machine running thecomponent, and focus instead on verifying the behaviour of the high-level-language code.

4.3.2. Remedy 2: Replacing components or libraries

Another simple way to improve system performance is to replace a component (in acomponentised system) or a library (in a noncomponentised system) with a different onewhich performs the same function, but is optimised in some desirable way. The extentto which this can be done depends on the structure of the existing code. Consider the

40

4.3. Remedies

protocol header optimisation case described in Section 4.1.4. This optimisation removeda copying requirement on network buffers by replacing the data structure representingthe buffers. If all operations on the buffers were performed through a small functionalAPI, then replacing this API is a simple matter. If, however, operations on the buffers areperformed directly, then any non-source-code-level optimisation may be infeasible. Com-ponent replacement is a very generic remedy and could conceivably be used to reduce theimpact of any of the performance anti-patterns listed. However, it is particularly well-suited to handling of overly-generic or inflexible APIs (anti-pattern 3) or unsuitable datastructures (anti-pattern 4). Another simple example of this sort of optimisation is addinghardware-specific support to a system, replacing the Decoder component in the videoplayer example with one which supports hardware-accelerated decoding, for example.

If the functionality of the replaced component is indeed exactly the same as that of thenew component, then the verification conditions are very simple: only path-independentanalysis is required. If, however, the new component imposes additional non-functionalrequirements on its clients, those requirements must be checked before the componentcan be said to be safe. For example, suppose the original Decoder and Client componentscommunicated by copying data, but the new Decoder component uses shared memory.In this case, the Decoder may have additional requirements of Client—such as that ofmutual exclusivity of access to the shared region. In this case, the optimiser performs thedata-sensitive verification that the Client component does not access the shared regionwhen it is in use by the Decoder component.

4.3.3. Remedy 3: Component interposition

Component interposition involves adding another component to the component graph,connecting two or more components to the new component, and then placing the newcomponent in the protection domain of an existing component. In terms of high-levelmodifications to the system, it is thus more complex than the previous two optimisations(combining protection domains, and replacing components). However, it is conceptuallysimple: the new component acts as an interface between two or more components. Anexample of component interposition is shown in Figure 4.11.

The figure illustrates the following optimisation: some videos played by the videoplayer do not require decoding, in the sense that they are already in a form that may besent directly to the output hardware. In this case, it may be acceptable to include a verysimple “null” decoder in the protection domain of the Client component, which forwardsdata to the Decoder component in another protection domain in the cases where the dataactually does require decoding. If the video does not require decoding, the Null compo-nent recognises this fact and simply returns the original data to the client: no protection-domain crossing is required.

The component interposition pattern is particularly well-suited to the “Overly-genericor inflexible APIs” anti-pattern (number 3).

The verification requirements for component interposition are simple (path-independent) if the newly-interposed component is trusted, that is, if the optimiser userdecides that the interposed component will not interfere negatively with other compo-

41


FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Null

Codec

Figure 4.11. Component interposition

nents in its protection domain. This scenario is not as unlikely as it might be for thegeneral-case protection-domain-combining remedy described above. This is because it isexpected that interposed components be used to translate from one API to another (reme-dying anti-pattern 3, “Inflexibe APIs”) or otherwise perform some small, well-containedtask. A smaller amount of code translates directly to decreased likelihood of bugs in thatcode.

4.3.4. Remedy 4: Modifying component-to-component APIs

This remedy involves modifying the way components communicate. In our componentsystem, this is equivalent to changing a component connector. In Android, this is insteadequivalent to causing the application to make different calls to the Android API package.

The simplest way to perform this modification is to use component interposition. Thatis, a component supporting both the old and new API is interposed between two com-municating components. The new component translates API calls from the old API tothe new, and back again. This scenario is illustrated in Figure 4.12 (interposition is alsoshown in Figure 3.1, system C, in the previous chapter). To accomplish interface inter-position in Android, one would write a class that mimicked the old API, and rewrite theapplication code to make use of the new class.

Component interposition to perform API rewriting is tempting because it requires min-imal modification to the components themselves and, as described above, can be verifiedusing efficient data-insensitive techniques, as long as the interposed component is trusted.

Interposition is, not, however, the most efficient solution. At the very least, additionalfunction calls (to the interposed code) are required. More dramatically, simulating theold API in terms of the new API simply may not be the best solution. For example,the system in Figure 4.12 shows an interposed component, Shim, translating from the

42

4.4. Conclusions

FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Shim

OldFS

Figure 4.12. API modification via interposition

FileSystem API to the OldFS API. Imagine that the FileSystem API supports memorymapping of files, but the OldFS API does not. If the Client could take advantage ofmemory-mapped files, it may perform more efficiently. However, it cannot make use ofmemory mapping as long as it continues to use the OldFS API.

The more efficient alternative is for some portions of the Client code to support the newAPI. Doing so may be quite simple—the API evolution may simply involve additionalparameters, or renamed functions. In the former case, the code modifications involvedcould simply supply default values. In the latter case, even less work is required: thefunction names could be identified within the code, and simply renamed in-place (this is,to some degree, a simplification. More detail is given in Chapter 6). In these cases, theverification requirements remain simple and data-insensitive.

However, more complicated API modification may require more complex, data-sensitive verification. For example, the memory mapping interface described above mayonly be supported if the client does not attempt to modify the memory-mapped file. Ver-ifying this behaviour requires data-sensitive analysis.

4.4. Conclusions

This chapter analysed a large and wide-ranging set of optimisations. These were then gen-eralised into a small set of general performance anti-patterns. The anti-patterns served asmotivation for a number of optimisation remedies. For many optimisations, the actualsystem modification involved is relatively small—changing a component’s protection do-main, for example—but the verification effort to ensure the optimisation has not intro-duced bugs into the system may be non-trivial.

43

5. Design

Before you start [designing], it is a good idea to stop and think about whatyour computer can and cannot do.

– Computer Spacegames [Isaaman et al. 1982]

This chapter describes the design of Currawong, a tool to perform architecture optimi-sation. Given an application and a description of the optimisation, Currawong does twothings: it finds suitable locations to apply an optimisation within the application, and thenit applies the described optimisation at those locations.

There is no point in an optimisation system which cannot perform useful optimisa-tions. In Compilers: Principles, Techniques, and Tools [Aho et al. 1986], Aho, Sethi, andUllman identify three characteristics of optimisations:

1. They must preserve the meaning of programs, in the sense that the optimised pro-gram must give the same output for all inputs as the unoptimised program;

2. They must, on average, speed up programs by a measurable amount; and

3. They must be worth the effort, in the simple sense that the benefit gained from theoptimisations offsets the time invested in implementing them.

Adding domain-specific optimisations to general-purpose compilers fails criterion 3above: it is not worth the effort. Robinson makes a convincing economic argument: itis simply not economically viable for compiler writers to spend time writing and test-ing optimisations that only benefit a small percentage of their customers [Robison 2001].He describes the general problem with the memorable statement that “Compile-time pro-gram optimizations are similar to poetry: more are written than are actually published incommercial compilers”.

Architectural optimisation’s answer to this economics argument follows the Active Li-braries model discussed in Chapter 2: give the task of writing appropriate optimisationsto the authors of the optimisable component. This strategy encourages those who willdirectly benefit from the optimised code to write their own optimisations, and thus di-rectly addresses Robinson’s economic argument. However, the more general concernremains—if architecture optimisations are too difficult to write, or too difficult to apply,they will not be used. A goal of Currawong’s design, then, was to make optimisationspecifications both easy to write, and easy to understand.

44

5.1. Currawong overview


Currawong locates anti-patterns, and applies remedies. Specifically, Currawong locatesthe anti-patterns summarised in Section 4.2. Once an anti-pattern has been found, Curra-wong then applies one of the remedies described in Section 4.3.

Two additional criteria guided Currawong’s design: ability to work without applicationsource code, and support for multiple languages. In Chapter 1, I argued that architectureoptimisation is most useful when the architecture optimiser does not require source codefor the application being optimised, because the optimisation can then be used even ifthe application developer is unwilling, or unable, to apply it. Support for binary-onlyoptimisation is, therefore, a Currawong design criterion.

Although many of the transformation systems described in Chapter 2 are capable ofnon-trivial static analysis, most do not optimise binary-only systems. Even fewer recog-nise multiple implementation languages. This is not a failing of the related work, butit reflects a different focus: either on enriching a source-level application programminginterface (API) for application programmers (as championed by the active libraries ap-proach), or on helping programmers keep up-to-date with minor API changes (as usedin refactoring-based approaches). In Chapter 1 I noted that in many cases the program-mer will not, in fact, benefit significantly from optimising his or her application, but theend user will. Currawong, therefore, differs from the active-library and refactoring ap-proaches in that it attempts to support the system designer or end user rather than theprogrammer.

In Section 3.2.1, I argued that it was common for applications to be be written in morethan one programming language and that, consequently, architecture optimisation shouldsupport multiple languages also. Using a single architecture optimisation implementationcapable of supporting multiple languages means that optimisations which span multiplelanguages could, in theory, be implemented. For example, one could apply an optimisa-tion to an Android application, written in Java, which makes use of an extension librarywritten in C. Unfortunately, time constraints meant that no examples of this optimisationtype could be included in this dissertation. Currawong supports multiple languages, butcross-language optimisation was not implemented.

In summary, Currawong should:

1. Recognise anti-patterns;

2. Rewrite applications to apply remedies;

3. Perform its work without the target’s source code; and

4. Support multiple languages.

Architecture optimisation is a program transformation technique. Therefore Curra-wong must, at the very least, behave like a program transformer. That is, it should accepta program as input; perform some transformation on that program; and produce a trans-formed program as output. Currawong’s high-level design extends this simple model byadopting features from the program transformers surveyed in Chapter 2.

45

5. Design

Parse application

Parse specification

Matching

Transform application

Succeeded? No

Yes

1

2

34

5

Outputtransformed application

6

Figure 5.1. Currawong overview

The major components of Currawong are shown in Figure 5.1. The input to Currawongis an application and an optimisation specification. The output is an application that hashad the optimisation described by the specification applied to it as many times as possible.It may not have been possible to apply the optimisation even once, in which case theoutput of Currawong is an application identical to the one supplied as input.

Currawong consists of four major stages. Each of these stages corresponds with one ortwo numbered steps in Figure 5.1.

Parsing (steps 1 and 2 in Figure 5.1): The application is parsed and Currawong gener-ates an abstraction. Currawong also loads and parses the optimisation specification.The result of this process is a parsed optimisation along with an application repre-sentation.

Matching (3): The optimisation specification is used in conjunction with the applicationabstraction to locate a portion of the application code to optimise. The optimisationspecifies one or more match critera which determine whether or not applying anoptimisation at a given point is the correct thing to do. Currawong verifies thatthese match criteria actually apply.

Transformation (4 and 5): Currawong transforms the application according to the opti-misation specification. The matching and transformation steps repeat until no moreoptimisations can be applied.

Output (6): Finally, Currawong writes the new application to disk, if any transformationswere applied.

The optimisation examples in this chapter focus on performing optimisations for Javaand C. These are the two languages supported by the implementation of Currawong asdescribed in the next chapter (Chapter 6). They were selected because they are the majorlanguages used for Android applications and for CAmkES applications, respectively.

5.1.1. Use of source code

A key design goal for Currawong is that it not rely on the target application’s sourcecode being available when optimising. The major advantage of this approach is thatoptimisation can be decoupled from program development—this advantage is described

46


in detail above. However, optimisers typically rely on information only present in thesource code in order to perform safe code optimisation. What is the impact, if any, uponthe expressive power of Currawong optimisations if source code is not available? Further,what about source code for other parts of the system, such as the system libraries? Tounderstand the trade-offs, it is necessary to understand the information that is gained orlost when source code is compiled and linked. The discussion below focuses on C codefirst, and then discusses the differences between C and Java.

#define EFFECT_NONE 0#define EFFECT_BLUR 1#define EFFECT_SHARPEN 2

struct video_frame {int depth;int length;uint8_t *data;

};

struct video_renderer {int token;

};

int return_error(void);int send_to_frame_buffer(struct video_frame *data);

int render(struct video_renderer *self, struct video_frame *data, int effect){

switch(effect) {case EFFECT_NONE:

break;case EFFECT_BLUR:render_blur(self, data);

break;case EFFECT_SHARPEN:

render_sharpen(self, data);break;

default:return -1;

}

send_to_frame_buffer(data);

return 0;}

Figure 5.2. A small component (bold portions represent information preserved by thecompiler)

A lot of information is discarded during the compilation process. Specifically, typeinformation, control flow, language features, and meta-information is lost. The mostimportant loss for programs like Currawong is that of type information. Figure 5.2 shows

47

5. Design

a simple component to apply visual effects to a frame of video. The bolded portions ofcode in that figure represent information preserved after compilation. As the figure shows,all type information is discarded: both data types, such as the video frame structure, andfunction type information, such as the parameters and return type of the render function.

Types are not the only detail discarded by the compiler. Language features disappearas well. In C, pre-processor macros are fully expanded, with the result that symbolicnames (such as EFFECT BLUR in Figure 5.2) are replaced with numbers. Control flow andprogram structure is obscured, either due to optimisation passes in the compiler, or simplyas a result of reimplementing a high-level feature in binary code. In this example, theswitch statement is replaced by a series of comparisons and jumps, and the send to -frame buffer call is duplicated. The compiler may choose to inline specific functions,obscuring the program structure.

Program meta-information is also discarded by the compiler. Variable names are re-moved, as are the names of C functions which do not have external linkage. Commentsare removed, and the arrangement of code into files and directories is obscured.

However, the compilation and linking process also adds information. The most impor-tant gain for Currawong is that the overall program structure is revealed. It is difficult, ingeneral, to determine exactly which functions are used within a program from its sourcecode, without simulating the entire compilation and linking process. The problem can beappreciated by considering a program which implements multiple versions of the samefunction, with the linked version being selected by arguments passed to make. Using theapplication binary eliminates a tedious and error-prone stage required of source-basedoptimisers.

Additionally, timing information is more accessible in a binary than it is from sourcecode: it is possible to compute best- and worst-case execution times from binary code.Doing the same thing using source code is much harder, as one must essentially re-implement the compiler so as to know the number of CPU clock cycles that each high-level instruction would take to execute.

Two properties unique to system software architecture optimisation mitigate the impactof information discarded by the compilation and linking process. The first property is thatCurrawong optimises around API boundaries. In terms of a compiled application, thisamounts to performing optimisations around function calls made by an application, oftento functions in an external library. Function calls to external library code are usuallyeasy to detect in binary code (a well-known exception to this rule is discussed below).Furthermore, function calls to external libraries must preserve the name of the calledfunction in the application binary, for linking purposes. Consequently, Currawong canidentify both the name of the API function being called, and the point in code at which itis called.

The other property which improves Currawong’s optimisation power is related to theexpected author of high-level optimisations. For system software architecture optimi-sation, the expected optimisation writer is not the author of the optimisation tool, but isinstead the system designer, hardware manufacturer, or other person familiar with the par-ticular API being optimised. Because the optimisation writer is familiar with the API, shemay bring domain-specific knowledge of that API to Currawong when writing optimisa-

48


tions. This knowledge comes both directly, by including parts of the API in a Currawongoptimisation specification, and indirectly, through knowledge of the behaviour of a par-ticular API.

5.1.2. Multiple-API limitations

As described above, Currawong does not infer optimisations by itself. Instead, it usesoptimisation descriptions provided by an optimisation writer—either the system designer,or the author of an API for a set of optimisable components. This requirement extendsthe expressive power of Currawong significantly, because optimisations can make use ofdomain-specific knowledge known to the optimisation writer, but not easily discoverablefrom the code. However, reliance on an external source for optimisation specificationmakes some categories of optimisation infeasible.

To appreciate the limitations of the approach, consider the componentised video playerfirst shown in Figure 3.4. This system comprises five components: decoder, display,client, and file system. In this system, data flows from the file system, to the client, to thedecoder, and finally to the display. It may be useful to implement a whole-system opti-misation that involves multiple components. For example, the decoder component maywish to modify the rate at which it receives data from the file system. However, doing soinvolves optimising two interfaces: the first between decoder and client, and the secondbetween client and file system. At this point, we have an optimisation problem that in-volves three components. The optimisation writer would have to have knowledge of boththe decoder-to-client and the client-to-file-system interfaces in order to write an optimi-sation specification. This is not always realistic. In general, optimisations which spanmultiple interfaces are not amenable to the single-external-author approach chosen byCurrawong. Overcoming this multiple-API limitation in a reusable way is an interestingbut difficult problem, and is outside the scope of this dissertation.

Nonetheless, the multiple-API problem is not necessarily as bad as it may appear. Insome systems it may not even arise, because the only relevant API is known in its en-tirety to the optimiser writer. This is the case for Android-based phones, where a largeAPI is maintained by a single vendor. In systems where it does arise, some optimisationscan still be performed by taking advantage of the concept of connectors. The advantageof connectors with respect to optimisation was described in Chapter 3. In this context,connectors essentially provide a single system-wide API: as long as components commu-nicate using known connectors, it doesn’t matter if the optimisation writer is not familiarwith the particular components.

5.1.3. A domain-specific programming language

Most of Currawong’s processes are performed as a result of evaluating an optimisationspecification. The optimisation specification determines which parts of an applicationshould be examined, how they should be checked, and how the application should betransformed. Currawong provides a declarative programming language: optimisations

49

5. Design

written for Currawong make use of an API provided by Currawong, concerned with ap-plication input, checking, transformation, and output.

Currawong’s API must provide everything that an optimisation specification may re-quire to analyse and transform an application. This chapter discusses the design in object-oriented terms. The major portion of the API is provided by a single object, the Applica-tion object, which is made available to the optimisation specification. This object providesa representation of the application being optimised, supports checking of an optimisation(by providing support for recognising anti-patterns), and is capable of transforming theapplication.

The rest of this chapter discusses the individual portions of the Application object: coderepresentation (Section 5.3), matching and checking (Section 5.4), and transformation(Section 5.5). After the API has been established, the specification language design ismotivated (Section 5.6) and described (Section 5.7).

5.2. Motivating examples

To give context to the design decisions made in this chapter, five complete examples arepresented below. Each example demonstrates an anti-pattern and remedy from the setintroduced in Chapter 4. These examples are revisited in more detail in the evaluation inChapter 7

The first three examples demonstrate architecture optimisation of compiled C codeand refer to the componentised video player example for the CAmkES demonstrationcomponent system, described in Figure 3.4.

The remaining examples demonstrate architecture optimisation of byte-compiled Javacode and refer to the Android platform.

5.2.1. Example 1: CAmkES same-domain decoder

Client

Decoder

IPC Connectorcom.example.Decoder.Codec

Client

Decoder

Direct Connectorcom.example.Decoder.Codec

A B

Figure 5.3. Protection domain merging in the componentised video player

50


This domain-specific componentised video player optimisation moves the Decodercomponent into the same protection domain as the Client component. This puts De-coder and Client into the same address space, in monolithic operating system terms, andhas the effect of reducing the communication overhead between the two components.Figure 5.3 shows the process: the original system portion is shown in section A of thefigure, and the rewritten system is shown in section B. In these diagrams, as per previousexamples, large, rounded rectangles represent components; a dashed outline represents aprotection domain; small squares represent shared memory; and the circle and semicircleicon represents communication via a functional interface.

In order to perform this optimisation, Currawong must identify the two components(Decoder and Client), verify that the components are currently not in the same protec-tion domain, and then place them both in the same protection domain. Communicat-ing components in the same protection domain will probably use a different connector(the CAmkES feature allowing component system designers to explicity state the methodof communication between components) compared with communicating components inseparate protecton domains, so Currawong must take this into account when changingprotection domains. In summary, the optimisation must:

1. Identify components by name;

2. Identify protection domains of components;

3. Add and remove components from protection domains; and

4. Ensure that an appropriate connector is used.

5.2.2. Example 2: CAmkES RGB conversion

Component replacement was identified in Chapter 4 as a general-purpose remedy forperformance problems. For this example, we imagine that the video display hardwareused by the video player supports two formats for display of information: RGB, a sixteen-bits-per-pixel format in which five bits represent the red component of a pixel, six bitsrepresent the green component, and the final five represent blue; and YUV, an eight-bits-per-pixel format in which four bits represent the overall brightness of the pixel, and theremaining four bits represent the pixel’s colour.

In this system, the native image format coming from the Decoder component is YUV.Before being transmitted to the client, the image is translated to RGB format. Performingsuch a transformation involves computing the result of a simple mathematical functionfor every pixel of the image. The client then sends the new RGB image to the display.If, however, it can be shown that the Client component does not care which format isbeing used (i.e., the Client component does not actually manipulate the image data), theDecoder component can be replaced with a component which produces output in theYUV format. Doing so would eliminate the memory overhead of processing each pixel,as well as the computational overhead of performing a YUV-to-RGB conversion.

The optimisation is shown graphically in Figure 5.4. The original player, shown insection A of the figure, is transformed to the player shown in section B. The Decoder

51

5. Design

Client

Decoder Display

Codec FrameBuffer

A

Client

DecoderYUV Display

Codec FrameBufferYUV

B

Figure 5.4. Component replacement in the componentised video player

component is replaced with a decoder which natively produces YUV, and the Client’sconnection to the Display is modified in order to identify the changed data format to theDisplay component. There is one major task that Currawong must perform to implementthis optimisation:

1. Determine whether a particular region of memory is accessed by a given portion ofcode—in this case, whether the decoded image is accessed by the Client compo-nent.

5.2.3. Example 3: CAmkES protocol translation

The final CAmkES-oriented remedy, component interposition, is demonstrated throughan improvement to the file-reading portion of the video player. The optimisation is repre-sented graphically in Figure 5.5.

FileSystem

Client

FileSystem

SHMFS

FileSystem

Client

FileSystem

FileSystem

A B

Figure 5.5. Component interposition in the componentised video player

In the original system, compressed video data are read from the FileSystem componentat the request of the Client component. The data are copied to a shared memory bufferfor use by the client. The Client, however, uses a standard POSIX-style read() callto get at the data, which means that the client supplies its own buffer. The data in theshared memory area must therefore be copied, again, to the client. Figure 5.6, section A,illustrates this process.

52


call read()Client

Copy from shared area

Copy data to shared area

FileSystem

Perform IPC

Connector

call read()Client

Copy data to shared area

FileSystem

Share memory?

ConnectorPerform IPC

A B

Figure 5.6. File system memory-sharing optimisation

A better approach is to be more like the process shown in section B of Figure 5.6, inwhich the buffer supplied to the Client’s read call is shared with the FileSystem compo-nent. The simple approach is unsafe, because the starting and ending addresses of sharedregions must be multiples of the system’s page size (4096 bytes on the systems evaluatedin this dissertation). If this is not the case, we must either share more memory than nec-essary, which is a security risk, or less than necessary, which would violate assumptionsmade by the component implementor (and would, consequently, almost certainly crashthe component).

As with the second optimisation, Currawong must perform some basic static analysishere. It must:

1. Determine whether the memory used by the Client component is both page-alignedand equal in size to some multiple of the system page size;

2. Replace the connector between two components with a shared-memory connector.

5.2.4. Example 4: Android touch events

As described in Chapter 3, many services are provided to Android applications via IPC toa privileged process called the System Server. Among the services provided is notificationof touch events. These are generated when the application user interacts with the touch-sensitive display by pressing or dragging one or more fingers on the display. Figure 5.7,section A, shows the activity performed by the standard system when a touch event is tobe transmitted to an application.

When the user is interacting with the display, a lot of data is generated (compared withother input methods): the display hardware is capable of generating over 100 events persecond, but in recent versions of Android the delivery rate is limited to 35 events persecond [Hills 2009]. IPC in Android uses an Android-specific mechanism called Binder.Binder IPC is notoriously slow [Hills 2009], which both reduces application responsive-ness (by a small amount) and contributes to overall CPU usage. On mobile devices,increased CPU usage results in decreased battery life, so reducing the performance over-head of delivering touch events could increase the per-charge longevity of the device.

For this example, as well as the next Android example, Currawong can rely on struc-tural features of the application, without having to resort to code analysis, in order to

53

5. Design

Wait for touch event

System ServerApp IPC Thread

App GUI Thread

Enqueue event

Marshal for IPC

Wait for IPC

Send to active application

Handle touch event

App Event ThreadApp GUI Thread

Enqueue eventWait for touch event

Handle touch event

A. Before optimisation

B. After optimisation

Figure 5.7. Touch events optimisation before (A) and after (B)

implement an optimisation. It must:

1. Determine that an application uses touch events; and

2. Add code to the application to support custom decoding of touch events.

5.2.5. Example 5: Android redraw

The second architecture optimisation applied to Android deals with the way applica-tions draw graphics to the display. Some applications, particularly games, require high-frequency updates to large portions of the display. Android offers a variety of methodsto accomplish this, and the best technique varies with the nature of the application (forexample, certain kinds of 2D games are best served by the 3D API).

The simplest and most general-purpose way to do arbitrary 2D drawing in Android isto use the Surface technique. To use this approach, the application declares a class whichinherits from the Surface API class. This class declares a method which the applicationclass can inherit, named onDraw. The application performs its drawing when onDrawis called. When the application wishes to update its display, it calls the invalidatemethod, also defined in the Surface class. The advantage of the Surface approach isthe simplicity of the API: the two functions described are literally all that is necessary tomanage the display. The disadvantage is performance: the Surface class does not allocatememory from the pool directly accessible to the display, so it must be copied to display-

54


Application

System Server

call onDraw()

Re-draw surface

Update display

FrameworkApplication

System Server

call onDraw()

Re-draw surface

Update display

Currawong thread

A B

Copy to display memory

Figure 5.8. Redraw pathway (A) before optimisation and (B) after.

accessible memory on each redraw. This is expensive. The whole process is shown inFigure 5.8, section A.

A slightly more complex approach is to use the SurfaceView technique. This techniquecentres around a different class, SurfaceView, which directly allocates display-accessiblememory. Applications using this technique are slightly more complex than those usingthe Surface class (SurfaceView is best implemented using another thread, for example),but they are more efficient, because they avoid the copy associated with Surface objects.

The difference between Surface and SurfaceView objects is technical, non-obvious,and somewhat hardware-specific. Consequently, some Android games which should beusing the SurfaceView technique use the Surface technique instead.

The redraw optimisation rewrites applications using the Surface technique to use whatis effectively the SurfaceView technique. A new package is added to the applicationwhich implements the requirements for interacting with a SurfaceView. This code starts anew thread in the application. That new thread calls onDraw, which allows the applicationto effectively behave as if the application were still using the Surface method. Thus thechanges required to the application are minimal. The optimisation is shown in graphicalform in Figure 5.8, section B.

As with the previous Android optimisation, this optimisation can be performed byanalysis of structural features of the application. Currawong must:

1. Determine that an application is using the slower drawing style; and

2. Add code to the application to use the faster drawing style.

5.2.6. Summary of example requirements

The examples were chosen as representative implementations of the design remedies out-lined in Section 4.3. Recall that these remedies are:

55

5. Design

1. Combine protection domains;

2. Replace component or library;

3. Interpose component; and

4. Modify component-to-component APIs.

The three examples (the CAmkES examples) are remedied, respectively, by the firstthree remedies in the above list. The two Android examples are both remedied by the“Modify component-to-component APIs” remedy.

5.3. Providing an application representation

All remedies require some way to reference portions of the application. Recall that thefour remedies may be divided into structural remedies and code remedies, where the for-mer category makes use of structural features of the application and its component system,and the latter category also relies on code analysis and modification. Currawong shouldtherefore provide a way to examine and modify application structure (for the structuralremedies), as well as a way to reference portions of an application as a starting-point forfurther code analysis (for code remedies).

In fact, the way applications are represented is mostly kept hidden from the optimisa-tion specification. Instead, a high-level approach was chosen: a portion of the applicationis matched, and then this match is transformed. The specifics of exactly how the applica-tion is stored in memory, and even the method by which it is analysed during the checkingstage, are details of the implementation.

class _ extends View {protected void onDraw(Canvas _){ }

}

Figure 5.9. An example template

Because Currawong acts on binary code rather than source code, this choice—to use ahigh-level representation—entails some process for converting binary code to somethingequivalent to application source code. Currawong does this in different ways dependingupon the language.

For Java applications, Currawong uses open-source tools to convert the compiled byte-code to and from an assembly-language representation. This process is discussed in moredetail in Section 6.5.1. Briefly, however, all structural information is retained during theJava compilation process, so the original division of a Java-based application into mod-ules, classes, inner classes, methods, and fields can be perfectly replicated in Currawong’sapplication representation. Each method is stored along with an assembly-language rep-resentation of its implementation. This code is supplied as input to an instruction-level

56

5.4. Matching and checking

simulator for data-flow search, discussed in more detail in the implementation chapter,Section 6.5.3.

For C applications, Currawong cannot perfectly replicate the structure of the originalcode, because it is not preserved by the compiler. Instead, Currawong extracts as muchinformation as possible from the symbol information present in all executable programs.More information is presented in Section 6.6.1.

Optimisation specifications do not refer to application code directly. Instead, they referto a matched portion. Application matching is achieved by presenting a template to theapplication representation. This results in the generation of zero or more Match objects,which represent portions of the application matching the template.

A sample template is shown in Figure 5.9. In fact, this is a simplified portion of a Javatemplate from Example 5: Android redraw. It specifies a match on a class which inheritsfrom (“extends”) View. This class must contain a method called onDraw of type void,which accepts a single Canvas object as a parameter. Note, however, that neither the nameof the class, nor the name of the Canvas parameter, is specified in the template: instead,the special variable is used, to indicate that it does not matter what actually occupiesthat syntactic position. In other words, classes will match this template regardless of theyare called (if they match the other criteria). The detailed design of Currawong templatesis described in Section 5.7.2, because the design relies on the types of checking thatCurrawong can perform.

The advantage of the templating approach is that complicated specifications (as seenabove: matching a function of a certain name, within a named class, taking a particulartype of parameter, and so on) can be expressed both concisely and readably.


Before it can optimise an application, an architecture optimiser must recognise a perfor-mance problem. Chapter 4 codified five generic types of performance problem as per-formance anti-patterns: context switching, copying, overly-generic or inflexible APIs,unsuitable data structures, and reprocessing of data. This section discusses how theseanti-patterns might be specified and identified.

When designing a program that will recognise anti-pattern implementations, the ob-vious strategy is to create a program that simply recognises each anti-pattern. Such aprogram would detect any instance of context switching, copying, and so on. This isa good solution, perhaps, for a specialised error-detecting program, but it is not idealfor an architectural optimiser: for reasons discussed in more detail below, recognisingsuch generic patterns is both difficult and slow. Even if generic anti-patterns could berecognised efficiently, it is difficult to imagine what could be done with them. The log-ical thing to do would be to apply an anti-pattern remedy, but anti-pattern remedies arevery domain-specific and tend to be the result of human ingenuity rather than mechanicalapplication of a template. Therefore, humans must be involved in the entire process ofanti-pattern recognition and remedy application. Currawong allows optimisation writersto direct the optimisation process through the use of specification.

57

5. Design

5.4.1. The importance of specification

A key idea of architecture optimisation is to make use of domain-specific knowledge toreduce analysis requirements. In Currawong, domain-specific knowledege is expressedvia a specification language. The importance of a specification language is described hereby first considering the alternative.

display_frame

ClientDecoder

Display

decode_frame ... yuv_to_rgb

draw_frame ... rgb_to_yuv

Figure 5.10. Unnecessary data manipulation in a video player (API calls have a doubleborder)

Figure 5.10 shows the activity between three components in the componentised videoplayer system (this is a portion of the complete system which was introduced in Figure3.4). In this example, the Client calls decode frame, which is part of the API provided bythe Codec component. Frames are decoded into the YUV colour space, but converted tothe more programmer-friendly RGB colour space before being returned to the Client. Theclient then calls draw frame, which is part of the Display component’s API. However,the Display component requires data in the YUV colour space, so the newly-decoded datais converted again before being displayed. This is an example either of API mis-use, or apoorly-designed API.

This figure describes an example of a data-reprocessing anti-pattern of the kind dis-cussed in Chapter 4. Suppose we were to attempt to locate this type of problem auto-matically, without any domain-specific knowledge of the API provided by the Decoderand Display components. Doing so would be a difficult problem. At the very least, itwould require pointer analysis of the entire application—a prospect that is at best timeconsuming, and at worst impossible without source code (because type information islost).

In fact, in this particular example, the analysis requirement doesn’t stop at the bound-aries of the Client component, but continues into the Decoder and Display components,as the Client makes calls to those components. An analogy could be drawn between thiscomponent system and a standard monolithic system in which Client would represent theapplication, and Decoder and Display represent libraries used by the application. In theworst case, the generic analyser would have to perform a static analysis of all code in thesystem to detect this type of problem.

58


The example of figure 5.10 shows that a generic anti-pattern detector would be difficultto write, but even if they were easy to write and executed quickly, a generic anti-patterndetector would be difficult to use effectively. This is because finding anti-patterns is onlyhalf of Currawong’s task. The other half of the work involves creating a remedy forthe anti-pattern and applying it. However, all of the optimisations discussed in Chapter4 are domain-specific remedies, custom-written for specific problems. None of themcould be derived automatically from a set of remedies. For example, FBufs (discussed inSection 4.1.5) is an optimisation which reduces context switching and copying overhead.However, coming up with FBufs required domain-specific knowledge and ingenuity. Itis not something that could have been produced mechanically from a program written toapply simple rules (such as “reduce copying overhead”).

Currawong’s solution to this dilemma is to incorporate the process that domain expertsalready apply when implementing optimisations in code: to recognise the characteristicsof a particular anti-pattern, rather than recognising the anti-pattern itself. The optimi-sation writer (who is, presumably, a domain expert) specifies the characteristics of theanti-pattern using a specification language.

Allowing custom specification means that Currawong allows optimisation writers tomake use of their own domain-specific knowledge of the system when writing optimi-sation specifications. Exploiting this knowledge can significantly reduce the amount ofanalysis required by Currawong.

The benefit of specification is best illustrated by referring again to Figure 5.10. Auto-matically analysing the system to detect the data reprocessing without any API-specificknowledge would be very difficult. It seems likely that the analyser would have to deter-mine that yuv to rgb and rgb to yuv are inverses, for example.

If, however, we assume that this performance issue is known to the author of the De-coder and Display components, the optimisation author can write an architectural opti-misation which detects usage of the Decoder and Display API functions decode frameand draw frame, respectively. Safely detecting that this optimisation is possible is thenreduced to determining that the Client calls the appropriate API functions, and ensuringthat the draw frame is always called with the output of decode frame. By incorporat-ing domain-specific knowledge of the Decoder and Display APIs within a specification,the optimisation writer significantly simplifies the optimisation process. In the nomen-clature of the previous chapter, specification has significantly reduced the data-sensitiverequirements of the matching for this example.

5.4.2. Recognising anti-patterns with Currawong

The complexity of the matching and checking being performed depends on the nature ofthe program transformer. Refactoring engines, as described in Section 2.3.1, simply ex-amine the structure of the code. Active library implementations, however, must do morework to ensure that a portion of code is suitable. Broadway, an active library implemen-tation discussed in Section 2.3.2, performs data-flow analysis in addition to the syntacticmatching performed by refactoring engines.

Some of the surveyed program transformers, then, implicitly perform several kinds of

59

5. Design

checking. In these systems, however, the checking being performed tends to be inter-twined with the rest of the system: Broadway, for example, is designed around data-flowanalysis of expressions related to function arguments, and this design decision is reflectedin the syntax of Broadway’s specification language.

In Currawong, one may wish to only apply an optimisation to a particular class, or toa particular type of application-to-system interface. This desire would result in a speci-fication that says something about the structure of the application—“only apply the op-timisation to portions of the application that call these functions, or that implement thisclass”, for example. However, the optimisation writer may also wish to make control-flow-sensitive statements about where to apply optimisations, such as “only apply thisoptimisation if function A has been called before function B”. The criteria for applyingthe optimisation may even be more complex, and may include data-flow-sensitive state-ments, such as “only apply this optimisation if the return value of function A is passed asthe first parameter to function B”.

In Currawong terms, all these criteria—structural, control-flow, and data-flow—arecalled matches, and the overall process of matching and checking is what ultimately re-sults in an optimisation being applied to an application. Matching is shown as step 3 inFigure 5.1. Currawong’s design attempts to strike a balance between keeping these matchtypes separable, and making them easy to express and use. The first type of match, struc-ture search, is supported by the syntax of Currawong’s specification language (completedetails are given in Section 5.6). The second type, control-flow search, as well as the thirdtype, data-flow search, are explicitly separated from structure search, in order to improvespecification readability.

The three match techniques are discussed below. The previous chapter introduced theconcept of data-insensitive and data-sensitive checking. In these terms, the first twomethods are data-insensitive, and the third method is data-sensitive.

Structure search

Structure search refers to locating a portion of code by reference to its structure, that is,by reference to the hierarchical division into modules, classes, and functions commonto object-oriented programming. Structure search refers to definitions (such as “this isa class named X”) and references (such as “create an instance of class X”). Becausearchitectural optimisation deals with API usage, it is far more common for a structuresearch to indicate references, rather than definitions.

Normally, the terms class, function, and module refer to source code. However, com-piled code (and bytecode) tends to retain references to its code structure as (essentially) aside effect of support for run-time features like dynamic linking, debugging, or reflection.It is, therefore, both meaningful and extremely convenient to refer to familiar structuralfeatures, even in compiled or tokenised code.

The following are examples of architectural optimisations which make use of structuresearch:

1. In the Decoder class, rename the function decode frame to old decode frame.

60


2. Replace all instantiations of the form com.example.player.Decoder() with in-stantiations of the form com.example.player2.LegacyDecoder()

3. Replace all calls to com.example.player.Decoder.decode frame with calls tocom.example.player.Decoder.decode frame toyuv.

The matching portion of the above examples resembles the type of activities performedby refactorings. This should not be surprising, since both activities refer to code structure(Section 2.3.1 defines refactorings as structural transformation). In fact, both the match-ing and transformation portions of refactorings are structural—structure search forms thematch portion of a refactoring.

Despite the simplicity of the concept, structure search in Java and in componentsystems has interesting properties that make it more powerful than might be ex-pected. One of these properties is ability to identify unique portions of code. Con-sider example 3 above: the Decoder class is referred to by its fully qualified name ofcom.example.player.Decoder. Following Java convention, this refers to a class namedDecoder, in the package named com.example.player. Since Java package names arerequired to be unique, this is sufficient to uniquely identify the Decoder class. If the opti-misation writer is also the author of the class, as is intended, then a positive identificationof the class is all that is needed to make assumptions about the behaviour of the classand, thus, to write optimisation specifications involving that class. Without the abilityto uniquely identify code portions, the optimisation writer would have to identify codeby its behaviour, a much less straightforward process. This is another example of Cur-rawong’s emphasis on reducing analysis requirements by replacing data-sensitive checkswith data-insensitive checks.

The optimisation problem introduced above (Figure 5.10) can be checked almost en-tirely using structure search. In this example, the data returned by one API call (decode -frame) is returned in the incorrect format for another API call (draw frame). An optimi-sation writer who knows the behaviour of these two functions can simply recognise themusing structure matching. Unfortunately a successful structure search is not sufficient tosafely apply the optimisation, because we must also verify that the data returned by thefirst call is not used anywhere else (since it would be in the wrong format). This seemsto be a common characteristic of structure search in optimisation problems: it serves as abasis for, and lowers the requirement for, other types of checking.

Simpler architectural optimisations can be solved entirely using structure matching.Dig and Johnson have shown that many types of API evolution can be represented asrefactorings [Dig and Johnson 2005]. As discussed above, a successful structure searchis all that is necessary to perform a refactoring, so the types of API evolution describedby Dig and Johnson can be catered to by structure search.

Structure search operates at a purely textual level. Fundamentally, it is a text searchtechnique, and standard text-searching operators can be utilised. For example, optimisa-tion authors could make use of wildcard-style searching to identify all functions whichinclude a common subexpression, such as *display* to identify all functions which in-clude the word “display”, or could even make use of regular expressions to support more

61

5. Design

complicated specifications. This style of specification is common in aspect-oriented pro-gramming environments such as AspectJ [Kiczales et al. 2001]. Particularly popular is thespecial case *, which matches all functions (usually within a certain higher-level scope,such as a class). This type of specification suits a common AOP pattern, which is to adda small piece of code to the beginning or end of many functions (to perform logging, forexample). It appears to be less useful for architecture optimisation, which focuses on anAPI in which the names of all functions are known to the specification author.

One real-world problem with structure matching as presented is the existence of mul-tiple versions of classes, functions, or modules. In Java, for example, packages are ver-sioned. Dealing with versioning information is an implementation detail and does notmeaningfully change the design. One solution is to allow the version information to alsobe matched. The version information could be represented as an attribute attached tothe versioned scope (package name in the Java case) which is also made available formatching purposes in all lower scopes (classes and methods in the Java case), making itan inherited attribute in attribute grammar terms [Aho et al. 1986].

Control-flow search

The second type of match supported by Currawong is control-flow search. Control-flowsearch refers to specifying a portion of code based on the way control flows through it.For example, control-flow search can identify a sequence of function invocations (such as“call function A, then call function B”). Because control-flow search refers to syntacticfeatures of the code, it can be specified as a sequence of structure searches (e.g. “findthe location where function A is called, and then function B is called subsequently”).Control-flow search is rarely used by itself, because control flow provides limited addi-tional information beyond what can already be found through structure search. Architec-tural optimisations that require control-flow search tend also to require knowledge of dataflow within the system.

The following are examples of architectural optimisations which make use of control-flow search:

1. If memhog 1 and memhog 2 are called in the same function, but create mempoolis not called, add create mempool() before memhog 1, and add destroy -mempool() after memhog 2.

2. If there is a function-call sequence decode frame; draw frame, replace decode -frame with decode frame yuv.

3. If functions from the Decoder component and the Display component are used inthe same function, place the Decoder component into the same protection domainas the calling function.

These examples make use of control flow information to make quite significant changesto code. Unfortunately, each of these examples is potentially unsafe:

62


1. Of the three, the first example probably has the greatest chance of being correct.However, the optimiser implementing this example must ensure that the functioncannot exit in any unexpected ways. If it does, the cleanup action (destroy -memory pool) may not be executed. In Java, this amounts to ensuring that thecleanup action is executed even if an exception is thrown. In C, this may be easyto ensure (if the function does not call any other functions), or it may be com-pletely impossible (if, for example, the application calls another function, such aslongjmp, which modifies the instruction pointer unpredictably).

2. The second example changes the target of a function call, implementing a solutionto the optimisation problem shown in Figure 5.10. However, the example doesnot check that decode frame and draw frame refer to the same data (i.e., thatthe parameter passed to the second function refers to the object created by the firstfunction): without this check we cannot guarantee that the optimisation will behaveas intended.

3. The third example combines the protection domains of two components. If thenewly-added component is malicious or buggy, it could corrupt the private data ofthe component whose protection domain it has joined.

In the second and third example above control-flow search is a necessary, but not suf-ficient, part of the matching process. These examples may become safe when combinedwith domain-specific knowledge. For example, it may be judged acceptable to add theDecoder component to the protection domain of its caller, even without verifying its cor-rectness. In general, however, control-flow search is not used by itself, but is insteadcombined with data tracking to provide data-flow search, discussed below.

Structure search and control-flow search are rigidly defined here so that there is nooverlap. However, it must be acknowledged that the distinction is somewhat artificial: acompelling argument could be made that method invocations count as control flow ratherthan structure. I have two responses to this: firstly, as far as architectural optimisations areconcerned, method-invocation sites are treated as if they were a structural feature, ratherthan a control flow feature, and are thus described as such. Secondly, the separationhere into three match types is primarily for ease of exposition and does not reflect anyfundamental separation in Currawong’s design. In other words, the categorisation of thematch methods is far less important than what they do.

Data-flow search

Some optimisations require knowledge of the system beyond structure or control-flow in-formation in order to guarantee correctness. That extra knowledge comprises statementsabout some, or all, of the data in the system—before, during, or after the optimised por-tion. The procedure for making this kind of statement is termed data-flow search in thisdissertation.

Just like structure search and control-flow search, data-flow search may also be speci-fied declaratively. For example, the Broadway domain-specific optimiser performs data-

63

5. Design

flow analysis. Optimisations in Broadway can be predicated on data-sensitive propertiesof function parameters, and this is specified declaratively. Figure 5.11 shows an exampleincluded in Guyer and Lin’s paper [Guyer and Lin 2005]. This declarative specificationmethod is both concise and easy to comprehend.

procedure fgets(s, size, f) {when (size == 1)replace-with %{ (*${s}) = fgetc(f); }%

}

Figure 5.11. Verification as search in Broadway [Guyer and Lin 2005]

The following three architecture optimisations require data flow search:

1. If there is a function-call sequence X = decode frame(); draw frame(X), re-place decode frame with decode frame yuv.

2. If data returned from a function are not accessed by the caller, replace the functioncall with a call to another function which does not return that data.

3. If functions from the Decoder component and the Display component are used inthe same function, place the Decoder component into the same protection domainas the calling function. Verify that the Decoder component cannot interfere withthe Display component in unexpected ways.

These examples all require tracking of data within the application being examined. Thefirst example a proof that two variables refer to the same object: the result of decode -frame should be passed to draw frame . In C, this could be achieved by showing that twopointers point to the same location. The equivalent in Java is to demonstrate that the objectreturned from the first function is the same as the object passed to the second function.The second example involves slightly more work: the returned data must not be usedanywhere, but the variable referencing that data can be copied, supplied as an argument tofunctions, and so on. The third example is still more complex: the architecture optimisermust perform a pointer safety analysis of the entire component, to prove that it cannotinadvertently overwrite memory.

These analyses are relatively-simple forms of symbolic execution, a method of programanalysis through execution in which actual data values are not tracked, but in which con-straints applying to data of interested are accumulated as execution proceeds [King 1976].Section 6.3.3 contains more information about this technique as used by Currawong.

The diversity of data-flow search techniques is handled through abstraction: an analysisAPI is exposed to optimisation specifications, through which data-flow analyses can beperformed. In other words, Currawong attempts to keep the implementation details ofits data-flow search hidden from the specification. Instead, specific analyses are exposedvia language constructs. For example, the task of checking pointer identity between twovariables within a function is indicated using the CSL syntax var1 = var2.

64

5.5. Transformation

The content of the search API is explained in more detail below.This section discussed the first part of the Currawong process: matching and checking.

The second part, transformation, is discussed next.

5.5. Transformation

Section 4.3 described the transformation remedies that should be applied by Currawongin high-level terms. They are:

1. Combine protection domains

2. Replace component or library

3. Interpose component

4. Modify APIs

To actually implement these remedies, Currawong must make material changes to theapplication. Transforming an application involves both reference, that is, locating theportion of the application to be modified, and application: performing the modificationon the referenced portion of the application.

This section discusses the requirements of each remedy in terms of what kind of APIis required of the Currawong Application object. The specifics of the API are discussedduring the discussion of the specification language in Section 5.8.4, because they arelanguage-dependent.

The remedies are discussed in two broad categories: those that do not involve modify-ing application code (architecture-level transformation), and those that do (code transfor-mation).

5.5.1. Application architecture transformation

The first three remedies—combine protection domains, replace component or library,and interpose component—involve no modifications to application code. Instead, theseremedies, summarised in Figure 5.12, operate at the boundaries between portions of anapplication (or component system).

To accommodate this sort of transformation, the application representation should in-clude a model of the component system, including components, connections betweencomponents, and protection domains. The optimisation specification should be able touse this model to combine protection domains, to replace one component with anotherone, and to interpose a component.

5.5.2. Application code transformation

Not all systems are as neatly componentised as the example above. Android, for example,does not use a distinct architecture definition language, and most services are provided bya single component. When dealing with real systems such as Android, it is often simpler

65

5. Design

FileSystem

Client

FileSystem

FileSystem

Client

FileSystem

NewFS

Client

FileSystem

FileSystem

Client

FileSystem

ShimOldFS

Original system

Combine protection domains Replace component Interpose component

Figure 5.12. Application architecture transformation

to discuss architectural transformations in terms of minor code modifications. In Android,for example, it is difficult to combine protection domains of components, particularlywhen one of them provides system services—a simpler approach is to add additional codeto the application and modify the application code to use the new addition rather than theexisting system services. The fourth transformation method—modify APIs—relies oncode modification, at least at a simplistic level. It recognises that code modification issometimes convenient.

Unfortunately, the question “what sorts of code modification should Currawong sup-port?” is difficult to answer, because code modification is, fundamentally, an open-endedprogramming task. However, the optimisations described in this dissertation tend to haveonly two simple code-modification requirements.

The first requirement is for function-call renaming, in which all calls to a certainmethod are changed so that a different method is called instead; and function renaming,in which a function’s name is changed. Note that, given suitable component interposi-tion, these two operations are equally powerful. Currawong includes both for the sake ofconvenience only.

The second requirement is to add code to the application. Code addition is limited toadding extra functions to existing bodies of code.

In summary, there are two major types of transformation: those which involve actu-ally modifiying code, and those which work at a level above the code, such as at thecomponent-system level for CAmkES. It is desirable, but not always practical, to avoidcode modification.

66

5.6. Specification language requirements

5.6. Specification language requirements

The above sections provided a background, describing the requirements of Currawong’smatching and transformation API. On the basis of that background, the requirements forthe programming language that implements that API can now be described.

Optimisation writers describe both matching and transformation for a given optimi-sation using Currawong Specification Language (CSL). CSL is a templated, declarative,extensible logic language. Each of these features plays an important role in making CSLmaximally expressive with minimal overhead.

MatchOnDraw is Java {class $ClassName extends android.view.View {

protected void onDraw(Canvas _){ }

}}MergeOnDraw is Java {

class $ClassName extends Android.view.SurfaceViewimplements Android.view.SurfaceHolder.Callback {

private int _cw_tok;public void surfaceDestroyed(SurfaceHolder s) { }public void surfaceCreated(SurfaceHolder s) {}public void surfaceChanged(SurfaceHolder s,

int fmt, int w, int h) {_cw_tok = au.com.nicta.cw.Draw2D.init(this);

}protected void invalidate() {

au.com.nicta.cw.Draw2D.invalidate(_cw_tok);}

}}MergeOnDrawInit is Java {

class $ClassName {$ClassName {

getHolder().addCallback(this);}

}}optimise(ondraw, App) is

Match = App.match(MatchOnDraw),App.add_package(’au.com.nicta.cw’),App.merge(Match, MergeOnDraw),App.merge_all(Match, MergeOnDrawInit),App.rename_method

(Match.ClassName, ’onDraw’, ’_cw_onDraw’),App.merge(Match, MergeOnDrawInit).

Figure 5.13. An optimisation specification written in CSL.

67

5. Design

A complete optimisation specification is given in Figure 5.13. This describes an An-droid optimisation, the Android redraw optimisation, that is discussed in more detailthroughout this section. Evaluation of the effectiveness of this optimisation, as well asthe motivation behind its application, is covered in Chapter 7.

Most of this specification’s details can be ignored at this stage (a complete descriptionis given in Section 5.2.5). It was included to give an idea of the concision of a declarativespecification language such as CSL, simply in terms of the small number of code linesrequired to describe an optimisation. Note the basic format of the specification: Most ofit is written in a Java-like templating language, but the final portion (which contains thetransformation rules) is written in CSL.

Concision is a worthwhile goal by itself, but it takes on special significance in trans-formation languages. Transformation specifications tend to be small and rather simpleto begin with, so any unnecessary code is particularly noticeable. Tree transformationswritten in TXL [Cordy 2006] are a good example of the size of these sorts of programs,although architecture optimisation specifications are slightly more complex than TXLprograms.

Similarly, match criteria tend to be restricted to small, well-defined problem domains,and are amenable to compact representations. In particular, the match criteria describedabove can all be formulated as search problems. For these reasons, a logic program-ming language was selected as the basis of CSL. The fundamental characteristic of logicprogramming languages is that they are goal-directed: they use a proof mechanism toperform deductions in a goal-oriented manner, guided by rules [Kowalski 1988]. Logiclanguages are expressive, concise, and well-suited to search problems. Because they aredeclarative, programs written in logic languages express what rather than how, that is,they describe a particular problem, rather than explain in detail how to go about solvingit. For this reason, CSL is based on a logic language. In particular, it is based on Prolog,and, like Prolog, CSL uses unification as its primary execution mechanism.

class Example { void $Y(int) {}}

class('Example', Functions),filter(Methods, (void, _, [int]))).

for class in AllClasses: if class.name == 'Example': for method in class.methods: if len(method.params) == 1 and method.params[0].data_type == 'int' and method.return_type == 'void': return functionA B C

Figure 5.14. Finding functions in a class through pattern-matching (A), unification (B),and iteration (C).

Structure search is a good example of a problem that is well-suited to declarative ex-pression. Figure 5.14 shows three ways to perform structure search. The task is to find themethods taking a single int as a parameter and returning void, within a class named Exam-ple. Method A uses pattern-matching; method B, unification (for Currawong’s purposes,a syntactic transformation of pattern-matching); and method, C, iteration. The iterativeapproach reveals details about the implementation of the data structure used: that the ob-ject representing a class contains a field named methods, that the classes objected can be

68

5.7. Currawong Specification Language

iterated over, and so on. The iterative version is also significantly longer than the othertwo versions.

By contrast, the unification version (B) is much smaller, making use of functional pro-gramming techniques, such as list filtering, to express intent more clearly. However, it isstill rather difficult to understand the intent of the search, because the search terms (“Ex-ample”, “int”, and so on), are obscured by the syntax for pattern-matching and filtering.This approach also reveals details about the format of the data structure used for repre-senting classes, i.e. that it is a two-element structure containing a class name and a list offunctions.

The pattern-matching form, A, is an improvement over both alternatives in terms ofboth length and readability. This form uses a specification written in the language beingsearched (Java in this case), making it easy to read. It is immediately obvious that certainaspects of the template (such as the function name matched) are variable. Method A alsocompletely hides the underlying data representation.

This example demonstrates that syntax is important. Methods A and B are similar(they are both declarative specifications), but they are syntactically very different. CSLsupports method A using a templating approach: specifications in style of method A aretranslated to the method B form at run-time. The details of this approach, particularly themethod by which the template is translated to CSL terms, are discussed in Section 5.7.2.

The above requirements motivate the design of the specific language proposed for Cur-rawong: Currawong Specification Language.


CSL is a variant of Prolog [International Standards Organisation 1995] with extensionsfor templating and object orientation. The base language is a minimal Prolog: a lan-guage using first-order unification as its execution mechanism, supporting lists, struc-tures, terms, and atoms. CSL by itself supports strings, floating-point numbers, andarbitrary-precision integers as its base data types, but these are augmented with com-plex built-in data types representing the application under investigation, and search resultmatches.

5.7.1. Specification structure

The specification is defined with the Prolog rule named optimise(+Name, +App). Thefirst argument is the specification name, as an atom. The second parameter is the appli-cation to be optimised. As a shortcut, this description follows the de facto method ofreferring to clauses by their name, a forward slash, and their arity (the number of pa-rameters they accept). In this case, the optimise clause accepts two parameters, and iswritten optimise/2. The body of optimise/2 is written, as is standard in Prolog, as alist of expressions separated by commas and terminated with a semi-colon. As evaluationproceeds, a number of rules within the Application object are evaluated, which have theside effect of modifying the Application object. If evaluation of optimise/2 succeeds,

69

5. Design

these modifications are made to the actual on-disk application.

5.7.2. Syntax

CSL is syntactically similar to Prolog. CSL atoms begin with a lower-case alphabeticcharacter, and variables begin with an upper-case alphabetic character. A minor change isthat clauses are defined using the keyword is rather than the syntax :-. As with Prolog,clauses consist of a sequence of expressions separated by commas and terminated with adot (.).

The following sections describe the major differences between CSL and standard Pro-log, as well as the two major built-in data types, Application and Match, which providethe API for specification-directed architecture optimisation.

Objects and data types

Encapsulated data types (i.e., objects) are a convenient addition to CSL, because theymake specifications more concise. There are two different kinds of uses of encapsulateddata types in Figure 5.13. The first kind, as exemplified by App.match(), App.add -module(), and so on, look like function calls in an object-oriented language. The secondkind, as shown in Match.ClassName, looks like a field reference in an object-orientedlanguage.

To add support for this kind of data type, the language is first extended to supportimmediate evaluation. Prolog already offers a limited form of immediate evaluation—theis keyword—but its semantics are inconvenient for CSL’s purposes.

expr(Arg1) is expression, expression, expression.

expr(Arg1, Result) is expression, expression, Result = expression.

A B

Figure 5.15. Implicit return values in CSL rules.

The rules for CSL immediate evaluation are as follows:

1. All rules are treated as if they have an implicit unbound additional parameter. Thisparameter is unified with the last successfully-evaluated expression in the rule. Fig-ure 5.15 illustrates this behaviour. Rules of the form shown in part A of that figureare treated as if they were of the form shown in part B.

2. When an expression of the form Expr((...)) is encountered, evaluate the expres-sion immediately.

3. Continue evaluation as if Expr((...)) was syntactically replaced by the returnvalue from the expression after it has been evaluated.

70


Immediate evaluation provides a way for rules to act as macros, returning other expres-sions. This forms the basis for an object-oriented type system. The type system itself canthen be implemented as syntactic sugar: in other words, it can be implemented throughpurely syntactic transformations on the code.

To support object field access of the form Object.field, proceed as follows:

1. CSL defines an object type to be the two-element list [DataType, Value], whereDataType must be bound to a string, and Value can be bound to anything at all, orcan be unbound.

2. When an expression of the form Object.field is encountered, unify Object withthe list [DataType, Value]. This unification must succeed exactly once.

3. Construct a structure name consisting of the string to which DataType is bound, anunderscore, and field. For example, if the object access is App.ClassName, andApp unifies with [java application, ], the structure name would be java -application ClassName. Call this new name StructureName.

4. Continue the evaluation as if the object access were replaced by StructureName((Value)).

This method easily extends to support structure access of the form Object.structure(Arguments). To do this, the replacement in Step 4 becomes StructureName (Value,Arguments).

Template support

A typical architecture optimisation includes structure search—the process of identifying aportion of code through reference to its structural features, such as class names, packagenames, and function names (covered in detail in Section 5.4.2). A convenient way toexpress a structure search term is to write the desired code that should be matched directlyin the object language—that is, in the language that the application being optimised waswritten in. For example, to reference a particular class in Java, one would like to writeclass ClassName. For similar reasons, it would also be convenient to specify the data-flow conditions for data-flow search directly in the object language. Finally, some formsof transformation involve adding code; obviously the best way to describe the code to beadded is simply to write it in the desired language.

CSL therefore has a unique need for portions of optimisation specifications to be writ-ten in the object language. This is achieved in CSL through templating.

CSL’s templating language takes advantage of the immediate evaluation syntax de-scribed above. In place of any expression in CSL, the optimisation writer may usethe syntax TemplateName { Code }. This is translated by the parser into template -TemplateName "Code", i.e. everything inside the braces is supplied as a string to therule template TemplateName. This rule is then immediately evaluated, as per the stepsdescribed in Section 5.7.2, and the result is syntactically inserted into the specificationcode.

71

5. Design

MatchOnDraw is Java {class $ClassName extends android.view.View {protected void onDraw(Canvas _);

}}

Figure 5.16. CSL templating example

Figure 5.13 shows two examples of templating. The first example, reproduced forconvenience as Figure 5.16, specifies a structural match. This particular template matchesany Java class which extends android.view.View, which contains a protected methodnamed onDraw, accepting a single parameter of type Canvas, and returning void.

Using templates to perform structure search results in two problems. The first is areferencing issue: having matched a portion of the application, how can one refer to thematched portion? The second problem is a generality issue: how can we ensure that atemplate isn’t overly-specific, not matching portions of an application that it should?

The referencing issue can be re-stated as the requirement that the optimisation specifi-cation should have some way to reference portions of the matched code. Matched portionsof code deal with this problem by defining a hierarchy that matches the structure of thecode, and is accessible using object attribute references (see below for more informationon Match objects).

The generality issue is addressed through support for variables in the templated code.Any Java or C identifier in a template, if prefixed with a dollar sign, is treated as a variable.The example in Figure 5.13 uses this feature to identify the name of the class. Variablescan be used in place of any identifier in the object language. Instead of specifying aclass name to match, the variable $ClassName is supplied instead. These variables canbe accessed by other portions of the optimisation specification. A special-case variable,named can also appear in templates. This variable functions as a wildcard—it can beused multiple times within a template, but is never bound, and cannot be accessed fromthe optimisation specification.

The idea of using the object language as a specification system is not new: manyother transformation systems support direct injection of object code into the specificationlanguage. Broadway supports specification of data-flow conditions in the object language,as can be seen in Figure 5.11.

The logic-language-based aspect-oriented programming system TyRuBa [De Volder1999] also supports specification in the object language. TyRuBa’s basic syntax for thistype of reference, which CSL borrows, is to surround object language strings with brack-ets. TyRuBa’s support for object-language references is quite rudimentary: it treats thetarget language snippet as simple text, and matches it against the code in this way (afternormalising whitespace). By contrast, templates in CSL are a complete domain-specificspecification language which must obey formal rules.

Real systems are built using multiple programming languages. Therefore, an architec-ture optimiser for real systems must support multiple programming languages, too. CSL’s

72

5.8. The Currawong API

templating system was designed to accommodate multiple programming languages by re-quiring explicit specification of the template name when creating templates. This meansthat different languages can be supplied to different templates—for example, the one im-plementation could support Java and C through appropriately-named templates.

Summary

The major features of CSL are:CSL is templated so as to provide a domain-specific way for the code under examina-

tion to be checked and modified. Templates let optimisation authors write portions of theoptimisation specification in the object language. In Figure 5.13, the clauses beginningwith MatchOnDraw and MergeOnDraw are templates.

CSL is declarative both because it provides a concise way to express information, andbecause it resembles the natural way that programmers talk about program modification:“replace any call to method X with a call to method Y”. This high-level approach alsomeans that optimisations need not be aware of the implementation details relating to pro-gram analysis. This, in turn, means that the implementation details may change withoutaffecting the optimisations making use of them: it insulates optimisation specificationsfrom changes to Currawong’s implementation.

CSL is extensible as it is a complete programming language. This means that vendor-specific optimisation support routines can be supplied externally to Currawong. In thecurrent implementation of Currawong, the Android- and CAmkES-specific portions ofCSL need not be, and in fact are not, built in.

Finally, CSL is a logic language, using first-order unification as its method of execu-tion. Using unification for execution gives CSL some powerful properties. For example,an optimisation specification can make use of the back-tracking behaviour of unificationto match two mutually-interdependent, but structurally separate, portions of the applica-tion. Doing so requires just two lines of specification code.


CSL’s abstraction of object language and application behaviour is handled by an API.Most of the API is provided by two objects: Application object and Match objects. CSL’sApplication object implements the API described earlier in this chapter, starting withSection 5.3.

An Application object represents the application being examined, and is system-specific: an Android Application object is different from a CAmkES Application object.As described above, Application objects abstract the details of the access to code; thelayout of code on the disk; the process required to perform static analysis on that code;and the method by which code is transformed.

The following sections discuss the Currawong API with respect to the way it is used byan optimisation specification. Figure 5.17 shows this process: Matching starts with struc-ture search(1), optionally followed by additional structure searches as part of a control-

73

5. Design

Structure search

Data-flow search Transformation

1 2 3 4Control-flow search

Figure 5.17. Matching in Currawong

flow search(2), optionally followed by one or more data-flow searches(3), followed byrewriting(4).

5.8.1. Structure search

Matching in Currawong starts with structure search. To perform a structure search, theoptimisation specification supplies a template to the Application.match rule. The re-sult of this rule is zero or more unifications resulting in Match objects. Match objectsrepresent a portion of code within the Application which has matched a particular tem-plate.

Match objects are a very important part of Currawong. They perform three functions:

1. They are the basis of structural and control-flow search;

2. They provide information to aid data-flow search; and

3. They provide a reference for code transformation.

Match objects can serve these three roles by taking advantage of templating. Theconcept of templating as a way to represent applications was introduced in Section 5.3.Templating as a language extension was described in Section 5.7.2.

In the context of structure search, Match objects are quite simple. A template is ap-plied to the application under examination. The resulting match object makes structuralinformation from the match, as well as information about the location of any functioncalls within the match, available to the optimisation specification.

Structural information refers to the type of structural information described in Sec-tion 5.4.2, i.e., classes, functions, and modules. Other structural information may bepresent, depending on the system being analysed. For example, if the system is a compo-nent system, an additional structural level may be supplied describing the system archi-tecture.

The locations of function calls (if the structure search matches any code-containingstructure) is a special-case of structural information. However, rather than describing ascope, this information refers to a piece of application code.

Match objects also make any variables matched during template matching availableto the optimisation specification. The content of this information (variable matches andstructural information) is deliberately opaque to the optimisation specification—the only

74


thing the specification can do with it is pass it to APIs which perform further matching,or which perform transformation. This is discussed in more detail below.

5.8.2. Control-flow search

Control-flow search can be expressed as a series of structure searches (Section 5.4.2 cov-ers this in detail). More specifically, control-flow search can be expressed as a sequenceof increasingly-constrained structure searches. The first structure search is completelyunconstrained; the second is constrained in that it must occur after the first one in a plau-sible program control flow; the third must occur after the second, which implies that italso occurs after the first; and so on. Currawong represents this scenario by supportingan additional parameter to the match rule through which the match which chronologi-cally precedes the desired match is supplied. In this manner a chain of matches may begenerated. Figure 5.18 shows an example of this API. In this example, the first matchcommand produces a match object based on the template named FirstTemplate. Thesecond match command refines this match object, reducing the matched code to thatwhich matches FirstTemplate and, subsequently, SecondTemplate. The final matchcommand further refines the match object: Match3.

Match1 = Application.match(FirstTemplate),Match2 = Application.match(SecondTemplate, Match1),Match3 = Application.match(ThirdTemplate, Match2).

Figure 5.18. Control-flow matching API

Control-flow search does not place any additional requirements on the resulting Matchobjects.

5.8.3. Data-flow search

Data-flow search involves making statements about data as control moves through anapplication. Currawong’s data-flow search support focuses on tracking object identity.Compared with systems that were written specifically to perform data-flow search, ob-ject identity tracking is quite limiting. Systems like Yang’s EXE, for example, can alsokeep track of the values of all the data within the system under analysis [Yang et al.2006]. However, restricted data-flow search abilities are less limiting for Currawong thanone might expect, due to the nature of Currawong. Firstly, many viable optimisationssimply do not require extensive data-flow search—as Section 5.4.1 discusses, often therequirement for data-flow search can be ameliorated by structural search combined withdomain-specific knowledge of the system. Secondly, object identity tracking seems toaddress many types of real-world optimisation: the first two examples given in Section5.4.2 can be verified by object identity tracking.

To track object identity, Currawong adds objects to the list of elements tracked byMatch objects. In Java, almost everything is an object (apart from basic types, such

75

5. Design

as int). In C, objects refer to pointers. Because C discards type information, objecttracking in C is limited compared with object tracking in Java, but it is sufficient to trackparameters passed to API functions.

MatchMethods is Java {class $ClassName extends android.view.View {void redraw(){

$arg1 = api.function1();api.function2($arg2);

}}

}optimise(dataflow, App) is

Match = App.match(MatchMethods),Match.arg1 = Match.arg2.

Figure 5.19. Data-flow matching example

Two types of pointer tracking are supported: to check for equivalence, and to determineaccess.

Figure 5.19 illustrates the pointer-tracking concept for equivalence. In this exam-ple, the template portion produces two variables: arg1, which is the result of callingfunction1, and arg2, which is the only parameter passed to function2. These two vari-ables are then unified in the optimise clause (with the line Match.arg1 = Match.arg2.Currawong implements a custom unification process for this data type which triggers apointer analysis.

The example highlights another advantage of equivalence tracking as a data-flow searchmethod: it integrates well into the language. In CSL, equivalence tracking can be initiatedby attempting to unify two variables from a Match object.

Checking for object access is initiated using a rule defined on the Match object,Match.access/3. This function produces a set of objects which use a particular pointer,starting from the scope object supplied as the first argument, and unifies it with the thirdargument.

The most important aspect of any implementation of data-flow analysis is that it isconservative, that is, if the analysis gives an incorrect result, it is always that the propertybeing checked does not hold, when in fact it does. This conservatism ensures that opti-misations are never applied when they should not be (at least, on the basis of data-flowanalysis). If the reverse were true, then optimised applications could perform unexpect-edly.

5.8.4. Transformation

Matching is one portion of a CSL specification. The other portion specifies the transfor-mation to perform. Unsurprisingly, transformation portions of optimisation specifications

76


tend to reference the same code that was referenced by the match portion of the specifi-cation.

Section 5.5 described application transformation in terms of reference (identifying aportion of the application to be transformed) and application (applying the transforma-tion). That section also outlined the types of transformation to be supported: architecture-level transformation, function and function-call renaming, and adding code.

The API for architecture-level transformation and function and function-call renamingis straightforward—a single rule for each expected action (such as renaming a functiondefinition). This API is provided in Appendix B However, the API for adding code doesnot follow the same convention. Instead, Currawong supports code merging.

MergeOnDraw is Java {class $ClassName {

private int _cw_tok;public void surfaceChanged(SurfaceHolder _,

int _, int _, int _) {_cw_tok = au.com.nicta.cw.Draw2D.init(this);

}}

}...Match = App.match(MatchOnDraw),App.merge(Match, MergeOnDraw).

Figure 5.20. Adding code with Currawong

Code merging is used in the example provided in Figure 5.13. The relevant portionis reproduced as Figure 5.20 for convenience. The syntax is the same as that requiredfor specifying templates in Currawong, and the new code can incorporate templated vari-ables. The new code is then applied to the application in the context of a Match object.When applying code, all variables (that is, all tokens beginning with $) in the new code arereplaced by the appropriate matched variables. Any class-level variables in the new codeare then added to the matching class in the Match object. New functions are inserted intothe class if necessary. In the example, a Match object is created from the MatchOnDrawtemplate. Then, code specified in a template, MergeOnDraw, is added to the application,using the Match object to guide placement. Variables present in the Match object, suchas ClassName are applied to the merge template—so ClassName is filled with the appro-priate class. The field cw tok and the method surfaceChanged are then added to theclass named ClassName.

Currawong cannot currently replace or remove code directly. So far it has been suffi-cient, when code should be replaced, to rename the function implementing the unwantedcode, and then insert another function of the same name as the original first function.However, it does not seem infeasible to extend Currawong to support removal of code,should the need arise.

77

5. Design

5.8.5. Summary

A fundamental part of Currawong’s design is to provide support to the optimisation spec-ification via an API which abstracts the details of checking and transformation as muchas possible. Currawong’s API supports the optimisation specification by supporting threetypes of match: structure search, control-flow search, and data-flow search. The API alsoenables the optimisation specification to transform code through simple refactoring-stylemanipulations, as well as via more complex code modification.

5.9. Transformation and looping

Architecture optimisation specifications may match within an application zero times,once, or more than once. Handling the zero-match case is simple: Currawong can leavethe application unchanged. Handling the single-match case is almost as simple: Curra-wong transforms the application according to the match, and exits.

However, problems arise if a specification matches multiple times. The major problemis that of interference between matches. Perhaps a specification matches twice on theoriginal application, but the process of transforming the application according to the firstmatch invalidates the second match.

There are two ways to deal with this problem. The first approach, optimisation-guaranteed noninterference, is to require that optimisation specification writers guaran-tee that matches do not interfere with each other. The advantage of this approach is thatall potential optimisations can be extracted from a single evaluation of the optimisationspecification (making use of backtracking). A single evaluation pass means a faster archi-tectural optimiser. However, optimisation-guaranteed noninterference puts a significantburden on the optimisation writer—arguably an inappropriate burden, she must now di-vert her attention from writing the optimisation and instead focus on correctly using thearchitectural optimiser’s API.

The second way to deal with potential conflicts is to perform matching and transforma-tion sequentially: take the first match, perform the appropriate transformation, and thencheck for new matches. This approach, sequential application, simplifies optimisationspecifications—optimisation writers do not need to worry about conflicting matches, be-cause the transformations effected by the first match will ensure that the second matchdoes not occur. This means that the only requirement of optimisation writers is that theyensure that optimisation specifications only match when a valid optimisation can be ap-plied. Since this requirement is the basic requirement of an optimisation in the first place,sequential application is a better alternative for optimisation writers. It is also simpleto implement within architecture optimisers: the core optimisation pass can be imple-mented as a loop which repeatedly evaluates the optimisation specification, transformsthe code according to the first result, and repeats, until the optimisation specification doesnot produce any result.

Sequential application may be improved in various ways. One option is to apply thebest transformations, rather than simply applying the first one, where “best” means “re-sults in the largest performance improvement in the rewritten application”. This approach

78

5.10. Output

is outside the scope of this dissertation.

5.10. Output

The final component of the architectural optimisation is writing changes to disk by actu-ally modifying the application. Optimisation writers have no control over this process: ifthe optimisation succeeds, it is applied automatically. This leaves a lot of flexibility up tothe implementation, regarding the way in which application rewriting will occur.

The exact process by which output is performed is left as an implementation detail,because it is system-specific.

79

6. Implementation

On two occassions I have been asked, – “Pray, Mr. Babbage, if you put intothe machine wrong figures, will the right answers come out?” In one case amember of the Upper, and in the other a member of the Lower, House putthis question. I am not able rightly to apprehend the kind of confusion ofideas that could provoke such a question.

– Charles Babbage [Babbage 1864]

Currawong’s implementation follows the design described in Chapter 5. It is a multi-platform architectural optimiser, supporting both the CAmkES software stack and theAndroid mobile operating system software stack. In addition, Currawong supports twolanguages—C and Java—and satisfies the criteria for an architecture optimiser in thatit does not rely on application source code, supports multiple verification methods, andimplements Currawong Specification Language.

Chapter 5 presented the design of Currawong’s API, and then used this API design tomotivate the design of Currawong specification language. In this chapter, an overviewof Currawong is given first (Section 6.1), followed the implementation of the language(Section 6.2), and then specifics of the API: verification (Section 6.3) and transformation(Section 6.4). Verification and transformation are first described in a system-agnosticway, after which system-specific extensions for Java and Android (Section 6.5), and Cand CAmkES (Section 6.6) are described.

6.1. Overview of Currawong

Currawong is a program mutator controlled by a domain-specific programming language.The programs it runs are optimisation specifications; the input to the program is an un-optimised application; and the output (if the optimisation was successful) is an optimisedapplication.

Figure 6.1 shows the workings of Currawong from the perspective of the optimisationspecification. From this perspective, Currawong provides an API to the optimisationspecification. The specification makes use of this API to find a good place to apply theoptimisation and to transform the application after locating a suitable place. Most ofthe API provided by Currawong to the optimisation specification is embedded within theApplication object, which represents the application being examined. Interaction with theApplication object results in the generation of a number of ancillary objects (such as theMatch object). This high-level approach to optimisation keeps the specification concise.

80

6.1. Overview of Currawong

Structure search

Produce Match object

Specification

Application object

Data-flow search

Symbolic execution

Transformation

Application re-writing

Figure 6.1. Currawong workflow (optimisation specification perspective)

Database

Load application

App Rep

Application Specification

Parsing

Evaluation

Transformation

Application

Spec Rep

Changes

1 2

3 4

5 6

7

8

9

10

Figure 6.2. Currawong implementation (system agnostic version)

Figure 6.2 shows Currawong’s architecture. Currawong is mostly written in thePython [Python Software Foundation 2010] programming language, but it makes useof several existing tools as part of its operation. Its design is based around that of a tra-ditional logic-based application: the central data structure is a database of CSL clauses.Currawong execution proceeds as follows: firstly, the application (1 in the diagram) andspecification (2) are both loaded (3 and 4). Currawong stores an abstracted form of theapplication, or app rep (5), as well as the parsed version of the optimisation specification,or spec rep (6), into a database. Here “database” is used in the Prolog sense, i.e., a search-able list of facts (statements about the program or specification) and rules (procedures tofollow in order to deduce facts). Currawong then evaluates the spec rep (6). The resultof the evaluation (7), if it is successful, is a set of changes to be applied to the application(8). These changes are applied (9), and the resulting new application (10) is written to

81

6. Implementation

CSL type Internal representationatom (’atom’, string value)variable (’variable’, string name)structure (’structure’, string name, terms...)

Table 6.1. Example type mappings, CSL to Python

disk.Step 7, Evaluation, is where Currawong performs the optimisation work. Since Cur-

rawong specifcation language is declarative, “executing” the optimisation specificationrule is equivalent to evaluating it (an example of a rule, optimise/2, was shown in Fig-ure 5.13). In other words, the optimisation specification is presented as a search problemwith zero or more solutions, where each solution represents a complete evaluation of theoptimisation specifiation.

6.2. Implementing Currawong specification language

Currawong’s Currawong specification language (CSL) support is built around a customProlog interpreter to which CSL-specific features were added. This method of imple-mentation provided flexibility when designing CSL, because it is easy to extend thelanguage—either by adding built-in functions, or by extending the grammar. For ex-ample, an early design experiment, later rejected, was to add support for constructinglinear temporal logic terms as an integral part of the grammar. This approach providesflexibility at the expense of overall execution time—a suitable compromise for Curra-wong, but a non-prototype architecture optimiser would probably make use of an existingProlog implementation for speed reasons.

Currawong uses a simple mapping between Python types and Prolog / CSL types.These mappings are used by both the interpreter and the parser. Each CSL type instanceis represented internally by a Python tuple, where the first element is a string containingthe type’s name, and subsequent elements are the instance’s value. Figure 6.1 shows asubset of the type mappings to demonstrate the idea.

6.2.1. Parser

CSL consists of three parsers: a parser for the core language, a parser for C templates, anda parser for Java templates. Each parser is generated by a backtracking LL(k) parser gen-erator written by the author. Because both the generator and the parser itself are writtenin Python, the generator is well-suited to rapid prototyping: there is no compilation step.Changes to the grammar are automatically integrated into the grammar without a sepa-rate parser-building step, and adding code to execute when certain grammar constructsare encountered is seamless.

The core language parser produces, as output, a list of structures of the type shown inTable 6.1. Both the other parsers produce Python objects representing the parsed syn-

82

6.3. Matching

tax tree. The sytnax trees are deliberately kept opaque to the optimisation specification,which is required to use rules (i.e., call functions) on the Application object to investigatespecific properties of the parsed source, rather than examining it directly.

Parse CSL

Parse C/C++ Parse Java

C/C++ subspec Java subspecSubspec endSubspec end

Spec end

Figure 6.3. Control flow between CSL parsers

Figure 6.3 shows the interaction between the parsers. CSL requires that templated sec-tions begin and end with braces ({ and }). This allows the CSL parser to treat templatedsections within the specification as unparsed data. When such a template is encountered,CSL reads the braced portion, then calls relevant template-specific parser to produce aparsed representation. The object returned by the template parser is stored within theparsed representation of the CSL as a custom datatype. Note that this differs from thetemplating design outlined in Chapter 5: parsing of the template is performed when thespecification as a whole is parsed, rather than at run-time.

6.2.2. Interpreter

Currawong executes optimisation specifications using an interpreter. This implementsstandard Prolog unification without the occurs check. Unification in this version of Cur-rawong is implemented using Python generators, which are essentially co-routines. Thecore interpreter is very small (approximately 900 LOC, including comments), whichaided debugging.

Besides implementing support for objects (as described in Section 5.7.2), the inter-preter is extensible: additional built-in functions may be written (in Python) and madeavailable to the interpreter; the interpreter in turn makes these functions available to theoptimisation specification as built-in rules.

6.3. Matching

Application representations are exposed to the specification through an API: no attemptis made to give the specification access to the internal representation of the parsed appli-cation. The advantage of this approach is that both the internal representation, and conse-

83

6. Implementation

quently the static analysis methods which operate on that internal representation, can bemodified without creating incompatibilities for existing optimisation specifications.

Matching is highly language-specific. Therefore Currawong includes two implemen-tations of structure search, control-flow search, and data-flow search (the three matchingtypes, as discussed in the previous chapter): one for Java, and one for C. The Android-specific portion verifies bytecode compiled from Java source, and the CAmkES-specificportion verifies code compiled from C source. The common features of these implemen-tations are discussed below. The Java- and C-specific portions are discussed in Sections6.5 (Android) and 6.6 (CAmkES), respectively.

6.3.1. Structure search

A structure search matches a template to a portion (or portions) of the application. Chap-ter 5 outlines this process in general terms.

In Currawong, the presence of a template in the optimisation specification causes Cur-rawong to create a custom data-containing object representing that template as matchedagainst portions of code. Creation of this object is triggered by evaluation of anApplication.match rule as discussed in Section 5.8.1. One template can therefore re-sult in the creation of multiple objects over the course of evaluation, if it matches multipleportions of code.

The match rule is syntactically translated to application do match(Application,Templatename) (The Application object does not make use of the macro feature). Whenthis rule is itself evaluated, a Match object is prepared by the appropriate (Java or C)application representation and returned.

Section 5.8.1 outlines the types of things that a structure search should produce: struc-tural information, the locations of function calls within any matched methods, and anyvariables included within the structure. Match objects therefore contain methods to ac-cess these data.

Referencing structure information and function calls

As described above, structural information provided by a Match object is made availableto the specification language through the Match.feature/1 rule (see Appendix B fora complete list of Match object rules). This rule accepts a miniature domain-specificlanguage to reference parts of the Match object’s structure.

The feature specification language allows one to identify a particular structural portionof code. Figure 6.4 shows some examples of the specification language in action. Here forMatch.feature declarations are shown, each referring to the template MatchMethods,above them. Note that the numbered lines are explanatory and are not a part of the speci-fication language.

The optimisation author uses the feature specification language to identify a feature byspecifying one or more scope names (i.e. module, class, or function names, separated bya dot). The default scope is assumed to be the scope of the highest-level container withinthe match; in Figure 6.4 this is the class $ClassName.

84

6.3. Matching

MatchMethods is Java {class $ClassName {

void method(){

dosomething();}

}}

1 Match.feature("method")2 Match.feature("")3 Match.feature("$ClassName.method")4 Match.feature("_.method")5 Match.feature("method.dosomething")

Figure 6.4. Using the ”.feature” rule to access a Java method

If the Match object cannot locate the specified feature, it tries again from the scopeabove the current default scope: i.e. if the default scope is currently a class, the searchis retried with the default scope set to the scope that contains classes, i.e., in Java, thepackage in which the class is defined. Therefore examples 1 and 3 in the figure areequivalent—both match the method named method, but the latter example does it afterfirst attempting, and failing, to locate the contained scope within the class.

Two additional features are provided: templated variables may be used as part of thespecification, as $ClassName is, above; and a “scope name” of a single underscore char-acter ( ) may be specified as a wildcard to indicate that any name is acceptable.

This mechanism is extended to describe the location of function calls within a match:the name of the function call may be specified as an additional scoped name after a func-tion scope is matched. Support for referencing additional calls to the same function afterthe first one is currently not implemented; a simple solution is to extend the templatingsyntax to allow the template writer to uniquely name each call, so that the correct one canbe specified unambiguously within the feature specification language.

The result of evaluation of a Match.feature rule is another custom datatype: amatch reference. Match references are made available to the optimisation specification asopaque types, so that they may be passed as input parameters to transformation rules. Thedetails of the reference are, however, hidden from the specification language. A matchreference must provide sufficient information to the rest of the system so that it can beused as part of a transformation. There are four types of match reference required forstructure search:

1. scope match references identify a particular scope, that is, a module, class, function,or similar;

2. invocation match references identify a particular function invocation.

3. data match references identify a particular variable within a function.

85

6. Implementation

4. variable match references identify variables matched in the template (that is, nameswhich were prefixed with a dollar sign, such as $ClassName).

These reference types are object-language-dependent, so their implementation is dis-cussed below. In general terms, scope-related match references refer to a specific classin the hierarchical application representation built after scanning an application, whereasinvocation references include a function-level scope match reference as well as some wayof indicating the particular invocation within that reference (in both cases, an offset fromthe beginning of the function to the relevant instruction is used).

Variable match references are a special case. Variable match references are made avail-able to the optimisation specification using the mechanism described in Section 5.7.2—i.e., evaluation of Match.VariableName results in immediate evaluation of Match -match(Object), where Object is the custom object representing the actual variablereference. Because, however, variables can be specified at various different locationswithin the template, one of two different types of variable match reference is returned,depending on the semantic role of the variable within the matched template:

1. If the variable refers to a scope (i.e. if the variable is a class name in a Java template)then the result of evaluation of the variable is a scope match reference.

2. If the variable refers to a function parameter then the result of evaluation of thevariable is a data match reference.

6.3.2. Control-flow search

No additional support is required of Match objects in order to implement control-flowsearch. To support the API described in Section 5.8.2, match objects must reference theprevious (in terms of control flow) match object. Actually determining the path from oneMatch object to the next is up to the language-specific application representation.

6.3.3. Data-flow search

Currawong supports data-flow search with two symbolic execution engines: one for Java,and one for C. As discussed briefly in Section 5.4.2, symbolic execution is a type of execu-tion method in which all possible values for a given control-flow path are simultaneouslyconsidered [King 1976]. In order to achieve this, regular variables within a program arereplaced by symbolic variables, which can represent a set of values. In the case of inte-gers, a symbolic variable may represent a range of numbers. In the case of pointers, asymbolic variable may represent a range of locations. The set of all symbolic variablesalong the current execution path is known as the path condition or path constraint.

When a control-flow-modifying instruction is encountered, the path condition mustbe checked to determine in which direction execution should continue. Consider thecode in Figure 6.5. When the if statement is reached in a normal execution, controlflows in exactly one of two directions depending upon the value of a. However, undersymbolic execution, there is an additional possibility. If the value of a is not known,

86

6.4. Supporting code transformation

int func(int a) {if (a < 3) {

return a;} else {

return 0;}

}

Figure 6.5. Data-dependent control flow modification

then the symbolic execution engine does not know which path should be taken. Theonly safe possibility is to explore both paths. This is known as an execution fork: thesymbolic execution engine first updates its path condition to include the assumption thata < 3. It then executes the true path of the if statement. However, the symbolic executionengine must also investigate the false path of the if statement: the path condition is in thiscase updated to include the assumption that a >= 3. Thus as control-flow statements areencountered the amount of knowledge about variables in the program increases.

Implementing a complete symbolic execution engine, let alone two, is a large researchproject in itself. Currawong therefore includes proof-of-concept engines capable of per-forming the task described in Section 5.8.3—that is, determining whether parameterspassed to one API function represent the same object as those passed to a previous func-tion. The advantage of using symbolic execution even in this constrained context is thatadditional data-flow verification support can be added to Currawong without requiringmodification of any optimisation specification.

Currawong’s data-flow search support does not require any additions to Match objects.It makes use of data match references (supplied from a Match object resulting from astructure search).

6.4. Supporting code transformation

Code transformation involves both reference and application, as discussed in Sections 5.5and 5.8.4.

Match a template

Produce match template

Specification

Currawong

Apply transformation

Record code change

Specification ends

Transform application

Store changes to disk

Figure 6.6. Code transformation process

Figure 6.6 shows the transformation process. The specfication begins by performing atemplate match using a match rule. This causes Currawong to produce a Match template

87

6. Implementation

relating to the match. The specification then applies one or more transformations, whichare recorded by Currawong. After applying transformations, the specification ends. Atthis point, Currawong transforms the application by modifying its code and meta-data.

Reference is achieved through Match objects, using the four match reference types de-scribed above. When transformation rules are evaluated, objects representing the transfor-mation are created. These objects are created by side effect when the rule is evaluated. Ifthe optimisation specification is evaluated successfully, all transformation rules are usedto rewrite to the application in a language-specific way.

Objects which describe transformations contain the following information:

• A description of the type of transformation: for example, to rename a function;

• A reference to a match object to which the transformation should be applied;

• Any transformation-specific information: the new function name, in the case of afunction-renaming transformation.

6.5. Android-specific portions

In this section, the name Currawong Java is used to refer to Currawong with Android andJava extensions.

Android applications are written using a standard Java development environment con-sisting of the standard Java compilation tools (such as javac [Oracle 2010]) and theEclipse IDE [Eclipse Foundation 2010a]. Applications for Android are compiled to a setof .class files, as is standard practise. After compilation, however, each .class fileis translated from Java bytecode to Dalvik bytecode, a custom bytecode supported onlyby the virtual machine supplied as a core portion of Android, Dalvik (Figure 3.2). Thistranslation is apparently performed due to licensing issues, rather than for any techni-cal reason. (Dalvik files are designed to be more efficient than .class files on mobilehardware, but this translation could instead be performed on the mobile device—indeed,Dalvik files still do undergo further optimisation after installation on a device.)

After translation, the bytecode files—retaining the Java convention of a single file perclass—are combined into a single uncompressed archive, named classes.dex (“dex” is ashortened form of “Dalvik Executable”). This file is itself placed into an archive, alongwith any resources required by the application (such as sound files or images). Thisarchive, called an Android Package (APK), is signed by the developer, and can be in-stalled onto an Android device.

Figure 6.7 shows Currawong with extensions for Android. Currawong is applied toAPK files. To apply an optimisation, Currawong first extracts the Dalvik bytecode fromthe APK. It then disassembles the bytecode and parses it to create an application rep-resentation. The optimisation specification is evaluated and, if it results in changes tothe application, Currawong modifies the bytecode files directly, either by adding classes,or by modifying already-present classes. Finally, Currawong re-generates the APK. Insummary, Currawong includes the following extensions to handle Android applications:

88


Database

Store

App Rep

APK

Specification

Parsing

Evaluation

Transformation

Spec Rep

Changes

Unpack

Android extensions

AndroidAPK support

Optimised APK

Packing

Android transformation

12

3 4

5 6

7

8

9

10

A1

A2

A3

A5A4

Figure 6.7. Currawong implementation (Android extensions)

1. Unpacking and disassembling, and subsequently assembling and re-packing (A1,A4 and A5 in the figure);

2. Creation of application representation, and support for static analysis of Android-specific byte codes (A2);

3. Support for transformation of Android applications (A3).

Figure 6.7 shows a high-level overview of this process. The Android-specific portionsare discussed in detail below.

6.5.1. Unpacking, disassembling, and reassembling

Access to Dalvik bytecode is achieved through third-party utility programs. The APK fileitself is a zip archive. Currawong uses the Dalvik bytecode disassember Baksmali, and itscorresponding assembler, Smali, to decode and encode Dalvik VM bytecode [JesusFreke2010]. Invocation of these utility programs comprises the bulk of the work performed bythe “Unpack” stage in Figure 6.7.

89

6. Implementation

.method public constructor <init>()V.registers 1.prologue.line 38invoke-direct {p0}, Landroid/app/Activity;-><init>()Vreturn-void

.end method

Figure 6.8. Baksmali’s disassembly of an automatically-generated constructor

Figure 6.8 shows an example of the code generated by Baksmali. This is anautomatically-generated constructor method for an Android application. In Baksmalidisassemblies, lines beginning with a dot control the assembler, or contain comments, sothere are only two actual instructions in this snippet: invoke-direct and return-void.This constructor method simply calls the constructor of the method’s class’ parent, viainvoke-direct, and then returns to the caller, via return-void. Notably, this disas-sembly retains a lot of information about the original code. Parameters and return valuesare enumerated and typed, and function invocations are specified by name.

In the “Store” stage, Baksmali’s output is parsed by a custom parser, a process facili-tated by the regular structure of the disassembly: each file produced by the disassemblerrepresents a separate Java class. These are read from disk, and an in-memory hierarchyof objects, mirroring the namespace set out in the application, is created. The outer-most level, Application, contains one or more Package objects, each of which containsClass objects. These in turn contain Fields (class-level global variables) and Methods.This hierarchy is illustrated by Figure 6.9. This figure shows Currawong’s in-memoryobject hierarchy for a single class, containing one method and one field. The particularclass shown is taken from the Lunar Lander example game provided with the Androidsoftware development kit.

6.5.2. Application representation

Figure 6.9 also shows the way the disassembled Java classes are represented internally.Each method contains within it both the original code and a simulation. The originalmethod is used for structure and control-flow search. It is also used for rewriting purposes.The simulation is used for data-flow analysis (Section 6.5.3).

6.5.3. Matching

Currawong Java implements structure, control-flow, and data-flow search. CurrawongJava produces Match objects that directly reference both the application representationand the template object.

Template objects are produced by the Java-specific parser. Internally they are a nestedset of objects corresponding with the template in the optimisation specification. Objectsused to represent a template include scope objects, which represent scopes such as mod-

90


<Application>

<Package> com.example.android.lunarlander

<Class> LunarLander- File name- Disassembled non-method code

<Field> MENU_EASY:I

<Method> init()V- Disassembled code- Simulation

Figure 6.9. In-memory class hierarchy for a portion of the Android Lunar Lander game

ules, classes, and functions; call objects, which represent function calls; and parameterobjects, which represent parameters to function calls. All template objects contain a scopeobject as the outermost object. For example, if the first level of the optimisation speci-fication is a class declaration, the first scope object represents a templated class. Eachobject has domain-specific features:

• Class scope objects include references to their name and any classes they extend;

• Function scope objects include the function name, return type, and a list of param-eters;

• Call objects include the called function’s name, the return type, and a list of param-eters;

• Parameter objects include the parameter’s name and type.

An associative array mapping variable names to appropriate templated objects is con-structed and associated with the template.

Structure search

The goal of structure search is to produce the four types of match reference requiredby the optimisation specification for verification and rewrite purposes. These referencetypes, described in Section 6.3.1, comprise references on scopes, invocations, data, andvariables.

91

6. Implementation

Currawong supports scope match references on class names and function names. Tofind matching scopes, Currawong recursively descends the scopes of the application rep-resentation (Figure 6.9) until it finds a scope which matches the outermost templatedscope. At this point, Currawong produces a Match object. The match object is filled withinformation in the following way:

1. Each nested scope in the template is checked against the appropriate nested scopein the application. If a scope (i.e. a class or a function declaration) is specified in inthe template but is not found in the application, the Match object is discarded andstructure search at this location fails.

2. Whenever a child scope is encountered, its description is added to the Match object.Class scopes and method scopes are currently added.

3. Invocation and data match references are added to the appropriate scopes. To dis-cover these references, the dissassembled method code is scanned, and methodcalls are detected.

4. When scope references, invocation references, and data match references areadded, a check is made against the template to see if the relevant portion of thetemplate makes uses of a variable match reference (that is, a name prefixed witha dollar sign). If a variable match reference is present, an entry is created insidethe Match reference, referring to the appropriate scope, invocation, or data matchreference.

Control-flow search

Control-flow search operates at the level of functions and is based on a control-flow objectcalled the reachable set. To build the reachable set for a given control-flow search thefollowing algorithm is used:

1. Start with an empty set named reachable-from, which lists scopes which can bereached from the calling scope through function calls; and an empty queue, thework queue.

2. Add the start object, which should be a function-level scope, to the work queue.

3. Take an item from the work queue. Add the item to the reachable-from set.

4. Get a list of outgoing function calls from the item.

5. For each outgoing function call, find the appropriate scope object in the applicationrepresentation and add it to the work queue, as long as it is not already in reachable-from.

6. Continue until the work queue is empty.

7. The resulting set reachable-from consitutes the set of scopes accessible from thestart object.

92


The reachable set is used in control-flow search. As described in Section 5.8.2, control-flow search requires a template object indicating the target of the match, as well as a“predecessor” match object indicating the object prior to this one.

To perform control-flow search, a reachable set is built where the start object corre-sponds with the function-level scope of the predecessor Match object. A structure searchis then performed using the supplied Template. The result is accepted if and only if theresulting top-level scope is in the reachable set.

Data-flow search

To construct the simulation of a method, Currawong begins by examining each instructionin the disassembled code. Instructions are similar to assembly language: each instructionis composed of an opcode and zero or more arguments, where an opcode represents oneof the instructions in the Dalvik virtual machine’s instruction set. For each instructionin the dissasembly, Currawong creates a simulation of that instruction. An instructionsimulation consists of a reference to a function which implements the behaviour of theinstruction.

To perform data-flow analysis, Currawong uses symbolic execution. The implemen-tation of symbolic execution in Currawong does not improve meaningfully on any exist-ing symbolic execution tool—in fact, in many ways it is signficantly less well-developed.This is not a disadvantage: the proof-of-concept symbolic execution support implementedhere is sufficient to demonstrate that symbolic execution is a reasonable option for data-flow analysis in Currawong, and is sufficient to verify optimisations in simple examples(see Chapter 7).

To perform symbolic execution, a symbolic register set is constructed of appropriatesize for the matched function. Each opcode is then examined in the function simulationin sequence. The classes of supported opcodes, and their effect on registers, is describedbelow.

Move instructions: copy registers. Registers are copied without regard to their contents.

Return instructions: terminate the execution.

Constant declarations: set the value of a given register to a concrete value.

Object creation functions: set the value of the given register to the unknown value (insymbolic execution terms, the “unknown value” indicates that a given variablecould have any possible concrete value).

Comparisons: see below.

Array and object field access functions: set the value of the given register to theuknown value.

Invocations: set the value of each affected register to the unknown value (calls are notfollowed).

93

6. Implementation

Other opcodes: are ignored.

Comparisons always involve two registers. When a comparison is encountered, theengine must decide which control flow path to take. The decision is made according tothe following rules:

• If the comparison does not involve integers, examine both paths.

• If both operands are concrete, perform the comparison concretely and take the ap-propriate single path.

• If one or both operands are symbolic, attempt to decide which path should be taken,by first attempting to prove that the comparison is true, and then by attempting toprove that it is false. Currawong makes use of the python-constraint constraintsolver in order to make this decision [Labix 2010]. If the truth or falsity of thecomparison can be proved, that path is omitted from the execution.

6.5.4. Transformation and output

Transformation in Currawong Java consists of performing refactorings, and adding code.

Performing refactorings

Refactorings, such as method renaming, are performed on the disassembled application.Both method definitions and method invocations have distinctive, fully-typed signatureswhich can be recognised via purely-textual means. Modifications are made in-place tothe in-memory representation of the code.

Adding code

Currawong adds code indirectly, by adding packages to the application, and directly,through the use of the Application.Merge rule. Indirect addition of code is a trivialprocess: no modification to the application is needed, because Java includes a mecha-nism for packages to self-initialise when they are loaded. Currawong therefore simplycopies the named package to the application’s APK file when re-building.

Currawong supports translation of a small number of Java instructions to Smali as-sembly language in order to implement direct addition of code to modules. Currentlysupported is the ability to add fields to classes, to implement class variables, the ability toadd functions to classes, and the ability to call functions.

Generating output

Once the application has been rewritten in memory, Smali assembly language files foreach class are written to disk. Currawong uses the Smali assembler to convert these filesto a classes.dex file suitable for Dalvik.

94

6.6. CAmkES-specific portions

The final task Currawong must perform is to sign the code. Android preserves appli-cation integrity through code signing. When a class has been modified, the applicationmust be re-signed before it can be used. Currawong can apply a signature automatically—currently a special application developer signature is applied. This re-signing has negativepractical implications. In particular, updates to the application cannot be automaticallyapplied. A simple potential solution to this problem is to provide a way for the optimiseruser to indicate to the update mechanism that the signature mis-match is due to applica-tion of an optimisation, and thus the application should still be considered for updates.


In this section, the name Currawong C is used to refer to Currawong with CAmkES andC extensions. In CAmkES, a system is specified using a domain-specific ArchitectureDefinition Language (ADL). ADL determines which components comprise a system, andhow the components should communicate. Figure 6.10 shows an overview of CurrawongC.

Database

Parsing

App Rep

ADL Specification

Parsing

Evaluation

Transformation

ADL

Spec Rep

Changes

Parsing

Components

CAmkES extensions

Components

ADL transformation

1 2

3 4

5 6

7

8

9

10

C1

C2

C3

C4

C5

Figure 6.10. Currawong implementation (CAmkES extensions)

Unlike Android applications, CAmkES applications (more specifically, BinaryCAmkES applications) are compiled directly to machine-specific code. This means thatan architecture optimiser must be capable of reading the binary files which contain ob-jects, performing verification of these objects, and changing their behaviour.

The additions to Currawong to support CAmkES and C are labelled with the letter “C”in the figure:

95

6. Implementation

1. ADL (C1) is parsed (C2) in addition to the usual component loading;

2. Determining the suitability of an optimisation includes extensions for verificationof CAmkES components (C3);

3. Transformation requires domain-specific knowledge of the ADL (C4 and C5).

ADL

ComponentsOptimisation

ADL

ComponentsAssembly System

Figure 6.11. The CAmkES assembly process (Currawong is inside the dotted portion)

Figure 6.11 describes the role of Currawong with respect to the CAmkES system as-sembly process. CAmkES acts as a type of linker, combining component implementationfiles according to the description in the ADL. Currawong, indicated by the dotted sectionin the diagram, executes before the CAmkES assembly process occurs.

Currawong is currently limited to supporting the ARM processor, because this is theprocessor in the test environment used to evaluate Currawong (more information abovethe particular enviroment is given in Chapter 7).

6.6.1. Unpacking and application representation

SymbolsBCAMKES_INTERFACESBCAMKES_INCOMINGBCAMKES_OUTGOINGencode_frame...

Header

Code

Data

Currawong info

}

Figure 6.12. A Binary CAmkES component (ELF file)

Binary CAmkES components are stored as object (.o) files in ELF format [TIS Com-mittee 1995]. Each component object has a fully-qualified name using a reverse-DNSscheme. That is, a component author assigns a name to her component by coming up witha short name, appending that to the name of domain name under her control, and reversing

96


the result. For example, the codec component produced by the owners of example.comwould be named com.example.codec.o. This naming scheme provides a global guar-antee of uniqueness.

CAmkES stores information relevant to the component system in the component filesthemselves. Figure 6.12 illustrates the additional information stored by CAmKES insidecomponent files. Each component includes three extra symbols, BCAMKES INTER-FACES, BCAMKES INCOMING, and BCAMKES OUTGOING. These reference dataalso stored within the object file. These three symbols define the component’s interfaces,as described in Section 3.1.

Currawong C builds an application representation by examining the object file’s sym-bol table. The symbol table provides a list of functions defined by the component, aswell as a list of functions the component expects to be able to call. ELF files can containmultiple symbol tables. Typically, components contain at least two: an information-richsymbol table, suitable for static linking and for debugging; and a minimal symbol table,suitable for dynamic linking. Currawong uses this latter table only, as the former one maynot be present.

6.6.2. Matching

Currawong C, as with Currawong Java, supports structure search, control-flow search,and data-flow search.

Production of Match objects follows a process similar to that of Currawong Java, butthe method through which information is gathered about a component differs. Thesedifferences vary depending on the type of verification being performed.

Structure search

Currawong C supports the four types of reference described in the system-agnostic por-tion of the design, in Section 6.3.1. However, in addition to examining individual compo-nents in order to determine structure, Currawong C also examines the ADL specificationdescribing the component system’s layout. As discussed in Chapter 3, ADL describes acomponent-based system in terms of components and the connections between them.

The following two scopes are supported:

• Component scope: This is a CAmkES scope. As discussed in Chapter 3, each com-ponent in binary CAmkES has a unique name. A simple syntactic transformationof the component’s file name is used as the component’s fully-qualified name.

• Function scope: This is a C scope. To find functions within a file, the componentobject’s symbol table is scanned. This produces both a list of function names (formatching purposes) and a set of pointers to binary code (for symbolic execution).

Currawong produces invocation match references for C code by scanning the binarycode discovered during scope search. Specifically, Currawong searches the code for in-stances of the bl and blx (branch and link) instructions, which are used to call functions.

97

6. Implementation

Currawong only searches for branches to functions which are mentioned in the com-ponent’s symbol table. The intent behind this approach is that functions referred to inthe component’s symbol table are those functions which are externally-visible or imple-mented in other modules—that is, functions which comprise the publically-accessibleAPI of the component, or which comprise the API of another component upon which thepresent component relies.

Data match references are produced from the list of invocation match references aswell as the list of function scopes.

Control-flow search

Unlike Currawong Java, which builds a separate reachable set for control-flow search,Currawong C uses its symbolic execution engine to perform control-flow search. Thisis because component binary files do not contain enough information to trivially build areachable set. In practice, this is not an issue, because the only control-flow search that isrequired of binary components tends to be in the context of data-flow analysis.

Data-flow search

Like Currawong Java, Currawong C implements a proof-of-concept symbolic executionengine. The start and end points of execution are defined by invocation match references.Currawong implements a somewhat abstract machine model: registers are modelled, butmemory accesses are not, with the result that loads from memory result in the associatedregister being assigned the unknown value. This approach works for small loops and forconsecutive functions which are not separated by a large amount of code (and for which,therefore, data remain in registers).

As per the Java implementation, Currawong C models integers, which are normally thebasis for loops. It can also determine whether a pointer returned from (or passed to) onefunction is the same as a pointer returned from (or passed to) another function. However,Currawong’s implementation cannot track more complex properties, such as “the data ina specific region has not changed between two calls”.

6.6.3. Transformation and output

Currawong C implements transformations in a similar way to Currawong Java, by storingthe set of transformations and applying them when evaluation of the optimisation speci-fication succeeds. However, Currawong C supports a different set of transformations toCurrawong Java:

Renaming functions: Currawong C renames functions by modifying the symbol tableof the component defining the function, i.e. without modifying code in any way.

Renaming function calls: A new symbol is added to the affected component’s symboltable. This symbol is marked as not defined within the component object. Codereferencing the old symbol (containing the original function name) is updated toreference the new symbol.

98

6.7. Example Java optimisation

Interposing components: The new component is added to the ADL, and connectionsdescribed by the ADL between the two original components are modified to includethe new, interposed component.

After modification, Currawong writes new versions of the components and ADL todisk.

6.7. Example Java optimisation

This section describes a Java-based architecture optimisation. Consider a callback-basedevent-processing API for mouse movements. To use this API, applications provide a classthat implements a special interface named MouseEventHandler that provides mousemovement events. When the API has finished processing the event, it calls the specialmethod next(), which informs the event processor that the application is ready to re-ceive more events. An example of code making use of this interface is shown in Figure6.13.

public class Handler implements api.MouseEventHandler {public void mouseEvent(Context c, Event e) {

...c.next();

}}

Figure 6.13. Example application using the MouseEventHandler API

Suppose that a more efficient API was created, which would only be used by applica-tions which always called next in the function that handled the mouse event. To use thenew API, applications should call efficient next rather than next.

MatchEvent is Java {class $ClassName implements api.MouseEventHandler {

public void mouseEvent(Context $Context, Event $Event) {$Context.next();

}}

}optimise(mouseevents, App) is

Match = App.match(MatchEvent),App.rename_call(Match.feature("$ClassName.mouseEvent"),

"$Context.next", "$Context.efficient_next()").

Figure 6.14. Optimisation specification for the MouseEventHandler optimisation

Figure 6.14 shows the optimisation specification. The Match template closely resem-bles the code to be matched. The optimisation specification first performs the match, and

99

6. Implementation

then renames the method call.To perform this optimisation, the following steps take place:

1. Application representation: The application is loaded and an application represen-tation is generated;

2. Match object generation: A Match object is generated corresponding to theMatchEvent template in Figure 6.14;

3. Method renaming: The method is recognised and renamed;

4. Application rewriting: the application is modified and written to disk.

These steps are discussed individually.

6.7.1. Application representation

The application is supplied as an Android .apk file. This file is unpacked and disassem-bled. Figure 6.15 shows the resulting scope object.

<Application>

<Package> com.example.mouseevents

<Class> Handler- File name: Handler.smali- Disassembled non-method code

<Method> mouseEvent(Context c, Event e)V

- Disassembled code- Simulation

Figure 6.15. The MouseEventHandler example application, internal representation

The disassembled code is stored within the scope object. Figure 6.16 shows the disas-sembly for the function shown in Figure 6.13.

6.7.2. Match object generation

Currawong generates a Match object according to the procedure described in Section 6.3.The resulting Match object for this application is shown in Figure 6.17. In this figure,numbers within angle brackets represent distinct objects. The first portion of the Matchobject resembles the scope hierarchy of the actual Application, and is used to providesupport for the Match.feature/1 rule. The second portion of the Match object containsall the variables supplied in the template, as well as the corresponding classes to whichthey relate.

100

6.8. Discussion

.method public mouseEvent(Lcom/example/Context;Lcom/example/Event;)V.registers 3.parameter "c".parameter "e".prologue.line 6invoke-virtual {p1, p2}, Lcom/example/Context;->next(Lcom/example/Event;)V.line 8return-void

.end method

Figure 6.16. Disassembled code for the MouseEventHandler example application

6.7.3. Method renaming

Once the Match object has been generated, method renaming can take place. Evaluationof the rule results in an object being added to the logic database which contains datasimilar to that shown in Figure 6.18. Note that the numbers in angle brackets here referto object references contained within the Match object of Figure 6.17.

6.7.4. Application rewriting

When the optimisation specification has been completely evaluated, the application isrewritten. In this case, the rename-call Modification object is examined. To rename thecall, the dissasembled method is modified in-place. Currawong knows which line of thedisassembly to modify, because it is contained within the Call object, which itself part ofthe Match object, which is referenced from the Modification object.

The rewritten assembly language code is written back to disk, the application is as-sembled using the Smali assembler, the resulting file is added to an Android applicationpackage (.apk), and the new package is signed.

6.8. Discussion

This implementation of Currawong satisfies the requirements of an architecture optimiseras described in Chapter 5: it supports multiple languages, operates on binary files, andis capable of applying architecture-level optimisations after applying a number of tech-niques to ensure that the optimisation will behave as expected. Although it is a prototypeimplementation, a number of powerful techniques are supported.

Currawong uses symbolic execution to implement data-flow verification. The advan-tage is that the engine can be continuously extended and improved without requiring mod-ifications to the optimisation specification. However it is by no means the only choice.One interesting alternative is to support data-flow verification at run-time: data-flow con-ditions would be compiled to the target language (Dalvik or binary code) and insertedas verification conditions before the transformed code. The advantage of this approach is

101

6. Implementation

Scope: class <1>- Name: Handler- Extends: None- Implements: api.MouseEventHandler

Scope: function <2>- Name: mouseEvent- Returns: void- Parameters: [com.example.Context

<3>, com.example.Event <4>]

Call- Name: <3>.next- Line: 6

Variables:ClassName: <1>Context: <3>Event: <4>

Figure 6.17. The MouseEventHandler match object

Modification: rename-call- Match <Reference to the Match object>- Call <5>- New name: <3>.efficient_next()

Figure 6.18. The method rename description

that data-flow verification that would be difficult or impossible to perform statically couldbe possible if performed dynamically. One can imagine an architcture optimiser makinguse of a hybrid scheme: static analysis where possible, falling back to dynamic analysis.

Even if Currawong continues to use symbolic execution, a strong case can bemade that Currawong should use existing symbolic execution engines, such as JavaPathFinder [Havelund and Pressburger 2000] or EXE [Cadar et al. 2008]. This sugges-tion is difficult to argue with: using an existing execution engine would give Currawongbetter data-flow search capabilities. Unfortunately, integrating two existing symbolic exe-cution engines was ultimately rejected due to a lack of readily-integrable code. Althoughmany engines exist for Java code, none exists for Dalvik code. Converting from Dalvikto Java, while certainly achievable, is non-trivial (Dalvik is register-based while Javais stack-based, for example). Similarly, several symbolic execution engines exist for Csource code, but the author found no publically-available symbolic execution engines forARM binaries while implementing Currawong. Recent (2009) work by Chipounov et al.indicate promising work in the area of symbolic execution on the QEmu system simula-tor [Chipounov et al. 2009]. Their work is currently limited to the x86 architecture, but

102

6.8. Discussion

could be extended to other architectures with relative ease.Despite the potential for improvements to Currawong’s data-flow analysis, the cur-

rent system as is a successful capable of an architecture optimiser, performing a widevariety of architecture optimisations in two languages and across two entirely differentsystems. The next chapter demonstrates Currawong’s capabilities as applied to a varietyof CAmkES and Android-based applications.

103

7. Evaluation

In considering any new subject, there is frequently a tendency, first, to over-rate what we find to be already interesting or remarkable; and, secondly, bya sort of natural reaction, to undervalue the true state of the case.

– Ada Augusta [Menabrea and Augusta 1842]

Can Currawong be used to optimise real systems? In this chapter, Currawong is appliedto applications running on two very different systems: a strongly component-orientedresearch system and a real-world system. The results show that Currawong is capable ofhigh-performance architecture optimisation of both systems.

7.1. Introduction

An architecture optimiser must meet the design guidelines outlined in Section 5.1: that is,it should recognise anti-patterns, apply remedies, work without source code, and supportmultiple languages. Chapter 6 described Currawong, which meets these criteria. Meet-ing design goals is not the same as being effective, however. This chapter demonstratesCurrawong’s effectiveness at performing real architecture optimisations.

This chapter has three goals. The first goal is to demonstrate that Currawong can per-form a variety of optimisations across multiple systems. To demonstrate this, Currawongis applied to two systems. The first system implements the architecture described in Sec-tion 3.3.1, the componentised video player, on the CAmkES component-based system.The second system is the Android mobile operating system: optimisations are are appliedto a variety of Android applications. This division demonstrates Currawong’s perfor-mance both in an ideal componentised system (CAmkES) and on publically-available,commercial-quality code (Android).

The second goal of this chapter is to demonstrate that the type of optimisations thatCurrawong can perform are worthwhile. Unfortunately there is no simple answer to thequestion “what degree of performance improvement is worthwhile?” The answer dependson the application and on the platform. For example, a 5% performance improvementto the main loop of a fast-paced action game may be worthwhile simply in terms ofincreased run-time on a battery-operated device; but a 5% performance improvement tothe configuration panel of an application is probably not worthwhile.

Nonetheless, I set some intuitive lower bounds for whether or not a performance im-provement was worthwhile for these tests. All performance improvements measured here

104

7.1. Introduction

are active for the majority of the application’s runtime—componentised video player per-formance improvements are active whilst playing video; Android performance improve-ments are active for the entire duration of the application. Furthermore, most of theperformance improvements benchmarked here apply to applications which display video,and result in reducing the amount of data, specifically image data, that must be processed(either by copying, or by transformation). Thus most of the performance improvementshere would improve the frame-rate of the application, if it were CPU bound; or reducethe application’s CPU usage, if it were not. A 10% performance improvement for a CPU-bound application running at 30 frames per second could increase the frame rate by threeframes per second. Intuitively, this seems to represent the lower bounds of “worthwhile”.

The third and final goal of the chapter is to demonstrate the limitations of Currawong—or, at least, to place it properly in context. Some of the demonstrations below fail thesignificance test, but would probably pass it if placed in a different context.

7.1.1. Test hardware

Experiments were run on an HTC Dream smartphone, also known as the ADP1 orG1 [HTC Corporation 2010b]. This phone includes two processor cores: an ARM 9,and an ARM 11. The ARM 9 implements what is known as “baseband” functionality:low-level interaction with the various mobile radios on the phone in order to provide ba-sic phone service. The ARM 11 core implements all other functionality: this is the coreon which the Android operating system, and all applications, run. Both cores run at amaximum speed of 528 MHz. The phone incorporates an LCD running at a resolution of320 by 480 pixels with 16-bit colour.

For the CAmkES tests, the phone ran on the OKL4 microkernel, version 3.0 [OK Labs2010]. This version of the microkernel was ported to the HTC Dream smartphone by theauthor [NICTA 2010]. For the Android tests, the phone ran a standard Android 1.6 systemimage as supplied by the manufacturer [HTC Corporation 2010a]. The Linux kernel wasmodified to provide convenient access to the phone’s cycle counter, but was otherwise notchanged.

7.1.2. Methodology

To perform the tests, each application was started from a freshly-booted phone. Eachexperiment ran for several seconds, and experiments were repeated to ensure that theresult was consistent. Results are reported in terms of millions of CPU cycles taken.This number was obtained from the CCNT register of the ARM processor’s performancemonitoring unit.

An interesting quirk of this particular phone is that the ARM9 and ARM11 cores shareaccess to the memory bus, with the result that code executing on the ARM9 can impactthe performance of the ARM11. This problem was mitigated by not running tests duringthe first twenty seconds after boot (because testing showed that this was when the ARM 9was most active) and by verifying that the results of multiple runs of the same test showedlittle variance.

105

7. Evaluation

Anti-pattern Example RemedyContext switching Same-domain decoder Merge protection domainsCopying Eliminate RGB conversion Replace componentOverly-generic API New file system API Interpose componentUnsuitable data New file system API Interpose componentReprocessing Eliminate RGB conversion Replace component

Table 7.1. Summary of CAmkES-specific examples

7.2. CAmkES

The three optimisations implemented for CAmKES were designed to replicate poten-tial real-world optimisation. However, the system on which they were implemented is aprototype. The absolute numbers presented here are, therefore, less important than thecapability being demonstrated.

FileSystem

Client

Decoder Display

FileSystem

Codec FrameBuffer

Figure 7.1. Componentised video player

The CAmkES examples are based around the componentised video player, introducedin Chapter 3. The video player’s architecture is reproduced in Figure 7.1.

Currawong’s support for architectural optimisation of component systems is demon-strated through three examples, each of which starts with the basic componentised videoplayer (Figure 7.1) and transforms it by merging protection domains, replacing compo-nents, or interposing components. Table 7.1 summarises the examples and their purpose.The first column of this figure lists anti-patterns; the second column (“Example”) namesthe example below which addresses the anti-pattern; and the final column (“Remedy”)

106

7.2. CAmkES

describes the remedy employed by the example.The first example, same-domain decoder, addresses a possible context switching anti-

pattern by combining the protection domains of two components. The second example,eliminate RGB conversion, deals with unnecessary data reprocessing and copying by re-placing the Decoder component. The final example, new file system API, addresses issueswith an overly-generic API and unsuitable data structures by interposing a componentbetween the Client and the FileSystem components.

7.2.1. Same-domain decoder

Client

Decoder

IPC Connectorcom.example.Decoder.Codec

Client

Decoder

Direct Connectorcom.example.Decoder.Codec

A B

Figure 7.2. Protection domain merging in the componentised video player

This optimisation, introduced in Section 5.2.1, merges the protection domains of twocomponents.

The CSL used to perform the optimisation is shown in Figure 7.3. The specificationfirst searches for a component that matches the template shown in MatchDecoder (lines1 to 12 in the figure). This specification makes use of CAmKES-specific extensions tothe templating language to indicate architectural features of the system. The templatematches a component which connects to a specific Decoder component, specified usinga fully-qualified name; and which makes use of the decode function supplied by thefunctional connector between the two components. After this match is found, Currawongverifies that the two components are in separate protection domains (line 16). If they are,the specification instructs Currawong to combine the protection domains (line 17).

The system-level communication interface used for the functional connectors (that is,whether to use IPC or direct function calls) is not specified: selection of an appropriatemethod is left to Currawong. If the components share a protection domain, direct functioncalls are used. If they do not, IPC is used.

This example demonstrates the benefits of a declarative approach to specification. Thisspecification relies purely on structure search, and in fact a large amount of searching isperformed: producing a template requires a search across components in Currawong’s

107

7. Evaluation

1 MatchDecoder is C {2 component com.example.Decoder $Decoder;34 component $Client {5 connector com.example.Decoder.Codec $Codec;6 connect $Codec to $Decoder;78 anonymous $Func {9 $Codec.decode(_);

10 }11 }12 }1314 optimise(mergedecoder, App) is15 Match = App.match(MatchDecoder),16 App.DisjointComponentPDs(Match.Client, Match.Decoder),17 App.AddToPD(Match.Decoder, Match.Client).

Figure 7.3. The “Merge protection domains” optimisation specification

Buffer size Switches Separate-PD/stddev Same-PD/stddev Speedup %1 frame 320 228.34 / 0.03 226.32 / 0.02 100.892 frames 160 227.08 / 0.01 226.13 / 0.02 100.424 frames 80 225.49 / 0.02 225.02 / 0.01 100.218 frames 40 223.00 / 0.01 222.72 / 0.02 100.13

Table 7.2. The “Merge protection domains” optimisation, results

ADL representation, and ensuring that PDs are disjoint is a unification. However, thissearching is not made explicit in the specification, for an overall improvement in specifi-cation readability.

The optimisation was tested by passing 160 frames of data from the Client to the De-coder for decoding. The Decoder simply copied the data to an output buffer, which waspassed back to the client via shared memory. Thus the only operation being tested wasthe cost of memory copying and IPC overhead. This procedure was repeated thirty times.

The results are shown in Figure 7.2. For each test, the value presented is the aver-age of thirty runs. This value represents millions of CPU cycles counted. A chi-squaretest for uniform distribution “Switches” records the total number of context switches per-formed in the case where Client and Decoder components reside in separate protectiondomains. The “Separate-PD” column represents the original case, where Client and De-coder components are separated. The “Same-PD” column represents the optimised case,in which Client and Decoder reside in the same protection domain. “Speedup” recordsthe percentage difference between the separate-protection-domain and same-protectiondomain tests, where 100.00 is exactly the same speed. Each test was run multiple timeswith different buffer sizes—large buffers result in less communication between Client and

108

7.2. CAmkES

Decoder. These results show that the benefit due to optimisation in this case is negligible.

Discussion

Performing this particular optimisation on this particular system is not worthwhile. How-ever, this “null result” is still interesting, as far as architecture optimisation is con-cerned. Combining protection domains was identified in Chapter 4 as a method thatmany domain-specific optimisation techniques use to improve performance: why is it souninspiring here?

The problem lies in separating an optimisation remedy from its environment. TheOKL4 microkernel platform on which CAmkES is based is very efficient at switchingbetween protection domains. Other systems may have different priorities. Android, forexample, is almost two orders of magnitude slower than OKL4 at performing IPC. Aping-pong test, in which a message is sent from one component to another, and thena reply is sent back to the original component, takes 1 592 CPU cycles on OKL4, but95 053 CPU cycles on Android [Hills 2009]. Eliminating context switches due to IPCon Android is therefore a worthwhile goal. This possibility is explored further in Section7.3.1.

Overall, the message from this null result is that context is important: optimisationsthat may be worthwhile for one system are ineffective on another.

7.2.2. Eliminate RGB conversion

Client

Decoder Display

Codec FrameBuffer

A

Client

DecoderYUV Display

Codec FrameBufferYUV

B

Figure 7.4. Component replacement in the componentised video player

The RGB conversion elimination optimisation was introduced in Section 5.2.2. Thisoptimisation cannot rely entirely on structure search. If the Client component makes useof the data supplied by the Decoder, then the optimisation cannot be performed—theClient would not understand the new data format. Therefore, data-flow search is used toensure that the Client simply passes data from Decoder to Display, without storing it oraccessing it.

Figure 7.5 shows the complete optimisation specification. This specification is ratherlong and warrants some explanation:

1. The match specification (lines 1 to 16 in the figure) references three components.Decoder (line 2) and Display (line 3) are specified by fully-qualified name, ensuring

109

7. Evaluation

1 MatchDecoderDisplay is C {2 component com.example.Decoder $Decoder;3 component com.example.Display $Display;45 component $Client {6 connector com.example.Decoder.Codec $Codec;7 connect $Codec to $Decoder;8 connector com.example.Display.FrameBuffer $FrameBuffer;9 connect $FrameBuffer to $Display;

1011 anonymous $Func {12 $Codec.decode($Frame1);13 $FrameBuffer.update($Frame2);14 }15 }16 }1718 optimise(norgb, App) is19 Match = App.match(MatchDecoderDisplay),20 Match.Frame1 = Match.Frame2,21 RestrictAccess = [Match.feature("$Func.$Codec.decode"),22 Match.feature("$Func.$Display.update")],23 Match.access(Match.feature("$Func"),24 Match.Frame1, RestrictAccess),2526 App.replace_component(Match.Decoder,27 "com.example.DecoderYUV"),28 App.replace_connector(Match.FrameBuffer,29 "com.example.Display.FrameBufferYUV").

Figure 7.5. The “Eliminate RGB conversion” optimisation specification

that the optimisation is only applied to systems using these unique components.No other match information is required of these components—this follows as ageneral consequence of the principle that unique matching reduces specificationrequirements (Section 5.4.2).

2. The third component reference (lines 5 to 16) does not specify a unique reference.The implication is that this optimisation specification applies to any componentwhich makes use of the named Decoder and Display components. This match spec-ifies a number of additional structure matches for this component. The client musthave connectors of a named, unique type to both the Display and Decoder com-ponents (lines 6 to 9). The client must also define a function which calls specificfunctions on both connectors (lines 11 to 14).

3. The optimise method first performs a structure search (line 19).

4. Following the structure search, two data-flow searches are performed. The first

110

7.2. CAmkES

Before optimisation/stdev After optimisation/stdev Speedup %263.12 / 2.69 114.65 / 1.84 229.50

Table 7.3. The “Eliminate RGB conversion” optimisation, results

verifies that the object passed to decode() is the same object that is passed toupdate() (line 20)—in other words, that decoded data is being passed to the framebuffer.

5. The second structure search verifies that the data being passed from one componentto another is only used within the two functions named. A list is built containingthe functions which are allowed to access the data (lines 21 and 22). This list isthen passed to the access rule described in Section 5.8.3, and Appendix B. Thisrule verifies that only the named functions access the Frame1 object.

6. If both data-flow searches succeed, the system is modified: the Decoder componentis replaced with a DecoderYUV component (lines 26 and 27), and the connectionbetween Client and Display is similarly replaced (lines 28 and 29).

To evaluate the optimisation specification, the Decoder component was modified toperform YUV-to-RGB conversion.

The results are shown in Figure 7.3. As before, numbers shown are in millions ofCPU cycles. “Speedup” shows the performance improvement of the no-conversion caserelative to the with-conversion case.

Discussion

This optimisation specification eliminates a data-reprocessing step and also a data copy.These are both bus-intensive operations, so eliminating them is particularly useful onhandheld devices—smartphones in general, and the G1 phone used in this test inparticular—have slow buses. The final optimised application is faster than the originalexample, because the YUV encoding uses less space (one byte per pixel instead of two).

The current generation of smartphone displays do not have native YUV LCDs. How-ever, it is common for the native format of a smartphone’s included camera to be somevariant of YUV, and a small variant of this optimisation which preserves the native dataformat if at all possible is quite plausible for modern devices.

This example demonstrates several design aspects of Currawong specification languagethat work to reduce the verbosity of the optimisation specification. For example, the struc-ture search described in step 3 of the above description is quite complex: the structuresearch must first find a particular architectural layout (three components, two connectors),then it must examine one component for a particular sequence of function calls to othercomponents. This work is handled by the implementation. CSL’s Prolog heritage meansthat the specification is ultimately handled internally as a series of nested searches. Hid-ing the complexity of the nested searches means that Currawong does the right thing evenin strange contingencies: for example, a component could satisfy all the criteria except

111

7. Evaluation

for the final data-flow search, and Currawong would ensure that it did not result in a falsematch.

7.2.3. Protocol translation

The final component-based remedy, component interposition, implements a memory-sharing optimisation in the file system component of the video player. It was introducedin Section 5.2.3

FileSystem

Client

FileSystem

SHMFS

FileSystem

Client

FileSystem

FileSystem

A B

Figure 7.6. Component interposition in the componentised video player

The optimisation relies on domain-specific knowledge combined with static analysis.The memory allocator in this system is known to supply page-aligned memory regionsfor allocations equal to, or greater than, the page size. If the optimisation specificationcan verify that the allocated memory region used by the Client component for its data hasa size which is a multiple of the page size, it can therefore assume that the data region isalso page-aligned.

The optimisation specification is shown in Figure 7.7. Here the template specification(Lines 1 to 16 in the figure) identifies two components: the FileSystem and Client com-ponents. Once again, one component is not identified by fully-qualified name, but theother one is: this pattern allows optimisations to be “for” a particular component, whilestill capable of being applied to any client of that component. The specification identifiestwo functions within the component: memory allocation (line 9) and invocation of theread operation on the FileSystem component (line 13).

The optimisation specification first performs a structure search (line 19). It then at-tempts to verify that the size of the memory allocation is a multiple of the page size.Ifthis succeeds, it then verifies that the pointer returned by malloc is the one that is passedin to read (line 20).

Once the match is complete, the actual transformation is simple, as it makes use of thebuilt-in interpose rule to add a component.

The results of this optimisation are shown in Figure 7.4. Here a single 3-megabyte filewas transferred from FileSystem to Client. Two buffer sizes were evaluated. As before,results are presented in millions of CPU cycles. Sizes are in bytes, and “Speedup” is theperformance of the optimised case relative to the unoptimised case.

112

7.2. CAmkES

1 MatchFS is C {2 component com.example.FileSystem $FSComponent;34 component $Client {5 connector com.example.FileSystem.FileSystem $Files;6 connect $Files to $FSComponent;78 anonymous $Init {9 $AllocBuffer = malloc($Size);

10 }1112 anonymous $Control {13 $FSComponent.read(_, $ReadBuffer, _);14 }15 }16 }1718 optimise(fsshm, App) is19 Match = App.match(MatchFS),20 is_pagesize(Match.Size),21 Match.AllocBuffer = Match.ReadBuffer,2223 App.interpose(Match.Files, "com.example.SHMFS").2425 is_pagesize(Size) is26 mod(Size, 4096, 0).

Figure 7.7. The “Protocol translation” optimisation specification

Discussion

As expected, halving the number of copies involved in this process doubles the throughputin the 64-kilobyte buffer case. Interestingly, however, the performance is not as dramaticin the 4-kilobyte buffer case. This is because data copied into the shared memory regionare overwritten before the information in that region is written back to main memory.Thus the example applied to smaller packet sizes mostly remains within the CPU cache.

It is quite conceivable that this component could be interposed even without the atten-dant data-flow analysis. In that case, the component could dynamically check whetherthe memory was page-aligned. The cost of performing this check will be small comparedwith the cost of reading data from the file system.

Buffer size Copying/stdev Sharing/stdev Speedup %4096 30.50 / 0.04 25.64 / 0.01 118.9565536 43.23 / 0.02 22.59 / 0.01 191.37

Table 7.4. The “Protocol translation” optimisation, results

113

7. Evaluation

7.3. Android

To demonstrate performance on a commercial smartphone operating system, Currawongwas applied to several Android applications. Architecture optimisations for Android re-quire more code modification than those for CAmkES, because Android is not as stronglycomponentised as CAmkES.

Two architecture optimisations are shown. The first, “touch events”, addresses a con-text switching anti-pattern by combining protection domains. The second, “redraw”, ad-dresses a copying anti-pattern by modifying the API.

7.3.1. Touch events

The touch events optimisation, described in Section 5.2.4, puts the application in controlof reading and processing its own touch events. In the Android system, touch eventsreceived by the in-kernel touchscreen driver are delivered to the character device fileevent0. Applications with appropriate permission can therefore open this file and processtouch events directly.

Wait for touch event

System ServerApp IPC Thread

App GUI Thread

Enqueue event

Marshal for IPC

Wait for IPC

Send to active application

Handle touch event

App Event ThreadApp GUI Thread

Enqueue eventWait for touch event

Handle touch event

A. Before optimisation

B. After optimisation

Figure 7.8. Touch events optimisation before (A) and after (B)

To implement the optimisation, custom code was written as an Android nativecode library, to be loaded into applications using the standard Java Native Interfacemethod [Liang 1999]. This code creates a new application thread, which continuouslyreads touch events and delivers them to the application. Figure 7.8, section B, shows theprocess involved.

114

7.3. Android

1 MatchOnTouch is Java {2 class $ClassName implements android.view.View.OnTouchListener;3 }45 MergeOnTouch is Java {6 class $ClassName {7 $ClassName8 {9 au.com.nicta.cw.OnTouch.register(this);

10 }11 }12 }1314 optimise(ontouch, App) is15 Match = App.match(MatchOnTouch),16 App.add_module(’au.com.nicta.cw’),17 App.merge_all(Match, MergeOnTouch).

Figure 7.9. The “Touch events” optimisation specification

Name Unopt / stdev Opt / stdev Speedup %Continuous Scrolling 49.23 / 10.17 43.19 / 5.34 113.98

Table 7.5. The “touch events” optimisation, results

The specification is shown in Figure 7.9. Finding an appropriate location to performthe optimisation is a simple structural match (lines 1 to 3, and line 15, in the figure). Thismatch reflects the Android API—Java classes which are to receive touch events alwaysimplement the OnTouchListener interface.

The custom event-handling code takes care of reading events, encoding them into aformat suitable for OnTouchListeners, and delivering them to the application. The onlyapplication requirement is that it inform the new event-delivery thread of the target forevents, which it does by calling a function, register(), provided by the new code (line 9 inthe figure).

Unlike the CAmkES optimisations, the Android specification adds a small amount ofcode using Currawong’s code merging feature. When the merge is applied (in line 17),the merged object, MergeOnTouch, is applied to code corresponding to the portions ofthe Match object which match the structure of the merged object. In other words, theregister call is added to the application wherever appropriate (but only in areas coveredby the Match object).

To evaluate the optimisation, a sample application was written which displays an imagewhich is larger than the physical display. Dragging a finger across the display scrolls theimage. The test involved dragging a finger across the display for several seconds andmeasuring the number of CPU cycles consumed. To ensure consistency of input, eventsfrom the touchscreen were recorded and replayed to the application, so each test received

115

7. Evaluation

exactly the same number of inputs at exactly the same rate.The results of the evaluation are shown in Table 7.5. Numbers are given in millions of

CPU cycles. The first column shows the number of cycles required by the original appli-cation; the second column shows the number of cycles taken by the optimised version;and the final column, “Speedup” shows the percentage improvement of the optimisedversion.

Discussion

This type of architecture optimisation, if applied to other system devices, would result ina system resembling the Exokernel [Engler et al. 1995]: application-level, decentralisedmanagement of resources. As with Exokernel, the result is a reduction in CPU usage. Theimprovement can be viewed in two ways. In one sense, it is very significant: relocationof the processing code for a single device significantly lowered the CPU usage of theapplication. However, like all optimisations, this one should be considered in context.Of all the devices whose data are transferred via Binder, the touch device generates thelargest amount of events per second, and continuous on-screen movement represents aworst case for this particular data path. The conclusion that should be drawn is that theoptimiser user should consider the nature of the application to which an optimisation isto be applied.

The new code added by the touch events optimisation reads from the Android Linuxdevice node event0. Normally, this is forbidden by the Android system. This evaluationdid not include the permission management that would be necessary for a production ver-sion of the optimisation. This does not impact the result. Permission management wouldbe simple to implement: the System Server would only need to appropriately managethe UNIX-level file permissions on the relevant device node. Even if this process wasslow, however, it would not significantly reduce the benefit of the optimisation, because itwould only be necessary to check permissions associated with the node when switchingapplications, i.e., off the critical path.

7.3.2. Redraw

The Android redraw optimisation replaces a slow but simple and easy-to-debug drawinginterface with a faster but more complex interface. It was described in Section 5.2.5, andillustrated in Figure 7.10.

The optimisation specification is shown in Figure 7.11. Once again, the matchingportion is a simple structure search (lines 1 to 6 in the figure). Most of the remainder ofthe specification adds code to the matched class. The first portion of merged code (lines 8to 26) adds a number of functions to the class to partially implement the code necessary tosupport a SurfaceView. The second portion of merged code (lines 28 to 34) also supportsthe SurfaceView API. It is provided separately because it may be applied multiple times,one for each class constructor.

Results of the evaluation are shown in Figure 7.6. Three applications were tested. Asbefore, numbers are shown in millions of CPU cycles. The “Unopt” column shows the

116

7.3. Android

Application

System Server

call onDraw()

Re-draw surface

Update display

FrameworkApplication

System Server

call onDraw()

Re-draw surface

Update display

Currawong thread

A B

Copy to display memory

Figure 7.10. Redraw optimisation before (A) and after (B).

Name Unopt / stdev Opt / stdev Speedup %Redraw 60 FPS 78.12 / 3.51 37.67 / 0.34 207.38Space War 724.93 / 3.37 413.13 / 10.94 175.47Bonsai Blast 1384.33 / 32.95 1179.23 / 13.64 117.39

Table 7.6. The “Redraw” optimisation, results

performance of the application prior to optimisation; the “Opt” column shows the perfor-mance after optimisation; and the “Speedup” column shows the percentage improvementdue to optimisation.

The first application, Redraw 60 FPS, is a test application custom-written for this op-timisation. It simply draws to the display as quickly as possible. In Android, displayupdates are rate-limited to 60 frames per second, so this is the speed at which this ap-plication refreshes the display. Because it does very little other than re-draw the display,Redraw 60 FPS represents a best-case scenario for the redraw optimisation.

The other applications evaluated are both proprietary applications written by third par-ties and made available to all Android users via the Android Marketplace, a centralisedapplication repository.

Space War is a top-down scrolling “shoot-em-up” game. The player controls a spaceship which flies along a vertically-scrolling display. Other space ships appear from the topand sides of the display, and the player must shoot these other ships and avoid incominglaser fire to win the game.

Bonsai Blast is a puzzle game in which multi-coloured balls travel in single file alonga pre-determined path, and the player must launch balls of the correct colour into theappropriate place in the line of balls in order to score points.

To evaluate each game, a custom program was written which counts the number of CPUcycles which elapse within a three-second window of game play. Care was taken to ensurethat measurements between unoptimised and optimised runs measured the same part of

117

7. Evaluation

1 MatchOnDraw is Java {2 class $ClassName extends android.view.View {3 protected void onDraw(Canvas _)4 { }5 }6 }78 MergeOnDraw is Java {9 class $ClassName extends Android.view.SurfaceView

10 implements Android.view.SurfaceHolder.Callback {11 private int _cw_tok;1213 public void surfaceDestroyed(SurfaceHolder s) { }1415 public void surfaceCreated(SurfaceHolder s) {}1617 public void surfaceChanged(SurfaceHolder s,18 int fmt, int w, int h) {19 _cw_tok = au.com.nicta.cw.Draw2D.init(this);20 }2122 protected void invalidate() {23 au.com.nicta.cw.Draw2D.invalidate(_cw_tok);24 }25 }26 }2728 MergeOnDrawInit is Java {29 class $ClassName {30 $ClassName {31 getHolder().addCallback(this);32 }33 }34 }3536 optimise(ondraw, App) is37 Match = App.match(MatchOnDraw),38 App.add_package(’au.com.nicta.cw’),39 App.merge(Match, MergeOnDraw),40 App.merge_all(Match, MergeOnDrawInit),41 App.rename_method42 (Match.ClassName, ’onDraw’, ’_cw_onDraw’),43 App.merge(Match, MergeOnDrawInit).

Figure 7.11. The Android redraw optimisation

the game. Getting this right was a matter of trial and error. For example, the first levelof Bonsai Blast gets harder (more balls appear on-screen) approximately fifteen seconds

118

7.4. Costs of running Currawong

into the start of the game. At this point, the game’s CPU usage increases significantly. Toensure consistency, each run was measured multiple times, from the same location withinthe game.

Discussion

The redraw optimisation showed surprisingly promising performance improvements,even in applications which already make heavy use of the CPU, and it can be appliedto commercial applications without changing their semantics.

The results here show noticeably different speedup percentages across the three appli-cations. There are two reasons. Firstly, the amount of additional processing performed bythe application influences the proportional speed-up. Secondly, Bonsai Blast has a lowerframe rate than Space War, and Space War has a lower frame rate than Redraw 60 FPS.A lower frame-rate means less optimisable work per second, which results in a reducedperformance improvement.

Bugfix or optimisation?

It could be reasonably argued that an application’s use of the Surface technique ratherthan the SurfaceView technique constitutes a bug, i.e., the application is simply using thewrong API. This is in contrast to an optimisation, in which the API is used correctlybut inefficiently. This distinction is blurry, and goes to the heart of system softwareoptimisation as a whole. In many cases, the SurfaceView case included, what constitutesa bugfix in one context is an optimisation in another context. In this specific example,applications written using the slower technique perform correctly and with acceptableperformance. Furthermore, the application code is simpler and easier to maintain. In thiscontext, using the slower technique is not a bug, but a design trade-off—presumably oneof many made during development. Ultimately, the bugfix-or-optimisation question mustalways be framed in an appropriate context before we can determine whether Currawongis performing bugfixes, design-space exploration, or optimisation.

7.4. Costs of running Currawong

There are two types of execution cost that should be considered in relation to Currawong:the run-time cost imposed on the optimised application by Currawong, and the executioncost of running Currawong itself.

7.4.1. Application run-time cost

When considering run-time application execution-time costs, it is instructive to againdivide the overall cost into that imposed by the optimisation system itself, and that whicharises are a result of implementing an optimisation.

The smaller the run-time cost imposed by Currawong, the better: obviously, lowerexecution-time overheads results in better application performance, all else equal. There

119

7. Evaluation

is a subtler benefit also: the less run-time overhead Currawong imposes, the more pre-dictable it becomes. This is an important benefit to optimisation writers: an optimisationthat unpredictably introduces its own overheads makes writing optimisation specificationsmore difficult.

Consequently, Currawong does not impose any run-time overhead on optimised sys-tems. This is achieved both through use of static (rather than dynamic) analysis for veri-fication, and by direct re-writing of the application statically, rather than, for example, byrewriting the application at run-time; or by making use of tracing or debugging featuresof the operating system to interpose new code implementing a particular optimisation.

Currawong does not, however, make any guarantees about the performance of the mod-ifications implemented by the optimisation specification. This is a design decision basedon the assumption that the optimisation writer is a domain expert who has a good under-standing of the performance characteristics of the API she is optimising.

7.4.2. Currawong execution cost

The other type of execution cost that should be considered is that taken by running Cur-rawong. For this version of Currawong, that cost is negligible, even for optimisations thatrequire significant analysis or which make heavy use of CSL’s Prolog heritage to performunification. Structure and data-flow search for each example here was very fast (less thanone second). Total run-time was slightly longer, as it involved decompiling, parsing, andthen reassembling the application. The reason for this is nothing to do with efficiency ofimplementation: Currawong simply does not perform a significant amount of static anal-ysis. A future version of Currawong, with stronger support for data-flow search, wouldtake longer.

Even if Currawong took a long time to execute, this would, in some sense, not matter:Currawong only needs to be run once, by the optimiser user; after that, the optimisedapplication can be executed many times. Therefore the cost incurred by the optimiseruser can be amortised across each run of the optimised application. Yang et al. make asimilar point when discussing the static analysis tool to find bugs in kernel code [Yanget al. 2006]: even a high per-run cost is offset by the very large pay-off from a successfulexecution.

Yang’s analysis in the above paper is a good data point illustrating the trade-off Cur-rawong makes between depth of analysis and performance: Yang’s symbolic evaluationengine takes approximately an hour to analyse functions with, in his words, “complexcontrol flow but little symbolic looping”.

7.5. Discussion

The aim of the chapter was to clearly demonstrate both that Currawong cannot improvea system that will not benefit from the optimisation being applied; and that, in systemswhich do benefit, significant performance gains can be made. The results presented inthis chapter prove that both goals were achieved.

120

7.5. Discussion

It is helpful to consider whether the optimisations performed by Currawong could beachieved as effectively using other means. Several alternate methods exist:

1. Source-code modification. The application authors could have performed any ofthese optimisations themselves while building the application. The optimisationcould either have been performed manually, or by using the source-to-source op-timisation techniques described in Chapter 2. In some cases, this approach is bet-ter than using Currawong, because the programmer has the potential to performa wide range of algorithmic optimisations currently inaccessible to Currawong.For example, rather than attempting to eliminate RGB conversion, as described inSection 7.2.2, each component could be rewritten so that no component actuallyrequires data in RGB format. However, a major motivating factor for Currawong,as described in Chapter 1, is the absence of source code, so that optimisationscan be applied even if the application developer was unaware of them at the timeof compilation. By definition, source-code-based techniques cannot be applied tooptimisation problems for which the source code is not present, which is the sameproblem domain that Currawong addresses. Even if source code is available, binarymodification confers a number of advantages related to target-platform adaptibil-ity. For example, binary modifications can be performed when the exact hardwarecharacteristics of the target device are known. This may not be the case for source-code-based or compile-time modifications. The additional information can be usedto do hardware-dependent things, such as, for example, to adjust RAM usage to alevel suitable for the device. The same advantage applies to adapting the compo-nent to software running on the platform. Keller and Holzle demonstrate this typeof adaptation in their Binary Component Adaptation system [Keller and Holzle1998].

2. Improved system libraries. The authors of system libraries which were the target ofoptimisations could improve them. This solution is similar to the previous one, inthat it requires source code access, and this access may not be available. Even whenthey can be updated, modifying system libraries to include optimisations is a riskyprocess: effectively every client of the APIs exposed by those libraries becomes op-timised. This calls for more care than is required from Currawong, which is at leastcapable of performing static analysis to determine whether an application shouldbe optimised. Nonetheless, careful modification of system libraries is an ideal so-lution, if it is possible: the converse of the “all applications become optimised”problem is that optimisation is applied to new applications without requiring anintermediate optimiser, such as Currawong.

3. Binary optimisation techniques. Systems such as Keller and Holzle’s , as discussedin Section 2.4, could be applied to these optimisations. Notably, Keller and Holzle’ssystem has a small run-time cost, which Currawong avoids; later systems, such asPROSE [Popovici et al. 2002], avoid this. Notably, these systems are presented asbinary implementations of AOP, and are, unlike Currawong, limited to structuralmodifications.

121

7. Evaluation

The performance gains due to Currawong come without significant modification to theapplication, without requiring application source code, and without requiring the involve-ment of the original application authors. Applying the novel technique of architectureoptimisation to complete systems can bring about significant performance improvement.

122

8. Conclusion

I wish to God these calculations had been executed by steam!

– Charles Babbage, possibly apocryphal

This dissertation described architecture optimisation, a novel high-level optimisationtechnique. Architecture optimisation is viable both on research systems, and on commer-cial operating systems on which significant thought has, presumably, already been givento performance tuning.

An architecture optimiser, Currawong, was implemented and tested. Currawong canoptimise two completely different types of system, deal with multiple languages, and op-timise without requiring source code, all without imposing a run-time cost. These charac-teristics distinguish it from all other optimisation techniques. Unlike traditional compileroptimisations, Currawong is extensible; unlike library optimisations, Currawong can op-timise without source code, and across a larger domain; unlike refactorings, Currawongcan modify the behaviour of code. Currawong delivered good performance results acrossa variety of optimisation categories and both systems.

This chapter summarises the main themes of the dissertation and discusses directionsin which this type of work could be taken in the future.

8.1. Summary

Currawong is a tool to apply domain-specific, or high-level, optimisations to applica-tions. It achieves its performance improvements by optimising at API boundaries. APIboundaries are a good place to implement high-level optimisations, because an API callis a rich source of application-level information. For example, the fact that Java applica-tions in the Android “Redraw” example implemented a specific method of a well-knowninterface meant that the optimiser could assume many things about the nature of the appli-cation: that it would be drawing to the screen, that it would be performing drawing callson a particular object, that that object would be supplied in a callback fashion through acall to a known function within the application, and so on. Every piece of knowledge thatcan be assumed in this way is one less fact that must be verified through static analysis.

The introduction to Chapter 5 reproduces three criteria by Aho, Sethi, and Ullman,on the criteria for optimisations: they must preserve program meaning; they must, onaverage, speed up programs; and they must be worth the effort. It is interesting to notethat Currawong does not, by itself, meet those criteria: Aho et al. would probably classifyCurrawong as an optimisation framework, or simply as a compiler. Only when combinedwith a well-written optimisation specification does Currawong become an optimiser.

123

8. Conclusion

This distinction may seem trivial, but it hides a design decision: Currawong reliesheavily on the optimisation writer to produce correct optimisations. Currawong itself willobligingly apply “optimisations” that insert Trojan horses into code, slow it down insteadof speed it up, or merely introduce subtle bugs. This is, of course, a deliberate designdecision: the author of the API is best placed to write optimisations that take advantageof that API. Currawong helps the author by providing a static analysis toolkit. However,it remains to be seen whether API authors are up to the task of providing high-qualityoptimisations.

In many other ways, Currawong behaves like other optimisation systems. Currawong’smatching emphasises conservatism: if a property cannot be verified, Currawong assumesthe property did not verify. This conservatism property is of course an important part ofensuring that optimisations preserve program meaning. Optimisations can be additive:multiple optimisations can be applied to the same application resulting in performanceincreases in multiple areas.

8.1.1. Designing to be optimised

Architecture optimisation is applied to binary applications, by users (or system creators),after the code to be optimised has been written and packaged. One of the advantages ofthis approach is that the impact of bad design decisions can be ameliorated post hoc, byapplying architecture optimisation. Realistically, however, this is only true to an extent.

Consider the protocol header optimisation described in Section 4.1.4. In this opti-misation, the structure which defines a network packet was changed. The old structurerequired reallocation and copying whenever the packet size was increased, but the newstructure deals with the problem more efficiently, by treating a packet as a linked list ofregions. Assume that this optimisation is to be applied to two communicating compo-nents in a component system, one which consumes network buffers (such as a networkcard driver) and one which produces them (such as a TCP/IP stack)

A standard design for this interaction would be to share memory between the producerand consumer, and leave management of the format of the buffers to the components.However, if this design is chosen, creating a high-level optimisation becomes rather dif-ficult. The optimiser would have to recognise that the output of standard C memory-management routines (such as malloc and realloc) were being applied to packetbuffers, possibly involving the tracking of pointers to packet buffer objects throughout theentire application. If, instead, access to packet buffers was provided through a function-call API, applying an architecture optimisation is as simple as searching for uses of thatAPI.

It seems that the best way to write optimisable APIs is to try to provide as high-levelan interface as is possible: rather than expose the format of a data structure, provide aninterface to the structure via a functional API; rather than allow applications to makearbitrary modifications, provide a set of single-purpose functions to do the job for them.

These criteria for API design are familiar: they were first proposed by Parnas in 1972as part of a set of guidelines for writing code that is easy to maintain and change [Parnas1972]. It is perhaps unsurprising that the old rules still apply in this new context.

124

8.2. Achievements

8.2. Achievements

This dissertation describes an optimisation technique that is both novel and practical.Other high-level optimisation techniques, such as active libraries, have limited domainapplicability by design. These techniques are designed to assist programmers. Curra-wong instead assists end users, the real beneficiaries of optimised applications, because itdoes not require programmer involvement when applying optimisations—which, in turn,means that applications which are no longer supported by their authors, or hardware onwhich the application author cannot test her application, can still benefit. Currawongcombines the simplicity of refactoring with the static analysis of active libraries to sup-port a wide range of optimisation types.

Currawong is a success with regards to performance. Currawong’s excellent perfor-mance on Android is particularly notable because this large and commercially-supportedsystem has already been through several performance-enhancing revisions.

8.3. Future work

Currawong is a successful demonstration of a novel technique, but it is by no meansperfect.

Architecture optimisation’s benefits are also, to some extent, its disadvantages. Therequirement that Currawong work without source code limits both its ability to performstatic analysis and its ability to make code modifications. This shortcoming is some-what mitigated by the design principle that domain-specific knowledge reduces analysisrequirements. The two effective Android optimisations presented in Chapter 7, for ex-ample, rely only on structure search, and it is a strength of Currawong that significantperformance improvements can be achieved even without cutting-edge significant staticanalysis techniques. Nonetheless, stronger support for static analysis, probably in theform of a better symbolic execution engine, is an obvious target for future work.

One of architecture optimisation’s key design goals was its support for multiple lan-guages. This is an important goal, because most systems are written in multiple lan-guages. Currawong met this goal through its support for both Java and C. However, itis worth revisiting the assumptions implicit in this design goal. One implicit assump-tion is that architecture optimisations are written in specific languages. For example, theCAmkES optimisations were both specified and written in C, and the Android optimi-sations were specified in Java, and written in a combination of Java and C. Requiringoptimisations to be specified in a particular language has the advantage the specificationcan closely match the actual code and structure of the targeted application. However,none of the the optimisations evaluated in Chapter 7 is necessarily language-specific. It isreasonable to expect the RGB-to-YUV conversion problem, for example, to crop up in aJava program, or for the Android redraw optimisation to be applicable to C code. Perhapsa logical extension of support for multiple languages is support for general-purpose spec-ification, supporting multiple languages. The specification would then become language-independent.

125

8. Conclusion

Currawong and its specification language, the author’s Python-based parser, and theBcamkes binary component system will all be released as open source software.

126

A. Glossary

Android A software stack for mobile devices that includes an operating system, middle-ware and key applications. More information is available at http://developer.android.com/guide/basics/what-is-android.html.

Binary CAmkES A dialect of CAmkES supporting a number of additional features, in-cluding support for composition of binary-only components.

Binder A remote procedure call mechanism for Android. Binder consists of a (relatively)small kernel implementation, a user-space library, and JNI hooks for use from Java.

CAmkES The Component Architecture for microkernel-based Embedded Systems is acomponent system which runs on various dialects of L4. More information isavailable from the CAmkES home page, at http://www.ertos.nicta.com.au/software/camkes/.

architecture A machine-readable high-level description of a componentised system. InCAmkES, this is the description of the componentised system in Architecture Def-inition Language.

component system A software development tool, and the run-time portion of that tool,which allows a system to be created as a set of one or more communicating com-ponents. Contrast with componentised system

componentised system See system

object language The programming language in which an application or component iswritten. Contrast with specification language.

platform Another name for software stack.

software stack A collection of interacting programs that, combined, provide a fully-functional operating environment for applications. For example, the Android soft-ware stack consists of the Linux kernel, a set of run-time libraries, a virtual ma-chine, system components written in Java, and a number of user-accessible ap-plications. A basic CAmkES software stack consists of an L4 microkernel andCAmkES run-time code.

specification language The language in which a specification is written. May be differ-ent to the object language.

system A binary implementation of an architecture (see architecture).

127

B. Summary: Currawong API

This chapter provides a brief overview of the code matching, checking, and modificationrules provided by Currawong.

B.1. Application object

B.1.1. match(+Name)

Produce a Match objectSearches for the named template and returns a Match object representing the portion ofthe application corresponding to that template.Input: Name is a reference to a previously-defined template.Output: The return value is an opaque type representing a match object. This rule eitherfails, or succeeds exactly once.

B.1.2. rename call(+Scope, +Old, +New)

Rename all references to a function callSearches Scope and all child scopes for any references to function invocation for thefunction named Old, and replaces them with function invocations to the function namedNew.Input: Scope is a reference to a scope from a Match object. Old and New are stringscontaining function names.Output: None. The rule always succeeds.rename method(+Scope, +Old, +New)Rename a method definitionSearches Scope and all child scopes for a declaration of the function Old, and renames itto New.Input: Scope is a reference to a scope from a Match object. Old and New are stringscontaining function names.Output: None. The rule always succeeds.

B.2. CAmkES-specific Application object rules

B.2.1. access(+Scope, +Object, =Funcs)

Checks object usageChecks that Object is only used by Funcs within Scope.

128

B.2. CAmkES-specific Application object rules

Input: Scope is a reference to a matched scope (obtained via Match.feature/1).Object is a match variable representing data (i.e. a function parameter). Funcs is alist of functions.Output: The list of functions which use Object is returned in Funcs. If Funcs is bound,and the list is the same, the rule succeeds. If Funcs is unbound, it is bound to the list offunctions, and the rule succeeds.

B.2.2. AddToPD(+Satellite, +Planet)

Combines protection domainsSatellite is placed in Planet’s protection domain.Input: Satellite and Planet are component references from a match specification.Output: None: The rule always succeeds.

B.2.3. DisjointComponentPDs(+PD1, +PD2)

Succeeds if PD1 and PD2 have disjoint protection domainsThis rule accepts two component references and succeeds if the components are in com-pletely disjoint protection domains.Input: PD1 and PD2 are component references from a match specification.Output: None. The rule succeeds if PD1 and PD2 are in disjoint protection domains, andfails otherwise.

B.2.4. replace component(+Old, +New)

Replace a component with another componentReplaces a component referenced by a match specification with a new component.Input: Old is a reference to a component scope from a Match object. New is a stringcontaining a fully-qualified component name.Output: None. The rule always succeeds. In the current implementation, Currawongreturns an error to the user and exits if the new component is not found.

B.2.5. replace connector(+Old, +New)

Replace a connector with another connectorReplaces a connector referenced by a match specification with a new connector.Input: Old is a reference to a connector scope from a Match object. New is a stringcontaining a fully-qualified connector name.Output: None. The rule always succeeds. In the current implementation, Currawongreturns an error to the user and exits if the new component is not found.

B.2.6. interpose(+Connector, +Component)

Interpose a component along a connector

129

B. Summary: Currawong API

A new component of type Component is created. The named connector, Connector, isdisconnected from its destination component and connected to the new component. Anew connector of the same type as Connector is created and connected from the newcomponent to the original connector’s destination. The new component is placed in itsown protection domain.Input: Connector is a reference to a connector scope from a Match object. Componentis a reference to a component scope from a Match object. These must refer to the sameMatch object, and Connector must be one of the connectors used by Component. Out-put: None. The rule always succeeds. Failure conditions are as per replace connector.

B.3. Java-specific Application object rules

B.3.1. add module(+Name)

Adds the named module to the applicationThe module identified by Name is physically copied to the application, so as to make itavailable to new code.Input: Name is the fully-qualified name of the new module. Output: None. The rulealways succeeds.

B.3.2. merge(+Match, +Merge)

Adds code to a matchThe named template object’s code and data is added to the portions of the applicationidentified by the Match object according to the rules described in Section 5.8.4. Mergingis performed at most once.Input: Match is a match object. Merge is a template. Output: None. If the merge cannotcomplete (because the structural features identified in the template cannot be correlatedwith the structural features in the Match object), the rule fails. Otherwise, it succeeds.

B.3.3. merge all(+Match, +Merge)

Adds code to a match, multiple timesThis performs the same function as merge/2, but the merge is performed as many timesas possible. This is used in the example described in Section 7.3.1 to add code to everyclass initialiser.

B.4. Match object

B.4.1. feature(+Path)

Access a structural feature of a Match object.Match objects make the structure of the matched code portion available to the optimisa-tion specification via the feature rule.

130

B.4. Match object

Input: Path is a string containing a path access specifier. The access specifier shouldindicate a scoped name corresponding to scopes in the generating template. Scopes mayinclude templated variables, which are substituted. Scopes may also include an under-score ( ) as a wildcard, in which case any scope may be substituted. Scopes are searchedstarting from contents of the outermost scope of the match; if this fails the scope abovethat scope is searched, and so on. If multiple matches are found, only the first one isreturned.Output: The return value is an opaque type representing the feature, to be passed tocode-modifying rules.Examples:Match.feature("method"): Match the single scope named “method”.Match.feature(""): Match the outermost scope.Match.feature("$ClassName"): Match the scope indicated by the $ClassName matchvariable.Match.feature(" .method"): Match the scope “method”, which is contained within ascope.

131

Bibliography

Mark B. Abbott and Larry L. Peterson. Increasing network throughput by integratingprotocol layers. IEEE/ACM Transactions on Networking, 1(5), February 1993.

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: principles, techniques,and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986.

Sameer Ajmani, Barbara Liskov, and Liuba Shrira. Modular software upgrades for dis-tributed systems. In Proceedings of ECOOP 2006, volume 4067 of Lecture Notes inComputer Science, pages 452–476. Springer Berlin / Heidelberg, 2006.

John Aycock. A brief history of just-in-time. ACM Computing Surveys, 35(2), June 2003.

Charles Babbage. Passages from the Life of a Philosopher. Rutgers University Press,1864.

David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler transformations forhigh-performance computing. ACM Computing Surveys, 26(4), December 1994.

Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. Dynamo: a transparent dy-namic optimization system. ACM SIGPLAN Notices, 35(5), May 2000.

Thomas Ball and Sriram K. Rajamani. The SLAM toolkit. In Proceedings of the 13thInternational Conference on Computer Aided Verification, July 2001.

Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichtenberg, ConMcGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah Ustuner. Thoroughstatic analysis of device drivers. SIGOPS Operating Systems Review, October 2006.

Andrew Baumann, Gernot Heiser, Jonathan Appavoo, Dilma Da Silva, Orran Krieger,Robert W. Wisniewski, and Jeremy Kerr. Providing dynamic update in an operatingsystem. In Proceedings of the 2005 USENIX Annual Technical Conference, pages 32–32, Berkeley, CA, USA, 2005. USENIX Association.

Ted J. Biggerstaff. The library scaling problem and the limits of concrete componentreuse. In IEEE International Conference on Software Reuse, November 1994.

Julien Brunel, Damien Doligez, Rene Rydhof Hansen, Julia L. Lawall, and Gilles Muller.A foundation for flow-based program matching: using temporal logic and modelchecking. SIGPLAN Notices, 44:114–126, January 2009.

132

Bibliography

John Bruno, Jose Brustoloni, Eran Gabber, Avi Silberschatz, and Christopher Small. Peb-ble: a component-based operating system for embedded applications. In Proceedingsof the USENIX Workshop on Embedded Systems, March 1999.

Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Dawson R. Engler.EXE: automatically generating inputs of death. ACM Transactions on Information andSystem Security, 12(2), November 2008.

Wen-Ke Chen, Sorin Lerner, Ronnie Chaiken, and David M. Gillies. Mojo, a dynamicoptimization system. In Proceedings of the 3rd ACM Workshop on Feedback-Directedand Dynamic Optimization, December 2000.

Vitaly Chipounov, Vlad Georgescu, Cristian Zamfir, and George Candea. Selective sym-bolic execution. In Workshop on Hot Topics in Dependable Systems, May 2009.

Hsiao-keng Jerry Chu. Zero-copy TCP in Solaris. In USENIX ’96: Proceedings of the1996 USENIX Annual Technical Conference, 1996.

David D. Clark, Van Jacobson, John Romkey, and Howard Salwen. An analysis of TCPprocessing overhead. IEEE Communications Magazine, 27(6):23–29, 1989.

James R. Cordy. The TXL source transformation language. Science of Computer Pro-gramming, 2006.

Willem de Bruijn and Herbert Bos. Beltway Buffers: Avoiding the OS Traffic Jam. InProceedings of the 27th Conference on Computer Communications (INFOCOM 2008),April 2008.

Kris De Volder. Aspect-oriented logic meta programming. In Pierre Cointe, editor, Meta-Level Architectures and Reflection, Second International Conference, Reflection’99,volume 1616 of Lecture Notes in Computer Science. Springer Verlag, 1999.

Danny Dig and Ralph Johnson. The role of refactorings in API evolution. In Proceedingsof the 21st IEEE International Conference on Software Maintenance. IEEE ComputerSociety, 2005.

Danny Dig, Stas Negara, Vibhu Mohindra, and Ralph Johnson. Reba: refactoring-awarebinary adaptation of evolving libraries. In ICSE ’08: Proceedings of the 30th interna-tional conference on Software engineering. ACM, 2008.

Peter Druschel and Larry L. Peterson. Fbufs: a high-bandwidth cross-domain transferfacility. SIGOPS Operating Systems Review, 27(5), 1993.

Adam Dunkels. Design and Implementation of the lwIP TCP/IP Stack. PhD thesis,Swedish Institute of Computer Science, 2001.

Eclipse Foundation. Eclipse.org home page, February 2010a. URL http://eclipse.org/.

133

Bibliography

Eclipse Foundation. Unleashing the power of refactoring, February2010b. URL http://www.eclipse.org/articles/article.php?file=Article-Unleashing-the-Power-of-Refactoring/index.html.

Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugsas deviant behavior: a general approach to inferring errors in systems code. SIGOPS –Operating Systems Review, 35(5), 2001.

Dawson R. Engler, Frans Kaashoek, and James O’Toole Jr. Exokernel: an operatingsystem architecture for application-level resource management. In SOSP ’95: Pro-ceedings of the fifteenth ACM symposium on Operating Systems Principles, New York,NY, USA, 1995. ACM.

Nicholas FitzRoy-Dale and Ihor Kuz. Towards automatic performance optimisation ofcomponentised systems. In Proceedings of the Second Workshop on Isolation andIntegration in Embedded Systems. ACM, 2009.

Bryan Ford, Jay Lepreau, Stephen Clawson, Kevin Van Maren, Bart Robinson, and JeffTurner. The Flux OS Toolkit: Reusable Components for OS Implementation. In Pro-ceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI). IEEEComputer Society, 1997.

Martin Fowler, Kent Beck, John Brant, William Opdyke, and Don Roberts. Refactoring:improving the design of existing code. Addison-Wesley Professional Computing Series,1999.

L. Fernando Friedrich, John Stankovic, Marty Humphrey, Michael Marley, and JohnHaskins. A survey of configurable, component-based operating systems for embed-ded applications. IEEE Micro, 21(3), 2001.

Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: ele-ments of reusable object-oriented software. Addison-Wesley Professional ComputingSeries, 1995.

Kang Su Gatlin. Profile-guided optimization with Microsoft Visual C++, 2010. URLhttp://msdn.microsoft.com/en-us/library/aa289170(VS.71).aspx.

Google Inc. What Is Android?, February 2010. URL http://developer.android.com/guide/basics/what-is-android.html.

Samuel Z. Guyer and Calvin Lin. Broadway: A compiler for exploiting the domain-specific semantics of software libraries. In Proceedings of the IEEE, volume 93. IEEE,2005.

Klaus Havelund and Thomas Pressburger. Model checking Java programs using JavaPathFinder. International Journal on Software Tools for Technology Transfer (STTT),2(4), 2000.

134

Bibliography

Gernot Heiser, Kevin Elphinstone, Ihor Kuz, Gerwin Klein, and Stefan M. Petters. To-wards trustworthy computing systems: taking microkernels to the next level. SIGOPSOperating Systems Review, 41(4), 2007.

Johannes Henkel and Amer Diwan. Catchup!: capturing and replaying refactorings tosupport API evolution. In ICSE ’05: Proceedings of the 27th international conferenceon Software engineering. ACM, 2005.

Michael Hills. Native OKL4 Android stack. BE Thesis, NICTA, 2009.

Gerard J. Holzmann. The model checker SPIN. IEEE Transactions on Software Engi-neering, 23(5), 1997.

HTC Corporation. HTC Developer Center for the ADP1, 2010a. URL http://developer.htc.com/adp.html.

HTC Corporation. HTC Dream, 2010b. URL http://www.htc.com/www/product/dream/specification.html.

IBM. The FORTRAN automatic coding system for the IBM 704 EDPM. InternationalBusiness Machines Corporation, October 1956.

International Standards Organisation. Information technology – Programming languages– Prolog – Part 1: General core. Technical Report 13211-1, ISO, 1995.

Daniel Isaaman, Jenny Tyler, and Martin Newton. Computer Spacegames. Usborn Pub-lishing Ltd., 1982.

JesusFreke. Smali and Baksmali. http://code.google.com/p/smali/, 2010.

Stephen Kell. Configuration and adaptation of binary software components. In Proceed-ings of the 31st International Conference on Software Engineering. IEEE, 2009.

Ralph Keller and Urs Holzle. Binary component adaptation. In Proceedings of the Euro-pean Conference on Object-Oriented Programming, volume 1445, page 307, 1998.

Gregor Kiczales, John Lamping, Anurang Mendhekar, Chris Maeda, Christina Lopes,Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. ACM ComputingSurveys, 28, 1996.

Gregor Kiczales, Eric Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William G.Griswold. An overview of AspectJ. In European Conference on Object-Oriented Pro-gramming, 2001.

James C. King. Symbolic execution and program testing. Communications of the ACM,19(7), 1976.

Stephen Cole Kleene. Mathematical Logic. Wiley, 1967.

135

Bibliography

Thomas Kotzmann, Christian Wimmer, Hanspeter Mossenbock, Thomas Rodriguez,Kenneth Russell, and David Cox. Design of the Java HotSpotTMclient compiler forJava 6. ACM Transactions on Architecture and Code Optimization, 5(1), 2008.

Robert A. Kowalski. The early years of logic programming. Communications of theACM, 31(1), 1988.

Ihor Kuz, Yan Liu, Ian Gorton, and Gernot Heiser. CAmkES: A component model forsecure microkernel-based embedded systems. Journal of Systems and Software, 80(5),2007.

Labix. python-constraint, 2010. URL http://labix.org/python-constraint.

Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong ProgramAnalysis & Transformation. In Proceedings of the international symposium on codegeneration and optimization. IEEE Computer Society, 2004.

Samuel J. Leffler and Marshall Kirk McKusick. The Design and Implementation of the4.3 BSD UNIX Operating System. Addison-Wesley, 1989.

Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying the cost of context switch. InExpCS ’07: Proceedings of the 2007 workshop on Experimental computer science.ACM, 2007.

Sheng Liang. Java Native Interface: Programmer’s Guide and Reference. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. ISBN 0201325772.

Jochen Liedtke. On Micro-Kernel Construction. In Symposium on Operating SystemsPrinciples, 1995.

Luigi Federico Menabrea and Ada Augusta. Sketch of The Analytical Engine, inventedby Charles Babbage. In Bibliotheque Universelle de Geneve, number 42. Geneve,October 1842.

David Miller. How SKBs work, 2010. URL http://vger.kernel.org/˜davem/skb.html.

David Mosberger and Larry Peterson. Making paths explicit in the Scout operating sys-tem. In OSDI ’96: Proceedings of the USENIX 2nd Symposium on OS Design andImplementation, volume 30. ACM Association for computing machinery, 1996.

MPlayer authors. MPlayer HQ, 2010. URL http://www.mplayerhq.hu/.

Erich Nahum, Tsipora Barzilai, and Dilip D. Kandlur. Performance issues in WWWservers. IEEE/ACM Transactions on Networking, 10(1), 2002.

NICTA. OKL4 for the HTC Dream, 2010. URL http://ertos.nicta.com.au/software/okl4htcdream/.

136

Bibliography

Object Management Group. CORBA 3.0.3, Common Object Request Broker Architec-ture (Core Specification), 2004-03-01, 2004.

OK Labs. OKL4 Microkernel, 2010. URL http://wiki.ok-labs.com/Microkernel.

Oracle. Java home page, February 2010. URL http://java.sun.com/.

Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. IO-Lite: a unified I/O buffering andcaching system. ACM Transactions on Computing Systems, 18(1), 2000.

David L. Parnas. On the criteria to be used in decomposing systems into modules. Com-munications of the ACM, 1972.

Andrei Popovici, Thomas Gross, and Gustavo Alonso. Dynamic weaving for aspect-oriented programming. In AOSD ’02: Proceedings of the 1st international conferenceon Aspect-oriented software development. ACM, 2002.

Calton Pu and Henry Massalin. The Synthesis Kernel. Computing systems: the journalof the USENIX Association, 1(1), 1988.

Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutler, and Koen De Boss-chere. DIABLO: a reliable, retargetable and extensible link-time rewriting framework.In Proceedings of the Fifth international symposium on signal processing and infor-mation technology, 2005.

Python Software Foundation. Python Programming Language – Official Website, 2010.URL http://python.org.

Alastair Reid, Matthew Flatt, Leigh Stoller, Jay Lepreau, and Eric Eide. Knit: Compo-nent composition for system software. In Proceedings of the 4th ACM Symposium onOperating Systems Design and Implementation, 2000.

Arch D. Robison. Impact of economics on compiler optimization. In JGI ’01: Proceed-ings of the 2001 joint ACM-ISCOPE conference on Java Grande. ACM, 2001.

Douglas R. Smith. Kids: A semiautomatic program development system. IEEE Transac-tions on Software Engineering, 16(9), 1990.

John A. Stankovic, Ruiquing Zhu, Ram Poornalingam, Chenyang Lu, Zhedong Yu, MartyHumphrey, and Brian Ellis. VEST: An aspect-based composition tool for real-timesystems. In IEEE Real-Time and Embedded Technology and Applications Symposium,2003.

Sander Tichelaar, Stephane Ducasse, Serge Demeyer, and Oscar Niestrasz. A meta-modelfor language-independent refactoring. In Proceedings of the International Symposiumon Principles of Software Evolution, 2000.

TIS Committee. Tool Interface Standard (TIS) Executable and Linking Format (ELF)Specification Version 1.2, May 1995.

137

Bibliography

Alan Turing. Can a machine think? In The World of Mathematics, volume 4. Simon &Schuster, 1956.

Todd L. Veldhuizen and Dennis Gannon. Active libraries: Rethinking the roles of com-pilers and libraries. In Proceedings of the SIAM Workshop on Object Oriented Methodsfor Inter-operable Scientific and Engineering Computing. SIAM Press, 1998.

Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, Jen-nifer M. Anderson, Steve W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W.Hall, Monica S. Lam, and John L. Hennessy. Suif: an infrastructure for research onparallelizing and optimizing compilers. SIGPLAN Notices, 29(12), 1994.

Junfeng Yang, Can Sar, Paul Twohey, Cristian Cadar, and Dawson Engler. Automaticallygenerating malicious disks using symbolic execution. In Proceedings of the 2006 IEEESymposium on Security and Privacy. IEEE Computer Society, 2006.

138

Date post:	19-Aug-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Architecture optimisationts.data61.csiro.au/publications/papers/FitzRoyDale:phd.pdf ·...

Documents