STRINGS AND AUTOMATA MODULO THEORIES Margus Veanes July 18, 2015SMT'15, San Fransisco1.

Post on 23-Dec-2015

214 views 0 download

transcript

SMT'15, San Fransisco 1

STRINGS AND

AUTOMATA MODULO THEORIES

Margus Veanes

July 18, 2015

SMT'15, San Fransisco 2

• Symbolic execution– Path feasibility analysis involving string

constraints– Regular expression matching

• Security vulnerabilities– SQL injection attacks– XSS attacks – DoS attacks

• e.g. regex injection

– Directory traversal attacks

…• Data processing

– Parallelization– Deforestation

• Malware detection

MOTIVATION

July 18, 2015

[OWASP]top 1,3 culprits

http://foo.bar.system/scripts/..%c1%1c../winnt/system32/cmd.exe?/c+dir+c:\

SMT'15, San Fransisco 3

“EARLY” WORK RELATED TO STRING ANALYSIS

• Tools– Mona: Henriksen-Jensen-Jørgensen-Klarlund-Paige-Rauhe-Sandholm, TACAS’95

• Built on BRICS automata library

– JSA: Christensen-Møller-Schwartzbach, SAS’03 (Uses BRICS)– Haderach: Shannon-Hajra-Lee-Zhan-Khurshid, MUTATION’07 (Uses BRICS)

• Theory– Bjørner, PhD Thesis’98, Decision procedure for queues– Blumensath-Grädel, LICS’00 (automatic structures)– Benedikt-Libkin-Schwentick-Segoufin, LICS’01 (regular string relations)– Khoussainov-Nies-Rubin-Stephan, LICS’04 (automatic Boolean Algebras)– Bala, STACS’04, (regular term matching)– Kunc, DLT’2007, (complexity of language equations)

July 18, 2015

SMT'15, San Fransisco 4

THE RISE OF THE STRING ANALYZERS

• String theory encodings in SMT:– Pex-LL: Bjørner-Tillmann-Voronkov, TACAS’09 (strings + SMT)– Reggae: Li-Xie-Tillmann-deHalleux-Schulte, ASE’09 (symolic exploration of regex code)– Z3-str: Zheng-Zhang-Ganesh, ESEC/FSE 2013 (plugin to Z3)– CVC4-str: Liang-Reynolds-Tinelli-Barrett-Deters, CAV’14 (DPLL(TSLRp))– S3: Trinh-Chu-Jaffar, CCS’14 (uses Z3-str-star)

• Automata related:– Stranger: Yu-Alkhalaf-Bultan-Ibarra-Cova, SPIN’08, TACAS’09, TACAS’10 (automata based)– DPRLE: Hooimeijer-Weimer, PLDI’09 (subset checking)– Hampi: Kiezun-Ganesh-Guo-Hooimeijer-Ernst, ISSTA’09 (best paper award) (reduction to BV)– Kaluza(in Kudzu): Saxena-Akhawe-Hanna-Mao-McCamant-Song, Okland’10 (Hampi + mult.var.)– Rex: Veanes-deHalleux-Tillmann-Bjørner-deMoura, ICST’10, LPAR’2010 (language acceptors)– Bek: Hooimeijer-Livshits-Molnar-Saxena-Veanes-Bjørner, USENIX Security'11, POPL’12 (transducers)– Bex: D’Antoni-Veanes, VMCAI’13, CAV’13 (lookahead)– PASS: Li-Ghosh, HVC 2013 (best paper award) . (array based)– SMC: Luu-Shinde-Saxena-Demsky, PLDI’14 (model counting)

CAV’15:– ABC: Aydin-Bang-Bultan (automata based counting, using Stranger and BRICS)– NORN: Abdulla-Atig-Chen-Holik-Rezine-Rümmer-Stenman, also CAV’14 (Horn clauses, BRICS)– Z3-str+: Zheng-Ganesh-Subramanian-Tripp-Dolby-Zhang. (string + regex + length )

July 18, 2015

SMT'15, San Fransisco 5

TWO QUESTIONS

• What are characters?

• What are strings?

July 18, 2015

smileycipher(“hello world”) = “ 😧😤😫😫😮😶😮

”😱😫😣

Is this a string function?

SMT'15, San Fransisco 6

WHAT ARE CHARACTERS?1. Elements of a Finite Alphabet ?

– Only primitive operation is =: Bool– What about Unicode, e.g., 😀 😁 http://unicode.org/charts/PDF/U1F600.pdf

• || = 1,112,064 – For succinctness allow total order ≺: Bool and ranges [a-b] (denotes {x | a ≼ x ≼ b})

• This affects the notion of automaton over !• Why not other operations as well?

2. Bit-vectors, say char (BV16) ?– With primitive operations like &: char char char – “ ” 😀 = “\uD83D\uDE00” (UTF16 surrogate pair)

• has its own theory, namely bv theory!

3. Integers (code points) ?– 😀 = 0x1F600 = 128512– e.g. + 1 = = 0x1F601😀 😁

• has its own theory, namely int theory!

July 18, 2015

SMT'15, San Fransisco 7

WHAT ARE STRINGS?• Finite sequences of characters (char)

– CVC4-strSingleton string = char

• Restricted arrays of int to char– Pex-LL, PASSarray<int,char> ≠ char singleton string ≠ char

• Finite lists of characters– Pex-Rexlist<char> ≠ char singleton string ≠ char

• Finite queues– transducers

The answer depends on the context and the required operations. – First, Last, Rest, Append, Substring, Length, …

July 18, 2015

SMT'15, San Fransisco 8

ANALYSIS TASKS

• Consider character type C, string type S<C>, and regular expression type R<C>.– When is DPLL(TC,TS<C>,TR<C>) possible/feasible?

• What about (finite state) transducers?– Regular transformations of type S<Tin> S<Tout>

– Typically Tin = Tout = bit-vectors– Many string transformations are such:

• sanitizers, encoders

July 18, 2015

SMT'15, San Fransisco 9

HTML ENCODER

July 18, 2015

Arithmetic operations on

characters

SMT'15, San Fransisco 10

FOR EACH DOMAIN SPECIFIC TASK

Design a language that• only has the features required by the task• it is simple to use• enables to automatically reason about what

the programs do• compiles into efficient code

July 18, 2015

SMT'15, San Fransisco 11

THE REST OF THE TALK

• Symbolic Automata and Transducers• BEK and string sanitizers• BEX and string encoders• Data parallel BEK/BEX for string processing

July 18, 2015

SYMBOLIC FINITE AUTOMATA

July 18, 2015 SMT'15, San Fransisco 12

SMT'15, San Fransisco 13

SYMBOLIC FINITE AUTOMATON (SFA)

• Labels are predicates

qp x. 'a' ≤ x ≤ 'd'

July 18, 2015

one symbolic transition:

denotesmany concrete

transitions:qp

'a'

‘c'‘b'

'd'

for x〚 'a' ≤ x ≤ 'd' 〛

SFA EXECUTION EXAMPLE

14

λx. x mod 2=0

λx. x mod 2=1

p q

λx. x mod 2 =0λx. x mod 2=1

1 2 5 3

p p q p p

p is final accept the inputJuly 18, 2015 SMT'15, San Fransisco

SYMBOLIC FINITE AUTOMATAWhat is the alphabet?

July 18, 2015 SMT'15, San Fransisco 15

ALPHABET IS ANEFFECTIVE BOOLEAN ALGEBRA

July 18, 2015 SMT'15, San Fransisco 16

Domain Predicates

P 2D

(D,P, 〚 _ 〛 , , T, , , )

ALPHABET EXAMPLE

July 18, 2015 SMT'15, San Fransisco 17

{a,b}

{,{a},{b},{a,b}}

id

{a,b}

c

p q

{a,b}{a}

{b}

a*b(a|b)*

SFA over 2{a,b} :

regex :

2{a,b} = (D,P, 〚 _ 〛 , , T, , , )

ALPHABET EXAMPLE: 2BVK

• D = {n | 0 n < 2k}• P = BDDs of depth k• Boolean operations are BDD operations Below 〚 i 〛 = {n D | i'th bit of n is 1}

July 18, 2015 SMT'15, San Fransisco 18

i has fixed size independent of i

ALPHABET EXAMPLE: SMTINT

• D = Integers • P = integer linear arithmetic formulas

(with one fixed free variable)• 〚 〛 = 〚〛 〚〛• 〚〛 = , 〚 〛 = D \ 〚〛• Satisfiability: 〚〛

July 18, 2015 SMT'15, San Fransisco 19

BOOLEAN ALGEBRA INTERFACE IN C#

July 18, 2015 SMT'15, San Fransisco 20

public interface IBoolAlg<P>{

P Top { get; }P Bot { get; }P Not(P pred);P Or(P pred1, P pred2);P And(P pred1, P pred2);bool IsSat(P predicate);}

public interface IBoolAlgExt<P,D> : IBoolAlg<P>{IEnumerable<D> Den(P);P One(D);}

UNIT ALPHABET EXAMPLE IN C#

July 18, 2015 SMT'15, San Fransisco 21

class A1 : IBoolAlg<bool>{

public bool Top { get { return true; } }public bool Bot { get { return false; } }public bool Not(bool pred) { return !pred; }public bool Or(bool pred1, bool pred2) { return pred1 || pred2; }public bool And(bool pred1, bool pred2) { return pred1 && pred2; }public bool IsSat(bool pred){ return pred; }}

One-letter alphabet

ANOTHER ALPHABET EXAMPLE IN C#

July 18, 2015 SMT'15, San Fransisco 22

class A16 : IBoolAlg<UInt16>{

public UInt16 Top { get { return 0xFFFF; } }public UInt16 Bot { get { return 0; } }public UInt16 Not(UInt16 pred) { return ~pred; }public UInt16 Or(UInt16 pred1, UInt16 pred2) { return pred1 | pred2; }public UInt16 And(UInt16 pred1, UInt16 pred2) { return pred1 & pred2; }public bool IsSat(UInt16 pred){ return pred != 0; }}

16-letter alphabet

ALPHABET TRANSFORMATIONS

• Effective Boolean algebras can be extended– e.g. disjoint union

• Effective Boolean algebras can be restricted– e.g. restriction wrt. a given predicate

July 18, 2015 SMT'15, San Fransisco 23

DISJOINT UNION OF ALPHABETS IN C#

July 18, 2015 SMT'15, San Fransisco 24

public class PairAlg<S, T> : IBoolAlg<Pair<S, T>>{ IBoolAlg<S> A; IBoolAlg<T> B; Pair<S,T> Bot {get return new Pair<S,T>(A.Bot,B.Bot);} … public Pair<S, T> Or(Pair<S,T> a, Pair<S,T> b) { return new Pair<S,T>(A.Or(a[0],b[0]), B.Or(a[1],b[1])); } public bool IsSat(Pair<S,T> p) { return A.IsSat(p[0]) || B.IsSat(p[1]); }}

SFA VS. CLASSICAL AUTOMATA?

• SFAs can support infinite alphabets• For some cases SFAs are

exponentially more succinct than NFAsExample (recall the BDDs i from before):

Equivalent NFA requires 2k transitions.July 18, 2015 SMT'15, San Fransisco 25

SYMBOLIC FINITE AUTOMATAAlgorithms over SFAs.

July 18, 2015 SMT'15, San Fransisco 26

ALGORITHMS OVER SFAS

• Language intersection– Uses product of automata

• Language complementation– Requires determinization

• Minimization– Extensions of Moore/Hopcroft [POPL’14]

• Regex SFA construction– Uses BDDs to represent Unicode character sets– Requires BDD interval-set conversions

• May cause exponential blowup: recall the BDDs i

July 18, 2015 SMT'15, San Fransisco 27

LANGUAGE INTERSECTION

• Uses DFS and product of transitions

July 18, 2015 SMT'15, San Fransisco 28

p1 q1

p2 q2

A:

B:

p1

p2

AB: q1

q2

delete when

unsat

X

INTERSECTION EXAMPLE

July 18, 2015 SMT'15, San Fransisco 29

a1 a2

2

A:

B:

66

b1

3

a1

b1

a2 b2

23

63

a1 b2

3

let k(x) ((x mod k) = 0)

AB:

b263

X

LANGUAGE COMPLEMENTATIONFirst determinize then swap final and nonfinal states

July 18, 2015 SMT'15, San Fransisco 30

p q

r

{p}{q}

{q,r}

{r}

delete unsat guards

determinize

31

MINIMIZATION (SYMBOLIC MOORE)

D := (F (Q\F)) ((Q\F) F)foreach (p’,q’) D, (p,q) D if (IsSat(guard(p,p’) ∧ guard(q,q’)))

add (p,q) to D

p

q

p’

q’

distinguishable

φ

ψ

distinguishable IsSat(φ ∧ ψ)

July 18, 2015 SMT'15, San Fransisco

REGEX SFA

• Classical algorithm extended to work with predicates– First produces SFA (SFA with -moves )– Then -moves are eliminated using the

standard -elimination algorithm– Requires interval-set BDD algorithm for

converting character classesExample: [\0x0-\0xFF] = BDD whose bits in pos. > 7 are 0

July 18, 2015 SMT'15, San Fransisco 32

ONLINE SFA ALGORITHM EXAMPLES

• http://www.rise4fun.com/Bex/zE

July 18, 2015 SMT'15, San Fransisco 33

SYMBOLIC FINITE TRANSDUCERS

July 18, 2015 SMT'15, San Fransisco 34

SYMBOLIC FINITE TRANSDUCER (SFT)

• Labels are guarded transformation functions

Concrete transitions:

p

q

Symbolic transition:

‘\x80’/“\xC2\x80”

… ‘\x7FF’/“\xDF\xBF”

q

p

x. 8016 ≤ x ≤ 7FF16/[C016|x10,6, 8016|x5,0]

guard

bitvector operations

1920transitions

SMT'15, San Fransisco 35July 18, 2015

SFT EXECUTION EXAMPLE

36

x mod 2 =0/[x, x]

x mod 2 =1/[x-1]

p q

x mod 2 =0/[]x mod 2 =1/[x-1]

1 2 5 3

p p q p p

Input tape

Output tape 0 2 2 4 2

July 18, 2015 SMT'15, San Fransisco

SYMBOLIC FINITE TRANSDUCERSProperties and algorithms

July 18, 2015 SMT'15, San Fransisco 37

WHY SFTS?

• They have good algebraic properties (POPL'12)– SFTs are closed under composition– Equivalence is decidable in the single-valued case– domain of an SFT is an SFA

• SFAs are closed under Boolean operations

• Useful for various analysis tasks

July 18, 2015 SMT'15, San Fransisco 38

SFT COMPOSITION

AB = x.B(A(x))

July 18, 2015 SMT'15, San Fransisco 39

a1 a2A

B

x>0/ [x+1,x+2]

b1 b2x<5/ [] b3x<4/[x,x]

AB a1b1

x>0 x+1<5 x+2<4 / [x+2, x+2] a2b3

SMT'15, San Fransisco 40

• Composition:

• Equiv. checking for single-valued-SFTs:(undecidable in general)

Algorithms use SMT for satisfiability checking of character formulas

SFT A B

SFT ALGORITHMS

July 18, 2015

in outSFT Bin outSFT A

in outSFT A

in outSFT B

“input string” A and B not equivalent

SMT'15, San Fransisco 41

PROPERTY ANALYSIS (USENIX SEC'11)

• Does it matter if a sanitizer is applied twice? Idempotence:

• Does order of sanitizers matter? Commutativity:

July 18, 2015

“input string” A not idempotent

A AA A

A

“input string” A and B not commutative

B AB A

A BA B

APPLICATIONS

July 18, 2015 SMT'15, San Fransisco 42

APPLICATIONS OF SFAS/SFTS

• SFAs:– Regex support in parameterized unit testing– Fuzz testing of regexes– Password generation

• SFTs:– Analysis of string encoders/decoders– Security analysis of sanitizers

July 18, 2015 SMT'15, San Fransisco 43

SMT'15, San Fransisco 44

APPLICATION 1REGEXES IN PARAMETERIZED UNIT TESTING

• Rex component in Pex• Generate values for s that reach the return branches

– s is a string of Unicode characters (16-bit bit-vectors)

July 18, 2015

bool IsValidEmail(string s) { string r1 = @"^[A-Za-z0-9]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$"; string r2 = @"^\d.*$"; if (System.Text.RegularExpressions.Regex.IsMatch(s, r1)) if (System.Text.RegularExpressions.Regex.IsMatch(s, r2)) return false; //branch 1 else return true; //branch 2 else return false; //branch 3 }

Solve: sL(r1)L(r2) [eg. s = “3@a.b”]

Solve: sL(r1)\L(r2) [eg. s = “a@b.c”]

Solve: sL(r1) [eg. s = “a@..c”]

APPLICATION 2 PASSWORD GENERATIONGiven constraints:• Length is k: "^[\x21-\x7E]{k}$"• Contains 2 capital letters: "[A-Z].*[A-Z]"• Contains a digit: "\d"• Contains a non-word character: "\W"Generate random instances with uniform distribution that match all the above conditions.k=4 : http://www.rise4fun.com/Rex/4nE

http://www.rise4fun.com/Bek/c3j

July 18, 2015 SMT'15, San Fransisco 45

SMT'15, San Fransisco 46

APPLICATION 3SAFETY ANALYSIS

Example: suppose good output = “NoEars"NoEars = [^\uDE38-\uDE40]*bad output: WithEars = Complement(NoEars)

x(smileycipher(x) WithEars) ?

{x | smileycipher(x) WithEars}

Does there exist an input x that causes “ears" in the

output ?

http://www.rise4fun.com/Bek/5sHO

July 18, 2015

EXTENSIONS

July 18, 2015 SMT'15, San Fransisco 47

EXTENSIONS OF SFAS AND SFTS

• ESFT– SFA/SFTswith look-ahead [CAV'13]– BEX language

• STT – Symbolic automata/transducer over trees– FAST language [PLDI’14]

• k-SFT – SFT with lookback [POPL’15]

July 18, 2015 SMT'15, San Fransisco 48

ESFAS AND ESFTS

• Unlike in the classical caselook-ahead breaks many properties– e.g. equivalence of ESFAs is undecidable

July 18, 2015 SMT'15, San Fransisco 49

x1≤FF ∧ x2≤FF ∧ x3≤FF / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]

q

above ESFT, reads 3 and writes 4 symbols

(base64encoder)

http://www.rise4fun.com/Bex/tutorial/guide

M a n M a n

T W F u T W F u

SMT'15, San Fransisco 50

FAST (TREE TRANSDUCERS)

• Trees are common input/output data structures– XML query, type-checking, etc…– Natural Language translators (from parse tree to parse

tree)– Compilers/optimizers (from parse tree to parse tree)– Tree manipulating programs: data structures algorithms,

ontologies, etc…– Augmented Reality

– http://www.rise4fun.com/Fast/tutorial/guide July 18, 2015

SMT'15, San Fransisco 51

TransducerModel

Z3

Transformation Analysis Does it do the right thing?

AnalysisquestionAutomata-.NET

s := iter(c in t)[b := false;] {        case (!b && c in "[\"\\]"):

       b := false;        yield('\\', c);        case (c == '\\'):

           b := !b;           yield(c); case (true):

          b := false; yield(c);

};

DSL

Code Gen

C# JavaScript C

Code Gen

OUR RECIPE FOR EACH TASK

July 18, 2015

SMT'15, San Fransisco 52

Automata-.NET will be open source on GitHub under MIT license

Some references:

BEK• Fast and precise sanitizer analysis with BEK

Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11• Symbolic finite state transducers: algorithms and applications

Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12

BEX• Static analysis of string encoders and decoders

D’Antoni, Veanes, VMCAI13• Equivalence of extended symbolic finite transducers

D’Antoni, Veanes, CAV13• Data parallel string manipulating programs

Veanes, Mytkowicz, Molnar, Livshits, POPL15July 18, 2015

QUESTIONS?

Links to related online tutorials:– Bek

http://rise4fun.com/Bek/tutorial

– Bexhttp://rise4fun.com/Bex/tutorial

– Rexhttp://rise4fun.com/rex/

– Fasthttp://rise4fun.com/Fast/tutorial

SMT'15, San Fransisco 53July 18, 2015