Probabilistic Programming with Imperative Factor...

Post on 28-Jun-2018

217 views 0 download

transcript

Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel.Some slide material from Avi Pfeffer.

Andrew McCallum

Department of Computer ScienceUniversity of Massachusetts Amherst

Probabilistic Programmingwith Imperative Factor Graphs

Uncertainty

• Uncertainty is ubiquitous- Partial information- Noisy sensors- Non-deterministic actions- Exogenous events

• Reasoning under uncertainty is a central challenge for building intelligent systems.

Probability

• Probability provides a mathematically sound basis for dealing with uncertainty.

• Combined with utilities, provides a basis for decision-making under uncertainty.

Probabilistic Modeling in the Last Few Years

• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite

• Developing the representation, inference and learning for a new model is a significant task.

Conditional Random Fields

Finite state model

Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)

transitions observations

y1 y2 y3 y4 y5 y6 y7 y8state sequence

observation sequence

x1 x2 x3 x4 x5 x6 x7 x8

Graphical model

(Linear-chain) [Lafferty, McCallum, Pereira 2001]

1Z�x

|�x|�

t=1

φ(yt, yt−1)φ(xt, yt)= exp

� �

k

λkfk(xt, yt)

p(y|x) =

Conditional Random Fields

Finite state model

Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)

state sequence

observation sequence

Graphical model

(Linear-chain) [Lafferty, McCallum, Pereira 2001]

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

y1 y2 y3 y4 y5 y6 y7 y8

x1 x2 x3 x4 x5 x6 x7 x8

Skip-chain CRF

. . .

Senator Joe Green said today . Green chairs the ...

Joint NER across sentences

Capture long-distance dependencies

[Sutton, McCallum, 2005]

Factorial CRF

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

Those surfers like San Jose

[Sutton, McCallum ’04]

Joint Part-of-speech, NP chunking, NER

Inference by Loopy Belief Propagation

Pairwise Affinity CRFMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

[McCallum & Wellner 2003]

Entity Resolution

“mention”

“mention” “mention”

“mention”

“mention”Mr. Hill

Amy Hall

Dana Hill

Dana she

Entity Resolution

“entity”

“entity”

Mr. Hill

Amy Hall

Dana Hill

Dana she

Entity Resolution

“entity”

“entity”

Mr. Hill

Amy Hall

Dana Hill

Dana she

Entity Resolution

“entity”

“entity”

“entity”Mr. Hill

Amy Hall

Dana Hill

Dana she

CRF for Co-referenceMr. Hill

Amy Hall

Dana Hill

Dana she

CRF for Co-referenceMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

[McCallum & Wellner 2003]

p(�y|�x) =1

Z�xexp

i,j

l

λlfl(xi, xj , yij)

+ mechanism for preserving transitivity

Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- enforcing transitivity.

Pairwise Affinity is not EnoughMr. Hill

Amy Hall

Dana Hill

Dana she

C

C

N

N

N

C

NC

N N

Pairwise Affinity is not Enoughshe

Amy Hall

she

she she

C

C

N

N

N

C

NC

N N

Pairwise Comparisons Not EnoughExamples:

• ∀ mentions are pronouns?

• Entities have multiple attributes (name, email, institution, location);

need to measure “compatibility” among them.

• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin

• Need to measure size of the clusters of mentions.

• ∃ a pair of lastname strings that differ > 5?

We need to ask ∃, ∀ questions about a set of mentions

We want first-order logic!

Pairwise Affinity is not Enoughshe

Amy Hall

she

she she

C

C

N

N

N

C

NC

N N

Partition Affinity CRFshe

Amy Hall

she

she she

Ask arbitrary questionsabout all entities in a partitionwith first-order logic...

Partition Affinity CRFshe

Amy Hall

she

she she

Partition Affinity CRFshe

Amy Hall

she

she she

Partition Affinity CRFshe

Amy Hall

she

she she

Partition Affinity CRFshe

Amy Hall

she

she she

How can we perform inference and learning in models that cannot be “unrolled” ?

Can’t use belief propagation.Can’t use standard integer linear programming.

Don’t represent all alternatives...

she

she

AmyHall

sheshe

Don’t represent all alternatives... just one

she

she

AmyHall

sheshe

she

she

AmyHall

sheshe

StochasticJump

ProposalDistribution

at a time

Markov Chain Monte Carlo

Metropolis-HastingsSampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Given factor graph with target variables y and observed x

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

proposal distribution

SampleRank

Metropolis-Hastings for MAP

Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1

. . . over a model

P(y |x) =1

ZX

y i∈F

ψ(x , y i)

. . . using a proposal distribution q(y �|y) : F × F → [0, 1]

MH for MAP:

1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α

α = min

�1,

p(y �)

p(y)

q(y |y �)

q(y �|y)

1F is the feasible region defined by deterministic constraints, e.g. clustering,

non-projective tree

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 3 / 27

Can do MAP inference with decreasing temperature on ratio of p(y)’s

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

SampleRank

SampleRankmotivation

QUESTION: how do we learn θ for MH

Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding

Want: push updates (not inference) into inner-most loop of learning

Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

Probabilistic Modeling in the Last Few Years

• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite

• Developing the representation, reasoning and learning for a new model is a significant task.

Probabilistic Programming Languages

• Make it easy to represent rich, complex models, using the full power of programming languages- data structures

- control mechanisms

- abstraction

• Inference and learning come for free (or sort of)

• Give you the language to think of and create new models

Small Sampling of Probabilistic Programming Languages

• Logic-based- Markov logic, BLOG, PRISM

• Functional- IBAL, Church

• Object Oriented- Figaro, Infer.NET

BLOG

#Researcher ~ NumResearchersPrior();

Name(r) ~ NamePrior();

#Paper ~ NumPapersPrior();

FirstAuthor(p) ~ Uniform({Researcher r});

Title(p) ~ TitlePrior();

PubCited(c) ~ Uniform({Paper p});

Text(c) ~ NoisyCitationGrammar (Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));

• Generative model of objects and relations.

• Handles unknown number of objects

• Inference by MCMC.

[Milch et al, 2005]

Church

• Tell generative story-line in Scheme.

• Do MCMC inference over execution paths.(define (DP alpha proc)

(let ((sticks (mem (lambda x (beta 1.0 alpha)))) (atoms (mem (lambda x (proc)))))

(lambda () (atoms (pick-a-stick sticks 1))))) (define (pick-a-stick sticks J) (if (< (random) (sticks J)) J (pick-a-stick sticks (+ J 1))))

(define (DPmem alpha proc) (let ((dps (mem (lambda args

(DP alpha (lambda () (apply proc args)) )))))

(lambda argsin ((apply dps argsin))) ))

[Goodman, Mansinghka, Roy, Tenenbaum, 2009]

Figaro• Generative model of objects and relations.

• Object oriented (also in Scala!)- “Models” are basic building block,

composed of other models, derived by inheritance.

- Models are objects with conditions, constraints and relations to other objects.

- Model = data + factors; they are intertwined.

[Pfeffer, 2009]

Figaro [Pfeffer, 2009]

People smoke with probability 0.6:Smoke(x) 1.5Friends are 3 times as likely to have the same smoking habit than different:¬Friends(x,y) v ¬Smoke(x) v Smoke(y) 3¬Friends(x,y) v Smoke(x) v ¬ Smoke(y) 3

class Person { val smokes = Flip(0.6) }val alice, bob, clara = new Personalice.smokes.condition(true)val friends = List((alice, bob), (bob, clara))def constraint(pair: (Boolean, Boolean)) = if (pair._1 == pair._2) 3.0; else 1.0for { (p1,p2) ← friends } Pair(p1.smokes, p2.smokes).constrain(constraint)

Markov LogicFirst-Order Logic as a Template to Define CRF Parameters

[Richardson & Domingos 2005][Paskin & Russell 2002][Taskar et al 2003]

ground Markov network

grounding Markov network requires space O(nr)

n = number constants r = highest clause arity

My Approach

• I’m going to immediately dismiss the generative models.- Interesting, but not what performs best in NLP.

Want

• Discriminatively trained factor graphs.

• Best previous example of this: Markov Logic.

Logic + Probability

• Significant interest in this combination- Poole, Muggleton, DeRaedt, Sato, Domingos,...

• We now hypothesize that in much of this previous workthe “logic” aspect is mostly a red herring.- Power: repeated relational structures and tied parameters

- Logic is one way to specify these structures, but not the only one, and perhaps not the best.

- In deterministic programming, Prolog replaced by imperative lang’s✦ programmers have to keep imperative solver in mind after all✦ much domain knowledge is procedural anyway

- Logical inference replaced by probabilistic inference.

Declarative Model Specification• One of biggest advances in AI & ML

• Gone too far?Much domain knowledge is also procedural.

• Logic + Probability → Imperative + Probability- Rising interest: Church, Infer.NET,...

• Our approach- Preserve the declarative statistical semantics of factor graphs

- Provide imperative hooks to define structure, parameterization, inference, learning. Efficient. Easy-to-use.

- “Imperatively-Defined Factor Graphs” (IDFs)

Our Design Goals• Represent factor graphs

- emphasis on discriminative undirected models

• Scalability- input data, output configuration, factors, tree-width

- observed data that cannot fit in memory

- super-exponential number of factors

• Efficient discriminative parameter estimation- sensitive to the expense of inference

• Leverage object-oriented benefits- Modularity, encapsulation, inheritance,...

• Integrate declarative & procedural knowledge- natural, easy-to-use

- upcoming slides: 3 examples of injecting imperativ-ism into factor graphs

FACTORIE• Factor Graphs, Imperative, Extensible• Implemented as a library in Scala [Martin Odersky]

- object oriented & functional- type inference- lazy evaluation- everything an object (int, float,...)- nice syntax for creating “domain-specific languages”- runs in JVM (complete interoperation with Java)- “Haskell++ in a Java style”

• Library, not new “little language”- all familiar Java constructs & libraries available to you- integrate data pre-processing & eval. w/ model spec- Scala makes syntax not too bad.- But not as compact as a dedicated language (BLOG, MLN)

Stages of FACTORIE programming1. Define templates for data (i.e. classes)

- Use data structures just like in deterministic programming.

- Only special requirement: provide “undo” capability for changes.

2. Define templates for factors- Distinct from above data representation;

makes it easy to modify model scores indep’ly.

- Use & transform data’s natural relations to define factors’ relations.

3. Optionally, define MCMC proposal functions that leverage domain knowledge.

Scala• New variable

var myHometown : Stringvar myAltitude = 10523.2

• New constantval myName = “Andrew”

• New methoddef climb(increment:double) = myAltitude += increment

• New classclass Skier extends Person

• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }

• New class with traitclass BackcountrySkier extends Skier with FirstAid

• New static object [generic]object GlobalSkierTable extends ArrayList[Skier]

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg)

class Token(word:String) extends EnumVariable(word)

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq

class Token(word:String) extends EnumVariable(word) with VarSeq

label.prev label.next

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq

class Token(word:String) extends EnumVariable(word) with VarSeq

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label}

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Avoid representing relations by indices.Do it directly with members, pointers... arbitrary data structure.

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Bill loves skiing Tom loves snowshoeing

T F F T F F

Imperativ-ism #1: Jump Function

• Proposal “jump function”– Make changes to world state

• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,

run stochastic greedy agglomerative clustering

• Gibbs sampling, one variable at a time– poor mixing

• Rich jump function– Natural place to embed domain knowledge about what variables

should change in concert.

Imperativ-ism #1: Jump Function

• Proposal “jump function”– Make changes to world state

• Sometimes simple, sometimes not– Sample Gaussian with mean at old value– Sample cluster to split,

run stochastic greedy agglomerative clustering

• Gibbs sampling, one variable at a time– poor mixing

• Rich jump function– Natural place to embed domain knowledge about what variables

should change in concert.– Avoid some expensive deterministic factors with

property-preserving jump functions (e.g. coref transitivity, dependency parsing projectivity)

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores

Key Operation: Scoring a Proposal

• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.

• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate

those factors’ scores• How to find factors from variables & vice versa?

– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors

– But complex to maintain as structure changes

Imperativ-ism #2: Model Structure• Maintain no map structure between factors and variables

• Finding factors is easy. Usually # templates < 50.• Primitive operation:

Given factor template and one changed variable, find other variables• In factor Template object, define imperative methods that do this.

– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– I.e., use Turing-complete language to determine structure on the fly.– If you want to use a data structure instead, access it in the method.– If you want a higher-level language for specifying structure,

write it terms of this primitive.

• Other nice attribute– Easy to do value-conditioned structure. Case Factor Diagrams, etc.

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token]

object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token)}

object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label]

Labels

Words

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next)}

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}

Imperativ-ism #3: Neighbor-Sufficient Map• “Neighbor Variables” of a factor

– Variables touching the factor• “Sufficient Statistics” of a factor

– Vector, dot product with weights of log-linear factor → factor’s score

• Usually confounded. Separate them.• Skip-chain NER. Instead of 5x5 parameters, just 2.

(label1, label2) → label1 == label2

Labels

Words

Bill loves Paris Bill the painter ...

PER O LOC PER O O

Example: Linear-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends Template1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends Template2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends TemplateWithStatistics2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template1[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq) if (label.token == other.token)) yield Factor (label,other)}

Example: Skip-Chain CRF for Segmentation

class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}

// Factor templatesobject StateTemplate extends TemplateWithStatistics1[Label]object StateTokenTemplate extends Template2[Label,Token] { def unroll1(label:Label) = Factor(label, label.token) def unroll2(token:Token) = new Error // Tokens shouldn’t change}object StateTransitionTemplate extends TemplateWithStatistics2[Label,Label] { def unroll1(label:Label) = Factor(label, label.next) def unroll2(label:Label) = Factor(label.prev, label)}object SkipTemplate extends Template2[Label,Label] with Statistics1[Bool]{ def unroll1(label:Label) = for (other <- label.seq; if (label.token == other.token)) yield Factor (label,other) def statistics(label1:Label,label2:Label) = Stat(label1 == label2)}

Example: Dependency Parsingclass Word(str:String) extends EnumVariable(word)class Node(word:Word, parent:Node) extends PrimitiveVariable(parent)

object ChildParentTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, n.parent.word)}

object NearestVerbTemplate extends Template1[Node] with Statistics2[Word,Word] { def statistics(n:Node) = Stat(n.word, closestVerb(n).word) def closestVerb(n:Node) = if (isVerb(n.word)) n else closestVerb(n.parent) def unroll1(n:Node) = n.selfAndDescendants}

VERB

Extensibility

• Many variables types provided:- boolean, int, float, String, categorical,...

• Create new ones!- set-valued variable

- finite-state machine as a variable [JHU]

• Create new factor types- Poisson, Dirichlet,...

Experimental Results• Joint Segmentation & Coreference of

research paper citations.- 1295 mentions, 134 entities, 36487 tokens

• Compare with MLNs (Alchemy)- Same observable features

• Factorie results:

- ~25% reduction in error (segmentation & coref)

- 3-20x faster

- coref results:

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge

Conclusion: Some Reasons To Use Probabilistic Programming

• Simple- Save time & avoid debugging complex, hand-built ML code.- Say exactly what you want in the way you want to say it.

• Flexible- Encourage research exploration by making it easier to try

new modeling ideas- The language provides the right “hinge-points” to provide

the flexibility you want, without the underlying cruft.

• Glue that binds many reasoning paradigms together.• Allows probabilistic modeling to be integrated with all

the other traditional deterministic programming.

Some of this text from Avi Pfeffer

Key Questions for Probabilistic Programming

• What are good design patterns for probabilistic programming?

• What are the skills required to be an effective probabilistic programmer?

• How can probabilistic programmers work well with domain experts and end users?

• What kind of tools can we develop to support probabilistic programming (debuggers, profilers etc.)?

• How can probabilistic programs be learned (especially structure)?

• How can we make inference more efficient (especially memory) to scale up to even large domains?

Some of this text from Avi Pfeffer

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge

Don’t represent all alternatives... just one

she

she

AmyHall

sheshe

she

she

AmyHall

sheshe

StochasticJump

ProposalDistribution

at a time

Markov Chain Monte Carlo

M-H Natural Efficiencies1. Partition function cancels

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

SampleRank

MH Efficiency

factors cancel

=

�y �i∈y � ψ(x , y �i)

�y i∈y ψ(x , y i)

=

��y �i∈δy � ψ(x , y �i)

� ��y i∈y �/δy � ψ(x , y i)

��y i∈δy

ψ(x , y i)� ��

y i∈y/δyψ(x , y i)

=

�y �i∈δy � ψ(x , y �i)

�y i∈δy

ψ(x , y i)

δy is the “diff”, ie variables in y that have changed

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 6 / 27

2. Unchanged factors cancel

How to learn parameters for ?

SampleRank

SampleRankmotivation

QUESTION: how do we learn θ for MH

Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding

Want: push updates (not inference) into inner-most loop of learning

Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 7 / 27

SampleRank

MH Efficiency

partition cancels

p(y �)

p(y)=

p(Y = y �|x ; θ)

p(Y = y |x ; θ)

=1

ZX

�y i∈y � ψ(x , y i)

1

ZX

�y∈y ψ(x , y i)

=

�y i∈y � ψ(x , y i)

�y∈y ψ(x , y i)

(UMass, Amherst) Sample Rank Vs. Contrastive Divergence IESL 5 / 27

• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...

• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)

• But, getting marginal distributions by sampling can be inefficient due to large sample space.

• Alternative: Perceptron. Approximate gradient from difference between true output and model’s predicted best output.

• But, even finding model’s predicted best output is expensive.

• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]Learn to rank intermediate solutions: P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)

Parameter Estimation in Large State Spaces

Ranking vs Classification Training• Instead of training

[Powell, Mr. Powell, he] → YES[Powell, Mr. Powell, she] → NO

• ...Rather...

[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]

• In general, higher-ranked example may contain errors

[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

1.

UPDATE

Ranking Intermediate SolutionsExample

2.

∆ Model = -23∆ Truth = -0.2

3.

∆ Model = 10∆ Truth = -0.1

4.

∆ Model = -10∆ Truth = -0.1

5.

∆ Model = -3∆ Truth = 0.3

• Like Perceptron:Proof of convergence under Marginal Separability.

• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!

• Very fast to train.

UPDATE

Comparison to Contrastive DivergenceContrastive Divergence, n=2 [Hinton 2002]

sufficient statistics for update

proposal

truth

Persistent Contrastive Divergence [Tieleman 2008]

Sample Rank

SampleRank on Coreference• ACE 2004

• All nouns. 28,122 mentions, 14,047 entitiese.g. he, the President, Clinton, Mrs. Clinton, Washington

2005 Ng . 69.5%

2007 Culotta, Wick, Hall, McCallum . 79.3%

2008 Bengston, Roth . 80.8%

2009 Wick, McCallum MCMC+SampleRank . 81.5%

Contrastive Divergence . 75.1%

Persistent Contrastive Divergence . 74.9%

Perceptron . 76.3%

B3

FACTORIE Summary

• Factor graphs, • ...object-oriented

– data types and factor template types, with inheritance• ...scalable

– factors created on demand, only score diffs• ...with imperative hooks

– jump function, override variable.set() for coordination– model structure– neighbor variables → sufficient statistics

• ...discriminative– efficient online training by SampleRank– generative modeling also provided (LDA = ~12 lines)

• Combine declarative & procedural knowledge