Soft-constrained inference for Named Entity Recognition

Soft-Constrained InferenceFor Named Entity RecognitionI

E. Fersinia,, E. Messinaa, G. Felicib, D. Rothc

aDISCo, University of Milano-Bicocca,Viale Sarca, 336 - 20126 Milano, Italy

bCNR, Institute for Systems Analysis and Computer ScienceViale Manzoni, 30 - 00185 Roma, Italy

cDepartment of Computer Science, University of Illinois at Urbana-Champaign2700 Prairie Meadow Dr. Champaign, IL 61822

Abstract

Much of the valuable information in supporting decision making processes orig-inates in text-based documents. Although these documents can be effectivelysearched and ranked by modern search engines, actionable knowledge need to beextracted and transformed in a structured form before being used in a decisionprocess. In this paper we describe how the discovery of semantic informationembedded in natural language documents can be viewed as an optimizationproblem aimed at assigning a sequence of labels (hidden states) to a set ofinterdependent variables (textual tokens). Dependencies among variables areefficiently modeled through Conditional Random Fields, an indirected graphi-cal model able to represent the distribution of labels given a set of observations.The Markov property of these models prevent them to take into account long-range dependencies among variables, which are indeed relevant in Natural Lan-guage Processing. In order to overcome this limitation we propose an inferencemethod based on Integer Programming formulation of the problem, where longdistance dependencies are included through non-deterministic soft constraints.

1. Introduction

The data used by Decision Support Systems are assumed to be structuredand quantifiable. However, thanks to the growing of the web and the spread ofDocument Management Systems, most of the valuable information are embed-ded in textual documents that need to be processed to extract relevant infor-mation in a machine readable form to become actionable. In most of the casesthis activity involves the analysis of human language texts by means of Natural

IPlease cite: E. Fersini, E. Messina, G. Felici, D. Roth. Soft-constrained inference fornamed entity recognition. Information Processing and Management, 50(5), pp. 807-819,2014. doi:10.1016/j.ipm.2014.04.005

Language Processing (NLP) techniques. Named Entity Recognition (NER) isthe task aimed at identifying and associating atomic elements in a given text toa set of predefined categories such as names of persons, organizations, locations,dates, quantities and so on.

Early NER systems have been defined as rule-based approaches with a set offixed and manually coded rules provided by domain experts [37, 30, 39, 1]. Con-sidering the costs, in terms of human effort, to reveal and formulate hand-craftedrules, several research communities ranging from Statistical Analysis to NaturalLanguage Processing and Machine Learning have provided valuable contribu-tions for automatically derive models able to detect and categorize pre-definedentities. The first tentatives, aimed at deriving these rules under the form ofboolean conditions, are based on inductive learner where rules can be learnt au-tomatically from labelled examples. The inductive rule learning approach hasbeen instantiated according to different learning paradigm: bottom-up [6, 5, 10],top-down [44, 35, 28, 24] and interactive rule learning [25, 4, 3].

An alternative approach to inductive rule learners is represented by statisti-cal methods, where the NER task is viewed as a decision making process aimedat assigning a sequence of labels to a set of either joint or interdependent vari-ables, where also complex relationships may hold among them. This decisionmaking paradigm can be addressed in two different ways: (1) at segment-level[41, 15, 21], where the NER task is managed as a segmentation problem in whicheach segment corresponds to an entity label; (2) at token level [46, 42, 36, 38, 27],where an entity label is assigned to each token of the sentence. In the first casethe output of the decision process is a sequence of segments. More formally, asegmentation s of an input sentence x = x1, ..., xN is a sequence of segmentss1...sp with p ≤ N . Each segment sj consists of a start position lj , an end posi-tion uj , and a label y belonging to a set of entity labels Y. The second decisionmaking paradigm is represented by token-level models, where the unstructuredtext is tackled as a sequence of tokens and the output of the decision process isa sequence of labels y = y1, ..., yN .

Nowadays, the state of the art to model a NER problem is represented byLinear Chain Conditional Random Fields [27]. This model, thanks to its advan-tages over generative approaches, has been extensively investigated to extractnamed entities from different unstructured sources such as judicial transcrip-tions [20, 19], medical reports [14, 16], user generated contents [34, 43] and soon. The efficiency of Linear Chain Conditional Random Fields (CRF) is strictlyrelated to the underlying Markov assumption: given the observation of a token,the corresponding hidden state (label) depends only on the labels of its adjacenttokens. In order to efficiently enhance the description power of CRF, during thelast ten years several approaches have been proposed to enlarge the informationset exploited during training and inference. In particular, two main research di-rections have been investigated: (1) relaxing the Markov assumption [41, 21] tomodel long distance relationships and (2) introducing additional domain knowl-edge in terms of logical constraints during the inference phase [26, 40, 8, 7].Considering that the relaxation of the Markov assumption implies an increas-ing computational complexity related to the training and inference phase, in

2

https://www.researchgate.net/publication/220813236_FASTUS_A_Finite-state_Processor_for_Information_Extraction_from_Real-world_Text?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220321036_Bottom-Up_Relational_Learning_of_Pattern_Matching_Rules_for_Information_Extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220874184_Guiding_Semi-Supervision_with_Constraint-Driven_Learning?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/221013284_A_Hybrid_MarkovSemi-Markov_Conditional_Random_Field_for_Sequence_Segmentation?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/225550263_Chance_Discovery_and_Learning_Minority_Classes_149_Chance_Discovery_and_Learning_Minority_Classes?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220966356_RAD_A_Scalable_Framework_for_Annotator_Development?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/221604279_Interactive_Information_Extraction_with_Constrained_Conditional_Random_Fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/228057950_Conditional_Random_Fields_Probabilistic_Models_for_Segmenting_and_Labeling_Sequence_Data?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/2767094_Umasshughes_Description_Of_The_Circus_System_Used_For_Muc-5?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/225108175_A_Linear-Chain_CRF-Based_Learning_Approach_for_Web_Opinion_Mining?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/226988403_Learning_Logical_Definitions_from_Relations?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220343911_Learning_to_Parse_Natural_Language_with_Maximum_Entropy_Models?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/3507105_Extracting_company_names_from_text?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/221606983_Automatically_Constructing_a_Dictionary_for_Information_Extraction_Tasks?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/221345388_Integer_linear_programming_inference_for_conditional_random_fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220766268_Fine-Grained_Opinion_Mining_Using_Conditional_Random_Fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/236465315_Semantics_and_Machine_Learning_A_New_Generation_of_Court_Management_Systems?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/215470767_Semi-Markov_Conditional_Random_Fields_for_Information_Extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/235342185_Integrating_naive_Bayes_and_FOIL?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220813069_Adaptive_Information_Extraction_from_Text_by_Rule_Induction_and_Generalisation?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/252065022_Interactive_Access_Rule_Learning_Generating_Adapted_Access_Rule_Sets?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/2804782_Learning_Hidden_Markov_Model_Structure_for_Information_Extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/45860555_Learning_as_Search_Optimization_Approximate_Large_Margin_Methods_for_Structured_Prediction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/254004328_Automated_search_for_patient_records_Classification_of_free-text_medical_reports_using_conditional_random_fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/2546547_Relational_Learning_of_Pattern-Match_Rules_for_Information_Extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/228692389_Constraints_as_prior_knowledge?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/230616533_Large-scale_evaluation_of_automated_clinical_note_de-identification_and_its_impact_on_information_extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/2545727_Use_of_Support_Vector_Machines_in_Extended_Named_Entity?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/296002861_Named_Entities_in_Judicial_Transcriptions_Extended_Conditional_Random_Fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

this paper we focused our attention on an integer linear programming (ILP)formulation of the inference problem by including soft constraints obtained bylearning declarative rules from data.

The introduction of constraints allows us to improve the global label assign-ment by correcting mistakes of local predictions. The label assignment prob-lem is therefore solved through a constrained optimization problem where theextra-knowledge related to complex relationships among variables is representedthrough a set of logical rules easily introduced as linear inequalities. This ap-proach, as shown by the experimental results, makes it possible to significantlyimprove the performances of CRF in NER tasks.

The outline of the paper is the following. In section 2 a brief review of CRFis presented along with a background overview of the training phase, togetherwith the most relevant inference approaches able to include domain constraints.In section 3 the proposed soft-constrained inference approach is detailed byfocusing on learning constraints from data and by presenting its mathematicalprogramming formulation. In section 4 the experimental investigation on bothbenchmark and datasets is described, while in section 5 conclusions and ongoingresearch are summarized.

2. Conditional Random Fields

A Conditional Random Field is an indirected graphical model that definesthe joint distribution P (y|x) of the predicted labels (hidden states) y = y1, ..., yNgiven the corresponding tokens (observations) x = x1, ..., xN . Now, consider Xas the random variable over data sequences (natural language sentences) to belabeled, and Y is the random variable over corresponding label sequences overa finite label alphabet Y. The joint distribution P (X,Y ) is represented by aconditional model P (Y |X) from paired observation and label sequences, andthe marginal probability p(X) is not explicitly model. The formal definition ofCRF [27] is given below:

Definition 1 (Conditional Random Fields). Let G = (V,E) be a graph suchthat Y = (Yv)v∈V , so that Y is indexed by the vertices of G. Then (X,Y) is aConditional Random Field, when conditioned on X, the random variables Yvobey the Markov property with respect to the graph: p(Yv|X,Yw, w 6= v) =p(Yv|X,Yw, w ∼ v), where w ∼ v means that w and v are neighbors in G.

Thus, a CRF is a random field globally conditioned on the observation X.Throughout the paper we tacitly assume that the graph G is fixed. A Linear-Chain Conditional Random Field is a Conditional Random Field in which theoutput nodes are linked by edges in a linear chain. The graphical representationof a general CRF and a Linear-Chain CRF is reported in Figure 1. In thefollowing, Linear-Chain CRF are assumed.

According to the Hammersley-Clifford theorem [23, 11], given C as the set ofall cliques in G, the conditional probability distribution of a sequence of labels

3


https://www.researchgate.net/publication/233971696_Markov_fields_on_finite_graphs_and_lattices_Available_at?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

hidden states y

observation x

Figure 1: Graphical representation of CRF. On the right a general CRF is represented, whileon the left a linear-chain CRF is depicted: white circles represent hidden states y (the outputsequence of labels) and the grey ones denote the observation x (the input sequence of tokens).

y given a sentence x can be written as:

p(y|x) =1

Z(x)

∏C∈C

ΦC(xC , yC) (1)

where ΦC represents the set of maximal cliques C ∈ C, the vertices xC and yC ofthe clique C correspond to hidden states y and observations x, and Z(x) ∈ [0, 1]is the partition function for global normalization. Formally Z(x) is defined as:

Z(x) =

|Y|∑y

∏C∈C

ΦC(xC , yC) (2)

In order to compute the joint probability distribution (Eq. 1), a set of po-tential functions must be defined over configurations of the maximal cliques inthe underlying graph. The potential functions, which state the prior probabilitythat elements of the clique C have certain values, can be conveniently simplifiedas inner products between a parameter vector ω and a set of feature function f .Considering that each clique ΦC(·) ∈ C can encode one or more potential func-tions (in this case we consider the log-linear ones), the probability of a sequenceof label y given the sequence of observations x can be rewritten as:

p(y|x) =1

Z(x)exp

( N∑t=1

K∑k=1

ωkfk(yt, yt−1x, t))

(3)

where fk(yt, yt−1x, t) is an arbitrary feature function over its arguments andωk is a feature weight that is a free parameter in the model. Feature functionsare fixed in advance and are used to verify some properties of the input text,while the weights ωk have to be learned from data and are used to tune thediscriminative power of each feature function. In particular when for a tokenxt a given feature function fk is active, i.e. a given property is verified, thecorresponding weight ωk indicates how to take into account fk: (1) if ωk > 0it increases the probability of the tag sequence y; (2) if ωk < 0 it decreases theprobability of the tag sequence y; (3) if ωk = 0 has no effect whatsoever.

4


The partition function Z(x) assumes the form:

Z(x) =

|Y|∑y

exp

K∑k=1

ωkfk(yt, yt−1, x, t)

(4)

The conditional probability distribution of Linear-Chain CRF can be esti-mated by exploiting two different kinds of feature functions such that p(y|x)can be rewritten as follows:

p(y|x) =1

Z(x)exp

( N∑t=1

|I|∑i=1

λisi(yt, x, t) +

|J|∑j=1

µjtj(yt−1, yt, x, t))

(5)

where I and J represent the given and fixed set of state feature function si(yt, x, t)and transition feature function tj(yt−1, yt, x, t), while λi and µj are the corre-sponding weights to be estimated from training data. State feature functionand transition feature function model respectively the sequence of observationsx with respect to the current state yt and the transition from the previous stateyt−1 to the current state yt. The parameters λi and µj are used to weight thecorresponding state and transition feature function.

The choice of the feature functions strongly depends from the applicationcontext. For instance, an example of state feature function is

si(yt, x, t) =

1 if yt = PERSON and xt = John0 otherwise

while a transition feature function could be

tj(yt−1, yt, x, t) =

1 if yt−1 = PERSON and yt = DATE0 otherwise

In our case, where the CRF are assumed in a linear form, transition featurefunctions are assumed to verify dependencies between a label y at time t andits previous y values, while state feature functions evaluate Word Features tocheck whether a given token is present in the dictionary in a particular state,Start Features and End Features to check if a given label is a start and/or anend label.

2.1. Training: estimating the parameters of CRF

The learning problem in CRF relates to the identification of the best fea-ture functions si(yt, x, t) and tj(yt−1, yt, x, t) by unrevealing the correspondingweights λi and µj . These parameters can be quantified either by exploiting somebackground domain knowledge or by learning from training data. When nobackground knowledge is available, several learning approaches can be adoptedfor estimating λi and µj . Among them a widely used approach consists of max-imizing the conditional log-likelihood of the training data. Given a training set

5

T composed of training samples (x, y), with y ranging in the set Y of entitylabels, the conditional (penalized) log-likelihood is defined as follows:

L(T ) =∑

(x,y)∈T

log p(y|x, λ, µ)−|I|∑i=1

λ2i2σ2

λ

−|J|∑j=1

µ2j

2σ2µ

=∑

(x,y)∈T

N∑t=1

|I|∑i=1

λisi(yt, x, t) +

|J|∑j=1

µjtj(yt−1, yt, x, t)− logZ(x) (6)

−|I|∑i=1

λ2i2σ2

λ

−|J|∑j=1

µ2j

2σ2µ

where the terms|I|∑i=1

λ2i

2σ2λ

and|J|∑j=1

µ2j

2σ2µ

map a Gaussian prior on λ and µ used for

avoiding over-fitting.The objective function L(T ) is concave, and therefore both parameters λ

and µ have a unique set of global optimal values. A standard approach to pa-rameter learning computes the gradient of the objective function to be usedin an optimization algorithm [27, 45, 33, 31]. Among them we choose thequasi-Newton approach known as Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [31] due the main advantage of providing a dramatic speedupsin case of huge number of feature functions.

2.2. Inference: finding the most probable state sequence

The inference problem in CRF corresponds to find the most likely sequenceof hidden state y∗, given the set of observation x = x1, ..., xN . This problemcan be solved approximately or exactly by determining y∗ such that:

y∗ = arg maxy

p(y|x) (7)

The most common approach to tackle the inference problem is representedby the Viterbi algorithm [2]. Now, consider δt(yt = y|x) as the probability ofthe most likely path of generating the sequence y∗ = y1, y2, ..., yt. This pathcan be derived by one of the most probable paths that could have generated thesubsequence y1, y2, ..., yt−1. Formally, given δt(yt = y|x) as:

δt(yt = y|x) = maxy1,y2,...,yt−1

p(y1, y2, ..., yt|x) (8)

we can derive the induction step as:

δt+1(yt+1 = y′|x) = maxy∈Y

[δt(y|x)Φt+1(x, y, y′)

]=

= maxy∈Y

[δt(y|x) exp

( K∑k=1

ωkfk(yt = y, yt+1 = y′, x, t))] (9)

The recursion terminates in y∗ = arg maxy

[δT (y)], allowing the backtrack to

recover y∗.

6

https://www.researchgate.net/publication/38365401_Statistical_Inference_for_Probability_Functions_of_Finite_State_Markov_Chains?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/230873264_Numerical_Optimization_Second_Edition?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/51991698_A_Comparison_of_Algorithms_for_Maximum_Entropy_Parameter_Estimation?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/235114989_Efficient_Training_Methods_for_Conditional_Random_Fields?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/285906939_Numerical_optimization?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

3. Introducing Background Knowledge in CRF

CRF, in their native form, are able to capture some local properties throughthe definition of transition and state feature functions. However, some morecomplex relationships among output variables (labels to be predicted) couldexist when addressing NER problems. For instance, when annotating scientificcitations the label Author should appear anywhere before the label Title.

Two different strategies are suitable to include relationships in the proba-bilistic model: the feature way and the constraint way. The first one is concernedwith the training phase by the definition of feature functions able to capturedifferent kinds of relationships [41]. The second one relates to the introductionof constraints during the inference phase for preserving the necessary relation-ships over the output prediction [26, 40, 8, 9]. While the first strategy mightlead to intractable training and inference due to the necessity of learning addi-tional parameters or defining higher order models, the second paradigm allowsus to keep the model simple by enclosing expressive constraints directly duringthe inference phase. In this context we can distinguish between constraining theViterbi algorithm [26, 13] and formulating the inference process as a constrainedoptimization problem.

The main idea underlying the Constrained Viterbi algorithm is concernedwith the clamping of some hidden variables to particular values. The resultingalgorithm alters the induction step outlined in equation (9) such that y∗ is

constrained to pass through a given sub-path C. The related constraint C canbe encoded in the induction step as follows:

δt+1(yt+1|x) =

maxy∈Y

[δt(y) exp

( K∑k=1

ωkfk(yt, yt+1, x, t))]

if C is satisfied

0 otherwise

(10)

For time steps not involved in C the equation (8) is used instead, restricting thealgorithm to only consider paths that respect the defined constraint.

Although the Constrained Viterbi algorithm is suitable to deal with shortdistance relationships in an efficient way, the introduction of non-local and non-sequential constraints makes this solution no longer applicable. In order to dealwith more complex relationships, the inference process can be formulated as aconstrained optimization problem and in particular as a shortest path problem.A simple representation of a labeling shortest path problem is given in Figure 2.

Given n tokens and m labels that each token can take, we can define a graphΩ = (Φ,Ψ) as composed of nm+2 nodes and (n−1)m2+2m edges 1. The label ofeach token is represented by a node φty, where 0 ≤ t ≤ n−1 and 0 ≤ y ≤ m−1,while the arc connecting two adjacent nodes φ(t−1)y and φty′ is denoted by adirected edge ψt,yy′ with the associated cost cyy′,t = − log(Mt(yy

′|x)) 2. The

1Two special nodes denoting the start and end positions of the path are additionallyintroduced

2The weight of each edge corresponds to the exponential in equation (5)

7





https://www.researchgate.net/publication/222646336_Corrective_feedback_and_persistent_learning_for_information_extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


https://www.researchgate.net/publication/257618553_Structured_learning_with_constrained_conditional_models?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

Betsy loves Rome

start

P

L

O

P

L

O

P

L

O

end

Figure 2: Graphical representation of the labeling shortest path problem. Nodes (P=Person,L=Location, O=Other) represent labels to be assigned to each token of the sentence. The redline highlights the optimal label solution path.

problem consists in minimizing the cost of visiting the nodes φty′ along theentire path:

arg miny−n−1∑t=0

log(Mt(y, y′)|x) = arg max

y

n−1∏t=0

log(Mt(y, y′)|x) (11)

The goal defined in equation (11), which corresponds to find the shortest pathalong the graph Ω, can be formulated as an Integer Linear Programming prob-lem that allow us to introduce additional background knowledge in terms ofconstraints. Let et,yy′ denote a decision variable restricted to be 1 if the edgeψt,yy′ is in the shortest path and 0 otherwise. The ILP formulation for theshortest path problem3 can be formalized as follows:

maxZ(e) =∑

0≤t≤n−10≤y,y′≤m−1

logMt(y, y′) · et,yy′ (12)

subject to: ∑0≤y1≤m−1

et−1,y1y −∑

0≤y2≤m−1

et,yy2 = 0 (13)

∑0≤y≤m−1

estart,0y = 1 and∑

0≤y≤m−1

eend,y0 = 1 (14)

estart,0y, et,y1y, eend,y0 ∈ 0, 1 (15)

∀t, y s.t. 0 ≤ i ≤ n− 1, 0 ≤ y, y1, y2 ≤ m− 1 (16)

3The shortest path problem solution corresponds to the output of the Viterbi algorithm

8

Any relationships that should be preserved over the output prediction could besmoothly represented as Boolean function and therefore introduced as linearinequalities constraints [50]. Nevertheless, this approach is based on two mainassumptions: (1) constraints must be manually defined by a domain expert;(2) constraints must be satisfied, i.e. they are hard constraints. The implica-tions arisen are concerned with the time consuming activity on defining booleanfunctions and the satisfiability of constraints with respect to training and test-ing data (otherwise possible infeasible solutions could be obtained). In orderto overcome the mentioned weak points, a combination of learning constraintsfrom data and a two-stages ILP approach is proposed.

3.1. Soft Constrained Inference

In this section we outline in detail the approach proposed in this paper. Firstwe describe a general approach to extract from the available data additionalknowledge in the form of logic rules. Then we describe how such knowledgecan be embedded in a mathematical model as soft constraints, by relaxing theinference procedure and using a set of additional constraints whose violation isminimized in the objective function.

3.1.1. Learning constraints from data

In order to provide a completely automated Named Entity extraction pro-cess, with no need of human effort to define and include domain specific complexrelationships, we defined a procedure for learning constraints from data. In asequential labeling issue, the problem of finding out valuable knowledge can beviewed as a discovery process aimed at revealing common patterns within train-ing data and a subsequent extraction of logical relationships from the identifiedpatterns. The identification of some common patterns and their extraction fromdata in the form of logic rules is formulated, as originally proposed in [17] and[18], as a sequence of minimum cost satisfiability problems. These problems aresolved efficiently with an ad hoc algorithm based on the decomposition strate-gies presented in [49]. The method adopts a standard learning paradigm wherethe different tag labels are the classes and the logic rules that are found areable to separate the samples of one class from the samples of the other classesoptimizing the information used to define the logic rules with the aim of con-trolling the performances of the system in a desired direction. The formulasso obtained are in disjunctive normal form (DNF), i.e. they are composed bythe disjunction of one or more conjunctive clauses. A specific characteristic ofthis method is that these clauses have decreasing discrimination power and thusthey can be pruned according to their power and to the importance that is givento the rules that explain the outliers that may be present in the training data.Moreover, each conjunctive clause can be easily associated with an integer linearconstraints with standard techniques.

Interesting logic relationship that are extracted from data may fall into oneof the following categories:

9

https://www.researchgate.net/publication/245015420_Representations_of_Boolean_functions_by_systems_of_linear_inequalities?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/269025354_Design_of_logic-based_intelligent_systems?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/2600420_A_MINSAT_approach_for_learning_in_logic_domains?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/292695934_The_Lsquare_system_for_mining_logic_data?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

• Adjacency: if label A is associated to token xt, then label B must beassociated to token xt+1∑

0≤y1≤m−1

et−1,y1A −∑

0≤y2≤m−1

et,By2 ≤ 0 (17)

• Precedence: if label A is associated to token xt, then label B should beassociated to token xt+z∑

0≤y1≤m−1

et−1,y1A −∑

0≤y2≤m−10≤z≤n−t

et+z,By2 ≤ 0 (18)

• State Change: if xt is a given delimiter punctuation mark (d), then labelA and label B should be associated to token xt−1 and xt+1 respectively∑

0≤y1≤m−1

2(et−1,y1d)−∑

0≤y2≤m−1

et−2,y2A −∑

0≤y3≤m−1

et,By3 ≤ 0 (19)

• Begin-End: if label A is associated to token x0, then label B should beassociated to token xn−1∑

0≤y1≤m−1

e1,Ay1 −∑

0≤y2≤m−1

en−1,y2B ≤ 0 (20)

• Presence and Precedence: if label A appears, then label B can notappear before label A

m(t− 2)et,Ay1 −∑

1≤z≤t−20≤y2≤m−1

(1− ez,y2B) ≤ 0 (21)

with 2 ≤ t ≤ n and 0 ≤ y1 ≤ m− 1

These rules are not necessarily satisfied by all training sentences and a fortioriby data during the inference phase. Therefore introducing these constraintscould easily lead to feasibility problems. For this reason we enclosed in eachlinear equality constraint h a variable σh ∈ 0, 1, which will indicate when agiven logical relationships could not be preserved over the output prediction.The constraints associated with the rules can thus be represented in a compactway with the notation:

L · e− σ ≤ 0

where L is a matrix composed of H rows containing the coefficients of thecomponents of vector e for each of the H constraints, and σ is the binary vectorof size H representing the violation of the constraints.

10

3.1.2. Relaxing Inference using ILP

In order to introduce the automatically learned logical rules for constrainingthe inference phase without occurring in feasibility problems, a two-fold ap-proach is proposed. The first step is aimed at determining the optimal solutionof the shortest path by solving the problem formulation presented in equations(12) - (14), i.e. to determine the optimal solution e∗t,yy′ equivalent to the onesolved by Viterbi algorithm. Concerning the second step, we introduce a setof constraints in order to identify a labeling solution consistent with the logi-cal rules previously extracted, while ensuring a good approximation first stepViterbi solution. In order to ensure feasibility we allow for a violation of thisconstraints, which however must be penalized. The second step, is therefore tar-geted at minimizing the cost of violating these of constraints, and is formulatedas follows:

minZ(σ) =∑h

chσh (22)

subject to: ∑0≤i≤n−1

0≤y,y′≤m−1

logMt(y, y′|x) · et,yy′ ≥ τZ(e∗t,yy′) (23)

∑0≤y1≤m−1

et−1,y1y −∑

0≤y2≤m−1

et,yy2 = 0 (24)

∑0≤y≤m−1

e−1,0y = 1 and∑

0≤y≤m−1

en,y0 = 1 (25)

L · e− σ ≤ 0 (26)

e−1,0y, et,y1y, en,y0 ∈ 0, 1 (27)

∀t, y s.t. 0 ≤ t ≤ n− 1, 0 ≤ y, y1, y2 ≤ m− 1 (28)

σh ∈ 0, 1, h = 1, ...,H (29)

The objective function (22) is penalized whenever a logical relationship isviolated, i.e. when σh = 1. Constraints (24) - (29) play the same role asin the shortest path problem, while constraint (23) states that the variablesetyy′ should assume values to guarantee a solution close to the shortest pathone. This lower bound ensures the current solution of Z(σ), which originatesa novel configuration of et,yy′ , to be coherent to the solution determined atthe first step according to the threshold 0 ≤ τ ≤ 1 . The parameter vector crepresents the cost of violating a given constraint. In particular, the element chis proportional to the occurrence of a clause in the training data and representsthe (log) probability that the corresponding constraint h is violated. Given a

11

clause l representing the logical relationship among labels (for instance label Ashould appear before label B), the cost of violating all the constraints relatedto l is computed as follows:

cl = logP

(|D(l)|

|D(l)|+ |D(l)|

)(30)

where D(l) denotes the set of true clauses and D(l) represents the set of clausesthat are not satisfied in training data.

4. Experimental Results

The proposed method has been tested with success on different datasetsand benchmark models. In order to provide a complete understanding of theexperimental campaign, we firstly introduce the performance criteria used toevaluate quality of the generated solutions, the datasets exploited for trainingand testing, the settings of the models considered for the comparative analysisand finally the computational results.

4.1. Performance criteriaThe performance in terms of effectiveness has been measured by using four

well known evaluation metrics, i.e. F-Measure, Precision, Recall and Accuracy.The F-Measure metric represents a combination of Precision and Recall typicalof Information Retrieval. Given a set of labels Y we compute the Precision andRecall for each label y ∈ Y as:

Precision(y) =# of tokens successfully predicted as y

# of tokens predicted as y(31)

Recall(y) =# of tokens successfully predicted as y

# of tokens effectively labelled as y(32)

The F-Measure for each class y ∈ Y is computed as the harmonic mean ofPrecision and Recall:

F (y) =2 ·Recall(y) · Precision(y)

Recall(y) + Precision(y)(33)

The Accuracy measure can be summarized as follows:

Accuracy =

|Y|∑y

# of tokens correctly labelled as y

total number of tokens(34)

Considering that the datasets used in this experimental evaluation are composedof unbalanced samples, i.e. the class distribution of each label is not uniform,both micro and macro average have been computed for Precision, Recall andF-Measure. Micro and macro measures differs in the computation of global per-formance: macro-averaging gives equal weight to each label category (indepen-dently from the category size), while micro-averaging considers the contributionof each label class according to its dimension.

12

4.2. Datasets

In order to compare the proposed inference method against the traditionalones, we performed experiments on four datasets: US50, CoNLL-2003, Coraand Advertisements. The first dataset, named US50, represents a set of USPostal Addresses. It contains 740 instances, where 50 are used as training and690 as testing. Each postal address has been manually annotated according thefollowing labels: Street, City, Civic Number, State and Zip Code.

US50

Train Test

Street 228 1782

City 61 866

Civic Number 45 597

State 50 689

Zip Code 50 689

Tot. 434 4623

Table 1: US50 label distribution

A further benchmark concerns with the CoNLL-2003 corpus ([47]), whichconsists of Reuters news stories annotated with four entity types: Person, Lo-cation, Organization and Miscellaneous. It’s composed of 945 instances used astraining and 246 as testing.

CoNLL-2003

Train Test

Person 6600 1617

Location 7140 1668

Organization 6321 1661

Miscellaneous 3438 702

Tot. 23499 5648

Table 2: CoNLL-2003 label distribution

The Cora citation benchmark ([32]) is composed of 500 citations of researchpapers annotated with 13 different labels: Title, Author, Book Title, Date, Jour-nal, Volume, Tech, Institution, Pages, Editor, Location, Notes. The benchmarkhas been split for training and testing the models: 300 instances have been usedas training set, while the remaining 200 instances as testing.

The last benchmark concerns with the Advertisements corpus ([22]), whichcomprises announcement for apartment rentals in the San Francisco Bay Area.

13

https://www.researchgate.net/publication/220873139_Unsupervised_Learning_of_Field_Segmentation_Models_for_Information_Extraction?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/1956514_Introduction_to_the_CoNLL-2003_Shared_Task_Language-Independent_Named_Entity_Recognition?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/2324921_Maximum_Entropy_Markov_Models_for_Information_Extraction_and_Segmentation?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

Cora

Train Test

Author 1674 505

Title 2178 703

Publisher 140 46

Booktitle 1049 466

Data 381 126

Journal 382 100

Volume 184 59

Tech 144 47

Insitution 196 71

Pages 419 135

Editor 180 48

Location 178 81

Note 105 14

Tot. 7210 2401

Table 3: Cora label distribution

This dataset includes the following tags to be detected: Features, Size, Neigh-borhood, Photos, Address, Available, Contacts, Rent, Restrictions, Utilities, Roo-mates and Other. Its composed of 100 instances used as training and 100 astesting.

14

Advertisements

Train Test

Feature 4075 4179

Size 589 510

Neighborhood 1395 1374

Photos 100 100

Address 204 223

Available 113 153

Contact 1096 995

Rent 484 476

Restrictions 288 298

Utilities 163 159

Roomates 15 180

Other 357 485

Tot. 8879 9432

Table 4: Advertisements label distribution

4.3. Model Settings

In order to evaluate the contribution of the proposed inference method,we considered two main approaches for a comparative analysis. The exper-imental evaluation includes: Viterbi algorithm on a traditional CRF model(CRF-Viterbi), Nested Viterbi algorithm on the SemiCRF segment-based model(SemiCRF-Viterbi), and Soft-Constrained inference on traditional CRF model(CRF-Soft). This allows us to compare the proposed inference method with themain state of the art solutions: CRF-Viterbi as baseline token-based approach,and SemiCRF-Viterbi as reference of segment-based approaches able to considerlong range dependencies. Concerning the model settings, the following featureshave been included during the training phase of traditional CRF models:

• EdgeFeatures: denote transition feature functions that solely dependentupon the current label and the previous one.

• StartFeatures: evaluate whether the current label in consideration is astart state.

• EndFeatures: evaluate whether the current label in consideration is anend state.

• WordFeatures: check if the current token belongs to the dictionary createdon-the-fly from the training set, and provide a counts of the number oftimes the token occurs in a state (label).

15

• UnknownFeatures: check if the current token is not observed at the timeof training.

• ConcatRegexFeatures: match the token with the character patterns. Char-acter patterns are regular expressions for checking whether the token iscapitalized word, a number, small case word, whether the token containsany special characters and like.

Concerning SemiCRF, we used the same set of word-level features, as well theirlogical extensions to segments. In particular, we exploited indicators for thesentence inside a segment and the capitalization pattern inside a phrase, as wellas indicators for words and capitalization patterns in 3-word windows beforeand after the segment. We also used indicators for each segment length andcombined all word-level features with indicators for the beginning and end of asegment.

4.4. Results

The proposed inference relaxation has been compared with two state-of-the-art approaches, i.e Viterbi algorithm on CRF (CRF-Viterbi) and SemiCRF(SemiCRF-Viterbi). Tables 5 - 8 report the performance of Precision (P), Re-call (R), F-Measure (F) and Accuracy (A) on the considered datasets both interms of macro and micro average. The performance are detailed by showing atfirst the results of CRF-Viterbi and SemiCRF-Viterbi, and then illustrating thecontribution of the considered soft constraints on the relaxed ILP formulation.

Macro-Average Micro-Average

P R F P R F A

CRF-Viterbi 89.48 90.06 86.57 89.79 80.76 80.77 80.80

SemiCRF-Viterbi 89.79 91.10 86.88 89.95 81.12 81.15 81.39

Adiacency 89.61 90.31 86.91 89.92 81.24 81.29 81.24

Precedence 90.74 90.42 87.52 90.40 81.57 81.74 82.12

State Change 90.02 91.05 87.07 90.31 82.66 82.80 82.91

Begin-End 90.00 90.89 87.95 90.95 82.50 82.95 83.25

Presence and Precedence 90.63 91.13 87.93 90.99 82.42 82.19 82.23

CRF-Soft (All Constraints) 91.23 91.90 88.75 91.90 83.39 83.90 84.55

Table 5: Performance comparison on the US50 dataset

Concerning the US50 dataset, the results reported in Table 5 show that theproposed inference approach allows to improve the prediction accuracy with re-spect to the traditional approaches both by including constraint separately andby considering all the constraints together. The constraints that provide themost relevant improvement in terms of accuracy are State Change and Begin-End : while some punctuation marks as “comma” clearly denote a State Changebetween adjacent labels, a Begin-End constraint helps to capture recurrent pat-terns in US address specifications.

16

If we focus on results in Table 6 about the CoNLL-2003 dataset, we can eas-ily deduce that a free natural language text is more complex to be modeled thanstructured or semi-structured instances of US50. Although the overall resultsobtained by the proposed approach is slightly better than others (the CRF-Softwith all constraints achieves 73.49% of accuracy against 71.77% of traditionalCRF and 65.30 of Semi-CRF), we can point out that some constraints do notprovide any gain in terms of performance. In particular, Begin-End and Pres-ence and Precedence constraints are redundant in respect of the generated so-lutions, leading to a labeling output that correspond to the one generated bythe traditional Viterbi algorithm. On the contrarty, interesting constraints re-late to State Change, where some punctuation mark are relevant, i.e. the costof violating the corresponding constraint is high, for conditioning the solution.Among them we can for instance find out a constraint stating that given an Or-ganization name, its Location is usually specified subsequently among brackets,e.g. Organization='Dynamo Batumi'(Location='Georgia').


P R F P R F A

CRF-Viterbi 82.01 70.74 75.95 83.10 71.78 77.01 71.77

SemiCRF-Viterbi 84.39 63.67 72.45 84.82 65.30 73.66 65.30

Adiacency 82.39 70.89 74.32 83.85 71.91 77.95 71.94

Precedence 82.41 70.65 76.02 84.05 72.86 78.04 72.13

State Change 82.56 70.57 76.00 84.37 73.12 78.24 72.98

Begin-End 82.01 70.74 75.95 83.10 71.78 77.01 71.77


CRF-Soft (All Constraints) 82.55 72.02 76.92 84.92 73.50 78.78 73.49

Table 6: Performance comparison on the CONLL-2003 dataset

Concerning the Cora benchmark, the computational results in Table 7 con-firm once more that the learned constraints allows the relaxed inference toachieve good performance of the assessed criteria, by ensuring significant im-provements with respect to state of the art solutions. When introducing allthe modeled constraints in the soft ILP formulation, the proposed approachachieves the highest results. The only exception being the macro-average preci-sion, which is biased by classes of small size as Publisher, Tech and Note. If weanalyze the performance more deeply, we can notice that the labels that benefitmost from the learned constraints and therefore from the relaxed ILP problemformulation are Author, Title and Booktitle. In particular, for these labels theproposed approach is able to ensure a value of F-Measure about 95.19%, 95.31%and 92.61% against the 92.22%, 90.19% and 90.20% of traditional Viterbi. Theresults about macro and micro-average F-Measure on this named entities areaffected by the dimension of training and testing sets: sizable training entitiesallow a robust constraint learning and subsequently an effective inference phaseon testing instances.

17


P R F P R F A

CRF-Viterbi 85.64 81.71 83.58 88.22 87.78 88.05 87.78

SemiCRF-Viterbi 84.93 81.98 83.42 88.52 87.94 88.31 88.35

Adiacency 83.96 83.37 83.66 89.57 89.03 89.50 89.85

Precedence 85.01 84.29 84.64 89.00 88.20 88.88 88.53

State Change 87.78 82.65 85.13 90.20 89.83 90.05 91.43

Begin-End 87.91 82.33 85.02 89.74 89.15 89.64 90.36


Soft-Viterbi (All Constraints) 88.85 86.42 87.61 92.88 92.79 92.95 93.99

Table 7: Performance comparison on the Cora benchmark

Regarding the Advertisements dataset, the results are reported in Table 8.This benchmark is characterized not only by a free natural language text as theConLL-2003 dataset, but also by an informal expressive form. Although a highvariability of natural language, the proposed approach is able to capture someglobal relationships as constraints to be included in the inference phase. Forinstance, the constraint State Change is able to capture relationship Size.Feature(e.g. Size='1350 square feet '. Feature='new carpet and more '), while theBegin-End constraint is able to constraining the labelling solution to end withthe entity Contact.


P R F P R F A

CRF-Viterbi 65.35 59.63 66.97 75.59 76.36 75.50 76.36

SemiCRF-Viterbi 67.28 59.30 66.64 75.56 77.39 75.86 77.39

Adiacency 66.35 60.35 67.19 76.91 77.22 76.58 77.47

Precedence 66.30 60.20 67.00 76.40 77.00 76.40 76.98

State Change 72.74 64.43 65.13 79.27 78.08 77.98 78.08

Begin-End 71.73 65.18 67.36 79.15 78.15 78.90 78.05


Soft-Viterbi (All Constraints) 74.15 67.80 68.98 80.15 79.65 79.12 78.50

Table 8: Performance comparison on the Advertisements benchmark

As general conclusion, it’s interesting to highlight that the gains of the re-laxed ILP formulation are larger when the dataset shows complex structuralrelationships among token labels. In this case the Viterbi algorithm could be“biased” by transition feature function encoded in logMt(y, y

′|x), which cap-tures only local relationships, while the proposed approach is able to find amore appropriate labeling solution thanks to the global relationships encodedas constraints.

18

At the end of the computational comparison across several dataset, we wouldlike to stress once more that the proposed approach consists in adding informa-tion (constraints) during the inference phase, without modifying the learningstep. On the contrary, most of the works proposed in the literature are insteadbased on adding knowledge during the learning procedures. The advantage ofconstraining (and relaxing) the inference phase, instead of forcing the training,is twofold: (1) lower computational complexity and (2) adaptability to changesin the domain setting (the probabilistic model generated during the learningphase can be maintained while inference can be adapted to a modified context).To the best of our knowledge, the most recent work on constrained inference indiscriminative probabilistic model is Learning Plus Inference (L+I) presentedin [9], where the two dataset Advertisements and Cora have been exploited forvalidation purposes. If we compare the performance achieved by the proposedinference approach with the ones obtained by L+I, we can easily note that ourrelaxed ILP formulation is able to achieve higher performance: our approach isable to ensure an accuracy of 93.99% against 89.72% on the Cora dataset and78.50% against 75.28% on the Advertisements dataset. This highlights thatthe proposed approach is able to directly derive constraints from data and in-clude them in the relaxed ILP formulation making the inference with a moreexpressive model.

5. Conclusions and Ongoing Research

In this paper we have presented an ILP approach for relaxing inference forInformation Extraction problems by using Conditional Random Fields. The pro-posed approach is based on the automatic extraction of background knowledgefrom training data and in the incorporation of such knowledge in the inferenceproblem. The proposed ILP formulation is aimed at incorporating the additionalknowledge hidden in long distance dependencies through non deterministic softconstraints. The adoption of a relaxation of the inference problem and the use ofproperly weighted violation variables for the additional constraints is designedto preserve these complex relationships during the output prediction. Experi-mental results show that our method significantly outperforms the current stateof the art approaches, obtaining remarkable performance gains across severaldomains and types of data (benchmark and real datasets).

Concerning ongoing research, an interesting direction relates to how effec-tively label any named entity with hundreds of labels. As outlined in [12] and[29], the probabilistic model defined by a CRF in its native form is not appro-priate for dealing with Named Entity Recognition tasks in large-scale domains.This is mainly due to the computational limitation of the learning algorithms.Traditional approaches to train CRF are based on gradient-based methods and,although these approaches are relatively efficient, the learning phase results tobe impracticable due to the amount of data to be considered. Traditional train-ing algorithms are usually able to work on relatively small-scale domains, withreduced number of training examples and small label sets. For much larger tasks,with hundreds of labels and millions of examples, the current training methods

19

https://www.researchgate.net/publication/221055786_Fine-Grained_Named_Entity_Recognition_Using_Conditional_Random_Fields_for_Question_Answering?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==

https://www.researchgate.net/publication/220874352_Scaling_Conditional_Random_Fields_Using_Error-Correcting_Codes?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==


prove intractable. In order to perform a large scale named entity recognitionwith a number of labels and examples, a possible solution could be to performa heuristic search ([48]) to find the optimal feature subset that could be usedfor training .

References

[1] Appelt, D. E., Hobbs, J. R., Bear, J., Israel, D. J., Tyson, M., 1993. Fastus:A finite-state processor for information extraction from real-world text. In:Proc. of the International Joint Conference on Artificial Intelligence. pp.1172–1178.

[2] Baum, L. E., Petrie, T., 1966. Statistical inference for probabilistic func-tions of finite state markov chains. The Annals of Mathematical Statistics37 (6), pp. 1554–1563.

[3] Beckerle, M., Martucci, L. A., Ries, S., 2010. Interactive access rule learn-ing: Generating adapted access rule sets. In: In Proc. of the 2nd Interna-tional Conference on Adaptive and Self-adaptive Systems and Applications.pp. 104–110.

[4] Bohannon, P., Merugu, S., Yu, C., Agarwal, V., DeRose, P., Iyer, A., Jain,A., Kakade, V., Muralidharan, M., Ramakrishnan, R., Shen, W., 2009.Purple sox extraction management system. SIGMOD Rec. 37, 21–27.

[5] Califf, M. E., Mooney, R. J., 1999. Relational learning of pattern-matchrules for information extraction. In: Proc. of the 16th national conferenceon Artificial intelligence and the 11th Innovative applications of artificialintelligence conference innovative applications of artificial intelligence. pp.328–334.

[6] Califf, M. E., Mooney, R. J., 2003. Bottom-up relational learning of patternmatching rules for information extraction. J. Mach. Learn. Res. 4, 177–210.

[7] Chang, M., Ratinov, L., Roth, D., 2008. Constraints as prior knowledge. In:ICML Workshop on Prior Knowledge for Text and Language Processing.pp. 32–39.

[8] Chang, M.-W., Ratinov, L., Roth, D., 2007. Guiding semi-supervision withconstraint-driven learning. In: Proc. of the 45th Annual Meeting of the As-sociation of Computational Linguistics. Prague, Czech Republic, pp. 280–287.

[9] Chang, M.-W., Ratinov, L., Roth, D., 2012. Structured learning with con-strained conditional models. Machine Learning 88 (3), 399–431.

[10] Ciravegna, F., 2001. Adaptive information extraction from text by ruleinduction and generalisation. In: Proc. of the 17th international joint con-ference on Artificial intelligence - Volume 2. pp. 1251–1256.

20































[11] Clifford, P., 1990. Markov random fields in statistics; Disorder in PhysicalSystems: A Volume in Honour of John M. Hammersley. Oxford UniversityPress.

[12] Cohn, T., Smith, A., Osborne, M., 2005. Scaling conditional random fieldsusing error-correcting codes. In: Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics. Association for ComputationalLinguistics, pp. 10–17.

[13] Culotta, A., Kristjansson, T., McCallum, A., Viola, P., 2006. Correctivefeedback and persistent learning for information extraction. Artificial In-telligence 170 (14a“15), 1101 – 1122.

[14] Cvejic, I., Zhang, J., Marx, J., Tjoe, J., 2012. Automated search for patientrecords: classification of free-text medical reports using conditional randomfields. In: Proc. of the 2nd ACM SIGHIT International Health InformaticsSymposium. ACM, New York, NY, USA, pp. 691–696.

[15] Daume, III, H., Marcu, D., 2005. Learning as search optimization: approxi-mate large margin methods for structured prediction. In: Proc. of the 22ndinternational conference on Machine learning. pp. 169–176.

[16] Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., Marsolo,K., Jegga, A. G., Kaiser, M., Stoutenborough, L., Solti, I., 2013. Large-scale evaluation of automated clinical note de-identification and its impacton information extraction. JAMIA 20 (1), 84–94.

[17] Felici, G., Truemper, K., 2002. A minsat approach for learning in logicdomains. INFORMS Journal on computing 14, 20–36.

[18] Felici, G., Truemper, K., 2005. The lsquare system for mining logic data.In: Encyclopedia of Data Warehousing and Mining. Idea Group Publishing,pp. 693–697.

[19] Fersini, E., Messina, E., 2013. Named entities in judicial transcriptions:Extended conditional random fields. In: Proc. of the 14th InternationalConference on Intelligent Text Processing and Computational Linguistics.pp. 317–328.

[20] Fersini, E., Messina, E., Archetti, F., Cislaghi, M., 2013. Semantics and ma-chine learning: A new generation of court management systems. In: Fred,A., Dietz, J., Liu, K., Filipe, J. (Eds.), Knowledge Discovery, KnowledgeEngineering and Knowledge Management. Vol. 272 of Communications inComputer and Information Science. Springer Berlin Heidelberg, pp. 382–398.

[21] Galen, A., 2006. A hybrid markov/semi-markov conditional random fieldfor sequence segmentation. In: Proc. of the 2006 Conference on EmpiricalMethods in Natural Language Processing. pp. 465–472.

21





































[22] Grenager, T., Klein, D., Manning, C. D., 2005. Unsupervised learning offield segmentation models for information extraction. In: Proceedings ofthe 43rd annual meeting on association for computational linguistics. As-sociation for Computational Linguistics, pp. 371–378.

[23] Hammersley, J., Clifford, P., 1971. Markov fields on finite graphs and lat-tices. Unpublished.

[24] Ho, T. B., Nguyen, D. D., 2003. Chance discovery and learning minorityclasses. New Generation Computing 21 (2), 149–161.

[25] Khaitan, S., Ramakrishnan, G., Joshi, S., Chalamalla, A., 2008. Rad: Ascalable framework for annotator development. In: Proc. of the Interna-tional Conference on Data Engineering,. Vol. 0. pp. 1624–1627.

[26] Kristjansson, T., Culotta, A., Viola, P., McCallum, A., 2004. Interactive in-formation extraction with constrained conditional random fields. In: Proc.of the 19th national conference on Artifical intelligence. pp. 412–418.

[27] Lafferty, J. D., McCallum, A., Pereira, F. C. N., 2001. Conditional randomfields: Probabilistic models for segmenting and labeling sequence data. In:Proc. of the 18th International Conference on Machine Learning. pp. 282–289.

[28] Landwehr, N., Kersting, K., Raedt, L. D., 2007. Integrating naive bayesand foil. J. Mach. Learn. Res. 8, 481–507.

[29] Lee, C., Hwang, Y.-G., Oh, H.-J., Lim, S., Heo, J., Lee, C.-H., Kim, H.-J., Wang, J.-H., Jang, M.-G., 2006. Fine-grained named entity recognitionusing conditional random fields for question answering. In: InformationRetrieval Technology. Springer, pp. 581–587.

[30] Lehnert, W., McCarthy, J., Soderland, S., Riloff, E., Cardie, C., Peterson,J., Feng, F., Dolan, C., Goldman, S., 1993. Umass/hughes: descriptionof the circus system used for muc-5. In: Proc. of the 5th conference onMessage understanding. pp. 277–291.

[31] Malouf, R., 2002. A comparison of algorithms for maximum entropy pa-rameter estimation. In: Proc. of the 6th conference on Natural languagelearning - Vol. 20. pp. 1–7.

[32] McCallum, A., Freitag, D., Pereira, F. C. N., 2000. Maximum entropymarkov models for information extraction and segmentation. In: Proceed-ings of the Seventeenth International Conference on Machine Learning.ICML ’00. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,pp. 591–598.

[33] Nocedal, J., Wright, S., 2000. Numerical Optimization. Springer.

22



































https://www.researchgate.net/publication/285906939_Numerical_optimization?el=1_x_8&enrichId=rgreq-64c1b575-b89b-413b-bfa5-7d25f2beb552&enrichSource=Y292ZXJQYWdlOzI2MjQ1MjgzMDtBUzoxODUxNzA2MDI3NjYzMzlAMTQyMTE1OTUyMDkxOQ==



[34] Qi, L., Chen, L., 2010. A linear-chain crf-based learning approach for webopinion mining. In: Proc. of the 11th international conference on Webinformation systems engineering. Springer-Verlag, pp. 128–141.

[35] Quinlan, J. R., 1990. Learning logical definitions from relations. MachineLearning 5 (3), 239–266.

[36] Ratnaparkhi, A., 1999. Learning to parse natural language with maximumentropy models. Machine Learning 34 (1-3), 151–175.

[37] Rau, L., 1991. Extracting company names from text. In: Proc. of the 7thIEEE Conference on Artificial Intelligence Applications. Vol. i. pp. 29 –32.

[38] Richardson, M., Domingos, P., 2006. Markov logic networks. MachineLearning 62, 107–136.

[39] Riloff, E., 1993. Automatically constructing a dictionary for informationextraction tasks. In: Proc. of the 11th national conference on Artificialintelligence. pp. 811–816.

[40] Roth, D., Yih, W.-t., 2005. Integer linear programming inference for con-ditional random fields. In: Proc. of the 22nd international conference onMachine learning. pp. 736–743.

[41] Sarawagi, S., Cohen, W. W., 2004. Semi-markov conditional random fieldsfor information extraction. In: In Advances in Neural Information Process-ing Systems. pp. 1185–1192.

[42] Seymore, K., McCallum, A., Rosenfeld, R., 1999. Learning hidden Markovmodel structure for information extraction. In: AAAI’99 Workshop on Ma-chine Learning for Information Extraction. pp. 37–42.

[43] Shariaty, S., Moghaddam, S., 2011. Fine-grained opinion mining using con-ditional random fields. In: Proc. of the 11th International Conference onData Mining Workshops. IEEE Computer Society, pp. 109–114.

[44] Soderland, S., 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233–272.

[45] Sutton, C. A., 2008. Efficient training methods for conditional randomfields. Ph.D. thesis, University of Massachusetts.

[46] Takeuchi, K., Collier, N., 2002. Use of support vector machines in extendednamed entity recognition. In: Proc. of the 6th conference on Natural lan-guage learning - Vol. 20. pp. 1–7.

[47] Tjong Kim Sang, E. F., De Meulder, F., 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In:Daelemans, W., Osborne, M. (Eds.), Proceedings of CoNLL-2003. Edmon-ton, Canada, pp. 142–147.

23


































[48] Tkachenko, M., Simanovsky, A., 2012. Selecting features for domain-independent named entity recognition. In: Empirical Methods in NaturalLanguage Processing. p. 248.

[49] Truemper, K., 2004. Design of Logic-based Intelligent Systems. Wiley-Interscience.

[50] Zuev, Y. A., 1985. Representations of boolean functions by systems of linearinequalities. Cybernetics and Systems Analysis 21, 567–571.

24





Date post:	22-Apr-2023
Category:	Documents
Upload:	cnr-it
View:	0 times
Download:	0 times

Soft-constrained inference for Named Entity Recognition

Documents