+ All Categories
Home > Documents > Edits Manual 2.1

Edits Manual 2.1

Date post: 07-Apr-2018
Category:
Upload: yashar2007
View: 222 times
Download: 0 times
Share this document with a friend

of 45

Transcript
  • 8/6/2019 Edits Manual 2.1

    1/45

    Edit Distance Textual Entailment Suite (EDITS)User Manual - Version 2.1

    Milen Kouylekov and Matteo NegriFondazione Bruno Kessler, FBK-irst

    {kouylekov,negri}@fbk.eu

    1 Introduction

    Textual Entailment (TE) has been proposed as a unifying generic framework for modelinglanguage variability and semantic inference in different Natural Language Processing(NLP) tasks.The Recognizing Textual Entailment (RTE) task (Dagan and Glickman, 2007) consists indeciding, given two text fragments (respectively called Text- T, and Hypothesis - H),whether the meaning of H can be inferred from the meaning of T, as in:

    T: Yahoo acquired OvertureH: Yahoo owns Overture

    The system has been designed following three basic requirements:

    Modularity. System architecture is such that the overall processing task isbroken up into major modules. Modules can be composed through aconfiguration file, and extended as plug-ins according to individual requirements.Systems work-flow, the behavior of the basic components, and their IO formatsare described in a comprehensive documentation available upon download.

    Flexibility. The system is general-purpose, and suited for any TE corpusprovided in a simple XML format. In addition, both language dependent andlanguage independent configurations are allowed by algorithms that manipulatedifferent representations of the input data.

    Adaptability. Modules can be tuned over training data to optimize performancealong several dimensions (e.g. overall Accuracy, Precision/Recall trade-off onYES and NO entailment judgements). In addition, an optimization componentbased on genetic algorithms is available to automatically set parameters startingfrom a basic configuration.

    EDITS is open source, and available under GNU Lesser General Public License (LGPL).The tool is implemented in Java, it runs on Unix-based Operating Systems, and has beentested on MAC OSX, Linux, and Sun Solaris. The latest release of the package can bedownloaded from:

    http: //edits.fbk.eu.

    EDITS comes pre-packaged with:

    System Graphical Interface - an editor capable of handling the system coredata structures;

    Set of System Configurations - configurations used to run EDITS described inSection 4.2;

    Set of Cost Schemes - xml files used by the entailment engine described inSection 5;

    Entailment Rules - xml files that contain knowledge extracted from WordNet,Verbocean and Wikipedia represented as rules.

    Set of RTE Datasets - The public available RTE Corpora (RTE 1,2,3) in ETAFformat described in Section 4.1;

    Trained Models - Configured and Trained Entailment engines used by the FBKTextual Entailment group.

    1

  • 8/6/2019 Edits Manual 2.1

    2/45

    HTML Reference - a HTML Document describing all the modules available in thesystem.

    INSTALL.txt - a file that describes the installation of edits

    2

  • 8/6/2019 Edits Manual 2.1

    3/45

    2 System Overview

    Figure 1: Entailment Engine

    The EDITS package allows to:

    Create an Entailment Engine (Figure 1) by defining its basic components (i.e.algorithms, cost schemes, rules, and optimizers);

    Train such Entailment Engine over an annotated RTE corpus (containing T-Hpairs annotated in terms of entailment) to learn a Model;

    Use the Entailment Engine and the Model to assign an entailment judgmentand a confidence score to each pair of an un-annotated test corpus.

    EDITS implements a distance-based framework which assumes that the probability of anentailment relation between a given T-H pair is inversely proportional to the distancebetween T and H (i.e. the higher the distance, the lower is the probability of entailment).Within this framework the system implements and harmonizes different approaches todistance computation, providing both edit distance algorithms, and similarity algorithms(see Section 3.1). Each algorithm returns a normalized distance score (a numberbetween 0 and 1). At a training stage, distance scores calculated over annotated T-Hpairs are used to estimate a threshold that best separates positive from negativeexamples. The threshold, which is stored in a Model, is used at a test stage to assign an

    entailment judgment and a confidence score to each test pair.

    In the creation of a distance Entailment Engine, algorithms are combined with costschemes (see Section 3.2) that can be optimized to determine their behavior (seeSection 3.3), and optional external knowledge represented as rules (see Section 3.4).Besides the definition of a single Entailment Engine, a unique feature of EDITS is that itallows for the combination of multiple Entailment Engines in different ways (see Section4.4).

    Basic components are already provided with EDITS, allowing to create a variety ofentailment engines. Fast prototyping of new solutions is also allowed by the possibility toextend the modular architecture of the system with new algorithms, cost schemes, rules,or plug-ins to new language processing components.

    3

  • 8/6/2019 Edits Manual 2.1

    4/45

    3 Basic Components

    This section overviews the main components of a distance Entailment Engine, namely: i)algorithms, iii) cost schemes, iii) the cost optimizer, and iv) entailment/contradictionrules.

    3.1 Algorithms

    Algorithms are used to compute a distance score between T-H pairs. EDITS provides aset of predefined algorithms, including edit distance algorithms, and similarity algorithmsadapted to the proposed distance framework. The choice of the available algorithms ismotivated by their large use documented in RTE literature.

    Edit distance algorithms cast the RTE task as the problem of mapping the wholecontent of H into the content of T. Mappings are performed as sequences of editingoperations (i.e. insertion, deletion, substitution of text portions) needed to transform Tinto H, where each edit operation has a cost associated with it. The distance algorithmsavailable in the current release of the system are:

    Token Edit Distance: a token-based version of the Levenshtein distancealgorithm, with edit operations defined over sequences of tokens of T and H;

    Tree Edit Distance: an implementation of the algorithm described in (Zhangand Shasha, 1990), with edit operations defined over single nodes of a syntacticrepresentation of T and H.

    Similarity algorithms are adapted to the EDITS distance framework by transformingmeasures of the lexical/semantic similarity between T and H into distance measures.These algorithms are also adapted to use the three edit operations to support overlapcalculation, and define term weights. For instance, substitutable terms in T and H can betreated as equal, and non-overlapping terms can be weighted proportionally to their

    insertion/deletion costs. Five similarity algorithms are available, namely:

    Word Overlap: computes an overall (distance) score as the proportion ofcommon words in T and H. In the current implementation the algorithm uses thecost scheme to find the less costly substitution of a word from H with a wordform T. One word from T can substitute more than one of the words in H. Thescore returned by the algorithm is the sum of the cost of all substitutions dividedby the number of words in H;

    Jaro-Winkler distance: a similarity algorithm between strings, adapted tosimilarity between word .The algorithm uses the cost scheme to define if twowords are the same (they have a 0 cost of substitution). The entailment score isobtained by subtracting the obtained Jaro-Winkler metric from 1 (i.escore(A,B)=1-JW(A,B));

    Cosine Similarity: a common vector-based similarity measure. The EDITSimplementation uses the cost scheme to define if two words are the same (theyhave a 0 cost of substitution) and the weight of words (the cost of insertion for aword in H and the cost of deletion for a word in T).

    Longest Common Subsequence: searches the longest possible sequence ofwords appearing both in T and H in the same order, normalizing its length by thelength of H; The algorithm uses the cost scheme to define if two words are thesame (they have a 0 cost of substitution). The entailment score is obtained bysubtracting the obtained similarity from 1 (i.e. score(A,B)=1-(LCS(A,B)/words(B))).

    Jaccard Coefficient: confronts the intersection of words in T and H to theirunion. The algorithm uses the cost scheme to define if two words are the same

    (they have a 0 cost of substitution). The entailment score is obtained bysubtracting the obtained similarity from 1 (i.e. score(A,B)=1-JSC(A,B)).

    4

  • 8/6/2019 Edits Manual 2.1

    5/45

    Rouge : out implementation of a set of metrics for summarization evaluation.(Lin 2004)

    3.2 Cost Schemes

    Cost schemes are used to define the cost of each edit operation. Cost schemes are

    defined as XML files that explicitly associate a cost (a positive real number) to each editoperation applied to elements of T and H. Elements, referred to as A and B, can be ofdifferent types, depending on the algorithm used. For instance, Tree Edit Distance willmanipulate nodes in a dependency tree representation, whereas Token Edit Distance andsimilarity algorithms will manipulate words. Figure 2 shows an example of a costscheme, where edit operation costs are defined as follows:

    Insertion(B)=10 - inserting an element B from H to T, no matter what B is, always costs10;Deletion(A)=10 - deleting an element A from T, no matter what A is, always costs 10;Substitution(A,B)=0 if A=B - substituting A with B costs 0 if A and B are equal;Substitution(A,B)=20 if A!=B - substituting A with B costs 20 if A and B are different.

    1010

    (equals A B)0

    (not (equals A B))20

    Figure 2: Example of XML Cost Scheme

    In the distance-based framework adopted by EDITS, the interaction between algorithmsand cost schemes plays a central role. Given a T-H pair, in fact, the distance scorereturned by an algorithm directly depends on the cost of the operations applied totransform T into H (edit distance algorithms), or on the cost of mapping words in H withwords in T (similarity algorithms). Such interaction determines the overall behavior of anEntailment Engine, since distance scores returned by the same algorithm with differentcost schemes can be considerably different. This allows users to define (and optimize, asexplained in Section 3.3) the cost schemes that best suit the RTE data they want tomodel. For instance, when dealing with T-H pairs composed by texts that are muchlonger than the hypotheses (as in the RTE5 Campaign), setting low deletion costs avoidspenalization to short Hs fully contained in the Ts.

    To facilitate the usage of the system EDITS provides a mechanism for automaticgeneration of cost schemes. This mechanism creates a simple cost scheme that iscompatible with the distance algorithm used and the resources of entailment/contradiction accessible to the system. More information about this mechanism can befound in Section 8.

    EDITS provides two predefined cost schemes:

    Simple Cost Scheme - the one shown in Figure 2, setting fixed costs for eachedit operation.

    5

  • 8/6/2019 Edits Manual 2.1

    6/45

    IDF Cost Scheme - insertion and deletion costs for a word w are set to theinverse document frequency of w (IDF(w)). The substitution cost is set to 0 if aword w1 from T and a word w2 from H are the same, and IDF(w1)+IDF(w2)otherwise.

    In the creation of new cost schemes, users can express edit operation costs, and

    conditions over the A and B elements, using a meta-language based on a lisp-like syntax(e.g. (+ (IDF A) (IDF B)), (not (equals A B))). The system also provides functions toaccess data stored in hash files. For example, the IDF Cost Scheme accesses the IDFvalues of the most frequent 100K English words (calculated on the Brown Corpus) storedin a file distributed with the system. Users can create new hash files to collect statisticsabout words in other languages, or other information to be used inside the cost scheme.The definition of a cost scheme is described in details in Section 5. EDITS provides amechanism for automatic generation of cost schemes, as described in Section 4.6.

    3.3 Cost Optimizer

    A cost optimizer is used to adapt cost schemes (either those provided with the system,

    or new ones defined by the user) to specific datasets. The optimizer is based on costadaptation through genetic algorithms, as proposed in (Mehdad, 2009). To this aim, costschemes can be parametrized by externalizing as parameters the edit operations costs.The optimizer iterates over training data using different values of these parameters untilon optimal set is found (i.e. the one that best performs on the training set). Theoptimization mechanism is described in Section 4.5.

    3.4 Rules

    Rules are used to provide the Entailment Engine with knowledge (e.g. lexical, syntactic,semantic) about the probability of entailment or contradiction between elements of T andH. Rules are invoked by cost schemes to influence the cost of substitutions betweenelements of T and H. Typically, the cost of the substitution between two elements A andB is inversely proportional to the probability that A entails B.

    Rules are stored in XML files called Rule Repositories, with the format shown in Figure 3.Each rule consists of three parts: i) a left-hand side, ii) a right-hand side, iii) aprobability that the left-hand side entails (or contradicts) the right-hand side.

    acquireown0.95

    beautifulugly0.88

    Figure 3: Example of XML Rule Repository

    The format of the entailment rules repository is described in Section 6.

    6

  • 8/6/2019 Edits Manual 2.1

    7/45

    4 Using the System

    This section provides basic information about the use of EDITS, which can be run withcommands in a Unix Shell.

    4.1 EDITS Input

    The input of the system is an entailment corpus represented in the EDITS TextAnnotation Format (ETAF), a simple XML internal annotation format (DTD in Figure 4).ETAF is used to represent both the input T-H pairs, and the entailment and contradictionrules. ETAF allows to represent texts at two different levels: i) as sequences of tokenswith their associated morpho-syntactic properties, or ii) as syntactic trees with structuralrelations among nodes. Plug-ins for several widely used annotation tools (includingTreeTagger, Stanford Parser, and OpenNLP) can be downloaded from the systemswebsite. Users can also extend EDITS by implementing plug-ins to convert the output ofother annotation tools into ETAF.

    id CDATA #REQUIREDentailment (YES|NO|UNKNOWN) #REQUIREDtask CDATA #IMPLIEDlength CDATA #IMPLIED

    >

    name CDATA #IMPLIEDfrom CDATA #REQUIREDto CDATA #REQUIRED

    >

    name CDATA #REQUIREDsource CDATA #IMPLIED

    >

    name CDATA #REQUIREDstart CDATA #IMPLIED

    7

  • 8/6/2019 Edits Manual 2.1

    8/45

    end CDATA #IMPLIEDsource CDATA #IMPLIED

    >

    Figure 4. DTD of the ETAF annotation format

    The basic level of annotation represents texts as sequences of tokens with morpho-syntactic features. Common properties are token, lemma, morpho and part of speech,but other linguistic annotations may be used at this level, including named entities,weights of tokens (like IDF), and many others. For example, the following is the morpho-syntactic representation of the sentence "Edison invented the Kinetoscope.", with user-defined attribute names for two pos tagging sets (pos - TextPro tagset and wnpos -WordNet part of speech), token, lemma, full morphological analysis (full morpho) andsentence boundaries.

    Edison invented the Kinetoscope.

    nEdisonedisonNP0

    invent+v+part+pastinvented+adj+zero+invent+v+indic+past

    v invented

    inventsVVD

    the+adv the+artthetheAT0

    nKinetoscopekinetoscope

    NN1

    .+punc..PUN

    Figure 5: Morpho-Syntactic Annotation Example

    The second level of annotation represents texts as syntactic trees with their structural

    8

  • 8/6/2019 Edits Manual 2.1

    9/45

    features. Both nodes (terminal and non terminal) and edges with syntactic relations arerepresented. Nodes are typically described with their morpho-syntactic properties. Theexample below shows the output of the Stanford Parser for the sentence "Edisoninvented the Kinetoscope."converted into ETAF.

    Edison invented the Kinetoscope.

    EdisonEdisonNNP

    invented

    inventVBD

    thetheDT

    KinetoscopeKinetoscopeNN

    Figure 6: Syntactic Annotation Example

    Publicly available RTE corpora (RTE 1-3, and EVALITA 2009), annotated in ETAF at boththe annotation levels, are delivered together with the system to be used as firstexperimental datasets.

    The annotation of an entailment corpus with one of the annotation tools known by thesystem is done with the following command:

    edits -a -name-of-the-tool -o output-file input-file

    where "-a"indicates that EDITS must annotate a file. The "-name-of-the-tool"option

    indicates the annotation tool (e.g. "stanford-parser", "texpro" or "opennlp") used toperform the annotation. The "-o"indicates the file where EDITS will store the result ofannotation. For example the following command will annotate RTE2 corpus with the

    9

  • 8/6/2019 Edits Manual 2.1

    10/45

    Stanford parser.

    edits -a -stanford-parser -o RTE2_dev-annotated.xml rte/RTE2_dev.xml

    ETAF annotatated files can be visualised with the EDITS graphical interface. For example

    snapshots check Section 8.

    4.2 Configuration

    The creation of an Entailment Engine is done by defining its basic components(algorithms, cost schemes, optimizer, and rules) through an XML configuration file.

    Figure 7: An Example of a Configuration File

    The configuration file in Figure 7 is divided in modules, each having a set of options. Thisconfiguration defines a distance Entailment Engine that combines Tree Edit Distance as acore distance algorithm, and the predefined IDF Cost Scheme that will be optimized ontraining data with the Particle Swarm Optimization algorithm (pso) as in (Mehdad,2009). Adding external knowledge to an entailment engine can be done by extending theconfiguration file with a reference to a rules file (e.g. wordnet-rules.xml) as follows:

    The DTD of the format and more information about the configuration file appear inSection 7. Configuration files can be created, visualized and modified with the EDITSgraphical interface. For example snapshots check Section 8.

    4.3 Training and Test

    4.3.1 Training

    Given a configuration file and an RTE corpus annotated in ETAF, the user can run thetraining procedure to learn a model. This is done using the following command:

    edits -r -c configuration_file -sm model annotated_entailment_corpus.

    where "-r" instructs the system to train a model using the entailment engine defined inthe configuration file specified by the "-c" option on the annotated entailment corpus fileor directory provided as input. The obtained model will be saved in the file specified by

    10

  • 8/6/2019 Edits Manual 2.1

    11/45

    the "-sm" option. For example if we want to train a model on the RTE3 developmentdataset with the predefined configuration specified in the file share/configurations/conf2.xml we use the following command:

    edits -r -c share/configurations/conf2.xml -sm RTE3dev-model-conf2 rte/etaf/morpho-syntax/RTE3dev.xml

    The output of the training phase is a model: a zip file that contains the learnedthreshold, the configuration file, the cost scheme, and the entailment/contradiction rulesused to calculate the threshold. The explicit availability of all this information in themodel allows users to share, replicate and modify experiments.

    1

    At the end of the training phase, a summary of the system's performance over thetraining set is printed on screen. This summary reports: (i) the distance model, includingthe distance threshold; (ii) the accuracy of the annotation of the whole training set; (iii)separate precision, recall and F-measure scores for the YES and NO training pairs. Anexample summary is shown in Figure 8.

    Calculated Threshold: 0.7948717948717948

    ******************************

    Accuracy: 0.61

    ################################################ Examples # Precision # Recall # FMeasure # Confidence # Class ## 412 # 0.6244 # 0.6092 # 0.6167 # 0.1454 # YES ## 388 # 0.5955 # 0.6108 # 0.6031 # 0.3313 # NO ###############################################

    ################ # YES # NO ## YES # 251 # 161 ## NO # 151 # 237 ################

    Figure 8: Example of Training Summary

    The training stage also allows to tune performance along several dimensions (e.g.overall Accuracy, Precision/Recall trade-off on YES and/or NO entailment judgments). Bydefault the system maximizes the overall accuracy (distinction between YES and NOpairs). To make such adjustments the user should modify the configuration file adding

    the following option to the definition of the distance entailment engine:

    ......

    1. Our policy is to publish online the models we use for participation in the RTE

    Challenges. We encourage other users of EDITS to do the same, thus creating acollaborative environment, allow new users to quickly modify working configurations,and replicate results.

    11

  • 8/6/2019 Edits Manual 2.1

    12/45

    More information of the values of the option and other options of the distance entailmentengine can be found in the HTML reference guide downloadable with the system.

    4.3.1 Test

    Given a model and an ETAF annotated RTE corpus as input, the test procedure producesa file containing for each pair: i) the decision of the system (YES, NO), ii) the confidenceof the decision, iii) the entailment score, iv) the sequence of edit operations made tocalculate the entailment score. The command used to evoke the test procedure is thefollowing:

    edits -e -m model -o edits_result entailment_corpus

    where "-e" instructs the system to load the entailment engine stored in the file specifiedby the "-m" option and to annotate the entailment relation for the pairs of the input file.EDITS will store the in the output file, specified by the option "-o", the result for each

    entailment pair. An example output is the following xml framgment:

    Claude Chabrol (born June 24, 1930) is a French movie director and has becomewell-known in the 40 years since his first film, Le Beau Serge , for his chilling tales ofmurder, including Le Boucher .

    Le Beau Serge was directed by Chabrol.

    Figure 9: Simple EDITS output

    The entailment relation found by EDITS is reported in the entailmentattribute. If thepairs in the corpus had already an entailment relation assigned (in case of training dataor annotated test) the additional attribute benchmark is added signifying the originalvalue. The user can obtain also the edit distance operations made by the system foreach pair by controlling the verbosity of the output with the "-ot" option. This allow theto produce the Extended Output (-ot=extended) represented in Figure 10 or the Fulloutput represented in Figure 11.

    Claude Chabrol (born June 24, 1930) is a French movie director and has becomewell-known in the 40 years since his first film, Le Beau Serge , for his chilling tales of

    murder, including Le Boucher .Le Beau Serge was directed by Chabrol.

    Figure 10: Extended Output

    Claude Chabrol (born June 24, 1930) is a French movie director and has becomewell-known in the 40 years since his first film, Le Beau Serge , for his chilling tales ofmurder, including Le Boucher .

    Le Beau Serge was directed by Chabrol.

    12

  • 8/6/2019 Edits Manual 2.1

    13/45

    claudenClaudeclaude+pn-NP0

    ....

    bybyby+prep by+adv by+adj+zero-PRP

    ...

    chabrolnChabrol-NN1

    chabrolnChabrol-NN1

    Figure 11: Full output

    EDITS annotated files can be visualised with the EDITS graphical interface. For examplesnapshots check Section 8.4.4 Combining Engines

    A relevant feature of EDITS is the possibility to combine multiple Entailment Engines intoa combined entailment engine as shown in Figure 12. This can be done by grouping theirdefinitions as sub-modules in the configuration file. EDITS allows users to definecustomized combination strategies, or to use two predefined combination modalitiesprovided with the package, namely: i) Linear Combination, and ii) ClassifierCombination. The two modalities combine in different ways the entailment scoresproduced by multiple independent engines, and return a final decision for each T-H pair.

    13

  • 8/6/2019 Edits Manual 2.1

    14/45

    Figure 12: Combined Entailment Engine

    Linear Combination returns an overall entailment score as the weighted sum of theentailment scores returned by each engine:

    In this formula, weighti is an ad-hoc weight parameter for each entailment engine.Optimal weight parameters can be determined using the same optimization strategyused to optimize the cost schemes, as described in Sections 3.3 4.5.

    Classifier Combination is based on using the entailment scores returned by each engineas features to train a classifier (see Figure 10). To this aim, EDITS provides a plug-inthat uses the Weka machine learning workbench as a core. By default the plug-in usesan SVM classifier, but other Weka algorithms can be specified as options in theconfiguration file.

    The configuration file in Figure 13 describes configuration file the describes acombination of two engines (i.e. one based on Tree Edit Distance, the other based onCosine Similarity), used to train a classifier with Weka.

  • 8/6/2019 Edits Manual 2.1

    15/45

    Figure 13: Configuration file of Combined Entailment Engine

    A linear combination can be easily obtained by changing the alias of the highest-levelmodule (weka) into linear.

    Entailment engines can be combined by merging their configurations using the EDTSgraphical interface. For example snapshots check Section 8.

    4.5 Optimization

    The goal of the optimization process is to change the values of certain parameters(optimizable parameters) of an entailment engine in order to make it perform better onthe training set. The procedure is performed by modules called engine-optimizers. EDITSprovides two such module "genetic" and "pso" as plugins. For both modules

    the optimization procedure uses some form of genetic search to find the optimal values.The optimize-able parameters are specific to the optimized engine. The optimizeableparameters of the simple entailment engine are the OP constants of the cost scheme(more information in Section 5.2.1). The optimizeable parameters of the linearcombination engine are the weights of each sub-engine. The Weka based entailmentengine can not be optimized. In Figure 14 is shown the configuration file conf-optimize.xml that can be fond in share/configurations folder) which represents a simpleentailment engine using an optimizable cost scheme (scheme-optimize.xml) that will beoptimized when the training process is activated.

    Figure 14: Configuration file that represents an entailment engine that will be optimizedduring training

    The option "optimize-parameters" indicates to EDITS that it should use the engine-optimizer the module "genetic" to tune the performance of the entailment engine. Figure15 describes the result of the optimization process showing the new values of theoptimizable parameters.

    bin/edits -r -c share/configurations/conf-optimize.xml rte/etaf/morpho-syntax/RTE2_dev.xml

    Parameter: OPinsertion Value: 0.7482828808106775Parameter: OPdeletion Value: 0.08147454355821715Parameter: OPsubstitution Value: 0.7209179287932125

    Calculated Threshold: 0.5636193641242588

    15

  • 8/6/2019 Edits Manual 2.1

    16/45

    *******************************

    Accuracy: 0.61625

    ############################################## Examples # Precision # Recall # FMeasure # Confidence # Class ## 400 # 0.5947 # 0.73 # 0.6554 # 0.2658 # YES ## 400 # 0.6505 # 0.5025 # 0.567 # 0.2097 # NO ################################################################ # YES # NO ## YES # 292 # 108 ## NO # 199 # 201 ####################

    Figure 15: Result of Optimization Process

    4.6 Automatic Generation of Cost Schemes

    EDITS provides a mechanism for a quick experimentation with different distancealgorithms by allowing the user to avoid the specification of a cost scheme and still beable to have a fully functional entailment engine. In this case EDITS automaticallygenerates a cost scheme that is adapt to the algorithm and the resources of entailmentrules defined in the configuration file. For example the cost scheme in Figure 16 isautomatically generated for the simple entailment engine that uses token edit distance

    algorithm with configuration file shown in Figure 17.

    Figure 16: Simple Entailment Engine that uses Token Edit Distance Algorithm

    (* OPinsertion (size (words T)))

    (* OPdeletion (size (words H)))

    (equals (a.token A) (a.token B))OPsubstitution1

    (* OPsubstitution2 (+ (size (words T)) (size (words H))))

    16

  • 8/6/2019 Edits Manual 2.1

    17/45

    Figure 17: Automatically Generated Cost Scheme

    The generated cost scheme is also optimizable. Users are encouraged to start from

    automatically generated cost schemes as template for future cost schemes development.

    4.7 Automatic Generation of Configuration file

    EDITS allows the user to make quick experiments without providing a configuration file.For example the following command:

    bin/edits -r rte/etaf/morpho-syntax/RTE2_dev.xml

    will automatically generate a configuration file, like the one shown in Figure 16, that willdescribe a simple entailment engine using token edit distance as distance algorithm andan automatically generated cost scheme.

    The following command will create a simple entailment engine with tree edit distance asdistance algorithm:

    bin/edits -r -tree rte/etaf/syntax/RTE2_dev.xml

    The following command will generate a configuration file that will describe a simpleentailment engine with tree edit distance as distance algorithm and will optimize theautomatically created cost scheme with the genetic entailment engine optimizer:

    bin/edits -r -op -genetic -tree rte/etaf/syntax/RTE2_dev.xml

    4.8 Experiment

    /hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/RTE2_dev.xml

    /hardmnt/queneau0/tcc/kouylekov/edits/rte/etaf/morpho-syntax/RTE2_test.xml

    Figure 18: Example of Experiment File

    The experiment file allows the user to reduce the length of command lines and to repeatquickly the same experiment. An example of an experiment file is shown in Figure18. This file defines an experiment in which EDITS will use the configuration defined inthe file "share/configurations/conf1.xml" to train a model on the RTE2 development setand will test this model on RTE2 test set. The user can specify in the experiment file allthe options that will handle the models created during training and the output of a test.The DTD of an experiment file is shown in Figure 19.

    configuration CDATA #INPLIED

    17

  • 8/6/2019 Edits Manual 2.1

    18/45

    model CDATA #INPLIEDconfiguration CDATA #INPLIEDoutput CDATA #INPLIEDoutput-type CDATA #INPLIEDuse-memory CDATA #INPLIEDoverwrite CDATA #INPLIEDadd-data CDATA #INPLIED

    >

    Figure 19: DTD of Experiment File

    New experiments can be easily created using the EDITS graphical interface. For examplesnapshots check Section 8.

    18

  • 8/6/2019 Edits Manual 2.1

    19/45

    5 Defining Cost Schemes for Edit Operations

    According to the distance-based approach, T entails H if there exists a sequence oftransformations applied to T such that we can obtain H with an overall cost below acertain threshold. The underlying assumption is that pairs between which an entailmentrelation holds have a low cost of transformation. EDITS allows for the definition of thecost for each edit operation carried out by the distance algorithm in order to find thebest (i.e. less costly) sequence of edit operations that transforms T into H. The basicdata structure in EDITS for the definition of costs is the cost scheme. One or more costschemes can be associated to each edit operation, and they are collected in a costscheme file that can be created by the user.

    A cost scheme is invoked by the edit distance algorithm with three parameters: (i) anedit operation, (ii) an element of T, called the source and referred through the variableA, and (iii) an element of H, called the target and referred through the variable B. Eachcost scheme for a certain edit operation consists of three parts:

    1. Name - Every cost scheme must have a user defined unique name

    2. Condition - A set (possibly empty) of constraints over the source and the targetelements, which need to be satisfied in order to activate the cost scheme. Eachconstraint is expressed in a lisp-like syntax, and all constraints must be satisfied(i.e. they have to return true) in order the cost scheme to be applied.

    3. Cost - A fixed value, or a function that returns a numerical value, expressing thecost of the edit operation applied to the source and to the target.

    name CDATA #REQUIREDtype (string|number|boolean) #REQUIREDvalue CDATA #REQUIRED

    >

    Figure 20: XML Cost Scheme DTD.

    A cost function can consider as parameters the source element, the target element, thetext T, and the hypothesis H. EDITS adopts a combination of XML annotations andfunctional expressions to define the cost schemes. The XML Document Type Definition(DTD) of the cost scheme file is reported in Figure 20. A simple example of a pre-defined cost scheme file (simple-scheme.xml, introduced in Section 3.2) is shown inFigure 21.

    10

    19

  • 8/6/2019 Edits Manual 2.1

    20/45

    10

    (equals (attribute "token" A) (attribute "token" B))0

    (not (equals (attribute "token" A) (attribute "token" B)))20

    Figure 21: Simple Cost Scheme

    This cost scheme applies to elements of T and H, referred respectively as A and B, whichare annotated as words (see ETAF in Section 3). The function (attribute "token" A)returns the token of the source element. Within the example, there are four edit

    operations (1 for insertion, 1 for deletion, and 2 for substitution), assigned to differentcosts:

    insertion(B)= 10 - inserting an element B, no matter what B is, always costs 10. deletion(A)= 10 - deleting an element A, no matter what A is, always costs 10. substitution(A,B)= 0 if A=B - substituting A with B costs 0 if the token of A and

    the token of B are equal (i.e. they are the same string). substitution(A,B)= 20 if A != B - substituting A with B costs 20 if the tokens of A

    and B are not equal.

    (and (equals (attribute "lemma" A) (attribute "lemma" B)) (equals

    (attribute "pos" A) (attribute "pos" B)))0

    Figure 22 shows a more complex example of a cost scheme for the substitutionoperation. In the example, a token from T is substituted by a token from H with a costequal to 0 if their lemmas and part of speech are equal.

    As shown in the previous examples, EDITS allows to define the cost of the editoperations by means of user-defined attributes. In the example in Figure 23 the costscheme exploits the pre-computed frequency of a token to calculate the cost of insertion,according to the intuition that the most frequent words should have a lower cost of

    insertion.

    (not (null (attribute "freq" B)))(* (/ 1 (number (attribute "freq" B))) 20)

    Figure 23: Insertion Based on Frequency

    Cost schemes can be easily created, modified and viewed using the EDITS graphicalinterface. For example snapshots check Section 8.

    5.1 Data Types and Functions

    20

  • 8/6/2019 Edits Manual 2.1

    21/45

    Conditions and costs are defined using a set of functions, expressed in a lisp-like syntax.Such functions can consider as parameters the source A, the target B, the text T, andthe hypothesis H. This means that all the information about T and H derived from theirlinguistic processing (e.g. part of speech, syntactic structure, etc.) is available fordefining conditions and cost functions. As an example, typical constraints involvechecking the token and the part of speech of A and B, while typical cost functions are

    computed considering the lexical similarity between A and B, possibly normalizing suchvalue over the length of T and H.

    5.1.1 Data Types

    Basic elements for defining constraints and costs in a cost scheme are derived from thethree representation levels defined in ETAF. EDITS provides functions for the mostrelevant elements defined in the ETAF linguistic representation. The arguments of suchfunctions are the variables (i.e. A, B, T, H) which are instantiated within a specific costscheme. Functions use the following primitive objects data types:

    1. Word - represents a token in T and H and it is instantiated by the variables A and

    B in a cost scheme.2. Node - represents a tree element in T and H and it is instantiated by the

    variables A and B in a cost scheme.3. Tree - represents a syntactic tree; it is obtained using the function tree.4. Number - a real number, for example: 0 , 3:14 etc.5. Boolean - True or False.6. String - a sequence of characters, for example: "Dolomiti" or "Milen".7. List - a sequence of elements (words, nodes, numbers, booleans, etc.).8. Set - a group of elements of a given type (strings, numbers, or booleans) loaded

    from a file.9. Hash - an object that contains maps from keys to values, loaded from a le. Keys

    are strings, while values are either strings, numbers, or booleans.10. Edges - represents an edge in the dependency tree of T and H and it is

    instantiated by the variables A and B in a cost scheme.

    Sets and Hashs are objects that have to be read from an external file (i.e. they can notbe created inside a cost scheme or read from the entailment corpus). The format of a filecontaining a Hash is shown in Figure 24. The type is the data-type of the values in thefile, and the key is separated from a value with a tab. A fragment from the file "share/cost-scheme/idf.txt" containing the IDF of words is shown in Figure 25.

    typekey-1 value-1...

    key-n value-n

    Figure 24: Hash file format

    numberspeak 12.23ride 3.23...read 10.32

    Figure 25: Hash file example

    21

  • 8/6/2019 Edits Manual 2.1

    22/45

    The format of a file containing a Set is shown in Figure 26. The type is the data-type ofthe elements in the Set. A fragment from the file "share/cost-scheme/stopwords.txt"stop words is shown in Figure 27.

    typeelement-1...element-n

    Figure 26: Set file format

    stringspeakride...read

    Figure 27: Set file example

    The Hashs and Sets are defined as options in the configuration of the CostSchememodule. They are accessed using the functions set-contains and hash-value described inthe following section by refering to their ID. The fragment in Figure 28 represents asimple definition of a hash and a set inside a configuration file.

    Figure 28: Definition of a Hash in configuration file

    5.1.2 Functions

    Functions over AnnotatedText

    (string AnnotatedText ) - returns the text of the AnnotatedText (i.e. the text of Tor H).

    (tree AnnotatedText ) - returns the syntactic tree of the AnnotatedText. (words AnnotatedText ) - returns the list of words in the AnnotatedText.

    Functions for accessing Entailment Rules

    (entail SimpleRulesObject1 SimpleRulesObject2 ) - checks for the existence of anentailment rule (see Section 6) where the left hand side of the rule matchesSimpleRulesObject1 and the right hand side of the rulematches SimpleRulesObject2. The two arguments must be of the same datatype. The allowed types are: String, Word and Node. If the rule exists, then theprobability associated to the rule is returned, otherwise the output of thefunction is null.

    (contradict SimpleRulesObject1 SimpleRulesObject2 ) - checks for the existenceof a contradiction rule (see Section 6) where the left hand side of the rule

    matches SimpleRulesObject1 and the right hand side of the rule matchesSimpleRulesObject2. The two arguments must be of the same data type. The

    22

  • 8/6/2019 Edits Manual 2.1

    23/45

    allowed types are: String, Word and Node. If the rule exists, then the probabilityassociated to the rule is returned, otherwise the output of the function is null.

    Functions over Trees

    (nodes Tree ) - returns the list of nodes of a tree.

    (parent Node Tree ) - returns the parent of a node in the syntactic tree of T or H. (children Node Tree ) - returns the children of a node in the syntactic tree of T or

    H. (from Edge) - returns the from node of an edge in the syntactic tree of T or H. (to Edge) - returns the to node of an edge in the syntactic tree of T or H.

    Functions over Nodes

    (word Node) - returns the word of the node. (label Node) - returns the label (i.e. the syntactic category) of the node. (edge Node) - returns the edge (i.e. the syntactic relation) entering in the node. (is-label-node Node) - returns true if the node contains a label.

    (is-word-node Node) - returns true if the node contains a word.

    Functions over Words

    (attribute String Word ) - returns the value of the attribute String of the word. Ifthe attribute is missing the function returns null.

    Functions with string arguments

    (equals String1 String2) - Returns True if String1 is equal to String2. (equals-ignore-case String1 String2) - compares two strings ignoring their case. (capitalized String ) - returns true if the string is capitalized. (starts-with String1 String2 ) - returns True if String1 starts with String2 . For

    instance: (starts-with reading read ) is True. (ends-with String1 String2 ) - returns True if String1 ends with String2 . (contains String1 String2 ) - returns True if String1 contains String2 . For

    instance, (contains reenacting act ) returns True. (number String ) - reads a number from String . For instance, (number 3.14 )

    returns 3.14. (boolean String ) - reads a boolean from String . The possible arguments are

    true and false. (to-lower-case String ) - converts String to lower case. (length String ) - returns the number of characters in String . (char String Number ) - returns the character in String at position corresponding

    to Number.

    (substring String Number1 Number2 ) - returns the sub-string of String from theposition corresponding to Number1 till the position Number2 .

    (distance String1 String2 :normalize ) returns the Levenshtein distance betweenString1 and String2 . If the :normalize parameter is present the functionreturns a normalized distance (with respect to the length of the two arguments)between 0 and 1.

    Functions with numeric arguments

    (= Number1 Number2 ) - returns True if Number1 is equal to Number2 . (< Number1 Number2 ) - returns True if Number1 is less than Number2 . (> Number1 Number2 ) - returns True if Number1 is more than Number2 .

    (+ Number1 ... Numbern ) - makes a sum of numbers (example (+ 1 2) is equalto 3).

    23

  • 8/6/2019 Edits Manual 2.1

    24/45

    ( Number1 ... Numbern ) - subtracts numbers from N umber1 (example (- 5 21) is equal to 2).

    (* Number1 ... Numbern ) - multiplies numbers (example (* 2 2) is equal to 4). (/ Number1 ... Numbern ) - divides Number1 by the rest (example (/ 24 3 2) is

    equal to 4).

    Functions with boolean arguments

    (and Boolean1 ... Booleann ) - returns True if all the arguments are True. (or Boolean1 ... Booleann ) - returns True if at least one of the arguments is

    True. (not Boolean ) - returns True if the argument is False, and False if it is True.

    Conditional Functions

    (if Boolean Object1 Object2 ) - if the Boolean is equal to True then the functionreturns Object1 otherwise Object2 . If Object2 is not defined the function returnsnull.

    Functions with list arguments

    (member List Object ) - returns T rue if List contains Object. For example: thelist (1 2 3) contains the number 1; the list (sum plus minus) contains thestring plus.

    (size List ) - returns the number of elements in List. (nth Number List ) - returns the n-th (Number) element of List. The first element

    of a list is returned by (nth 1 list), the last with (nth (- (size list) 1) list). (subseq List Number1 Number2 )- returns the sub-list of the list from the

    position corresponding to Number1 till the position Number2 .

    Functions handling Hash and Set

    (hash-value String1 String1) - returns the value of the hash with id equals toString1 for the key String2 .

    (set-contains String Object ) - returns True if the set with id equals to Stringcontains Object.

    Null handling functions

    (null Object ) - returns True if the argument is null. For example, to express thata word A from T has not an attribute freq, the expression (null (attribute freqA)) can be used.

    5.2 Using constants

    COST

    COST

    Figure 29: Example of Cost Scheme constants.

    The constants are used in a cost scheme to externalize certain values that can be usedby more than one of the operations of the cost scheme. In the XML scheme in Figure 29

    24

  • 8/6/2019 Edits Manual 2.1

    25/45

    the constant "COST"is used as cost of both the insertion and substitution operations.Each constant must have a type that define what type of data the cost scheme willinterpret it's value as. The possible types are "string", "number" and "boolean". EDITSprovides two constants T and H that the user can use inside cost schemes. The contentsof these constants is the annotation correspondingly of of T and H. For example if theuser wants to set as the cost of deletion the number of words in Tm he/she must use the

    following xml fragment:

    (size (words T))

    5.2.1 Optimizing Cost Schemes

    The constants also play an important role in optimizing the cost scheme. The constantsthat have a name that start with the capital letters OP are considered by the system asparameters of the cost scheme that can be optimized. A cost scheme is optimizable if itcontains at least one such parameter. An example of a cost scheme with optimizable

    constants is the fragment in Figure 30 (which can be found in share/cost-schemes/optimize-scheme.xml). In this scheme the cost of insertion, deletion and substitution incase A and B are not equal is optimized.

    OPinsertion

    OPdeletion

    (equals (attribute "token" A) (attribute "token" B))0

    (not (equals (attribute "token" A) (attribute "token" B)))OPsubstitution

    Figure 30: Optimizable Cost Scheme

    5.3 Matching Edges

    All algorithm that process tokens (the only one that works on trees is tree edit distance)from version 2.1 are adapted to work with the edges of the syntactic tree. To do this theuser must use the match-edges option either in the configuration file or in the commandline. For certain algorithms like token edit distance or longest common subsequence thisis not semantically motivated as the edges does not have a predefined order.

    bin/edits -train -overlap -match-edges

    25

  • 8/6/2019 Edits Manual 2.1

    26/45

    26

  • 8/6/2019 Edits Manual 2.1

    27/45

    6 Defining rules in EDITS

    EDITS allows the use of sets of rules, both entailment rules and contradiction rules, inorder to provide specific knowledge (e.g. lexical, syntactic, semantic) abouttransformations between T and H. Rules can be manually created, or they can beextracted from any available resource (e.g. WordNet, Wikipedia, DIRT) and stored inXML files which are called Rule Repositories. Each rule in EDITS consists of four parts:

    1. Name - a unique identifier of the rule within a certain rule repository. This isused for logging purposes only, in order to help the user to understand whichrules have been applied by the system for a certain pair. If not provided by theuser, the rule name is automatically generated by the system.

    2. Type (entailment) - specifies the type of rule: entailment or contradiction.3. t - a text T, i.e. the left hand side of the rule.4. h - a hypothesis H, i.e. the right hand side of the rule.5. Probability - a probability that the rule maintains either the entailment or the

    contradiction between T and H. Both in entailment and contradiction rules, aprobability equal to 0 means that the relation between T and H is unknown,

    while a probability equal to 1 means that the entailment/contradiction between Tand H is fully preserved.

    6.1 Rule format

    Both T and H can be defined using the Edits Text Annotation Format (ETAF) described inSection 4. ETAF allows text portions to be represented at three different levels ofannotation: just as strings (i.e. the STRING object), as sequences of tokens with theirmorpho-syntactic features (i.e. the WORD object), and as syntactic trees (e.g. the NODEobject). Rules in EDITS can be defined using the three datatypes, provided that they areused consistently in T and H, i.e. either STRING, or WORD or NODE. The XML DocumentType Definition (DTD) of the rules file is reported in Figure 30.

    name CDATA #IMPLIEDfrom CDATA #REQUIRED

    to CDATA #REQUIRED>

    27

  • 8/6/2019 Edits Manual 2.1

    28/45

    >

    name CDATA #REQUIREDstart CDATA #IMPLIEDend CDATA #IMPLIEDsource CDATA #IMPLIED

    >Figure 30: DTD of the rule file

    In the current release of EDITS only rules that contain just one element both in t and h(i.e. lexical rules) are allowed.

    6.1.1 Entailment Rules

    Entailment rules preserve, with some degree of confidence, the entailment relationbetween T and H. The following are examples of entailment rules at the different levelsallowed by ETAF.

    inventedpioneered1.0

    Figure 31: String Entailment Rule

    The string entailment rule in Figure 31 states that the word "invented"entails the word"pioneered"with a probability equal to 1.0.

    inventv

    pioneerv

    1.0

    Figure 32: Morpho-Syntactic Entailment Rule

    The entailment rule in Figure states that the lemma invent entails the lemma pioneerwith a probability equal to 1.0.

    28

  • 8/6/2019 Edits Manual 2.1

    29/45

    dobj

    home

    dobj

    habitation

    1.0

    Figure 33: Syntactic Entailment Rule

    The entailment rule in Figure 33 states that the node home in the dependency relation ofdirect object with its syntactic head (e.g. John bought a house, where "house"is thedirect object of the verb "buy") entails the node "habitation"in the dependency relationof direct object with its syntactic head (e.g. John bought an habitation, where"habitation"is the direct object of the verb to "buy") with a probability equal to 1.0.

    6.1.2 Contradiction rules

    Contradiction rules represent, with some degree of confidence, the semanticincompatibility between T and H. The following are examples of contradiction rules at the

    different levels allowed by ETAF.

    beautiful

    ugly1.0

    Figure 34: String Contradiction Rule

    The contradiction rule in Figure 34 states that the string "beautiful"contradicts the string"ugly"with a probability equal to 1.0.

    extendv

    shortenv

    29

  • 8/6/2019 Edits Manual 2.1

    30/45

    1.0

    Figure 35: Morpho-Syntactic Contradiction Rule

    The contradiction rule in Figure 35 states that the lemma "extend"contradicts thelemma "shorten"with a probability equal to 1.0.

    amod

    white

    amod

    black

    1.0

    Figure 36: Syntactic Contradiction Rule

    The contradiction rule Figure 36 states that the node white in the dependency relation ofadjectival modifier with its syntactic head (e.g. Mary wears a white T-shirt, wherewhite is the adjective modifying the noun "T-shirt") contradicts the node "black"in thedependency relation of adjectival modifier with its syntactic head (e.g. Mary wears ablack T-shirt, where "black"is the adjective modifying the noun "T-shirt") with aprobability equal to 1.0.

    6.2 Rules repository

    In order to be used by EDITS, both entailment and contradiction rules have to be storedin a rule repository. EDITS allows to declare and use multiple XML rule files as sets ofentailment or contradiction rules that can be referred to using user-defined identifiers.

    As an example, the declaration below defines two rule repositories, which contain a rulefile each, called, respectively, entailment-rules-wordnet and contradiction-rules-wordnet.An example of entailment rules repository configuration can be found in Figure 37.

    30

  • 8/6/2019 Edits Manual 2.1

    31/45

    Figure 37: Entailment Rules Repository Configuration

    6.3 Rule activation

    The basic way for activating a rule is through one of the two functions, entail andcontradict, which can be used within a cost scheme (see Section 5). The two functionscheck for the existence of an entailment or a contradiction rule between the valuesassumed by A and B in a cost schema. If a rule exists in the specified repository whichmatches with both A and B, then the probability associated to the rule is returned,otherwise null. The two functions accept four parameters:

    1. The first two parameters X and Y are portions of T and H managed by thedistance algorithm and the cost scheme.

    2. The name of a set of rules in the rules repository where the search has to becarried. This parameter is optional.

    3. The search modality. Two search modalities are allowed: First, which selects thefirst rule that matches the X and Y parameters; Max, which selects the rule that

    matches X and Y and with thehighest probability.

    As an example, the following call at the entail function:

    (entail X Y "wordnet-entailment" :max)

    searches for the rule with the highest probability among those that are activated by theA and B parameters and that are contained in a rules repository called entailment-rules-wordnet.

    A rule is activated when the X parameter of the entail/contradiction function matcheswith the T part of the rule, i.e. the left hand side, and the Y parameter of the functionmatches with the H part of the rule, i.e. the right hand side. All the elements of the X /Yargument have to match against all elements of the rule. In case the rule containsvariables, their assignments to corresponding elements of the X /Y argument need to besatisfied.

    The entail and contradict functions are called in cost schemes, typically the cost schemedefined for the substitution edit operation. Figure 38 shows a substitution that calculatesthe cost of substituting A with B based on the inverse of the probability of the entailmentrule between A and B in the repository called entailment-rules-wordnet.

    (- 1 (entail A B "wordnet-entailment" :max))

    Figure 38: Substitution using Entailment Function

    31

  • 8/6/2019 Edits Manual 2.1

    32/45

    7 EDITS Configuration File

    The purpose of a configuration file is to define the three basic modules (i.e. distancealgorithm, cost scheme and rule repositories), and their corresponding parameters, thatwill be actually used while running EDITS on a certain dataset. Only modules defined in aEDITS Configuration File (ECF) can be used for training and testing with the commandbin/edits (see Sections 4.2 and 4.3).

    A module may require that another module is defined in order to work. Suchdependencies are expressed in a configuration file through nested modules. The wholeEDITS configuration is considered itself a module, the most global one, called theentailment engine, which requires three nested modules, respectively for the distance-algorithm, the cost-scheme and the rules repository.

    The XML Document Type Definition (DTD) of the configuration file is reported in Figure39.

    type CDATA #IMPLIEDid CDATA #IMPLIEDalias CDATA #IMPLIEDclassName CDATA #IMPLIED

    >

    name CDATA #REQUIREDid CDATA #IMPLIED

    value CDATA #IMPLIED>

    idref CDATA #REQUIRED>

    Figure 39: DTD of the configuration file

    7.1 Module Configuration

    Modules are defined by the following pieces of information:

    alias - an internal identifier known to the system. All module with their internal

    identifiers are listed in the HTML reference; className - a path to the Java class of the module, referring to the code thatwill be executed when the module is activated;

    id: a unique identifier for the module, assigned by the user; type: indicates the category of the module being defined. Accepted values for

    the type attribute are entailment-engine, distance-algorithm, cost-scheme,rules-repository etc (See the HTML reference).

    option - set the options of the module;

    Examples of configuration file can be found in the share/configuration folder.

    7.2 Usage of Constants

    32

  • 8/6/2019 Edits Manual 2.1

    33/45

    EDITS allows that the values of the options (which are frequently used in theconfiguration file) can be referred through the use of constants, declared at thebeginning of the configuration file.For example, the configuration in Figure 40 is using the ${DATA_PATH} variable toindicate the access path to cost scheme resources.

    /home/epack/edits

    Figure 40: Example of variable in the configuration file.

    All the configuration can access the path of the edits installation using the constant"EDITS_PATH".

    33

  • 8/6/2019 Edits Manual 2.1

    34/45

    8 EDITS Graphical Interface

    This section contains several snapshot of the EDITS graphical interface. The graphicalinterface is started with the command:

    edits -gThe graphical interface represents a simple editor for confgurations, cost-scemes andexperiments. It can also view entailment corpora, EDITS output and EDITS models. Thegraphical interface represents a desktop in which different views are open as windows.The user can copy and cut objects from one window and paste them in another one ofthe same type.

    Figure 41: New Configuration Simple Engine

    The interface in Figure 41 represents a dialog for creating a new entailment engine. Theuser must select one algorithm to create a configuration for a simple entailment engineor more than one for combination entailment engine. If he/she choses to do the latter acombination strategy must also be selected as demonstrated in Figure 42.

    Figure 42: New Configuration Combined Engine

    34

  • 8/6/2019 Edits Manual 2.1

    35/45

    Figure 43: Configuration Editor - Conf1.xml

    EDITS provides an interface for editing a configuration file as presented in Figure 43. Theconfiguration file is represented as a tree. The user can interact with it using the contextmenu available while right-clicking a node of the three.

    35

  • 8/6/2019 Edits Manual 2.1

    36/45

    Figure 44: Cost Scheme Editor - Simple Cost Scheme

    EDITS provides an interface for creating and editing a cost scheme file as presented inFigure 44. The configuration file is represented as a tree. The user can interact with itusing the context menu available while right-clickinga node of the three.

    36

  • 8/6/2019 Edits Manual 2.1

    37/45

    Figure 45: ETAF Corpus Representation

    EDITS provides an interface for browsing entailment corpora as presented in Figure 45 .The files are represented as a table containing the entailment pairs in the rows and theircontents in the columns. The user can view thatannotation of each pair by clicking the view button. The morpho-syntactic annotation ofa pair is presented as a table in Figure 46. The syntactic annotation of a pair ispresented as a tree in Figure 47.

    37

  • 8/6/2019 Edits Manual 2.1

    38/45

    Figure 46: ETAF Morpho-Syntactic Annotation of a Entailment Pair

    38

  • 8/6/2019 Edits Manual 2.1

    39/45

    Figure 47: Etaf Syntactic Annotation of a Pair

    EDITS provides an interface for browsing an EDITS output as presented in Figure 48.The table is similar to the interface for browsing entailment corpus. New columns areadded to represent the additional information (scoreconfidence etc.) that comes in the EDITS output. The view button provides a simpleviewer, represented in figure 49, for the edit operations log attached to each pair.

    39

  • 8/6/2019 Edits Manual 2.1

    40/45

    Figure 48: EDITS output

    40

  • 8/6/2019 Edits Manual 2.1

    41/45

    Figure 49: EDIT Operations

    EDITS provides an interface for creating and editing an experiment file as presented inFigure 50. The interface allows the user to change the basic elements of an experimentand execute the experiment to obtain a result as represented in Figure 51.

    41

  • 8/6/2019 Edits Manual 2.1

    42/45

    Figure 50: Experiment interface

    42

  • 8/6/2019 Edits Manual 2.1

    43/45

    Figure 51: Experiment Result

    EDITS provides an interface for viewing the contents of an EDITS model as presented inFigure 52.

    43

  • 8/6/2019 Edits Manual 2.1

    44/45

    Figure 52: EDITS Model View

    References

    Ido Dagan, Oren Glickman (2004), Probabilistic Textual Entailment: Generic AppliedModeling of Language Variability, in Proceedings of the PASCAL Workshop of LearningMethods for Text Understanding and Mining, Grenoble, France.

    Ido Dagan, Oren Glickman, Bernardo Magnini (2005), The PASCAL Recognizing TextualEntailment Challenge, in Proceedings of the First PASCAL Challenges Workshop onRecognising Textual Entailment, Southampton, U.K., 11-13 April.

    Dan Klein and Christopher D. Manning (2003), Fast Exact Inference with a FactoredModel for Natural Language Parsing, in Advances in Neural Information ProcessingSystems 15 (NIPS 2002), Cambridge, MA:MIT Press, pp. 3-10.

    Milen Kouylekov, Bernardo Magnini (2005), Tree Edit Distance for Recognizing TextualEntailment, in Proceedings of the International Conference Recent Advances in NaturalLanguage Processing (RANLP), Borovets, Bulgaria, 21-23 September.

    Vladimir I. Levenshtein (1965), Binary codes capable of correcting deletions, insertions,and reversals, in Doklady Akademii Nauk SSSR, 163(4), pages 845848.

    44

  • 8/6/2019 Edits Manual 2.1

    45/45

    Emanuele Pianta, Christian Girardi, Roberto Zanoli (2008), The TextPro tool suite, inProceedings of LREC, 6th edition of the Language Resources and EvaluationConference, Marrakech, Morocco, 28-30 May.

    Kaizhong Zhang, and Dennis Shasha (1990). Fast Algorithm for the Unit Cost EditingDistance Between Trees. In Journal of Algorithms. vol.11, December 1990.

    Yashar Mehdad 2009. Automatic Cost Estimation for Tree Edit Distance Using ParticleSwarm Optimization. Proc. of ACL-IJCNLP 2009.

    Matteo Negri and Milen Kouylekov 2009. Question Answering over Structured Data: anEntailment-Based Approach to Question Analysis. Proc. of RANLP-2009.

    Yashar Mehdad, Matteo Negri, Elena Cabrio, Milen Kouylekov, and Bernardo Magnini2009. Recognizing Textual Entailment for English EDITS @ TAC 2009 To appear inProceedings of TAC 2009.

    Chin-Yew Lin. ROUGE: A Package For Automatic Evaluation Of Summaries Workshop On

    Text Summarization Branches Out , ACL 2004


Recommended