Expressive Ontology Learning as Neural Machine …...in our evaluation show how approaching the...

Expressive Ontology Learning as Neural Machine Translation

Giulio Petruccia,b,∗, Marco Rospochera, Chiara Ghidinia

aFBK, Via Sommarive, 14, 38123, Trento, Italia.bUniversity of Trento, Via Sommarive, 14, 38123, Trento, Italia.

Abstract

Automated ontology learning from unstructured textual sources has been proposed in literature as a way tosupport the difficult and time-consuming task of knowledge modeling for semantic applications. In this paperwe propose a system, based on a neural network in the encoder-decoder configuration, to translate naturallanguage definitions into Description Logics formulæ through syntactic transformation. The model has beenevaluated to asses its capacity to generalize over different syntactic structures, tolerate unknown words, andimprove its performance by enriching the training set with new annotated examples. The results obtainedin our evaluation show how approaching the ontology learning problem as a neural machine translation taskcan be a valid way to tackle long term expressive ontology learning challenges such as language variability,domain independence, and high engineering costs.

1. Introduction

The task of encoding human knowledge into aformal representation, e.g. into an ontology, is apivotal aspect in the development of semantic webbased applications. Well established methodolo-gies heavily rely on a number of manual activities,performed by knowledge engineers or ontologists,tailored to obtain significant knowledge on the do-main at hand. Examples of these activities are in-terviews with domain experts and/or the selectionof knowledge from existing, often textual, sourcessuch as technical specifications, glossaries, and en-cyclopaedic entries. Such knowledge is then manu-ally encoded into a formal representation such as aset of OWL axioms.

Manually encoding knowledge can be extremelycostly and time-consuming, especially for applica-tions where knowledge sources are continuously in-creasing in volume and variety, thus contributingto the well known Knowledge Acquisition Bottle-neck [1]. In order to relieve part of the burden fromhuman operators, the ontology engineering commu-nity has been pursuing the goal of automatically ac-quiring formal knowledge from unstructured text.

∗Corresponding authorEmail addresses: [email protected] (Giulio Petrucci),

[email protected] (Marco Rospocher), [email protected](Chiara Ghidini)

During the 2000s first decade, several efforts aimedat investigating this area, exploiting the increasingresources and tools available in the field of NaturalLanguage Processing (NLP). The outcomes of theseefforts are nicely summarised by Volker et al. in [2]:

“state-of-the-art in lexical ontology learn-ing is able to generate ontologies that arelargely informal or lightweight ontologiesin the sense that they are limited in theirexpressiveness”

The most notable exception towards the automaticextraction of highly expressive knowledge and com-plex axioms was provided by LExO [2]. This toolapplies catalogues of manually hand-crafted ruleson top of the output of a statistical Natural Lan-guage Processing toolkit in order to syntacticallytransform a textual sentence into a Description Log-ics formula.

After a few years of slowdown, some recent workshave started to emerge, aiming at the automatic ex-traction of highly expressive knowledge from textbased entirely on machine learning approaches (see[3, 4]). Among them, the one proposed in [3] ex-ploits a combination of two Recurrent Neural Net-works to translate natural language definitions intoDescription Logics axioms, using only raw text asinput features. While limited in terms of the syn-tactic structure of the sentences it was able to han-

Preprint submitted to Elsevier October 31, 2018

dle, the work in [3] showed that a Recurrent NeuralNetwork based approach is capable of handling thebasic syntactic structures of definitory sentences.

Building on such experience, this paper proposesan original usage of a recently-proposed architec-ture [5] to increase the ability to deal with unknowninput words and with definitory sentences havingarbitrarily complex syntactic structures. In detail,the paper provides:

• a novel system that exploits one single neuralnetwork for the automatic translation of natu-ral language definitions into Description Log-ics axioms. Using the so-called pointer net-work (see [6, 5]), our architecture has the abil-ity to copy input words as extralogical symbolsof the output formula: this makes the systemextremely robust with respect to unknown in-put words. Moreover, the output vocabularyof the target language only consists of the logi-cal symbols, since all the extralogical ones canbe just copied from the input. This makes thesystem extremely flexible in handling the syn-tactic structure of the input sentence.

• An extensive evaluation of the proposed archi-tecture. In detail we evaluate the ability of oursystem to: (i) correctly handle the grammat-ical structures of definitory language, (ii) tol-erate unknown words, and (iii) generalize overunseen examples. For the evaluation we relyon three metrics which are intended to mea-sure the correctness of the extracted formulaand the effort that a knowledge engineer shouldspend in correcting (any) inaccurate formulaproduced by the system.

• A number of datasets which have been usedto train and evaluate our architecture underdifferent settings. Some of these datasets werebuilt as bootstrap data for our system and par-tially address the lack of commonly accepted,large-size, datasets for this task. In fact, theycan act as a valuable starting point for build-ing more accurate, or domain dependent, train-ing sets, e.g. by enriching them with moreand more real world sentence-formula pairs.Together with these synthetically generateddatasets, we built a manually curated datasetcomprising 500 sentences and their correspond-ing axioms. Part of this dataset has been usedto extend the training set made of bootstrapdata, while the remaining has been used as a

test set to assess the actual improvement of themodel on real world examples.

The main advantages of our system with respectto state-of-the-art ones are: (i) an entirely ma-chine learning based system which does not requirehand-crafted rules; (ii) a high tolerance towards un-known words, thus paving the way to a domain-independent tool; and (iii) the ability to handlecomplex syntactic constructs typical of definitorysentences and the ability to further extend the com-plexity of syntactic constructs just by extending thetraining set.

The paper is structured as follows. In Sec. 2 wedescribe our approach in detail, and provide de-tails of the source and target languages we considerin our work, while in Sec. 3 we give a functionaldescription of the neural network model we used.Sec. 4 contains an in depth description of all thedatasets that we built to train and evaluate themodel. The evaluation is detailed in Sec. 5, whileSec. 6 reports the experimental settings and the re-sults. In Sec. 7, we discuss other approaches presentin literature and their differences with respect to thework we are proposing in this paper. Finally, Sec. 8concludes the paper and outlines some directionsfor future work.

2. Approaching Ontology Learning as a Neu-ral Machine Translation Task

In our investigation, we designed and evaluatedan approach that is capable to turn a natural lan-guage definition into a logical formula through asyntactic transformation (see also [2]). We look atthis task as a particular type of Machine Transla-tion one, in which the source language is a partic-ular subset of natural language (see 2.1) and thetarget language is some logical language, namelythe description logic language ALCQ (see 2.2). Thetranslation from a sentence expressed in the sourcenatural language into a formula in the target logiclanguage is performed by a single neural network ar-chitecture by means of a syntactic transformation(see 2.3).

2.1. The source language

We identify our source language as definitory lan-guage, intuitively denoting with this expression theset of sentences in plain English that can be used bya human to express the characteristics of a species,

2

i.e. to give an intensional characterization of a setof entities.

Investigating the literature for some theory ofdefinition, e.g. [7] and related works, we found thatmany shared structural features of a definition canbe tracked back to “Organon” by Aristotle, and es-pecially to the sixth book of the “Topics”, see [8].Following such theory, scholastic logicians came tothe formulation that, for any species, “the definitionis given by the closer genus and the specific differ-ence”.1, or, shortly, by genus and differentia. As anexample, let us consider the well known aristotelianstatement according to which:

A human being is an animal that

has the capacity to reason.(1)

In this statement, the cluster of words “humanbeing” acts like the definiendum, i.e. the textualsurface identification of the concept we are goingto define. The word “animal” identifies the genusproximus, i.e. that species that is close to the onewe are defining and such that everything that canbe predicated of an individual from such species,can be predicated of any individual of the specieswe are defining as well. Namely, the genus proxi-mum can be consider the closer hypernym of thedefiniendum. The final part of the definition is thedifferentia specifica, i.e. the description of some pe-culiar characteristics of the definiendum that dif-ferentiates it from the genus. In our example, suchduty is fulfilled by the cluster of words “has the ca-pacity to reason.” Authors in [7], tackling a prob-lem of hypernym extraction, consider as proper def-initions all the sentences following the aristotelianscheme, where at least the text surface realizationof the genus is not empty.

Thus, according to [7], a definition representinga pure taxonomical relation—like “cars are vehi-cles”—is considered a valid definition. Conversely,a sentence defining a species with an empty genus—like “cars have four wheels”—is not considered asa proper definition. To the extent of this work,we deviate from [7] and consider the latter type ofsentences valid definitions. Indeed, such sentencesprovide useful characterizations of the definiendum,even without expressing explicitly its genus, thatcan be of interest for an ontological representation.

1“definitio fit per genus proximum et differentiam speci-ficam.”

We call all the sentences of interest for our workdefinitory sentences, since they define a salient fea-ture of a set of entities. Consequently, the definitorylanguage is the particular subset of whole Natu-ral Language that comprises all the definitory sen-tences. From now on, we will just use the wordsentence to indicate a statement in our definitorylanguage. This constrains the input language to allthe sentences that express the definition of a con-cept, but this is not a constraint that affects thesyntactic structure of the input sentence: in otherwords, the language we process is not a controlledlanguage.

2.2. The target language

We set the Description Logics language ALCQas our target language. This choice can be moti-vated in terms of different theoretical and practi-cal reasons. The reason why we aim at a Descrip-tion Logics language is that these are the standardlogic languages behind OWL-DL, the computableand tractable fragment of OWL2, which, in turn, isthe de facto ontological language standard for theSemantic Web community (for an introduction toOWL, see [9]). The Description Logics languageon which OWL-DL is built is SHOIND. The ex-pressiveness of this language goes way beyond thedefinition of concepts, which is what can be ex-pressed by our source language. As a consequence,we pruned all those constructs concerning roles andnot directly related to the definition of concepts.Namely, we removed all those constructs relatedto nominals and individuals, since they do not fallwithin the scope of the present work, and the localreflexivity construct, which expresses some kind ofknowledge more related to relations than to con-cepts. The result of this activity of pruning, is in-deed the ALCQ language.

Hereafter, when using the term formula we willrefer to a well-formed ALCQ formula expressing aconcept definition.

2.3. The translation process

Our approach relies on a single neural network,falling in the so-called Seq2Seq (for sequence to se-quence) category, in which a sequence of symbolsis processed in order to produce another sequenceof symbols. More in detail, our model implements

2https://www.w3.org/TR/2012/

REC-owl2-syntax-20121211/.

3

https://www.w3.org/TR/2012/REC-owl2-syntax-20121211/

https://www.w3.org/TR/2012/REC-owl2-syntax-20121211/

a recurrent encoder-decoder scheme (see [10]). Thisnetwork is made of two main components stackedon top of each other. The first one is called the en-coder and processes the input sequence of words innatural language, namely a sentence, building somelatent representation of this sequence as a whole.The second one is called the decoder. It sits on topof the encoder, processes the representation of theinput sequence built by the encoder, and emits asequence of logical symbols, namely a formula, asits output. The main advantage of this architectureis that it is capable to handle input and output se-quences that are structurally decoupled from eachother. The network model is described in detail inSec. 3.

As an example, let us consider the following def-inition:

A bee is an insect that produces honey. (2)

We can encode the description of a bee and itsmain characteristics in the following DescriptionLogic formula:

bee v insect u ∃produces.honey . (3)

Our system should be able to accept (2) as itsinput and produce (3) as its output. We say that(3) is a syntactic transformation of (2) if all theextralogical symbols in (3) are also words presentin (2). To ensure that the output sequence is theresult of a purely syntactical transformation of theinput sequence, we set up our system to work ina quite extreme scenario in which only the logicalsymbols of the target logical language are encodedin the output vocabulary of the network. Thus,the network ends up in containing no extralogicalsymbols at all. As a consequence, when emittingthe j-th symbol of the output formula, our systemwill have basically two options:

• to select a logical symbol from the output vo-cabulary: this produces the output symbols v,u, ∃ and . in (3);

• to select an extralogical symbol as word fromthe input sentence, so that the words bee,insect, produces and honey are copied from(2) to (3).

We refer to this setting as the quasi-zero-vocabulary setting, to indicate the absence of ex-tralogical symbols from the output vocabulary.Since the output vocabulary contains only the few

logical symbols of the target language, henceforthwe call it shortlist. The translation process we havedescribed is graphically depicted in Fig. 1, wherethe position of a word in the sentence is denotedwith the prefix #, the copy() function indicates thatthe network is copying from the input at some po-sition, and the emit() function indicates that thenetwork is emitting the logical symbol of its argu-ment.

Note that the generation of the output symbolsis not structurally dependent on the input sentence.Namely, the decision of which output symbol mustbe the next one to be produced depends on the in-put sentence as a whole and on the output symbolspredicted so far. As is described more in details inSection 3, the network first reads the whole sentencein the so-called encoding phase, and then starts toproduce the output symbols. The approach we fol-low enables the system to go beyond a 1-to-1 trans-lation of an input symbol to an output one. Indeed,input symbols are individually taken into accountonly when the decision about the next output sym-bol to be produced amounts to decide which ex-tralogical symbol has to be chosen: in this case,the network scans the input sentence in order topick the most suitable one.

This architecture, a single encoder-decoder withcapabilities of copying symbols from the input se-quence, has been proposed in [5] for a machinetranslation task, to handle rare or unknown words,named entities, and so on. Because of the quasi-zero-vocabulary setting we adopt in our work, we donot need any specific care when handling unknownwords since all the words but the logical symbolsare considered unknown from the decoder.

3. Network Description

The neural network model we use to tackle ourontology learning problem is based on the one pro-posed in [5] for handling rare and unknown wordsin a neural machine translation task. The modelis illustrated in Fig. 2 and can be thought of as acomplex system in which different components mu-tually interact. For each component of the overallarchitecture, we give an intuitive description of therole that it has in the whole translation process, wedescribe in detail the mathematical model, and welist all the learned parameters. The global param-eter set for the whole network is the union of allthese parameter sets.

4

bee v insect u ∃ produces . honey

copy(#2) emit(v) copy(#5) emit(u) emit(∃) copy(#7) emit(.) copy(#8)

Decoder

Encoder

A bee is an insect that produces honey .

Figure 1: From language to logical form in quasi-zero-vocabulary setting.

3.1. Terminology and notation

We indicate vectors with bold lowercase let-ters, e.g. x. Writing explicitly a vector in itscomponents, we use square brackets, like in x =[x1, ..., xn]. The element-wise dot product be-tween two vectors is indicated using the � sym-bol. To indicate the concatenation between twovectors, we use the ⊕ operator, so that givena = [a1, ..., an] and b = [b1, ..., bm], we write:a ⊕ b = [a1, ..., an, b1, ..., bm]. We use bold up-percase letters to indicate matrices, e.g. W. Thetransposition operation for vectors and matrices isindicated with the T superscript, as in xT . Upper-case letters are used to represents sets other thanthe set of real numbers, denoted with R. Given afinite set A, we indicate the number of its elementwith |A|. We use the greek letter θ to indicate set oftrainable parameters of the model, while the subsetof parameters of each given layer are denoted with θand a proper subscript, e.g. θenc for the parameterset of the encoder, θdec for the decoder, and so on.We indicate sequences of objects—to distinguishthem from vectors defined on some vector space—with comma separated items, like in s = si, ..., snor H = h1, ...,hn. The position of a symbol withina sequence is called timestep.

We use the term sentence to indicate a sequenceof words in plain English, which is fed into thenetwork as its input. Each sentence ends with aconventional symbol <EOS> that indicates the endof the sentence. We define the vocabulary W as

the list of the symbols, or words, that can be inthe sentence. In our case, such list comprises allthe words of the English language that are knownby the model, the special symbol <EOS>, a specialsymbol <UNK> that is used in place of unknownwords, and some other special symbols that cancome handy. For instance we replace all the num-bers with the symbol NUM, following the approachin [3, 11] to generalize over numbers as much aspossible. We use the notation |W | to indicate thenumber of words in the vocabulary. Given a wordw, we indicate as W (w) its index, namely a natu-ral number representing its position within the vo-cabulary. We use the term formula to indicate asequence of symbols that can be read as a ALCQformula and is produced by the network as its out-put. Each symbols of a formula can come from aset of output symbol called shortlist, L, containingall the logical symbols of the logical language, orcan be words from the input sentence. As for thewords from the input sentence, given a logical sym-bol l, we indicate its position within L as L(l). Thenumber of symbols from the shortlist is |L|.

3.2. The input layer

To be processed by a neural network, each wordin a sentence is mapped onto a vector of real num-bers. Such vectors are called word vectors or wordembeddings. The goal of the input layer is to accepta sentence as a sequence of words and map each ofthese words onto its corresponding word vector.

5

yj = zj · uj ⊕ (1− zj) ·wj

Switch

zj

Shortlist Softmax

uj

Decoder Cell

djdj

Attention

cj

cj

wj

dj−1

dj−1

yj−1

Encoder Cell ... Encoder Cell...Encoder CellEncoder Cellh1 h2 hi−1 hi hTx−1

h1

h2

hi

hTx

x1 x2 ... xi ... xTx

Figure 2: Full network architecture: decoding the j-th symbol. Arrows represent variables, scalars or vectors, flowing betweenthe different modules. Dashed arrows represent variables flowing from the previous timestep.

More in details, a sentence s to be fed into thenetwork is represented as a list Ts of natural num-bers, the i-th of which represents the index of thei-th word in the vocabulary:

s = W (w1), ...,W (wTs). (4)

We replace each word with a word vector of Re,where e (embedding dimension) is a hyperparame-ter to be set at design time. The word vectors ofall the words in the vocabulary can be consideredas the |W | rows of the so-called embedding matrix,namely a matrix E ∈ R|W |×e.

Each word index, say k, can be represented as a

vector ek ∈ {0, 1}|W |, where all the components areequal to 0 and the k-th is equal to 1. Such repre-sentation is called one-hot representation. Opera-tionally, projecting the index of the i-th word wi ofthe sentence into the corresponding word vector is

equivalent to a multiplication between the embed-ding matrix and the one-hot vector correspondingto the index, say k. So the word vector for the i-thword in the sentence will be given by:

xi = Eek, (5)

where k = W (wi) and xi ∈ Re. The input layerproduces a list of word vectors that can be writtenas:

x = x1, ...,xTs . (6)

Word vectors represent words in some latent fea-ture space. Such vectors are distributed representa-tions of words, being defined on a continuous vectorspace, Re. In our work, the word embeddings havenot been pre-trained but are learned jointly withthe main translation task. The embedding matrixis randomly initialized and its values, i.e. the word

6

vectors, are learned during the training process, be-ing the only learned parameters of the input layer,so that θinput = E.

3.3. Recurrent neural networks

In this subsection, we introduce the idea of Re-current Neural Networks, namely a particular classof neural architectures capable to model the tempo-ral evolution of an input or an output signal. Theactivity of a Recurrent Neural Network at a certaintime depends on the current input and the state ofthe network itself up to that moment.

This behavior is particularly suitable when deal-ing with sequential input and output signals, likein the case of natural or logic languages, where thefunction and the meaning of each symbol dependon the previous and following ones, through syn-tactic and semantic dependencies. Indeed, Recur-rent Neural Networks have been proven extremelyefficient in Natural Language Processing relatedproblems—for an overall discussion on RecurrentNeural Networks, see [12, Chapter 10].

A neural network can be considered recurrentwhen some of its layer has a characteristic function,or cell function, that keeps track of the previousevolution of the system up to the current timestep.A generic recurrent cell function g, accepts as itsinput at timestep i some representation of the cur-rent timestep, say xi, and its own result from theprevious one, say yi−1 (and possibly some otherarguments), as in:

yi = g(xi,yi−1, . . .). (7)

We use the term activation to indicate the cell func-tion output.

Several cell functions have been proposed in liter-ature. Following experiments with several cell func-tions, in our work we exploit Gated Recurrent Unit(GRU, see [10]). Gated Recurrent Units have thecapability to provide our recurrent neural networkswith short-term memory effect. As will becomeclearer in Section 6.5, this choice was made aftersome preliminary experiments that involved morecomplex cell functions. In those experiments weobserved comparable performances among the var-ious cell models tested, and therefore we decidedto choose GRUs because of their simplicity (i.e.,a model with less parameters to set, substantiallyshorter training times).

The cell behavior is driven by two gate functions,whose value ranges in the [0, 1] interval: the reset

gate r and the update gate z. At the i-th timestep,they are defined as follows:

ri = σ(Wrxi + Uryi−1), (8)

zi = σ(Wzxi + Uzyi−1), (9)

where Wr, Ur, Wz, Uz are weight matrices thatare learned during the training, xi is the currentinput of the cell, yi−1 is the previous cell activation,and σ(·) indicates the element-wise logistic sigmoidfunction:

σ(x) =1

1 + e−x, (10)

that squeezes the values of each component of thevector between 0 and 1. The inner state of the cellis represented by a vector yi defined:

yi = tanh(Wxi + r�Uyi−1), (11)

where W and U are learned weight matrices, yi−1is the activation of the cell at the previous timestep,and tanh(·) is the hyperbolic tangent function. In-tuitively, when the reset gate ri gets close to 0, thepiece of information carried by the feedback fromthe previous activation, namely Uyi−1, tends to beignored.

The cell activation is given by the function g de-fined as follows:

yi = g(xi,yi−1) = z� yi−1 + (1− z)� yi (12)

where 1 is a vector of the same size of z with all thecomponents set to 1. Intuitively, the update gate zjbalances the amount of information to be kept fromthe previous activation, yi−1, and from the currentinner state, yi.

The whole cell model can be synthetically de-picted as in Fig. 3. The set of the parameters to belearned in the training phase can be summed up asθgru = Wr,Ur,Wz,Uz,W,U.

x y r y

z

yk−1

yk−1

Figure 3: Gated Recurrent Unit

More than one recurrent cells can be stacked oneon top of the other, making the model more pow-erful. Given N stacked cells, at each timestep, the

7

activation of the (k− 1)-th cell will be the input ofthe k-th one, as in:

yki = g(yk−1

i ,yki−1, . . .), (13)

and the final activation will be the activation of thetopmost cell, yN

i . Further on, we will always writethe equations of all the recurrent layers of our net-work as if they were single-layered, with explicitreference to the input of the bottom-most and theactivation of the top-most layers, to keep the nota-tion as much essential as possible.

When using multiple stacked cells, we randomlyset to zero a fraction of the input/output val-ues during the training phase. Such techniquesis known as dropout (see [13]) and it has beenproven useful to avoid overfitting, preventing co-adaptation on training data.

The full parameter set will be given by the unionof the parameters of each layer, namely θrec =θ1, θ2, ..., θN .

3.4. The encoder

Each word vector represents a single word. How-ever, in an actual sentence, the function and themeaning of each word depend also on other words.The encoder is in charge of processing the word vec-tors and turning them into a set of other vectors,called encoder states, that can take into account,for each word, the contribution of the others. Forthis purpose, the encoder is modeled as a recurrentfunction, i.e. a function that accepts as inputs theword vector xi of the current word and its outputat the previous step, hi−1. So, at the i-th timestep,the encoder state hi will be given by:

hi = g(xi,hi−1). (14)

where as the cell function g(·, ·) we used the GatedRecurrent Unit presented in Section 3.3.

Encoder states are vectors of real numbers andtheir size is a hyperparameter to be set at designtime. Once the encoder states have been calculatedfor each timestep, i.e. for each word in the sentence,they can be fed into the next module of the network:the decoder. We represent the set of all the encoderactivations as sequence:

H = hi, ...,hTx(15)

In practice, the encoder can be built with severalrecurrent cells stacked one on top of the other. SaidN the number of stacked recurrent layers in the

encoder, and θi the parameter set of the k-th one,the set of all the parameters to be learned in theencoder is θenc = θ1, θ2, ..., θN .

3.5. The attention module

The attention module is in charge of focusing onthe portion of the internal state of the network thatis more important at the current timestep. Thecapacity to focus the network on some particularportion of its internal state at a certain timestep isachieved exploiting the Attention Mechanism, firstpresented in [14], where it has been used to train anetwork to implicitly align components of sentencesfrom different languages in a Neural Machine Trans-lation task.

Let us consider the set of encoder activations ofour network H, as defined in (15), and now act-ing as attention states. We can learn an alignmentfunction align(·, ·) that accepts a query vector de-noted with d ∈ Rd and one of the attention statesin H, and returns a real number ai representing thealignment score between the two input vectors:

ai = align(d,hi). (16)

We collect all the alignment scores between a givenquery vector and all the attention states into a vec-tor a = [a1, ..., aTx

]. We apply a so-called softmaxfunction to such vector so that its elements are nor-malized to fall in [0, 1] and to sum up to 1:

αi =exp(ai)∑Tx

k=1 exp(ak). (17)

Such normalized scores are called attention weights.Note that the amount of the attention weights thatare computed is equal to the number of encoderstates and, consequently, to the number of wordsof the input sentence. Intuitively, a higher weightmeans that a particular encoder state—and the cor-responding input word—is more important.

Such weights are then used to compute theweighted sum of all the encoder states into a sin-gle vector cj , the context vector, of the same sizeof the encoder states that is intended to summarizethe whole sentence into a single global representa-tion. We can define a function weights(·, ·) accept-ing a query vector and a set of attention states asits inputs, and returning a weight vector of size Tx,collecting all the attention weights:

w = weights(d,H) = [α1, ..., αTx]. (18)

8

We want to stress that this vector, differentlyfrom all the others, has a dimension that is notfixed but depends on the length Tx of the currentinput sentence. Moreover, since the value of eachweight is between 0 and 1, and they all sum up to 1,we can use the weight vector to model a probabilitydistribution over the input sentence. This is whatwill allow our network to copy extralogical wordsfrom the sentence into the formula, as we explainin 3.7.3.

Finally, the alignment function we used to com-pute the alignment score in (16) is the one presentedin [15], where the alignment score is given by:

ai = vTa tanh(Wahi + Uad). (19)

This equation fully defines the set of parametersθatt = va,Wa,Ua to be learned for the attentionmodel. The dimension v of the vector va is an hy-perparameter to be set at design time and we callit the attention inner size. Note that this hyperpa-rameter constrains the shape of all the other param-eters of the attention mechanism: being indeed in adot product with vT

a , the result of the tanh(·) mustbe a vector of the same dimension. This imposethat Wa ∈ Rv×h and Ua ∈ Rv×d. The meaningand the role of the query vector is fully clarified inthe Section 3.6, being such vector the activation ofthe decoder at the previous timestep.

3.6. The decoder

Acting as the counterpart of the encoder, the de-coder is in charge of translating the encoder statesinto a set of decoder states to be consumed, at eachtimestep, by the output layer. Being the final for-mula f a sequence of Tf symbols, the decoder willproduce one activation vector for each of these sym-bols.

As for the encoder, the decoder is a recurrentlayer, built with N cells stacked one on top of theother and, at each timestep, the activation of thetopmost one represent the decoder state. The de-coder state dj is a vector in Rd, where d is an hy-perparameter. At the j-th timestep, the decoderfunction can be written as:

dj = g(xdecj ,dj−1), (20)

where xdecj is the decoder input at the current

timestep, which is defined as follows:

xdecj = dj−1 ⊕ cj ⊕ yj−1. (21)

The decoder input is the concatenation between thedecoder output at the previous timestep, dj−1, thecontext vector cj computed over the encoder activa-tions using the very same vector as query, namelyattention(dj−1,H), and an approximation of theprevious output of the model, yj−1, that we callfeedback and are described in detail in 3.7.4.

We want to remark that the timesteps in the de-coder are completely unrelated with the timestepsin the input sentence, which is first processed as awhole and then fed into the decoder through theencoder states.

As for the encoder, θdec = θ1, θ2, ..., θN is the setof parameters of the decoder to be learned.

3.7. The output layer

At each timestep, the output layer has to decideif the current symbol in the formula is a logicalsymbol from the shortlist or an extralogical symbol,copied from the input sentence and, in both cases,which is the correct symbol. The output layer con-sists of three interacting modules, described next.

3.7.1. The switch network

The switch network is the module of the decoderthat decides if the current symbol will be a logicalone or an extralogical one. Note that the switchnetwork will not decide which will be the actualnext symbol of the formula. We can model suchmodule as a function z(·, ·) accepting as inputs thecontext vector and the decoder state for the currenttimestep and returning as output a real number zjbetween 0 and 1:

zj = z(cj ,dj). (22)

We can consider this value as an estimation of theprobability of the current symbol to come from theshortlist or not, given the input sentence, summa-rized in cj , and the evolution of the network outputso far, represented by dj . In other words, the closersuch value is to 1, the more likely the symbol at thecurrent timestep will be a logical one, the closerto 0, the more likely it will be a word copied fromthe sentence. We implemented this component asa single-layered perceptron:

zj = σ(Wz(dj ⊕ cj) + bz), (23)

where the sigmoid function σ(·) guarantees that theresult is in [0, 1]. So, the decoder output is con-catenated with the context vector and the resulting

9

vector is fed into the perceptron that computes theswitch signal.

A remark about the notation: Wz and bz havebeen written as a matrix and a vector respectivelyto have a coherent notation. Anyway, being zj ascalar, bz is a vector of one component, i.e. a scalar,as well, Wz is a matrix in R1×(d+h), i.e. a rowvector, and the concatenation between the decoderactivation and the context vector should be readas a column-vector of size (d + h). The learnedparameters of this component are θz = Wz,bz.

3.7.2. The shortlist softmax

The decision about which logical symbol has tobe chosen from the shortlist is up to the shortlistsoftmax. This module is a function u(·) that ac-cepts the decoder state as its input and returns avector of real numbers of the same size of the short-list:

uj = u(dj). (24)

The value of each element of uj is in the range[0, 1], and they sum up to one. Thus, they can rep-resent the estimation of a probability distributionover the logical symbols. If the value of zj is close toone (i.e., emit a logical symbol), the symbol whoseposition within the shortlist corresponds to the po-sition of higher value within uj will be chosen.

We implemented the shortlist softmax compo-nent as a single-layered perceptron projecting thedecoder output onto a vector in R|L|, i.e. of thesame dimension of the shortlist:

uj = softmax(Wudj + bu), (25)

with Wu ∈ R|L|×d and bu ∈ R|L|. The learnedparameters of this component are θu = Wu,bu.

3.7.3. The location softmax

If the value of zj is close to zero, the network willemit an extralogical symbol by copying it from theinput sentence. We use the weight vector wj com-ing from the attention module to decide which wordfrom the input sentence has to be copied into theoutput formula as the current extralogical symbol.Recalling from (17), the weight vector can be inter-preted as a probability distribution over the set ofattentions states, given by the encoder activation.Such activations are in the same amount of the in-put sentence words, namely Tx. The componentwith the maximum value is the one correspondingto the position within the sentence of the word to

be copied in the output formula. As already re-marked, the weight vector is a vector of variablelength, meaning that the length is not fixed butdepends on the length of the input sentence. Us-ing such vector as the location softmax, allows usto copy words from whatever position, without theconstraint of making every input sentence of a pre-fixed size, either by padding or by truncating it.

3.7.4. The final output

At each timestep, we can combine together theoutput of all the components of the output layerseen so far—namely the switch network, the short-list softmax, and the location softmax—in order tohave a single output that models a single probabil-ity distribution over all the possible symbols thatcan appear in the formula.

Recalling that zj is the probability for the cur-rent symbol to be a logical symbol, 1 − zj is theprobability for it to be an extralogical one, uj isa probability distribution over the symbols in theshortlist, and wj the probability distribution overthe words in the sentence, we can combine theminto the output vector, given by following expres-sion:

yj = zj · uj ⊕ (1− zj) ·wj , (26)

where the · operator is the multiplication betweena scalar and each component of a vector.

Elements in uj and wj separately, are in therange [0, 1], and they sum up to 1. Multiplyingthem by zj and 1 − zj respectively, with zj in therange [0, 1], makes all their elements sum up to zjand 1 − zj respectively. As a consequence, all theelements of the vector yj , resulting from their con-catenation, are in the range [0, 1] and sum up to1. For this reason and being the size of yj equalto the length of the shortlist plus the length of thesentence, it can be seen as a probability distribu-tion over all those symbols, namely all the possiblesymbols, both logical and extralogical, that can endup in the output formula for the current input sen-tence. If the position of the maximum value of yj ,namely

fj = argmax(yj) (27)

is between 1 and |L|, namely between 1 and thesize of uj , the corresponding symbol will be emittedfrom the shortlist, otherwise, the word in the sen-tence at position argmax(yj)− |L| will be copied.

Recalling (21), we need to feed the current outputback into the decoder for the next timestep. Sincethe output vector has variable length |L| + Tx, we

10

cannot use it as-is, since the network parameterswould be of undefined dimension. So, we choose afixed size Tx, which is a hyperparameter, and wetruncate or pad with zeroes the output vector tofit this size, obtaining the feedback vector yj . Wewant to remark here that the padding/truncationdoes not affect the input sentence, that is alwaysentirely fed into the network regardless the actuallength. The padding/truncation affects only theestimation, at a given timestep, of the probabilitydistribution over the possible output symbols thatis fed back into the decoder, at the next timestep.The truncation is a particularly critical operationsince it cuts off part of the probability distributionand ignores the information that an extra-logicalsymbol has been copied from the last part of thesentence. Though, the decoder is supposed to com-pensate this loss since, at every timestep, it con-sumes also the attention context that is a summa-rization of the whole sentence.

3.8. Training objective

We train the network minimizing the categoricalcross entropy between the predicted formula f =f1, ..., fTf

and the gold truth one, indicated with

f = f , ...fTf.

3.9. Implementation

The model has been implemented in python 3.5using TensorFlow 1.2.0. The source code is avail-able at https://github.com/dkmfbk/dket and li-censed under the GNU General Public Licensev3.0.

4. Datasets

The neural network we are presenting in thiswork, must be trained to translate definitory sen-tences, ideally from any domain, into ALCQ for-mulæ through syntactic transformations. As such,the training set for the task is a collection of pairs〈sentence, formula〉. We motivate the desiderataabout the characteristics of such training set withan example. Let us consider a training set made ofthe following sentences:

1. A bee is an insect that produces honey;

2. Every bee is also an insect and produces honey;

3. A cow is a mammal that eats grass.

Sentences 1 and 2 convey the same meaning,which, in the perspective of an ontological formu-lation, can be considered as equivalent to the oneencoded in the formula (3). Instead, their gram-matical structures are different. The words “bee”,“insect”, “produces”, “is” and “honey” occur inboth sentences. Such words denote concepts in-volved in the definition—namely the nouns “bee”,“insect”, and “honey”—or relationships occurringamong these concepts—as in the case of the verbs“is” or “produce”. On the contrary, words like “ev-ery”, “a”, “also”, “that”, and “and” are differentacross the two sentences but these differences donot reflect in a difference in their meanings, theyonly generate different sentence structures from thesyntactic standpoint. Vice versa, sentences 1 and3, share exactly the same syntactic structure, butthe words “cow”, “mammal”, “eat”, “grass” giveto sentence 3 a very different meaning from the oneof sentence 1. We could formalize sentence 3 withthe following formula:

cow v mammal u ∃eat.grass . (28)

Word classes like nouns, verbs, adjectives andsome adverbs are considered, from a linguisticstandpoint, content words since they describe theactual content of a sentence. In contrast, wordclasses like articles, pronouns, conjunctions, deter-miners, most adverbs, and so on, are consideredfunction words: they express grammatical relation-ships between words without carrying any lexicalmeaning. Roughly speaking, two sentences thatconvey the same meaning will tend to share thesame content words, even if their function wordsare different, like the above sentences 1 and 2. Viceversa, two sentences that are similar with respectto their grammatical structure, having similar func-tion words, can present different content words,having very different meanings, like the above sen-tences 1 and 3.

Referring to this example, we can say that ourideal dataset should be a significant sample of defin-itory language in the sense that:

• it covers as many syntactic structures as pos-sible that can be used by a human to correctlyexpress a definition: in this way, the networkcan be trained to recognize the purpose of dif-ferent function words and the syntactic struc-ture they define;

• it covers a significant portion of the humanvocabulary: in this way, the network can be

11

https://github.com/dkmfbk/dket

trained to recognize and interpret as many con-tent words as possible.

Such a dataset would be an exhaustive sample ofthe problem space, allowing our model to general-ize across different grammatical structures, so that,given the three above sentences, a fourth sentencelike “a bee is also an insect and produces honey” iscorrectly interpreted, exploiting something learnedfrom sentence 1 and something learned from sen-tence 2. Moreover, given a sentence like “a cow isa mammal and produces milk”, the model shouldbe able to leverage what already learned from thetraining set and interpolate that the unknown word“milk” represents something that is produced bythe cow similarly to what “honey” is for bees. Wewant to stress here that all the extralogical symbolsof a formula are copied from the input sentence. So,once correctly interpreted, our system can use thesymbol milk in a formula like:

cow v mammal u ∃produces.milk . (29)

To the best of our knowledge, the Semantic Weband Knowledge Engineering communities lack sucha dataset, and building it entirely manually fromscratch would be an extremely costly and time con-suming process.3 To evaluate and train our ap-proach, we took a first step to fill such gap, withtwo complementary efforts:

• first, we developed several synthetic datasets:following the best practices of some notable ex-amples in the literature (e.g. [3, 17, 18, 19, 20]),we set up a data generation pipeline so that thedatasets to train and evaluate the approachcould be synthetically generated. The datageneration process is described in Section 4.1.The main goal of the resulting datasets is to as-sess the capability of our neural network archi-tecture to perform the language to logic trans-lation on an approximation of the definitorylanguage;

• second, we developed a manually curateddataset, comprising 500 pairs made of a sen-tence and a corresponding ALCQ formula cov-ering different domains. This dataset and its

3As a reference, building the Penn TreeBank corpus forpart of speech tagging required the manual annotation of40,000 training sentences and 2400 test sentences, the equiv-alent of 4.5 million words (see [16]).

creation are described in Section 4.2. The maingoal of this datasets is to asses the capabilityof the model to gain more knowledge on thereal problem space through the extension ofthe training set by means of manually anno-tate examples.

4.1. Generating the synthetic dataset

We generated definitory sentences from a hand-crafted grammar. To design a grammar capableto generate such sentences, we qualitatively ana-lyzed different catalogues of definitions, Wikipediaentries, and comments from well known ontologiesthat could be formalized into a formula, in order tooverview the typical grammatical complexity andvariability used by humans in writing definitorysentences.

In general, the typical sentence that can be gen-erated by the grammar comprises a left-hand side,a noun phrase describing the definiendum, and aright-hand side, consisting of a noun phrase ex-pressing the genus and a verb phrase expressingthe differentia. For the differentia, the predicate isexpressed with one or two transitive verbs, whilethe range of the predicate is expressed with one ormore noun phrases, similar to those used for thedefiniendum or the genus. Moreover, verbs in thedifferentia can be constrained with one or two car-dinality restrictions. Regarding definitions with anempty differentia, namely purely taxonomical state-ments, the one generated by our grammar have aright-hand side that is a conjunction or a disjunc-tion of two noun phrase, each of which is the tex-tual realization of a concept. Finally, the grammarcan generate also definitions with an empty genus,where the differentia has the structure describedabove.

We ended up having a context-free grammarmade of 158 production rules capable to generatean amount of more than 16.5 millions of differentstrings that we call sentence templates.4 A sen-tence template is a definitory sentence where all thecontent words appearing in the textual realizationof concept, roles, and cardinality restrictions havebeen anonymized with their corresponding part ofspeech tags—NN for nouns, JJ for adjectives, NUM fornumbers occurring in cardinality restriction clauses,

4The total running time to generate all such sentence tem-plates on a D2v3 Azure Virtual Machine, running Ubuntu16.04 LTS as an operating system, was about 8 hours.

12

and VB for verbs—acting as placeholders for actualwords. As an example, the sentence template be-hind the sentence (2) in from Section 2, is the fol-lowing:

A NN is an NN that VB NN. (30)

In this sentence template, the involved conceptsare represented only by a single noun. Anyway,according to our grammar, the textual realizationof a single concept can range from a simple singlenoun, NN, to a more complex noun phrase, like NN

NN JJ of JJ NN, and so on. Considering that eachconcept can be expressed by the grammar with 10different surface realizations, and that in the sen-tence template above there are 3 concepts, we endup with 1000 different sentence templates obtainedonly by changing the way concepts are expressedin this example—without taking into account anyother construct (e.g., conjunction or disjunction ofconcepts) or even minimal linguistic variations, likechanging “a” with “every” that would already endup in doubling the possible templates. In thisway, taking into account the surface realizationsof concepts and roles, the conjunction or disjunc-tion of roles and concepts, the cardinality restric-tion clauses and their conjunction or disjunction,and so on, we can have an understanding of how thenumber of possible sentence templates can grown tomillions quickly, even when starting from a limitedamount of production rules, namely 158.

After having generate a sentence template, ourprocess turns it into an actual sentence through aprocess of actualization, where part of speech tagsare replaced with actual words, randomly selectedfrom a catalogue of 2841 nouns, 1629 adjectives and897 verbs.5 The actualization process can take intoaccount the option of producing unknown words ina sentence. In this case, the placeholder is filledwith the symbol <UNK> and not with an actual word.Finally, the special symbol <EOS>, denoting the endof the sentence, is appended at the end of each ac-tualized sentence. Taking into account also the spe-cial symbols <EOS>, <UNK>, and NUM, together withthe content words and some other function words,

5Resulting sentences are grammatically correct but possi-bly unrealistic with respect to their meaning from a humanpoint of view: a sentence template like “A JJ NN is alsosomething that VB at least NUM NN” could be actualized intoa sentence like “A smoking hurricane is also something thatpump at least 18 orange stampede.” Note that this does nothamper our evaluation, since the translation happens justvia syntactic transformation.

which are fixed in the grammar, we ended up witha vocabulary of 5359 words.

The very same grammar has been used to build aparser in charge of turning a sentence template intoa parse-tree. Such tree is then navigated to createthe formula corresponding with the sentence. Thisformula is a sequence of symbols that are logicalsymbols or reference to positions within the sen-tence, meaning that some words must be copied tobecome extra-logical symbols. In this way, the sameformula will work also from the actualized sentencetemplate, since the actualization just turns place-holders into actual words, leaving them in the verysame position within the sentence. So, the formulafor the sentence template in (30) will be:

#2 v #5 u ∃#7.#8 (31)

where the symbol #p means that the word at po-sition p must be copied into the formula. It is easyto verify that the formula is invariant to whateverplaceholder fillers are used.

With this process we can create pairs of sentencesand corresponding formulæ that we call grammargenerated examples. Such examples have the fol-lowing characteristics:

• the sentences have a syntactic complexity thatis similar to the ones in the inspected resourcesmentioned before;

• given a sentence, the corresponding output for-mula has extralogical symbols coming all andonly from the input sentence;

• the input language can be seen as plain En-glish where a few pre-processing operationshave been performed, to reduce the number ofinflected forms:

– the indefinite article “An” is always re-placed with “A”;

– all numbers are replaced with the specialsymbol NUM;

– all nouns and transitive verbs are lemma-tized;

– “does not” and “doesn’t” are replacedwith “do not.”

Concerning the way we generated the formulæ,we are aware that the very same sentence can beformalized in many different ways, even only withinthe scope of a syntactic transformation. Building

13

our dataset, we fixed some guidelines that havebeen strictly followed to set up the annotationprocess—e.g. a concept like “long and windingroad” is formalized as long road u winding road,and so on. Being a system that learns from exam-ples, our trained model will learn to translate sen-tences into formulæ according to such guidelines.Clearly, different guidelines in the annotation ofthe training examples would lead to a substantiallydifferent dataset, and thus a substantially differentformalization schema learned by the model. Thisdoes not hamper our evaluation, since the goal ofthis work is neither to define nor to verify that themodel learns the best axiomatization scheme, butjust the one underlying the training set.

The whole process of generating a syntheticdataset containing pairs of sentences and their cor-responding formulæ is synthetically depicted inFig. 4.

start

“A NN is an NN.”

Generate a randomsentence templatefrom the grammar.

“A NN is an NN.”, #2 v #5

Parse and annotatethe sentence template.

“A bee is an insect.”, #2 v #5

Actualize the sentencetemplate with content

words.

end

Figure 4: Grammar generated example generation process.

The generation process here described has beenused to produce different datasets to train andtest the architecture proposed in Section 3. Thesedatasets are available for download at https://

github.com/dkmfbk/dket. Table 1 summarizesome of the main characteristics of these datasetssuch as their size in term of number of examples,the distribution of unknown words, minimum, max-imum, and average sentence length, and the per-centage of the examples that contain an existential,

universal, or cardinality restriction construct. Theycan be split in two groups. The first group com-prises all the datasets (suffixed with ‘C’) for whichall the words are known, namely there is a bijec-tive correspondence between the words appearingin the text corpus and the words in the vocabu-lary. They are exploited for the closed-vocabularyevaluation (see 6.1). The second group comprisesall the datasets (suffixed with ‘O’) that containsome unknown words. This particular setting isused to train the network so that, at run time, itcan deal with words that have not been seen dur-ing the training phase. Technically, some words ofthe training set are replaced with <UNK>. In thiscase, more words, namely all the unknown ones, willbe mapped to the same vocabulary entry, namelythe one for the <UNK> symbol. In our datasets wereplaced 10% of the nouns, 5% of the adjectivesand 5% of the verbs with the <UNK> symbol. Thedatasets containing unknown words have been usedfor the open-vocabulary evaluation (see 6.2). Bothfor closed-vocabulary evaluation and for the open-vocabulary evaluation, datasets comprising 30000examples (30kC, 30kO), and hence larger than anytraining set, have been generated and used for val-idating the model.

Concerning the templates used to produce actualinput examples in the datasets, they were generatedwith a uniform random sampling across all the pos-sible productions of the grammar. This ensuredthat the different datasets are a uniform sampleacross the whole input space (language) and thatthere is no meaningful overlap between the sentenceused for training and those used for validating themodel.

A similar approach to build a dataset for ananalogous task has been followed in [3]. However,datasets used in that work have been considerednot suitable to the scope of this work, as is clarifiedin Sec. 7.

4.2. Collecting manually built examples

In addition to the synthetic dataset, we alsodeveloped a manually curated dataset consistingof 500 sentence-formula pairs. This dataset wascollaboratively developed by three ontology engi-neers in three different manners: first, definitorysentences and corresponding axioms were collectedfrom available ontologies (such as the Pizza On-

14



name size UNK words min. len. max. len. avg. len. exist. univ. card. restr.2kC 2000 - 6 30 16.21 40.60% 15.15% 10.65%5kC 5000 - 6 30 16.35 41.02% 15.82% 10.96%

10kC 10000 - 6 31 16.42 41.18% 16.15% 11.19%20kC 20000 - 6 32 16.42 41.46% 15.73% 11.34%30kC 30000 - 6 32 16.47 41.83% 15.53% 11.50%2kO 2000 10% NN, 5% JJ, 5% VB 7 30 16.32 42.05% 15.75% 9.95%5kO 5000 10% NN, 5% JJ, 5% VB 7 33 16.37 41.78% 15.66% 11.10%

10kO 10000 10% NN, 5% JJ, 5% VB 7 33 16.39 41.92% 15.69% 10.86%20kO 20000 10% NN, 5% JJ, 5% VB 6 33 16.41 41.69% 15.66% 11.19%30kO 30000 10% NN, 5% JJ, 5% VB 5 33 16.44 41.48% 15.84% 11.65%

Table 1: Datasets obtained through the data generation process.

tology,6 the SSN Ontology7, the VSAO Ontology8,the Wildlife Ontology9, the Transport DisruptionOntology10, the OBCS Ontology11, the OntologyTransportation Network12, and other ontologies onAtoms13) and their documentation (e.g., ontologycomments); second the ontology engineers have for-malized in ALCQ sentences from textual resourcessuch as glossaries (e.g, the NeON14 and the SEKT15

projects); third they where asked to create andformalize typical definitory sentences just for thepresent task.

The process for creating this manually curateddataset was as follows. One ontology engineer pro-vided a sentence-formula pair, either collecting itfrom an existing ontology or formalizing a selecteddefinitory sentence, while the other two validatedthe provided formalization. In case of disagreementbetween the three ontology engineers, the examplewas further discussed and collaboratively revised.

After its creation, the dataset was manuallysplit into a training part (75M, 75 pairs) anda testing part (425M, 425 pairs), trying to pre-serve the same distribution of axiom structures(e.g., simple subclasses, subclasses with one exis-tential/universal/cardinality restriction, and so on)among the two resulting datasets. The manually

6https://protege.stanford.edu/ontologies/pizza/

pizza.owl7https://www.w3.org/2005/Incubator/ssn/ssnx/ssn8https://bioportal.bioontology.org/ontologies/

VSAO9https://www.bbc.co.uk/ontologies/wo

10https://transportdisruption.github.io/

transportdisruption.html11Ontology of Biological and Clinical Statistics, https:

//www.ncbi.nlm.nih.gov/pubmed/2762788112https://www.pms.ifi.lmu.de/rewerse-wga1/otn/OTN.

owl —at the time of writing, this URL is not accessible.13http://dumontierlab.com/ontologies.php14http://www.neon-project.org15http://sekt-project.com

curated dataset was created with the purpose ofassessing the capabilities of the model, initiallytrained only on synthetic data, to improve its per-formance when incrementally fed with real worlddata. In particular, we were interested to evaluatewhether the addition to the synthetic data of veryfew samples (75), drawn from a different distribu-tion of Natural Language sentence, was enough tocover the other (425) samples of the distribution.That is, the 75/425 split is not to be intended astraining set versus validation set, but as the exten-sion of the training set versus validation set.

These datasets are available for download to-gether with the ones presented before. Their maincharacteristics are summarized in Table 2, both forthe dataset as a whole and for the splits. Besidesthe name and the size, the table reports the min-imum, maximum, and average length of the sen-tences each dataset contains, and the distributionof the existential, universal and cardinality Descrip-tion Logics constructs within the datasets.

Note that none of the sentence-formula pairs inthis manually curated dataset is included in anyof the bootstrap datasets in Table 1. Furthermore,only a small fraction (∼4%) of the manually curatedtesting dataset examples adheres to the syntacticgrammar rules used for generating the bootstrapdatasets, thus making it an appropriate dataset totest the capability of our model to be extended withadditional, new training examples, as investigatedin Section 6.3

4.3. Encoding the dataset

A final note is about the way such examples areencoded in order to be machine readable, i.e. suit-able to be fed into the network. First, we collectall the words that have been used into a set W ,called vocabulary, and all the logical symbols intothe shortlist L. We replace every word of the sen-tence with its index within the vocabulary, turning

15

https://protege.stanford.edu/ontologies/pizza/pizza.owl

https://protege.stanford.edu/ontologies/pizza/pizza.owl

https://www.w3.org/2005/Incubator/ssn/ssnx/ssn

https://bioportal.bioontology.org/ontologies/VSAO

https://bioportal.bioontology.org/ontologies/VSAO

https://www.bbc.co.uk/ontologies/wo

https://transportdisruption.github.io/transportdisruption.html

https://transportdisruption.github.io/transportdisruption.html

https://www.ncbi.nlm.nih.gov/pubmed/27627881

https://www.ncbi.nlm.nih.gov/pubmed/27627881

https://www.pms.ifi.lmu.de/rewerse-wga1/otn/OTN.owl

https://www.pms.ifi.lmu.de/rewerse-wga1/otn/OTN.owl

http://dumontierlab.com/ontologies.php

http://www.neon-project.org

http://sekt-project.com

name size min. len. max. len. avg. len. exist. univ. card. restr.500M 500 5 40 12.26 49.60% 4.20% 9.20%75M 75 5 28 11.72 42.67% 2.67% 9.33%

425M 425 5 40 12.36 50.82% 4.47% 9.18%

Table 2: Manually curated dataset and splits.

the sentence into a set of natural numbers rangingfrom 1 to |W |, the size of the set. The encoding ofthe formula is slightly more complex. If a symbolin the formula is a logical one, we indicate it withits index within L, as for the words. If the symbolmust be copied from the p-th word of the input sen-tence, we replace it with |L| + p. Since |L| is fixedand equal for all the sentences, there is no ambigu-ity in the interpretation of the symbol when turningthe formula back into human readable form. Thisencoding for the output formula is in accordancewith the description of the output of the network,formalized in (26).

An explicit example of this encoding process isgiven in Table 3, where W (·) and L(·) are the func-tions that return the position of their arguments inthe vocabulary W and the shortlist L respectively,and #p indicates that the p-th word in the inputsentence must be copied into the output formula.

5. Evaluation: goals and metrics

In this section we present the main goal of thestudy, articulated in different research questions.Moreover, the quantitative metrics used to measurethe performance of the model with respect to theevaluation objectives are introduced.

5.1. Goal of the study and research questions

The main goal of this work is to design a sin-gle neural architecture which is capable of trans-forming natural language definitions into Descrip-tion Logics formulæ through a series of syntactictransformations. The model we propose is a singleencoder-decoder architecture endowed with a point-ing mechanism, processing sentences representedusing only raw text as input feature and produc-ing a formula where all the extra-logical symbolsare copied from the input sentence. We evaluatedtwo main characteristics of the model, namely:

• the ability to generalize over the grammar andthe lexicon;

• the ability to improve by extending the train-ing set.

We refined the objectives of our evaluation in threeresearch questions.

5.1.1. Ability to generalize

Neural networks, like all the machine learning al-gorithms, are a particular class of statistical learn-ing models. Every machine learning based solutionstarts with a sampling operation of the real worldphenomenon we want to learn. The result of thissampling operation is a collection of examples thatare used to train and validate the model. The largerthe number of drawn examples is, and the morethey are homogeneously spread across the wholeproblem space, the more the network will be able togeneralize over such examples and process correctlyalso sentences that have not been seen during thetraining phase. In our scenario, this means beingable to deal with different sentence structures andwords that have not been seen during the trainingphase. Increasing the number of training examplesis often a good practice to boost the performanceof the model, but it is typically an expensive opera-tion. So, one pivotal feature of any machine learn-ing model is the ability to generalize as much aspossible in presence of a limited amount of exam-ples. These considerations lead to the following tworesearch questions:

RQ1. To what degree is the network capable togeneralize over the syntactic structure of defin-itory language?

RQ2. To what degree is the network capable totolerate unknown words?

5.1.2. Ability to improve adding training examples

Ideally, when we training a Machine Learningsystem, the training example should be as muchsimilar as possible to the ones that the system willhave to deal with at run time. In this way, themodel can exploit the regularities learned duringthe training phase. This objective is achievable ifthe training set is a good sample of the actual prob-lem space. As for our task there is no dataset thatfulfills this requirement, we relied on synthetically

16

human readable form machine readable formsentence “A bee is an insect.” W (A),W (bee),W (is),W (an),W (insect),W (.)formula #2 v #5 |L| + 2, L(v), |L| + 5

Table 3: Sentence and formula encoding.

generated data, containing many alternative vari-ants in which humans can express definitions, totrain and evaluate our models.

Nonetheless, new examples may be available thatare similar to a specific problem space. Indeed, thelong term goal of our work is to devise a systemthat can be effectively bootstrapped with grammargenerated data, and is capable to improve its perfor-mance when trained again after that new exampleshave been added to the training set. As a matter offact, the addition of new examples should normal-ize the training set and make it closer to the actualproblem space.

These considerations led us to the following re-search question:

RQ3. To what extent is the model capable toimprove its performances on unseen sentencestructures with the addition of few annotatedexamples?

5.2. Evaluation Metrics

In all our experiments, we quantitatively evalu-ated the performance of the model using three dif-ferent metrics.

A formula f is a sequence of Tf symbols. Foreach formula f = f1, ..., fTf

generated by the modelfor a sentence s, we indicate the gold truth formulawith f = f1, ..., fTf

, i.e. the one that the modelshould have been predicting for such sentence. Wesay that a formula and its gold truth are equals,f ≡ f , if fj = fj for each j in [1, Tf ]. Given alist of M sentences S = s1, s2, ..., sM we indicatewith F = f1, f2, ..., fM the list of the predictedformulæ such that fk is the formula that the modelactually generated with sk as input. We indicate asF = f1, f2, ..., fM the list of gold truth formulæ,such that fk is the correct formula for sk.

5.2.1. Average Per-Formula Accuracy

The first metric is the most straightforward mea-sure of the quality of the translation. We define theAverage Per-Formula Accuracy (FA) as the ratiobetween the number of correctly predicted formulæ

and the total number of formulæ, M:

FA(F ,F) =

∑Mk=1

{1, if fk ≡ fk

0, otherwise

M(32)

The value of FA is reported as a percentage. Thehigher the value of FA, the better the network isperforming.

5.2.2. Average Edit Distance

The second metric is intended to take into ac-count a realistic scenario in which a human opera-tor exploits the system to translate a set of defini-tions into a set of formulæ. The main idea is thatthe less corrections the human operator will haveto perform on the system output, the more thesystem can be considered accurate. Since we aredealing with syntactic transformation, so that allthe extra-logical symbols of the target language aretaken from the input sentence, the set of such pos-sible corrections will be limited for each sentence-formula pair. In this scenario, the Levenshtein Dis-tance can be extremely meaningful, since it mea-sures the number of transformations that must bedone on the predicted formula in order to turn itinto the correct one. We define the Average EditDistance (ED) as:

ED(F ,F) =

∑Mk=1 δ(f

k, fk)

M(33)

where δ(fk, fk) is the Levenshtein Distance be-tween the predicted formula fk and its gold truthfk. Since δ(fk, fk) is a natural number greater orequal to 0, the value of ED is a real number greateror equal to 0. The lower the value of ED, the betterthe network is performing.

5.2.3. Average Per-Token Accuracy

The last metric is the Per-Token Accuracy (TA),which indicates how many symbols have been cor-rectly predicted across the whole list F . This metricis useful to give an overall evaluation of the trans-

17

lation capabilities of the model. It is defined as:

TA(F ,F) =

∑Mk=1

∑Tfk

j=1

{1, if fkj = fkj0, otherwise∑M

k=1 Tfk

(34)

where fkj is the j-th symbol of the k-th formula,

fkj the j-th symbol of the k-th gold truth formula,and Tfk the length of the k-th formula. The valueof TA is reported as a percentage. The higher thevalue of TA, the better the network is performing.

6. Evaluation: experimental settings and re-sults

In this section we present all the experiments werun to evaluate the model and answer the inves-tigated research questions. All the results of theexperiments are presented in terms of the evalua-tion metrics defined in the previous section. Allthe experiments were run on NC6 Microsoft Azurevirtual machines with Ubuntu 16.04 as operatingsystem and provided with one NVidia Tesla K80core, taking a time lapse between 6 and 7 hourseach.

6.1. Closed Vocabulary Evaluation

In out first evaluation, we trained different modelconfigurations to learn to translate sentences intoformulæ in a closed vocabulary setting, i.e., in a set-ting where all the words in the evaluation set appearalso in the training data. The objective of this eval-uation was to test the capability of the network todeal with the syntactic structure of the definitorylanguage, in order to answer RQ1.

We trained the network on the ‘C’-suffixed train-ing sets presented in Section 4.1, and evaluated itagainst the corresponding test set of 30000 exam-ples. We tested different model configurations, i.e.,different hyperparameter settings and settled thebest configuration to the one reported in Table 4,together with the settings for the training algo-rithm, that is, the procedure used to update thevalues of the weights during the error backpropa-gation phase, Stochastic Gradient Descent in ourcase.

The predicted formulæ over the set of 30000 eval-uation examples have been compared against theirgold truth values and the metrics presented in Sec-tion 5.2 have been computed. Results for the dif-ferent (sizes of the) training sets are reported inTable 5.

setting valueembedding size 256encoder cell GRUencoder size 512encoder dropout 0.2attention inner size 256decoder cell GRUdecoder size 512decoder dropout 0.2output feedback size 30training algoritm Gradient Descendlearning rate 0.2training steps 20000batch size 200

Table 4: Settings for the closed vocabulary evaluation.

training FA ED TA2kC 61.1% 2.48 91.8%5kC 84.4% 0.60 97.5%

10kC 88.8% 0.47 98.7%20kC 81.7% 0.46 98.3%

Table 5: Results for the closed vocabulary evaluation. Test-ing: 30kC.

We achieved an Average Per-Formula Accuracyof 88.8% with 10000 examples (dataset 10kC), to-gether with an Average Edit Distance of 0.47. In-creasing the number of training examples, we lose7.1% of Average Per-Formula Accuracy, gaining0.01 in the Average Edit Distance. A possible ex-planation of this behavior is that the network seesall the possible words in the training, and tends tooverfit to the sentence structure and stops learning.Since closed-vocabulary settings would be very rareor not possible outside an in-vitro scenario, we havenot further investigated this phenomenon, leavingthis analysis to future work.

The obtained results provide evidences to thethesis that the syntactic complexity of the defini-tory language can be handled with just one neuralnetwork with pointing mechanism in the scope ofa syntactic transformation into a formula, offeringsupport for a positive answer to RQ1.

6.2. Open Vocabulary Evaluation

To answer RQ2, we run other experiments in adifferent setting, that we called open vocabulary, inwhich words in a sentence can be unknown. Thisis a typical problem when dealing with natural lan-guage, since the training set, even being extremelylarge, will not be able to cover the whole humanvocabulary, leaving out rare words, named entities,and so on. To the extent of the language modelgenerated by our source grammar, content words

18

training FA ED TA2kO 62.4% 1.51 93.8%5kO 85.9% 0.63 98.0%

10kO 85.4% 0.51 98.4%20kO 88.7% 0.38 99.0%

Table 6: Results for the open vocabulary evaluation. Test-ing: 30kO.

like nouns, adjectives, and verbs can be unknown.Being tolerant to such words is a key asset for oursystem. Indeed, moving from one domain to an-other, the lexicon can change and training a sys-tem only on sentences coming from a specific lexi-con, without paying attention to the capability ofscaling out to other words, can affect the perfor-mance on sentences coming from different domains.Intuitively, such variability mostly concerns con-tent words, especially names, while the grammati-cal structure and the behavior of the function wordsare basically constant —see sentences 1 and 3 fromthe driving example of Sec. 4. So, beside the lexi-cal variability, there is still some sort of regularityacross different domains. Trying to answer RQ2,we evaluated the ability of our model to leveragethe latter to compensate the former, measuring itstolerance to unknown words.

We trained the network on the ‘O’-suffixed train-ing sets presented in Section 4.1, and evaluated itagainst the corresponding test set of 30000 exam-ples. The model configurations were the same asthose reported in Table 4, while results are reportedin Table 6.

Using 20000 training examples (dataset 20kO),we achieved an Average Per-Formula Accuracy of88.7%, together with an Average Edit Distanceof 0.38. Even with half of the training examples(dataset 10kO), we lose only a 3.3% of Average Per-Formula Accuracy and 0.13 of Average Edit Dis-tance. Providing support for a positive answer toRQ2, this confirms that our system is capable tohandle the variability of the natural language lexi-con, by exploiting its knowledge of the grammaticalstructure of definitory sentences to infer the properway of dealing with unknown words.

As in the case of the open vocabulary, also in theclosed vocabulary setting the number of examplesneeded to reach a good result is limited. The re-sults in this setting are slightly different from theones of the Closed Vocabulary evaluation and sum-marized in Table 5. Indeed, in the Open Vocab-ulary setting, all the three metrics improve mono-tonically following the enlargement of the training

set size, so that the best performance is achievedtraining the model with 20000 training examples.In contrast, in the Closed Vocabulary setting, onlythe Average Edit-Distance follows such a trend asthe Average Per-Formula and Per-Token accuracyscore their best when using 10000 training exam-ples. In both evaluations we can nonetheless iden-tify a huge performance gap between the 2k andthe 5k datasets, suggesting that 5000 examples canbe already considered an effective sampling of theproblem space.

6.3. Training Set Extension Evaluation

The grammar generated data are intended to rep-resent, in the context of our evaluation, an approx-imation of real definitory sentences. To the extentof this approximation, the evaluation presented sofar shows how our approach is capable to gener-alize over a sample of sentences from this approxi-mate set. When the more ambitious goal of makingour approach viable for real world problems is con-sidered, the grammar generated datasets exploitedso far are intended to be used only to bootstrapthe model, providing our neural network based ap-proach with basic sentence structures and formulatranslations. Indeed, we envision an applicationscenario for our approach where the training set isenriched with some additional domain/applicationspecific human annotated examples, so that thenetwork can combine the generality of the basicsentence structures of the bootstrap datasets withthe domain/application specificity of provided an-notated examples.

We tried to investigate this aspect by answeringquestion RQ3 with an experiment that replicates,on a limited but significant scale, this process ofenriching a training set. In particular, we com-pared the performance of our system when trainedon the four 2kO–20kO synthetic datasets and on thesame four datasets enriched with the 75M datasetcontaining 75 manually annotated examples. Suchcomparative evaluation has been performed against425M, the manually built reference set of 425 exam-ples. The reason why we performed a comparativeevaluation is that we wanted to assess the abilityof the system built over synthetic data to improveits performances when augmenting the training setof a small amount of examples (namely the 75Mdataset) coming from the problem space of the ref-erence set, rather than assessing the capability ofthe original system to generalize over a manuallycurated reference set. We recall here that the 75M

19

and 425M datasets where extracted, preserving asmuch as possible the same distribution of axiomstructures among them, from 500 manually curatedsentence-formula pairs as described in Section 4.2.

training CF FA ED TA2kO 35 8.2% 4.80 46.73%

2kO+75M 143 33.7% 3.44 59.58%5kO 38 8.9% 4.58 48.03%

5kO+75M 126 29.6% 3.55 59.15%10kO 39 9.2% 4.59 48.38%

10kO+75M 82 19.3% 4.06 54.77%20kO 38 8.9% 4.55 49.29%

20kO+75M 55 12.9% 4.53 49.5%

Table 7: Evaluation with manual training set extension(75M) for different bootstrap datasets. Testing: 425M. CFindicates the number correctly generated formulæ.

Table 7 reports the results of the evaluation, ex-plicitly showing also the number of correctly gener-ated formulæ. We obtained the best performancesusing a learning rate of 0.4 for training the modelwith 2000, 5000, and 10000 bootstrap examples,and of 0.2 for training the model with 20000 boot-strap examples. Some other experiments have beencarried out by stratifying the 75M dataset in theaugmented training set (in particular by addingthem more than once, up to 10 times), but no sig-nificant variation in the results was observed.

Considering the raw performance of the modeltrained only with synthetic bootstrap training data(the 2kO, 5kO, 10kO and 20kO rows of Table 7)we notice that the performance increases almostmonotonically with the number of training exam-ples. This indicates that the bootstrap data areuseful to provide the model with a certain catalogueof syntactic constructs and a large enough input vo-cabulary that can be leveraged to process sentencescoming from the real problem space. In particular,the lack of overall improvement between the 10kOand 20kO indicates that 10000 bootstrap examplesare already capable to capture the most of the reg-ularities that are shared between the synthetic andthe real problem spaces.

In all the cases, adding the 75M examples to thebootstrap training set, we observe an improvementin the performance of the model in terms of all theevaluation metrics used. This provides support fora positive answer to RQ3, showing that an additionof manually annotated real world examples can ac-tually drive the model to a better comprehension ofthe actual problem space sampled, in our case, bythe 425M reference set.

Analyzing the outcome of this evaluation, we cannote that the measure of such improvement changeswith the size of the bootstrap training set. In par-ticular, using less bootstrap training examples, wehave a higher increase of performance. This hap-pens because the new examples, which are closer tothe test examples than the bootstrap examples, area larger fraction of the extended training set whendealing with a smaller bootstrap training set. Incontrast, with a larger size of the bootstrap train-ing set, the performance is more contained.

Nonetheless, in the perspective of a continuousprocess of extension of the training set and retrain-ing of the model, it can be meaningful to exploita large bootstrap, capable to provide the model aninitial coverage of a wide range of syntactical con-structs and of a large vocabulary, with many wordsused in as many contexts as possible.

We acknowledge that what provided so far is justa limited—but yet significant—example of the be-havior of the model in the context of a training setextension, and a deeper investigation is left as afuture work.

6.4. Evaluation against baselines

To complement our analysis, and put the overallperformances of our RNN-based approach in per-spective, we compare it with two baseline systemsagainst the same manually curated dataset (425M).Our model outperforms both the baselines accord-ing to all the evaluation metrics. The results arereported in Table 8.

system CF FA ED TAGrammar Parser 17 4.19% - -

Tag&Transduce [3] 0 0.00% 11.7 10.39%20kO 38 8.9% 4.55 49.29%

Table 8: Comparison of the performances of the proposedapproach (training: 20kO) against two baselines.

The first baseline (Grammar Parser) is the sys-tem that we used to annotate the examples in thesynthetic data generation pipeline described in Sec-tion 4.1. The sentences of the 425M reference setare transformed in suitable sentence templates ap-plying the inverse of the actualization process de-scribed in Section 4.1. Such re-built sentence tem-plates are then fed into a syntactic parser built onthe same grammar used in the data generation pro-cess. Since the parser is capable only to perfectlyparse a sentence or to fail, for this baseline we re-ported only the Average Per-Formula Accuracy and

20

the number of correct formulæ. Only 4.19% of thesentence can be parsed by the grammar. This cangive us a measure of the difference between the dis-tribution from which the manually curated dataand the bootstrap data come from, explaining thedecrease of performance we register when we gofrom testing the model against the 30kC (see Ta-ble 5) or 30kO (see Table 6) test set to testing itagainst the 425M reference set. Indeed, in the for-mer cases, the training data and the test data comeexactly from the same uniform distribution, the oneacross the language that can be generated by ourgrammar, while the 4.19% of Average Per-FormulaAccuracy registered by the Grammar Parser base-line indicates that the difference between the boot-strap training data and the test data from the 425Mreference set is remarkable. Nonetheless, even ifonly trained on grammar generated data, the net-work is somehow already capable to generalize overthe sentence structures defined by the grammar andcorrectly process roughly twice the sentences thatthe latter would be capable to convert. The com-parison with the grammar parser confirms that ourmodel learns more than the actual templates usedfor building the bootstrap dataset (and used fortraining).

The second baseline is a previously proposedneural network-based system (Tag&Transduce,see [3]) for ontology learning. On the onehand, On the other hand, the comparison withTag&Transduce [3] confirms that the pointingmechanism is more robust than the taggingand transduce solution, overcoming some of thelimitations of the latter. In particular, theTag&Transduce model is made of two different net-works, each of which is in charge of a single task:

• the tagging network annotates each word indi-cating whether it is an extra-logical symbol ornot. As an example, the sentence “a bee is aninsect” — or more precisely the preprocessedsequence “bee is insect” — would be taggedwith C0 w C1, where the tag w indicates that“is” is not an extra-logical symbol, and thetags C0 and C1 indicate that the correspondingwords (“bee”, “insect”) are extra-logical sym-bols in the considered definition;

• the transduction network converts a sentenceinto a formula template, namely a formula inwhich all the extralogical symbols are replacedby placeholders. As an example, the sequence

“a bee is an insect” is transduced into the for-mula template C0 v C1.

The final formula, Bee v Insect in our example, isdeterministically rebuilt downstream combining theoutput of the two networks. Being the two networkstrained separately to optimize the Per-Token Accu-racy on each task, the errors in the reconstruction ofthe whole formula are not kept into account duringthe training phase. Hence, even high performanceson the two different tasks can end up in lower val-ues for the evaluation metrics on the formula as awhole. A deeper analysis of this model and its be-havior within this task is provided in [21] (in partic-ular in Section 5.4.4) while a qualitative comparisonagainst the Tag&Tramsduce system and our modelis reported in Section 7.2.

6.5. Additional remarks

During our work, we investigated further aspectsnot directly aiming at answering our research ques-tions. We comment our supplementary findingshere.

Different Cell Functions. Before finally opting forGRUs, we experimented with different cell func-tions, more powerful than GRUs, in our neural net-work architecture. Being more powerful, these cellfunctions are characterized by a larger amount ofparameters. This implies that the resulting modelis more complex and the training phase is more timeand resource-consuming. For instance, we testedLong Short-Term Memory (LSTM, see [22]) cellsbut could not measure any improvement over GRUsin the network performance. For this reason, wesuspended the investigation of more complex mod-els and did not experimented with more complexarchitectures, such as bidirectional encoders, ex-ploited in the work of [5] that inspired the designof our neural network.

Impact of Random Inizialization. In order to ver-ify if the random initialization of the weights couldaffect the performance of the network we run thesame experiments several times. Similarly to thedifferent cell functions, we did not observe any sig-nificant variations.

Realistic versus unrealistic training examples. Afurther experiment was done to compare the impactof adding grammatically correct, but likely unreal-istic examples—with respect to their meaning from

21

a human point of view—versus grammatically cor-rect, and realistic ones. For this experiment, weleveraged the manually curated datasets describedin Section 4.2.

We took the 75M dataset, which consists of 75grammatically correct and realistic sentences, andwe operated on it to generate unrealistic exam-ples in a way similar to the one used to generatethe synthetic data: we identified nouns, adjectives,and verbs and randomly scrambled them across thecorpus—switching a noun with another noun, andso on. As an example, a sentence like “a bee is aninsect that produces honey” could be turned into thesentence “a fly is a tool that drives water”. The re-sulting dataset is labeled 75M*. We added the 75Mand 75M* datasets to the same starting bootstrapdataset (20kO) and compared the performances.Results are reported in Table 9 (where comparedwith the corresponding results already reported inTable 7).

The performance of the model trained with theaddition of the 75 unrealistic sentences is compara-ble the one of the model trained on the addition ofthe 75 realistic examples. This result provides evi-dence to the hypothesis that the meaningfulness—with respect to the test examples—of the trainingset extension does not have a relevant impact andthat a major role is played by the syntax. Morein detail, the slightly better performance obtainedwith the unrealistic training set extension could beexplained with the fact that these examples area more homogeneous sample between the gram-mar generated data and the manually curated ones.This hypothesis will be further inspected with fu-ture investigations.

# examples # correct FA ED TA20kO 38 8.94% 4.55 49.29%

20kO+75M 55 12.94% 4.53 49.59%20kO+75M* 62 14.60% 4.22 50.72%

Table 9: Performances of extending the training set withrealistic vs. unrealistic examples.

7. Related Work

In this section, we discuss relevant related works,highlighting the main differences with respect tothe contribution presented in this paper. Our workbelongs to the area of ontology learning and pop-ulation (see [23, 24] for broader overview). On-tology learning targets the extraction of termino-

logical (TBox) knowledge, while ontology popula-tion deals with the extraction of assertional (ABox)knowledge according to specific ontologies. No-table examples of the latter include FRED [25] andPIKES [26], while various approaches and toolshave tackled the former, though mainly address-ing lightweight ontology learning, and in particu-lar substasks such as the identification of concepts(e.g., KX [27]) or the extraction of concept tax-onomies (e.g, FRED [25], OntoLearn [28]).

Focusing on expressive ontology learning, i.e. theextraction of ontological axioms more complex thanconcept subclasses, we can roughly split the state-of-the-art approaches into two groups. The firstone comprises pattern-based approaches heavily re-lying on hand-crafted rules and resources. The sec-ond one comprises purely machine-learning basedapproaches.

7.1. Pattern-based approaches

LExO [2] is one of the most notable examples inthe literature describing a semi-automatic approachto extract expressive axioms from complex naturallanguage definitions through syntactic transforma-tion. The pivotal intuition—that we exploited aswell in this work—is that syntax matters. First,the input definition is processed with a syntacticparser. Afterwards, the resulting parse tree is ma-nipulated through a set of transformation rules thatturn the input sentence into an OWL axiom. Theapproach has been used in [29] as the foundationalidea to build a methodology with the ambitious goalto cover the whole ontology life-cycle management.

Recently, the authors of [30] proposed an effec-tive approach for ontology enrichment based on themapping of natural language text onto DL formulæ.The approach requires a hand-crafted Tree-Adjointgrammar and a hand-crafted lexicon to map a sen-tence into its corresponding logical form. The as-sessment of the system is performed verbalizing theDL formula back to natural language and evaluat-ing the BLEU score between the original sentenceand the re-verbalized one. Such verbalization isperformed through a third component of the sys-tem, called surface realizer that must be provided.All such components must be manually designed.This approach is tailored on a System InstallationDesign Principles (SIDP) text corpus, which con-tains mostly alethic statements, while our approachis focused on definitory sentences.

The main difference between this work and theapproaches presented above is in the fact that they

22

strongly rely on manually designed components.Pattern-based approaches for ontology learning areindeed rigid with respect to the grammatical struc-ture of the text they can process. Therefore, asacknowledged in [2], several linguistic phenomenasuch as conjunctions, negations, disjunctions, quan-tifiers scope, ellipsis, anaphora, etc., can be partic-ularly hard to parse and interpret. Extending acatalog of hand-crafted rules to handle as more lin-guistic phenomena as possible can be an extremelyexpensive task, leading to unsustainable costs inengineering, maintaining and evolving the system.In contrast, our approach has no hand-crafted ruleand its evolution is intended to happen through theextension of the training set by adding manuallyannotated training examples.

7.2. Machine Learning-based approaches

Authors of [4] present a machine learning basedapproach to formalize textual definitions of biomed-ical concepts. Training examples are generatedvia distant supervision, aligning definitions comingfrom several collections of biomedical texts —suchas MeSH16, MEDLINE17 and WikiPedia articleson the topic— with a catalog of relations comingfrom well known resources on the same domain, likeSnomed CT18 and the SemRep19 corpus. First, allthe concepts are tagged with their semantic types,then, such semantic types together with a minimalset of linguistic features, are used to train and eval-uate different classifiers. Differently from ours, thisapproach is focused on a specific domain and thereis no tolerance to unknown words. In addition, thetarget logic language is the one used in SnomedCT, namely EL++, which is less expressive thanALCQ.

Recently, some other approaches have tried to useneural networks trained in an end-to-end fashion inorder to transform natural language sentences intological formulæ. The most relevant to our work isthe approach presented in [3], which provided evi-dence that Recurrent Neural Networks can be ex-ploited to manage the syntactic structure of defini-tory text. Such approach uses two different neuralnetworks, one to transduce a sentence into a for-mula structure with placeholders for concepts androles, and the other one to tag each word in the

16https://meshb.nlm.nih.gov17https://www.nlm.nih.gov/bsd/pmresources.html18http://www.snomed.org/snomed-ct19https://semrep.nlm.nih.gov/

sentence with the proper placeholder: in this way,it is possible to rebuild the final formula. With anexample, the sentence “a bee is an insect” will betransduced into a formula template C0 v C1 and,in parallel, tagged with “a/w bee/C0 is/w an/w in-sect/C1” so that the final formula can be rebuiltas bee v insect, being w the empty tag. How-ever, the work in [3] is affected by three prominentlimitations that led us to propose the network ar-chitecture presented in this paper.

The first one dwells in the bijective nature oftagging: each word can be tagged only once andwith only one label. So, consider a sentencelike “a property is a quality of an event or ob-ject” and assume that we want to translate itinto the formula property v quality of event tquality of object. The corresponding formulatemplate generated by the transducer networkwould be C0 v C1tC2. The tagging network shouldtag twice the words “quality” and “of ”, first withC1 and then with C2, which is impossible. Suchlimitation has been completely overcome by ourapproach through the usage of the pointing mech-anism, which allows us to point a single word asmany times as we need.

The second limitation is given by the fact thatthe tags representing the placeholders for concepts,roles, and numbers are indexed with a number.Supposing that in the training examples the maxi-mum amount of concepts in a single sentence is 2,the network will be capable to use the tags C0 andC1. At run time, if a sentence with 3 involved con-cepts is fed into the network, it will not be able toemit a C2 tag for the third concept, since such sym-bol is not in the output vocabulary of the network.In a scenario of continuous training set enrichmentand model retraining, like the one we outlined inthis work, it is not possible to add more complex(i.e. with a larger amount of tags per sentence) ex-amples incrementally, without changing the outputdimension of the network, since new tags will haveto be added to the output vocabulary as well. Theapproach described in this paper is not subject toany of these issues, since there are no numbered tagsand all the network decisions in terms of emissionare bounded by the logical symbols or the words ofthe current sentence.

The last limitation, concerns the training set.In [3], the dataset was built relying on the ACE(Attempto Controlled English) verbalization [31]of existing ontological axioms. That is, the sen-tence of the example pairs is written according to

23

https://meshb.nlm.nih.gov

https://www.nlm.nih.gov/bsd/pmresources.html

http://www.snomed.org/snomed-ct

https://semrep.nlm.nih.gov/

an artificially defined subsets of natural language(with restricted vocabulary, syntax or semantics),thus resulting in sentences which are far away fromhow humans would typically express a definition:for instance, the axioms in (3) would be verbal-ized as “Every bee is an insect and produces somehoney””—cf. with (2). Furthermore, the basic con-structs in the example pairs were further combi-natorially combined with “and” and “or” to ex-tend the dataset in size and variability. Instead,in the approach presented in this paper we aimedat making the dataset as more realistic as possible,drawing examples from human annotations in wellknown ontologies, lexicon, Wikipedia, and so on.

Outside the Ontology Learning and SemanticWeb communities, the authors of [11] exploit a neu-ral network to translate natural language sentencesinto a logical representation of database queries.Even if the application scenario is quite differentfrom ours, the two architectures share some similar-ities, since the approach of [11] relies on an encoder-decoder configuration endowed with attention. Themain difference between the two network models isin the pointing mechanism which is not present in[11]. As a consequence, rare and unknown wordsmust be replaced by their types and a numericidentifier—e.g. NUM0 for the first number in the sen-tence, etc. This constrains the model to the samelimitation we pointed out about the numbered tagsfor the approach presented in [3]. Introducing thepointing mechanism, we completely overcome suchlimitation since unknown words can be copied intothe output sequence as they are.

8. Conclusions and Future Work

In this work, we presented a single neural networkin the recurrent encoder-decoder configuration andtrained it to translate a natural language definitioninto the corresponding ALCQ formula through syn-tactic transformation. In particular, we exploited apointing mechanism to make the network capableof producing any extralogical symbol just by copy-ing it from the input sentence, choosing the rightone via neural attention.

Since there was no dataset suitable for our task,we followed other notable examples in literatureand used appropriate synthetically generated datain order to assess the generalization capabilities ofthe model. In parallel, we developed a smallerdataset of 500 manually curated real-world exam-ples to be used to test the capability of improve-

ment through the extension of the training set. Inour evaluation process we assessed the capacity ofthe network to i) generalize over the syntactic struc-ture of the language of interest, ii) tolerate unknownwords, and iii) improve its performance with the ad-dition of annotated examples to the synthetic boot-strap data. We run several tests, with different sam-plings of the problem space —i.e. using trainingsets of different sizes— to measure the generaliza-tion capabilities of the network.

Answering RQ3, we gained evidence that oncebootstrapped with process generated data, the per-formance of the model can be improved over real-world data augmenting the training set with newmanually annotated examples. Although, we alsogained evidence that the bootstrap data used in thiswork are not, per se, an exhaustive sample over thedefinitory language—which, anyway, was not thegoal of our work.

Some future effort will be put in the direction ofimproving the quality of the bootstrap data withmore and better examples. Such improvement willhave to cover both the increasing complexity of theinput language and the amount of logical constructsas well, so to cover the whole OWL-DL language.The most straightforward way would be to improvethe grammar underlying the generation of the boot-strap data. We are not planning to follow suchdirection, since, on the long run, we will face amaintenance cost comparable with the hand-craftedrules based approaches. Moreover, this way will re-flect any prior bias of the knowledge engineer incharge of the grammar development, with the riskof ending up in some biased representation of real-ity. Instead, we plan to investigate the direction ofusing some distant supervision based approach, asin [4], or the exploitation of generative models, asin [32], to generate annotated examples which, stillartificial, can be even more realistic with respectto the current grammar generated ones. Upstream,some semi-automatic approach to align natural lan-guage definition and Description Logics formula,will be explored in the future to set up a robustand effective bootstrap data generation process.

The network is trained in a end-to-end fashion:raw text is the only feature used to represent the in-put and no further annotation (such as POS tags,dependency labels, and so on) that could be pro-duced upstream by some Natural Language Pro-cessing toolkit. Such toolkits are pre-trained on aportion of the natural language which is differentfrom the one we are interested in. Even if adding

24

more features can ideally improve the generaliza-tion capabilities of the model, this misalignmentcould introduce a significant amount of noise, capa-ble to cancel such improvement. As a consequence,the introduction of such features must be testedcarefully.

We remarked in Section 3.2 that word embed-dings are not pre-trained but are learned jointlywith the main translation task. What emerges fromthe results in Table 9 is that the model learns somesort of coarse-grained semantics for the differentcontent word classes, since randomly scrambling thecontent words in the extension of the training setleads to a comparable performance improvement.Further analyses in this direction are foreseen asfuture work, like comparative analysis with pre-trained word embeddings on different corpora orevaluating the embeddings learned jointly with themain task with a suitable semantic similarity bench-mark.

Another interesting direction of investigation liesin the possibility of training the model beyond thesimple syntactic transformation, making it capableto recognize the semantic of the input sentence andto formalize it in the most suitable way. This willenable the identification of equivalent concept defi-nitions and their translation into the same formula.As an example, the two sentences “a bee produceshoney” and “a bee makes honey” will be trans-lated into the formula Bee v ∃ produce.(honey),where the verb “produce” is typically preferred tothe more generic “make”. In order to achieve thisgoal, the model must be trained beyond the quasi-zero vocabulary setting, since it must be aware ofthe meaning of those extra-logical symbols, like theword “produce” in the example above, that cannotbe copied from the second input sentence.

From the architectural point of view, differentand more complex models could be beneficial tothe task. In particular we plan to experiment witha bidirectional recurrent layer (see [33]) as encoders.Such encoders process the input sentence both for-ward and backward. So, at each timestep they haveinformation about the portion of the sentence thathas been seen so far and of the portion of the sen-tence that has to be seen. This complex and power-ful representation of the whole sentence could pos-itively affect the translation from the sentence intothe formula.

In the end, we are confident that the approachwe proposed in this paper can be an effective toolfor the Knowledge Engineering and Semantic Web

community, and we hope that our contribution willencourage a shared effort in building a widely ac-cepted dataset for the task. In this perspective,our last improvement will go in the engineering di-rection, making the trained model usable and easyto integrate in other knowledge management suites(e.g. MoKi [34]).

Acknowledgements

All the computing resources used to the extentof the present work, were granted by the MicrosoftAzure for Research Award programme.

References

[1] Judith Reitman Olson and Henry H. Rueter. Extract-ing expertise from experts: Methods for knowledge ac-quisition. Expert Systems, 4(3):152–168, 1987. ISSN1468-0394.

[2] Johanna Volker, Pascal Hitzler, and Philipp Cimiano.Acquisition of OWL DL axioms from lexical resources.In Enrico Franconi, Michael Kifer, and Wolfgang May,editors, The Semantic Web: Research and Applica-tions, 4th European Semantic Web Conference, ESWC2007, Innsbruck, Austria, June 3-7, 2007, Proceedings,volume 4519 of Lecture Notes in Computer Science,pages 670–685. Springer, 2007.

[3] Giulio Petrucci, Chiara Ghidini, and Marco Rospocher.Ontology learning in the deep. In Eva Blomqvist,Paolo Ciancarini, Francesco Poggi, and Fabio Vitali,editors, Knowledge Engineering and Knowledge Man-agement - 20th International Conference, EKAW 2016,Bologna, Italy, November 19-23, 2016, Proceedings,volume 10024 of Lecture Notes in Computer Science,pages 480–495, 2016.

[4] Alina Petrova, Yue Ma, George Tsatsaronis, MariaKissa, Felix Distel, Franz Baader, and MichaelSchroeder. Formalizing biomedical concepts from tex-tual definitions. J. Biomedical Semantics, 6:22, 2015.

[5] Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. Pointing the un-known words. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics,ACL 2016, August 7-12, 2016, Berlin, Germany, Vol-ume 1: Long Papers. The Association for ComputerLinguistics, 2016.

[6] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.Pointer networks. In Corinna Cortes, Neil D. Lawrence,Daniel D. Lee, Masashi Sugiyama, and Roman Gar-nett, editors, Advances in Neural Information Process-ing Systems 28: Annual Conference on Neural Infor-mation Processing Systems 2015, December 7-12, 2015,Montreal, Quebec, Canada, pages 2692–2700, 2015.

[7] Roberto Navigli and Paola Velardi. Learning word-class lattices for definition and hypernym extraction.In Jan Hajic, Sandra Carberry, and Stephen Clark, ed-itors, ACL 2010, Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguistics,July 11-16, 2010, Uppsala, Sweden, pages 1318–1327.The Association for Computer Linguistics, 2010.

25

[8] Aristotele. Organon. Laterza, Bari, Italy, 1970. In-troduzione, traduzione e commento a cura di GiorgioColli.

[9] Grigoris Antoniou and Frank Van Harmelen. Web on-tology language: Owl. In Handbook on ontologies, pages67–92. Springer, 2004.

[10] Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase rep-resentations using RNN encoder-decoder for statisticalmachine translation. In Alessandro Moschitti, Bo Pang,and Walter Daelemans, editors, Proceedings of the 2014Conference on Empirical Methods in Natural LanguageProcessing, EMNLP 2014, October 25-29, 2014, Doha,Qatar, A meeting of SIGDAT, a Special Interest Groupof the ACL, pages 1724–1734. ACL, 2014.

[11] Li Dong and Mirella Lapata. Language to logical formwith neural attention. In Proceedings of the 54th An-nual Meeting of the Association for Computational Lin-guistics, ACL 2016, August 7-12, 2016, Berlin, Ger-many, Volume 1: Long Papers. The Association forComputer Linguistics, 2016.

[12] Ian J. Goodfellow, Yoshua Bengio, and Aaron C.Courville. Deep Learning. Adaptive computation andmachine learning. MIT Press, 2016.

[13] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. Improvingneural networks by preventing co-adaptation of featuredetectors. CoRR, abs/1207.0580, 2012.

[14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. Neural machine translation by jointly learning toalign and translate. CoRR, abs/1409.0473, 2014.

[15] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey E. Hinton. Grammar as aforeign language. In Corinna Cortes, Neil D. Lawrence,Daniel D. Lee, Masashi Sugiyama, and Roman Gar-nett, editors, Advances in Neural Information Process-ing Systems 28: Annual Conference on Neural Infor-mation Processing Systems 2015, December 7-12, 2015,Montreal, Quebec, Canada, pages 2773–2781, 2015.

[16] Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. Building a large annotated corpus of en-glish: The penn treebank. Computational Linguistics,19(2):313–330, 1993.

[17] Samuel R. Bowman, Christopher Potts, and Christo-pher D. Manning. Recursive neural networks for learn-ing logical semantics. CoRR, abs/1406.1827, 2014.

[18] Edward Grefenstette, Karl Moritz Hermann, MustafaSuleyman, and Phil Blunsom. Learning to transducewith unbounded memory. In Corinna Cortes, Neil D.Lawrence, Daniel D. Lee, Masashi Sugiyama, and Ro-man Garnett, editors, Advances in Neural InformationProcessing Systems 28: Annual Conference on Neu-ral Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1828–1836,2015.

[19] Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. Towards ai-complete question an-swering: A set of prerequisite toy tasks. CoRR,abs/1502.05698, 2015.

[20] Wojciech Zaremba and Ilya Sutskever. Learning to ex-ecute. CoRR, abs/1410.4615, 2014.

[21] Giulio Petrucci. Learning to Learn Concept Descrip-tion. PhD thesis, University of Trento, 2018.

[22] Sepp Hochreiter and Jurgen Schmidhuber. Long short-

term memory. Neural Computation, 9(8):1735–1780,1997.

[23] Jens Lehmann and Johanna Voelker, editors. Perspec-tives On Ontology Learning. Studies in the SemanticWeb. AKA / IOS Press, 2014.

[24] Philipp Cimiano, Alexander Madche, Steffen Staab,and Johanna Volker. Ontology learning. In Handbookon Ontologies, pages 245–267. Springer, 2009.

[25] Francesco Draicchio, Aldo Gangemi, Valentina Pre-sutti, and Andrea Giovanni Nuzzolese. FRED: fromnatural language text to RDF and OWL in one click.In ESWC (Satellite Events), volume 7955 of LectureNotes in Computer Science, pages 263–267. Springer,2013.

[26] Francesco Corcoglioniti, Marco Rospocher, andAlessio Palmero Aprosio. Frame-based ontology pop-ulation with PIKES. IEEE Trans. Knowl. Data Eng.,28(12):3261–3275, 2016.

[27] Sara Tonelli, Marco Rospocher, Emanuele Pianta, andLuciano Serafini. Boosting collaborative ontology build-ing with key-concept extraction. In Semantic Comput-ing (ICSC), 2011 Fifth IEEE International Conferenceon, pages 316–319, 2011.

[28] Paola Velardi, Stefano Faralli, and Roberto Navigli. On-tolearn reloaded: A graph-based algorithm for taxon-omy induction. Computational Linguistics, 39(3):665–707, 2013.

[29] Johanna Volker, Peter Haase, and Pascal Hitzler.Learning expressive ontologies. In Paul Buitelaar andPhilipp Cimiano, editors, Ontology Learning and Popu-lation: Bridging the Gap between Text and Knowledge,volume 167 of Frontiers in Artificial Intelligence andApplications, pages 45–69. IOS Press, 2008.

[30] Bikash Gyawali, Anastasia Shimorina, Claire Gardent,Samuel Cruz-Lara, and Mariem Mahfoudh. Mappingnatural language to description logic. In Eva Blomqvist,Diana Maynard, Aldo Gangemi, Rinke Hoekstra, PascalHitzler, and Olaf Hartig, editors, The Semantic Web -14th International Conference, ESWC 2017, Portoroz,Slovenia, May 28 - June 1, 2017, Proceedings, Part I,volume 10249 of Lecture Notes in Computer Science,pages 273–288, 2017.

[31] Tobias Kuhn. The understandability of OWL state-ments in controlled English. Semantic Web, 4(1):101–115, 2013.

[32] Tomas Kocisky, Gabor Melis, Edward Grefenstette,Chris Dyer, Wang Ling, Phil Blunsom, and Karl MoritzHermann. Semantic parsing with semi-supervised se-quential autoencoders. In Jian Su, Xavier Carreras, andKevin Duh, editors, Proceedings of the 2016 Confer-ence on Empirical Methods in Natural Language Pro-cessing, EMNLP 2016, Austin, Texas, USA, November1-4, 2016, pages 1078–1087. The Association for Com-putational Linguistics, 2016.

[33] Mike Schuster and Kuldip K. Paliwal. Bidirectional re-current neural networks. IEEE Trans. Signal Process-ing, 45(11):2673–2681, 1997. doi: 10.1109/78.650093.URL https://doi.org/10.1109/78.650093.

[34] Chiara Ghidini, Marco Rospocher, and Luciano Ser-afini. Modeling in a wiki with moki: Reference ar-chitecture, implementation, and usages. InternationalJournal On Advances in Life Sciences, 4(3&4):111–124,2012.

26

https://doi.org/10.1109/78.650093

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	1 times

Expressive Ontology Learning as Neural Machine …...in our evaluation show how approaching the...

Documents