Learning by switching type of information

Information and Computation 185 (2003) 89–104

www.elsevier.com/locate/ic

Learning by switching type of information�

Sanjay Jaina,∗ and Frank Stephanb

aSchool of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543, SingaporebMathematisches Institut, Im Neuenheimer Feld 294, Universität Heidelberg, 69120 Heidelberg, Germany

Received 11 January 2002; revised 19 July 2002

Abstract

The present work is dedicated to the study of modes of data-presentation in the range between text and informantwithin the framework of inductive inference. In this study, the learner alternatingly requests sequences of positiveand negative data. We define various formalizations of valid data presentations in such a scenario. We resolve therelationships between these different formalizations, and show that one of these is equivalent to learning frominformant. We also show a hierarchy formed (for each of the formalizations studied) by considering the number ofswitches between requests for positive and negative data.© 2003 Elsevier Science (USA). All rights reserved.

1. Introduction

Astronomers observing the sky with telescopes cannot obtain all available information but have tofocus their study on selected areas and might from time to time change to another area of the sky. Fortyyears after the discovery of Uranus, it was found that Uranus was not following the predicted orbitexactly. Taking into account the influence of the other known planets, the astronomer Alexis Bouvardcame up with the hypothesis that there exists a further unknown planet which disturbs the orbit of Uranus.John Couch Adams and Urbain Jean Joseph Le Verrier both computed independently the position of theunknown planet. In 1846, Le Verrier communicated his results to Johann Gottfried Galle, who thenfound Neptune with his telescope at the given position.

�A preliminary version of this paper appeared in ALT 2001. We would like to thank the anonymous referees for usefulcomments. Sanjay Jain was supported in part by NUS grant number R252-000-127-112. Frank Stephan was supported by theDeutsche Forschungsgemeinschaft (DFG) under the Heisenberg grant Ste 967/1–1.

∗Corresponding author. Fax: +65-6779-4580.E-mail addresses:[email protected] (S. Jain), [email protected] (F. Stephan).

0890-5401/$ - see front matter © 2003 Elsevier Science (USA). All rights reserved.doi:10.1016/S0890-5401(03)00082-8

90 S. Jain, F. Stephan / Information and Computation 185 (2003) 89–104

Similar to astronomy, one can also in inductive inference consider the scenario that the learner cannottrack all available data but has to focus on some type of data and can only few times switch the focus ofattention. The purpose of the present work is to formalize such switching between the two main modesof data-presentation in inductive inference, namely between reading positive data which are elements ofthe set to be learned or negative data which are the non-elements. There are several ways to formalizethis and it is investigated how these formalizations relate to each other and how they fit into the hierarchyof the already established notions of learning from positive data (text) or both, positive and negative data(informant).

In the scenario of learning from positive data, the learner is fed all the elements and no non-elementsof a languageL (the so calledtext of L), in any order, at most one element at a time. The learner, asit is receiving the data, outputs a sequence of grammars. The learner is said to identify (learn, infer)L

just in case the sequence of grammars converges to a grammar forL. A class of languages is learnableif some machine learns each language in the class. This is essentially the paradigm of identification inthe limit (calledTxtEx) introduced by Gold [11]. Gold also considered the situation of learning frominformant, where the learner receives both positive and negative data, that is elements of the graph of thecharacteristic function ofL (called informant for L) as input. This leads to the identification criterionknown asInfEx.

Gold [11] showed a central result that learning from text is much more restrictive than learning frominformant. Gold gave an easy example of a class which can be learned from informant but not from text:the collection consisting of one infinite recursively enumerable set together with all its finite subsets.

The main motivation for this work is to explore the gap between these two extreme forms of data-presentation. Previous authors have already proposed several methods to investigate this gap, some ofthese are described below.

Gasarch and Pleszkoch [10] considered allowing learners access to non-recursive oracles. HoweverJain and Sharma [14] showed that even the most powerful oracles do not permit to learn all recursivelyenumerable (or even all recursive) sets from texts whereas the oracleK allows one to learn all recursivelyenumerable sets from informant.

Restrictions on the texts (such as allowing only primitive recursive texts or ascending texts) reducetheir non-regularity and permit to pass on further information implicitly [20,23]. For example, ascendingtexts permit to reconstruct the complete negative information in the case of infinite sets, but fail to do soin the case of finite sets. Thus the class of one infinite set and all its finite subsets is still not learnable fromascending text. On the other hand, the class of all recursively enumerable languages can be learned fromprimitive recursive texts. Merkle and Stephan [18] also considered strengthening the text by permittingadditional queries to retrieve information not contained in standard texts.

Motoki [19] and later Baliga et al. [2] added to the positive information of the text some, but notall, negative information about the language to be learned. They considered two notions of supplyingthe negative data: (a) there is a finite set of negative informationS ⊆ L such that the learner alwayssucceeds learning the languageL from inputS plus a text forL, and (b) there is a finite setS ⊆ L suchthat the learner always succeeds learning the languageL from a text forL plus a text for a setH disjointto L which containsS, that is, which satisfiesS ⊆ H ⊆ L. In case (a) one is able learn the class of allrecursively enumerable languages. Thus, the notion (b) is the more interesting one.

The present work treats positive and negative data symmetrically and several of our notions are muchless powerful than those notions considered by Baliga et al. [2]. The most convenient way to definethese notions is to use the idea of a minimum adequate teacher as, for example, described by Angluin

S. Jain, F. Stephan / Information and Computation 185 (2003) 89–104 91

[1]. A learner requests positive or negative data-items from a teacher which has – depending on the exactformalization – to fulfill certain requirements. These formalizations (and also the number of switchespermitted) then define the model. We consider three formalizations (calledBasicSwEx, RestartSwEx,NewSwEx) of requirements a teacher needs to satisfy. The naturalness of this approach is witnessed bythe fact that all classes separating the various formalizations can be defined in easy topological terms.Due to the topological nature of the separating classes, these results hold even if the learners are non-computable. Out of the three formalizations,NewSwEx turns out to be the most natural definition inthe gap betweenTxtEx-learning and learning from informant.RestartSwEx (without constraints onnumber of switches) coincides with learning from informant, whereasBasicSwEx has some strangeproperties.

2. Preliminaries

Notation. Any unexplained recursion theoretic notation can be found in Roger’s textbook [21]. Thesymbol N denotes the set of natural numbers,{0, 1, 2, 3, . . .}. Symbols∅, ⊆, ⊂, ⊇, and⊃ denoteempty set, subset, proper subset, superset, and proper superset, respectively. Cardinality of a setS isdenoted by card(S). Domain and range of a partial functionψ is denoted by domain(ψ) and range(ψ),respectively.

An infinite sequence is a mapping fromN to N ∪ {#}; a finite sequence is a mapping from{y ∈N : y < x} (for somex ∈ N) to N ∪ {#}. In the first case, the length of the sequence is∞, whereasin the second case its length isx. We denote the length of a sequenceη by |η|. Sequences may take aspecial value # to indicate a pause (when considered as a source of data). Therefore the notioncontentisintroduced to denote the set of the numbers contained within the range of some finite or infinite sequence.The content of a sequenceη is defined as content(η) = range(η) ∩ N. Furthermore, ifx � |η|, thenη[x]denotes the restriction ofη to the domain{y ∈ N : y < x}. We letσ andτ range over finite sequences.We denote the sequence formed by the concatenation ofτ at the end ofσ by στ . Furthermore, weuseσx to denote the concatenation of sequenceσ and the sequence of length 1, which contains theelementx.

By ϕ we denote a fixedacceptableprogramming system for the partial computable functions that aremappingN to N [17,21]. Byϕi we denote the partial recursive function computed by the program withnumberi in theϕ-system. Such a programi is a (characteristic) index for a setL if

ϕi(x) ={

1 if x ∈ L;0 otherwise.

Programs for enumeration procedures (so called r.e. indices) are not considered in the present work.From now on, we call the recursive subsets ofN just languages and only consider characteristic indicesand not enumeration procedures. The symbolsL,H range over languages.L denotes the complement,N − L, of L. The symbolL ranges over classes of languages.

Learning theory often also considers learning non-recursive but still recursively enumerable sets.In this work we restrict ourselves to the recursive case since, for notions of learning considered inthis paper: (I) all inclusions hold for the case of recursive sets iff they hold for the case of recursivelyenumerable sets; (II) recursive sets already permit us to construct candidates for separations of learningcriteria – our diagonalization proofs use mainly the topological properties. Furthermore, recursive setshave, compared to recursively enumerable sets, the advantage that their complement also possesses


a recursive enumeration. This is an interesting property to have, as we are considering positive andnegative information in a symmetric way.

Notation from Learning Theory. The main scenario of inductive inference is that a learner readsmore and more data on an object and outputs a sequence of hypotheses which eventually converge to theobject to be learned.

Definition 2.1 [11]. A textT for a languageL is an infinite sequence such that its content isL, that is,T contains all elements ofL but none ofL. T [n] denotes the finite initial sequence ofT with lengthn.

Definition 2.2 [11]. A learner (or learning machine) is an algorithmic device which computes a map-ping from finite sequences intoN.

We letT range over texts andM range over learning machines.M(T [n]) is interpreted asM’s conjecturefor the input language based on dataT [n]. We say thatM converges onT to i, (writtenM(T ) ↓= i) iff(∀∞n) [M(T [n]) = i].

There are several criteria for a learning machine to be successful on a language. Below we definelearning in the limit introduced by [11].

Definition 2.3 [11].(a) M TxtEx-learns a languageL from textT iff, for some indexi for L, for almost alln, M(T [n]) = i.(b) M TxtEx-learns a languageL (written:L ∈ TxtEx(M)) just in caseM TxtEx-learnsL from each

text forL.(c) M TxtEx-learns a classL of languages (written:L ⊆ TxtEx(M)) just in caseM TxtEx-learns each

language fromL.(d) TxtEx = {L : some learnerM TxtEx-learnsL}.

The following propositions on learning from text are useful in proving some of our results.

Proposition 2.4 (Based on Proposition2.2A by Osherson,Stob and Weinstein[20]). LetLbe any infinitelanguage and Pos be a finite subset ofL. Then{H : Pos⊆ H ⊆ L ∧ card(L−H) � 1} /∈ TxtEx.

Proposition 2.5 [11]. LetL be any infinite language. If L containsL and the setsL ∩ {0, 1, . . . , n},for infinitely manyn ∈ N, thenL �∈ TxtEx.

We now generalize the concept of learning and permit the learners to request explicitly positive ornegative data from a teacher in order to define learning by switching between types of informationreceived.

Definition 2.6. Learning is a game between a learnerM and a teacherT . Both send alternately infor-mation in the following way: in thekth round (for ease of notation we start with round 0), the learnerfirst sends a requestrk ∈ {+,−}; the teacher then answers with an informationxk; thereafter the learneroutputs a hypothesisek. There are three types of interactive protocols between the learner and the teacher;every teacher satisfying the protocol is permitted.


(a) The basic switch-protocol. The teacher has two textsT+ andT− of L andL, respectively. Afterreceivingrk the teacher transmitsTrk (k).

(b) The restarting switch-protocol. The teacher has two textsT+ and T− of L andL, respectively.After receivingrk the teacher computes the current positionl = card{h : 0 � h < k ∧ rh = rk} andtransmitsTrk (l).Intuitively, in restarting switch-protocol, one may consider learner as asking the “next item” fromthe selected text (of language or its complement).

(c) The newtext switch-protocol. The teacher sends anxk ∈ L ∪ {#}, if rk = + and xk ∈ L ∪ {#}, ifrk = −. Furthermore, if there is ak such thatrh = rk, for all h � k, and eitherk = 0 or rk−1 /= rk,then the sequencexk, xk+1, . . . is a text forL (if rk = +) or a text forL (if rk = −).

Intuitively, in newtext switch-protocol, the teacher starts with a new text forL or L every time aswitch occurs.

A classL is learnable according to the given protocol iff there is a learnerM such that, for everyL ∈ L and for every teacher satisfying the protocol for thisL, the hypotheses of the learnerM convergeto an indexe of L. The corresponding learning-criteria are denoted byBasicSwEx, RestartSwEx, andNewSwEx, respectively.

Note thatM is aTxtEx-learner iffM always requests positive data(rk = + for all k). Therefore, allthree notions are generalizations ofTxtEx-learning.

In the following we define similar restrictions on the number of switches as has been done for thenumber of mind changes by Case and Smith [7] and Freivalds and Smith [8]. We consider countingnumber of switches by ordinals. The learner has a counter for an ordinal, which is downcounted at everyswitch. Due to the well-ordering of the ordinals, the counter can be downcounted only finitely often. Inorder to ensure that the learner is computable, we consider throughout this work only recursive ordinals.In particular, we use a fixed notation system, Ords, and a partial ordering of ordinal notations [16,21,22].We use�,≺,�, and� to compare ordinals according to the partial ordering mentioned above. We donot go into the details of the notation system used, but instead refer the reader to the methods outlinedin the papers [5,8,15,16,21,22].

Definition 2.7. BasicSw∗Ex denotes the variant ofBasicSwEx where the requests ofM have to con-verge to somer, wheneverM deals with a teacher following basic switch-protocol, for any givenL ∈ L.

For an ordinal notationα, we now define the variantBasicSwαEx of BasicSwEx. The learner (asin Definition 2.6) is additionally equipped with a counter. The value of counter at the beginning ofroundk is denoted byγk. Now in addition to the properties required forBasicSwEx-learnability, werequire(1) γ0 = α.(2) If rk+1 = rk, thenγk+1 = γk.(3) If rk+1 /= rk, thenγk+1 ≺ γk.

Similarly, one defines the four notionsRestartSw∗Ex, NewSw∗Ex, RestartSwαEx, andNewSwαExfor the restart and newtext switching protocols.

One can consider the generalization of above notions by replacingEx by other convergence criteria suchasBC [6] or FEx [4].


Remark 2.8. The notions,BasicSwEx, RestartSwEx, andNewSwEx might change a bit if instead ofarbitrary texts some restrictive variants are used.

A fat textfor a languageL, is a text in which every element ofL appears infinitely often (and nonele-ments ofL never appear). Therefore, arbitrary long initial segments of the text may be missing withoutlosing essential information. For criteria of inference considered in this paper, one may consider learningfrom “fat information” where all the texts considered in Definition 2.6, are fat texts. In this situation,one can, to a certain degree, compensate the loss of information when switching in the basic switch-protocol. The notionsNewSw∗Ex andRestartSw∗Ex do not change if one considers fat information,but the notion ofBasicSw∗Ex increases its power and becomes equivalent toNewSw∗Ex – note thatin the standard “non-fat” case by Proposition 3.1 and Theorem 3.2 below, the notionBasicSw∗Ex isproperly contained inNewSw∗Ex. Similar result applies if one replaces∗ by an ordinalα in the previousstatement.

It can be shown that learning fromrecursive textsdoes not give any advantage inTxtEx-criteria, see,for example, the textbook by Jain et al. [13]. All diagonalization results considered in this paper, can bedone using recursive texts.

Gold [11] showed that the class of all recursively enumerable sets can be learned fromprimitiverecursive text, which are generated by a primitive recursive function. Thus, the generalized criteriaconsidered in this paper coincide with learning from text, if one considers only primitive texts as inputin Definition 2.6.

Remark 2.9. Consider the classL which contains the four subsets of {0,1}. This class isTxtEx-learnable, but not learnable by aBasicSwEx-learner which is required to makeat leastone switch onevery possible data-sequence.

To see this, assume that the learner starts with requesting positive examples. As, 0∞ is a valid textfor language {0}, if the teacher answers 0 on requests for positive examples, eventually the learner mustswitch and ask for a negative example. Suppose the switch occurs at thenth round. But then the learnercannot distinguish between the following two situations:(1) teacher is giving the answers for language{0}, whereT+ = 0∞ andT− = 1 2 3 . . .;(2) teacher is giving the answers for language{0, 1}, whereT+ = 0n 1 0∞ andT− = 2 2 3 . . .;theT− in the above two cases differ at the first position and theT+ differ at the(n+ 1)th position. Asr0 was+ andrn was−, the learner is not able to distinguish between the two cases.

If the learner starts by requesting negative data, it can be trapped similarly.

As the above remark shows, althoughBasicSwEx is more powerful thanTxtEx, it still has a severerestriction that information might be lost – it might happen, that a given learner receives, due to switches,a data sequence which satisfies the protocol for several possible languages. This cannot occur for thecriteria ofNewSwEx-learning (for finite number of switches) andRestartSwEx-learning (for finite orinfinite number of switches), which from this point of view are more natural.

3. Basic relations between the concepts

Within this section, we investigate the basic relations between the various criteria of learning byswitching type of information.


Proposition 3.1.(a) For all ordinalsα, BasicSwαEx ⊆ NewSwαEx ⊆ RestartSwαEx.(b) BasicSw∗Ex ⊆ NewSw∗Ex ⊆ RestartSw∗Ex.(c) BasicSwEx ⊆ NewSwEx ⊆ RestartSwEx.

Proof. We first show that any teacher using the newtext switch-protocol also satisfies the basic switch-protocol. Thus every learner succeeding with a teacher satisfying the newtext switch-protocol also suc-ceeds with every teacher using the basic switch-protocol. It follows that the inclusion holds for anyconstraints on the number of switches permitted as the learner does not change.

Consider the interaction between the learner and teacher for any languageL. Let rk denote the requestof learner andxk denote the answer of the teacher in thekth round, where the answers by the teachersatisfy the newtext switch-protocol. To show that the teacher also satisfies the basic switch-protocol weneed to construct textsT+ (for L) andT− (for L) such thatxk = Trk (k). This can be done by induction.Let s+(k) ands−(k) be the number of thek′ < k for which rk′ is positive or negative, respectively. Nowone defines

T+(k)=xk if rk = +;s−(k) if rk = − ands−(k) ∈ L;# if rk = − ands−(k) /∈ L;

T−(k)=xk if rk = −;s+(k) if rk = + ands+(k) ∈ L;# if rk = + ands+(k) /∈ L.

Note that all elements ofT+ are either # or inL since they are either given by the newtext teacheror explicitly required to be inL. Furthermore, if almost allrk are positive, then the newtext protocolguarantees that all elements ofL show up and thatT+ is a text forL. If infinitely many rk are negativethen the functions− is not bounded and so there is for everyx ∈ L ak such thatx = s−(k) andrk = −.It follows thatx goes into the text. ThusT+ is a text forL and similarly one can verify thatT− is a textfor L.

Also, any teacher using the restart switch-protocol can be used to simulate answers using a newtextswitch-protocol – by appropriately repeating the already given positive/negative elements before givingany new elements presented in the restart switch-protocol. The proposition follows.�

In the following it is shown that the hierarchy from Proposition 3.1 (c) is strict, that is,

TxtEx ⊂ BasicSwEx ⊂ NewSwEx ⊂ RestartSwEx.

Besides this main goal, the influence of restricting the number of switches to be finite or even torespect an ordinal bound, is investigated.

Note that the inclusionTxtEx ⊆ BasicSw0Ex follows directly from the definition. Furthermore, theclass{L ⊆ N : card(L) � 1}, using Proposition 2.4, is notTxtEx-learnable; however, as the class con-tains only cofinite sets, it can be learned via some learner always requesting negative data. Thus theinclusionTxtEx ⊂ BasicSw0Ex is strict.

Combining finite and cofinite sets is the basic idea to separate newtext switching from basic switchingusing parts (a) and (c) of Theorem 3.2 below. The class used to show this separation is quite natural:


Lfin,cofin = {L : card(L) <∞ or card(L) <∞}.Theorem 3.2 below also characterizes the optimal number of switches needed to learnLfin,cofin (wherepossible): one can do it for the criteriaNewSw∗Ex andRestartSw∗Ex with finitely many switches, butan ordinal bound on the number of switches is impossible.

Theorem 3.2.(a) Lfin,cofin ∈ NewSw∗Ex.(b) For all ordinalsα, Lfin,cofin �∈ RestartSwαEx.(c) Lfin,cofin �∈ BasicSwEx.

Proof.(a) The machineM works in stages. At any point of time it keeps track of elements inL andL that it

has received.

Construction.Initially let Pos= ∅, Neg= ∅ and go to stage 0.Stages: If card(Pos) � card(Neg)

Then request a positive examplex;updatePos= Pos∪ {x} − {#};conjecture the finite setPos;

Else request for negative datax;updateNeg= Neg∪ {x} − {#};conjecture the cofinite setN − Neg.

Go to stages + 1.End stages.

It is straight forward to enforce that the learner always represents each conjectured set with the sameindex. Having this property, it is easy to verify thatM NewSw∗Ex-learnsLfin,cofin.

(b) Suppose by way of contradiction thatM RestartSwαEx-learns the classLfin,cofin. Since every fi-nite sequence of data can be extended to the one of a set inLfin,cofin, M has to behave correctlyon all data sequences and does not switch without downcounting the ordinal. There is a minimalordinalβ which M can reach in some downcounting process. For thisβ, there is a correspondingroundk, a sequence of requests byM and a sequence of answers given by a teacher such thatM’sordinal counter isβ after thekth round; letPosbe the positive data andNegbe the negative dataprovided by the teacher until reachingβ. As β is minimal, M does not make any further down-counting but stabilizes to one type request, say to requesting positive data; the case of requestingonly negative data is similar. LetL = Neg. If H satisfiesPos⊆ H ⊆ L and card(L−H) � 1 thenM is required to learnH without a further switch. SoM would essentially be aTxtEx-learner for{H : Pos⊆ H ⊆ L ∧ card(L−H) � 1}, a contradiction to Proposition 2.4.

(c) Suppose by way of contradiction thatM BasicSwEx-learnsLfin,cofin. Due to symmetry-reasons onecan assume that the first request ofM is + and assume that the teacher gives # as an answer. Nowconsider the special case thatT− is either #∞ or y#∞ for some numbery. The set to be learned iseitherN or N − {y} and the only remaining relevant information is the textT+. Thus if one couldlearnLfin,cofin under the criterionBasicSwEx, then one could alsoTxtEx-learn the class{L ⊆ N :card(L) � 1}, a contradiction to Proposition 2.4.�


Item (c) in Theorem 3.2 can be improved to show that even classes which are very easy forNewSwExcannot beBasicSwEx-learned.

Corollary 3.3. NewSw1Ex �⊆ BasicSwEx.

Proof. The proof of Theorem 3.2 (c) shows that the class

{L ⊆ N : card(L) � 1 or card(L) � 1}is notBasicSwEx-learnable; the sets with card(L) � 1 are added for the case that the requestr0 in theproof of Theorem 3.2 (c) is−. It remains an easy verification that the considered class isNewSw1Ex-learnable: A machine first asks for positive examples and outputs an index for the set consisting ofthe examples seen so far, unless it discovers that there are at least two elements in the language.At which point it switches to requesting negative examples to find the at most one negativeexample. �

The following theorem shows the strength of restarting switch protocol by showing that it has thesame learning power as the criterionInfEx, where the learner gets the full information on the setL tobe learned by reading its characteristic function instead of a text for it, see [11].

Theorem 3.4. RestartSwEx = InfEx.

Proof. Clearly,RestartSwEx ⊆ InfEx. In order to show thatInfEx ⊆ RestartSwEx, we show how toconstruct an informant for the input language using a teacher which follows the restart switch-protocol.Clearly, this suffices to prove the theorem. The learner requests alternatingly, positive and negative infor-mation. This gives the learner a text forL as well as forL, which allows one to construct an informantfor the input languageL. �

The following theorem shows that newtext switching protocols can simulate restart switching proto-cols, if the number of switches is required to be bounded by an ordinal.

Theorem 3.5. For all ordinalsα, RestartSwαEx = NewSwαEx.

Proof. By Proposition 3.1, it suffices to show the inclusions

RestartSwαEx ⊆ NewSwαEx.

Note that forRestartSwαEx andNewSwαEx learning, we may assume without loss of generality thatthe learning machine makes finite number of switches onall inputs (i.e., even for inputs for languagesoutside the class, or for teachers not following the protocol). Furthermore, if the machine makes onlyfinitely many switches, then it is easy to verify that any teacher following the newtext switch-protocolalso follows the restart switch-protocol. Theorem follows.�

In contrast to Theorem 3.5 the following theorem shows the advantage of restarting switchingprotocol, compared to newtext switching protocol if the number of switches is not required to bebounded.


Theorem 3.6. RestartSw∗Ex �⊆ NewSwEx.

Proof. Let odd denote the set of odd numbers. Let

L1={odd} ∪ {odd− {2x + 1} : x ∈ N},L2={odd∪ {0}} ∪ {odd∪ {0} ∪ {2x + 2} : x ∈ N},L=L1 ∪ L2.

It is easy to see thatL1 can be learned using negative data, andL2 can be learned using positive data.Thus, a machine canRestartSw∗Ex-identifyL by first finding (by alternatingly requesting positive andnegative data) whether 0 belongs to the input languageL or not. After this the machine uses just positivedata (if 0∈ L) or just negative data (if 0�∈ L), to identifyL.

To show thatL �∈ NewSwEx, we use the fact that any infinite subset ofL1 containing the languageOdd, cannot be learned from positive data alone (Proposition 2.4) and that any infinite subset ofL2containing the language Odd∪ {0}, cannot be learned from negative data alone (symmetric version ofProposition 2.4).

Suppose by way of contradiction thatL ∈ NewSwEx as witnessed byM. Let Even denote the setN − Odd of even numbers. We then consider the following cases.Case 1: There exists a way of answering the requests ofM such that positive requests are answered

by elements from Odd, negative requests are answered by elements from Even− {0} andMmakes infinitely many switches.In this case, clearlyM cannot distinguish between the cases of input language being Odd andinput language being Odd∪ {0}.

Case 2: Not case 1. Letx0, x1, . . . , xk be an initial sequence of answers such that• for i � k, if ri = +, thenxi ∈ Odd,• for i � k, if ri = −, thenxi ∈ Even− {0},• if the teacher is consistent with Odd and Odd∪ {0}, thenM does not make a further switch,

that is, the following two conditions hold:◦ if rk+1 = + and the teacher takes its future examplesxk+1, xk+2, . . . from the set Odd,

thenrj = rk+1 for all j > k;◦ if rk+1 = − and the teacher takes its future examplesxk+1, xk+2,. . . from the set Even−

{0}, thenrj = rk+1 for all j > k.Note that there exists suchk, x0, x1, . . . , xk, since otherwise one can construct an infinitesequence of answers as needed for case 1, by infinitely often extending a given sequence toforce a switch by the learner — leading to infinitely many switches by the learner.

Case 2a: rk+1 = +.In this case,M has to learn the set Odd and every set Odd− {2x + 1}, where 2x + 1 /∈ {x0, x1,

. . . , xk}, from positive data. This is impossible by Proposition 2.4.Case 2b: rk+1 = −.

This is similar to Case 2a.M needs to learn the set Odd and every set Odd∪ {0, 2x}, where2x /∈ {0, x0, x1, . . . , xk}, from negative data. Again this is impossible by symmetric version ofProposition 2.4. �

The previous result completes the proof that all inclusions of the hierarchyTxtEx ⊂ BasicSwEx ⊂NewSwEx ⊂ RestartSwEx are proper.


4. Counting the number of switches

Theorem 4.1 and Corollary 4.2 below show a hierarchy based on the number of switches allowed tothe learner.

Theorem 4.1. For α � β, NewSwαEx �⊆ RestartSwβEx.

Proof. Extend≺ to Ords∪ {−1} by letting−1 ≺ γ , for everyγ ∈ Ords. There is a computable functionod fromN to Ords∪ {−1} such that• for everyγ � α there are infinitely manyx ∈ N such that od(x) = γ ;• there are infinitely manyx ∈ N such that od(x) = −1;• the set{(x, y) : od(x) ≺ od(y)} is recursive.A setF = {x1, x2, . . . , xk} ⊆ N is α-admissible iff• 0< x1 < x2 < . . . < xk;• α � od(x1) � od(x2) � . . . � od(xk) � −1.The empty set is alsoα-admissible. By definition no infinite set isα-admissible (also note that the secondcondition postulates a descending chain of ordinals which is always finite). Let the classL be definedby

LF ={x : card({0, 1, . . . , x} ∩ F) is odd};Lα={LF : F is α-admissible}.

Note that the setL∅ is just∅. Intuitively, for F = {x1, x2, x3, . . . , xk}, where 0< x1 < x2 < . . . < xk,one can consider the set of natural numbers to be divided intoblocks: Bi = {x ∈ N : xi � x < xi+1},for i � k, where we takex0 = 0 andxk+1 = ∞. The odd blocksB2i+1 belong toLF and even blocksB2i belong toLF .

Now we show that the classLα witnesses the separation.

Claim. Lα ∈ NewSwαEx.

Proof of Claim. The machineM has variablesn for the number of switches done so far,E for the finiteset of examples seen after the last switch,mn for the maximal element seen so far andγn the valueof the ordinal-counter aftern switches. The initialization before stage 0 isE = ∅, n = 0,m0 = 0 andγ0 = α; maxordinalsY denotes the maximum element of a non-empty finite setY of ordinals with respectto their ordering. Intuitively, forF = {x1, x2, . . . , xk}, the aim of the algorithm below is to eventuallyhavemn � xk−1 (without downcounting the ordinal counter below ordinal 0). It will be shown later thatmn and data of type opposite that ofmn, is enough to identify the languageLF . Go to stage 0.

Construction. Stages (what is done when thesth examplex is read).(1) If n is even, request a positive examplex;

If n is odd, request a negative examplex.(2) If x /∈ {#, 0, 1, . . . , mn} andX = {y � x : 0 � od(y) ≺ γn} is not empty

Then switch the data type by doing the following:ResetE = ∅;Updaten = n+ 1;


Letmn = x andγn = maxordinals{od(y) : y ∈ X};Else letE = E ∪ {x} − {#}.

(3) If E �⊆ {0, 1, . . . , mn},then leta be the least example outside the set{0, 1, . . . , mn} which had shown up after thenth switchelse leta = mn.

(4) If n is even anda = mn then conjectureE;If n is even anda > mn then conjectureE ∪ {a, a + 1, . . .};If n is odd anda = mn then conjectureE;If n is odd anda > mn then conjectureE ∩ {0, 1, . . . , a}.

(5) Go to stages + 1.It is clear that the ordinal is downcounted at every switch of the data presentation. Thus the ordinal

bound on the number of mind changes is satisfied.Assume thatF isα-admissible,k = card(F ) andF = {x1, x2, . . . , xk}, and the input language isLF .

LetBi = {x ∈ N : xi � x < xi+1}, for i � k, where we takex0 = 0 andxk+1 = ∞.Below letn denote the limiting value ofn in the above algorithm. At every switch,M downcounts the

ordinal fromα throughγ1, γ2, . . . to γn and thus keeps the ordinal bound. The valuesm0, m1, . . . , mnsatisfy the condition thatLF (mh) /= LF (mh+1), and belong to different blockBi ’s. Note that the valuesmh with odd h are positive and the valuesmh with evenh are negative examples;m0 = 0 and thusm0 /∈ LF by definition. Thus, by definition ofLF it follows thatm1 � x1, m2 � x2, . . . , mn � xn. Byinduction, one can easily verify thatγh � od(xh), for h = 1, 2, . . . ,n.

After making thenth switch,mn has the opposite type of information than the examples seen fromthen on.

Thus if no informationx > mn arrives afternth switch, it follows thatx0, x1, . . . , xk � mn and thuseveryy, such that the type of information ofy is opposite to the one ofmn, will eventually belong toE. If n is even thenLF = E (in the limit) and the algorithm is correct. Ifn is odd thenLF = E (in thelimit) and the algorithm is correct again.

If somex > mn arrives after the last switch, then one knows thatM abstains from switching due tothe fact that whenever an examplex > mn arrives then, at step 2,X = ∅. Sinceγn � od(xn) � od(xn+1)

andxn+1 � x, for anyx > mn which arrives after the last switch, we must have that od(xn+1) = −1,and thusk = n + 1. Thus, the least examplea > mn to show up satisfiesa � xk. Furthermore, ev-ery x � a satisfiesLF (x) = LF (a) and it is sufficient to know which of thex � a are inLF andwhich are not inLF . This is determined in the limit and thus the sets conjectured byM are cor-rect.

It is straight forward to ensure thatM always outputs the same index for the same set and thus doesnot only semantically but also syntactically converge to an index ofLF . �

Claim. If a RestartSwαEx-learnerM starts with requesting a negative example first, thenM cannotRestartSwαEx-learn the whole classLα.

Proof of Claim. Let data of typen be negative data ifn is even and positive data ifn is odd. So, for thisclaim, data of typen is whatM requests aftern switches. In the following, a setF is constructed suchthatM does notRestartSwαEx-learnLF .


Construction of F. The inductive construction starts withF = ∅, n = card(F ) andM requesting exam-ples of typen. There is a finite sequenceσ0σ1 . . . σn defined inductively such that one of the followingcases applies:Switch: For someσn consisting of examples of typen for LF , M requests examples of typen after

having receivedσ0σ1 . . . σn−1τ , for all proper prefixτ of σn, but requests example of typen+ 1 after having receivedσ0σ1 . . . σn−1σn.

LS: For someσn consisting of examples of typen for LF , σ0σ1 . . . σn is a locking-sequence forLFin the following sense• for every prefixτ of σn, M requests for example of typen after having receivedσ0σ1 . . . σn−1τ , and

• for every extensionτ of σn, consisting of examples of typen for LF , M requests for exampleof typen after having receivedσ0σ1 . . . σn−1τ , and

• for all extensionsτ of σn, consisting of examples of typen for LF , M conjecturesLF as itsoutput after having receivedσ0σ1 . . . σn−1τ .

Fail: There is a textT of data of typen for LF such that for allτ ⊆ T , M, on the sequenceσ0σ1 . . . σn−1τ , requests for example of typen. FurthermoreM onσ0 . . . σn−1T does not con-verge to a grammar forLF .

Note that the above cases are not mutually exclusive. Now the construction ofF is continued asfollows, based on first case which applies:Switch: After having seenσ0σ1 . . . σn, M downcounts the ordinal to a new valueγ ′ ≺ γ . Let xn+1 be

such that• od(xn+1) = γ ′;• xn+1 > y for all y ∈ F ∪ content(σ0σ1 . . . σn) ∪ {0}and addxn+1 to F . Continue the construction with the next inductive step.

LS: Choose a numberxn+1 such that• od(xn+1) = −1;• xn+1 > y for all y ∈ F ∪ content(σ0σ1 . . . σn) ∪ {0}and complete the construction by addingxn+1 to F .

Fail: LeaveF untouched and complete the construction.End construction

Verification. Note that in the inductive process, adding a numberxn+1 toF never makes any previouslyexamples invalid, therefore it is legal to do these modifications during the construction. Furthermore,in the case that it is not possible to satisfy the case “Switch” in the construction at some stagen, onehas that after having seen the example-sequenceσ0σ1 . . . σn−1 (which is the empty sequence in the casen = 0) M requests only data of typen as long as it sees examples consistent withLF . Therefore usingthe locking sequence argument as introduced by Blum and Blum [3], see also [9,20], either (I) there is afinite sequenceσn of examples of typen for LF such thatσ0σ1 . . . σn is a locking sequence forLF , thatis, case “LS” holds or (II) case “Fail” holds. So it is possible to do the inductive definition in every step.

As the sequence od(x1), od(x2), . . . is a falling sequence of ordinals, it must be finite and thereforethe construction eventually ends in the cases “LS” or “Fail”. In the case “Fail” it is clear that theF

constructed gives anLF not learned byM.If case LS holds, then letF ′ = F − {xn+1}. Note that ally � xn+1 are examples of typen for LF ′ ,

andLF andLF ′ do not differ on anyz ∈ {0, 1, . . . , xn+1 − 1}. Thus the information provided toM is


consistent with bothLF andLF ′ . It follows that, given any textT of typen for LF , M converges onσ0σ1 . . . σnT to an index ofLF ′ and thus does not learnLF . �

The first claim shows thatLα is RestartSwαEx-learnable while the second claim shows that such alearner cannot start by requesting a negative example first. Therefore, ifM would be aRestartSwβEx-learner forLF and β ≺ α, then M has to start with requesting a positive example. However, thenone could consider a newRestartSwαEx-learnerM′ which first requests a negative example (with-out loss of generality assumed to be #), and then switches to positive data, downcounts the ordinalfrom α to β, and from then on copycats the behaviour ofM with an empty prehistory. It would thenfollow that M canRestartSwβEx-learnLF iff the new learnerM′ RestartSwαEx-learnsLF and startswith requesting a negative example. However this contradicts the second Claim above. Thus no suchM can exists, and the assertion thatLF witnessesNewSwαEx �⊆ RestartSwβEx, for all β ≺ α iscompleted. �

Corollary 4.2. Supposeα � β. ThenBasicSwαEx �⊆ RestartSwβEx, in particular:(a) BasicSwαEx �⊆ BasicSwβEx.(b) NewSwαEx �⊆ NewSwβEx.(c) RestartSwαEx �⊆ RestartSwβEx.

Proof. The main idea is to use the cylindrificationLcylα of the classLα from Theorem 4.1 in order to

show thatLcylα ∈ BasicSwαEx − RestartSwβEx.

Then (a), (b), and (c) follow immediately.Let 〈·, ·〉 code pairs of natural numbers bijectively into natural numbers:〈x, y〉 = ((x + y) · (x + y

+1)/2)+ x. The cylindrification of a setL is then defined byLcyl = {〈x, y〉 : x ∈ L, y ∈ N} andLcylα = {Lcyl : L ∈ Lα}, whereLα is as defined in Theorem 4.1.Note that any text forLcyl (Lcyl) is essentially a fat text forL (L). Therefore the factLα ∈ NewSwαEx

implies thatLcylα ∈ BasicSwαEx by using Remark 2.8. On the other hand,Lcyl

α �∈ RestartSwβEx sinceLα �∈ RestartSwβEx and by using Remark 2.8 again.�

5. Conclusion

The starting point of the present work was the fact that there is a large gap between the data-presen-tation by a text and by an informant: a text gives only positive data while an informant gives completeinformation on the set to be learned. So notions of data presentation between these two extreme caseswere proposed and the relations between them were investigated. The underlying idea of these notionsis that the learner may switch between receiving positive and negative data, but these switches are eitherfinite in number or may cause the loss of information.

For example, theBasicSwEx-learner can at every stage only follow one of the textsT+ and T−of positive and negative information on the setL to be learned and might therefore miss importantinformation on the other side.

The results of the present work resolve all the relationships between different switching criteria pro-posed in this paper. In particular it was established that the inclusion


TxtEx ⊂ BasicSwEx ⊂ NewSwEx ⊂ RestartSwEx

is everywhere proper. Furthermore, the notionRestartSwEx coincides with learning from informant.When we consider restricting the number of switches to meet an ordinal bound,RestartSwαEx coincideswith NewSwαEx. The hierarchy induced by measuring the number of switches with recursive ordinalsis proper.

In summary, the notionNewSwEx and its variant by bounding the number of switches turned outto be the most natural definition in the gap betweenTxtEx-learning and learning from informant. Thenotion of BasicSwEx-learning is betweenTxtEx-learning and learning from informant, but has somestrange side-effects: requiring some minimum number of switches may be more harmful than requir-ing no switches, as pointed out in Remark 2.9. On the other hand,RestartSwEx coincides with othernotions, as mentioned above.

Note that these criteria differ from learning from negative open text as considered by Baliga et al. [2],this is notion (b) from the introduction. Learning from open negative text is weaker than learning frominformant and thus different fromRestartSwEx. On the other hand, the classLfin,cofin and the classLfrom Theorem 3.6 are both learnable from negative open text and so separate this notion from the otherswitching criteria mentioned in this paper.

There is an application of learning by switching type of information to the field of learning algebraicsubstructures of vector spaces. Harizanov and Stephan [12] investigated when it is possible to learn theclassL of all recursively enumerable subspaces of the spaceV∞/V . HereV∞ is the standard recursivevector space over the rationals with countably infinite dimension andV is a given recursively enumerablesubspace ofV∞. The spaceV∞/V is calledk-thin iff there is a subspaceW ∈ L such thatV/W is k-dimensional and, for everyU ∈ L, U is an infinite dimensional subspace ofV∞/V iff W ⊆ U . WhileL is TxtBC-learnable iffV∞/V is finite dimensional,L is NewSwBC-learnable iff eitherL is alreadyTxtBC-learnable orV∞/V is 0-thin or 1-thin.InfBC-learning is much more powerful; it covers the caseof all k-thin spaces, but there is no effective algebraic characterization of the spaces whereL is learnable.So NewSwBC-learning turned out to be the only notion where learnability of the class of recursivelyenumerable subspaces has an interesting and non-trivial algebraic characterization.

References

[1] D. Angluin, Learning regular sets from queries and counter-examples, Information and Computation 75 (1987) 87–106.[2] G. Baliga, J. Case, S. Jain, Language learning with some negative information, Journal of Computer and System Sciences

51 (5) (1995) 273–285.[3] L. Blum, M. Blum, Toward a mathematical theory of inductive inference, Information and Control 28 (1975) 125–155.[4] J. Case, The power of vacillation in language learning, SIAM Journal of Computing 28 (6) (1999) 1941–1969.[5] J. Case, S. Jain, M. Suraj, Not-so-nearly-minimal-size program inference, Algorithmic Learning for Knowledge-Based

Systems, K. Jantke, S. Lange (Eds.), Lecture Notes in Artificial Intelligence, vol. 961, Springer, Berlin, 1995, pp. 77–99.[6] J. Case, C. Lynes, Machine inductive inference and language identification, Proceedings of the 9th International Col-

loquium on Automata, Languages and Programming, M. Nielsen, E.M. Schmidt (Eds.), Lecture Notes in ComputerScience, vol. 140, Springer, Berlin, 1982, pp. 107–115.

[7] J. Case, C. Smith, Comparison of identification criteria for machine inductive inference, Theoretical Computer Science25 (1983) 193–220.

[8] R. Freivalds, C. Smith, On the role of procrastination in machine learning, Information and Computation 107 (1993)237–271.

[9] M. Fulk, Prudence and other conditions on formal language learning, Information and Computation 85 (1990) 1–11.


[10] W. Gasarch, M. Pleszkoch, Learning via queries to an oracle, Proceedings of the Second Annual Workshop on Com-putational Learning Theory, R. Rivest, D. Haussler, M. Warmuth (Eds.), Morgan Kaufmann, Los Altos, CA, 1989, pp.214–229.

[11] E. M. Gold, Language identification in the limit, Information and Control 10 (1967) 447–474.[12] V. Harizanov and F. Stephan, On the Learnability of Vector Spaces, Forschungsberichte Mathematische Logik 55/2002,

Mathematical Institute, University of Heidelberg, 2002.[13] S. Jain, D. Osherson, J. Royer, A. Sharma, Systems that Learn: An Introduction to Learning Theory, second ed., MIT

Press, Cambridge, 1999.[14] S. Jain, A. Sharma, On the non-existence of maximal inference degrees for language identification, Information Process-

ing Letters 47 (1993) 81–88.[15] S. Jain, W. Menzel, F. Stephan, Classes with easily learnable subclasses, Algorithmic Learning Theory: 13th Proceedings

of International Conference, N. Cesa-Bianchi, M. Numao, R. Reischuk (Eds.), Lecture Notes in Artificial Intelligence,vol. 2533, Springer, Berlin, 2002, pp. 218–232.

[16] S. Kleene, Notations for ordinal numbers, The Journal of Symbolic Logic 3 (1938) 150–155.[17] M. Machtey, P. Young, An Introduction to the General Theory of Algorithms, North Holland, New York, 1978.[18] W. Merkle, F. Stephan, Refuting learning revisited, Theoretical Computer Science 298 (2003) 145–177.[19] T. Motoki, Inductive inference from all positive and some negative data, Information Processing Letters 39 (4) (1991)

177–182.[20] D. Osherson, M. Stob, S. Weinstein, Systems that Learn: An Introduction to Learning Theory for Cognitive and Computer

Scientists, MIT Press, Cambridge, MA, 1986.[21] H. Rogers, Theory of Recursive Functions and Effective Computability, McGraw-Hill, New York, 1967, (Reprinted, MIT

Press 1987).[22] G.E. Sacks, Higher Recursion Theory, Springer, Berlin, 1990.[23] R. Wiehagen, Identification of formal languages. in: Mathematical Foundations of Computer Science, vol. 53, Lecture

Notes in Computer Science, Springer, Berlin, 1977, pp. 571–579.

Date post:	17-Sep-2016
Category:	Documents
Upload:	sanjay-jain
View:	213 times
Download:	0 times

Learning by switching type of information

Documents