[IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

transcript

MACHINE LEARNING FOR CLASSIFYING LEARNING OBJECTS

Girish R RanganathanUniversity ofNew Brunswick

Canadaemail: fl59h@unb.ca

Yevgen BiletskiyUniversity ofNew Brunswick

Canadaemail: biletski@unb.ca

Dawn MacIsaacUniversity ofNew Brunswick

Canadaemail: dmac@unb.ca

Abstract

Building an ontology for learning objects can be useful fortranslating such objects between learning contexts. Such translationsare important because they afford learners and educators with theopportunity to a survey a wide selection of learning and teachingmaterial. For instance, university instructors are sometimes requiredto assess curriculum from courses deliveredfrom other programs oruniversities, even internationally. Often, the only learning objectavailable to do so is the course outline made available in HTMLformat on a web page. Generally there is an abundance of metadataavailable from such learning objects and this information can be usedto generate useful components of the ontology. Other usefulinformation can be derivedfrom first establishing the domain of theobject, Electricity and Computing for instance, or possibly History.Once extracted, the information representing learning objects can bestored as elements in an XML template. The purpose of this work wasto develop and implement a machine learning strategy for classifyingcourse outlines into pre-defined domains and sub-domains in order toprovide this information to an ontology repository designed to aid inthe translation of such objects. First some typical domains wereidentified. Then, 20-30 course outlines were chosen to represent eachsub-domain. Next, frequency tables of words common to the courseoutlines for a given sub-domain were generated in order to compilean ordered list of synonyms used to represent the sub-domains.Finally, a new set of course outlines were randomly selected forclassification based on an analysis of the synonym content of each.Establishing the frequency tables and completing the synonymanalysis was automated completely thereby constituting the machinelearning strategy.

Keywords. e-Learning; ontology, learning object, LOM;clustering, learner's profile.

1. Introduction

The wide proliferation of Internet and e-Learningtechnologies makes users capable of accessing a large amountelectronic learning documents (learning objects) generatedworld-wide. Due to the heterogeneity of learning objects builtwithin different cultural contexts, the use of these learningobjects by learners is often ineffective. The present research isrelated to development of a delivery technology, whichimproves the effective delivery of semantically heterogeneouslearning objects through adaptation of their contents to specificcultural contexts of learners [1, 2]. The work presented in thispaper is devoted to the initial phase of the proposed deliverytechnology - information extraction from learning objects.Once the content of a learning object is extracted and stored in

1-4244-0038-4 2006 2IEEE CCECE/CCGEI, Ottawa, May 2006

a meaningful format (i.e. XML-based), then it can be convertedto any learner's context using automatic and semiautomaticconversion procedures.

The particular focus of the present work is on informationextraction and classification of learning objects from thedomain of Electrical and Computer Engineering (ECE). Theproposed classification of learning objects into sub-domains isimportant for enhancing automatic and semiautomaticconversion procedures.

2. Information Extraction from LearningObjects

There are a variety of learning objects such as universitycurricula, course outlines, transcripts, and calendars which arerich in metadata and common attributes. The contents of theseobjects can be extracted and stored by attribute in predefinedXML templates for use in context conversion procedures.There are various commercial tools for information extractionand conversion to XML, such as Altova's XML Spy,CambrigeDocs' xDoc XML, Any2XML, Republica's X-FetchWrapper, Exegenix Conversion Solutions (ECS), Texterity'sText Cafe, GATE (General Architecture for Text Engineering),IBM's UIMA (Unstructured Information ManagementArchitecture) SDK, etc. However, most of these productsrequire significant manual effort in script generation to tailorthe resulting XML document to conform to potential uses. Ourwork proposes to reduce the amount of manual work donewhile still creating meaningful XML documents.

For now, our focus is restricted to semi-structured universitycourse outlines, but it may be extended to a broader focus andto structured and even unstructured documents as well. Thepresent work is focused on the extraction of the content andmetadata from course outlines, which are HTML documentsstored in the respective ECE course websites. The targetedgeneric XML template will meaningfully represent bothstructural and contextual information from the original HTMLdocument.

There are three general methods by which informationextraction can be accomplished: manually (writing customcode for each new type of document), which is labourintensive, wrapper learning (manually setting anchors withinexample documents to generate a set of rules for extraction),and fully automatic (pattern recognition) [3]. In this work weused pattern recognition and machine learning techniques.

Our objective was to extract the document's content andmetadata, and store them in a meaningful XML document witha pre-defined template, which is presented in Appendix. Theproposed template adopted IEEE LOM [4] and CanCore [5]standards with major changes in order to reduce attributes tothe most useful and include the learning object sub-domainclassification.Most of the information extraction approach has been

previously reported in [6]. The present paper is devoted to thestep of forming LOM - classification of learning objects.Information regarding the sub-domain to which a learningobject belongs may be helpful in establishing an efficienttransformation between learning contexts.

3. Classification of Learning Objects

The proposed classification approach methodology adopts awidely known clustering technique based on minimizing themean squared error distance from a learning object to thecenter of a cluster. The proposed approach uses machinelearning, and consists of the following five steps.

(1) We have identified five sub-domains/sub-classes of theECE domain/cluster (Electrical Circuits, Electronics, DigitalSystems, Communications and Software Engineering)..Certainly, the number of domains can be bigger, but we limitthis number to probate the approach.

(2) Then we manually selected 15-20 course outlines foreach sub-domain to form corpus of domain-specific trainingexamples as well as five corpuses of sub-domain-specifictraining examples.

(3) These examples have been analyzed using machinelearning of training examples, filtered, and the list of the mostcommonly occurring key words {ki} with frequencies of theiroccurrence was obtained for each sub-domain {nj1} and thewhole domain {n1}. So, we have created a library of frequencytables for keywords for each sub-domain and the domain.

(4) Because the number of key words is different in thewhole domain and sub-domains, these frequencies have beennormalized as follows:

N Zni Nji=l I

:Y ji;i=l

where Ay- normalization coefficient for each sub-domainj;ni - frequency of occurrence of i-th keyword in the domain

corpus;nji - frequency of occurrence of i-th keyword in the j-th

sub-domain corpus;i - a key word index in the list of keywords;j- a sub-domain index (range [1,5] in our case);m - number of key word in the lists;Nj - total number of keywords in the j-th sub-domain

corpus;

N- total number of keywords in the domain corpus.Thus, we obtain a library of normalized frequency tables of

keywords for each sub-domain and the domain.(5) Then we arbitrarily selected course outlines, and

frequency tables for keywords from them were built throughmaking machine learning of the keywords by itself. Thefrequency table from an arbitrary page is also normalized:

N.Na (3)

where 2a - normalization coefficient for an arbitrarydocument a;Na - total number of keywords in the arbitrary document a.(6) Finally, The distance from the arbitrary selected course

outline and sub-domain is calculated using the mean squarederror distance:

dj =Z(A¾xn1i-A xnai)2;i=l

where nai - frequency of occurrence of i-th keyword in thethe arbitrary document a.

So, the shortest distance dj defines an j-th sub-class of thelearning object in the ECE domain.

4. Evaluation

The approach have been implemented as a Java program;and for evaluating the developed system, we had consideredabout 20 arbitrary course outlines from five sub-domains (forexample 21 arbitrary course outlines from the sub-domain"Communication"), and identified their belonging to their andother sub-domains. The results are given in table 1. The leastsuccessful is "Digital Systems" because of overlapping withother sub-domains.

Table 1. Effectiveness of classifying course outlinesSub-domain f1<~~~~~~~~~~~~~~~~~~-E r 2 19 1 2

Electroics 3 12 I 66.67Communcar 12 1 6 0 2 5001 1Exampal 4 3 4 7 3684

Communia-e 12 1 0 21 57.140

Elgietricang9 1 2 910

5. Conclusions

This work presented the process of machine learningfor the classification of learning objects in to their respectivedomains and sub-domains, which plays a significant role indelivering the learning objects to learner's contexts. For this,we selected 15-20 course outlines in a number of sub-domainsin one domain and created a corpus for each of the sub-domains and one for the whole domain. We then generate thekeywords for each of the corpus created and normalize them.Then we selected some arbitrary course outlines and repeat theabove procedure for these course outlines. Then the distanceof the arbitrary course outline from each of the sub-domains iscalculated using mean squared error distance as explainedbefore. Then based on the distance we determine the sub-domain to which the arbitrary course outline belongs.

The future work will be devoted to the improvementof the efficiency of the developed system by using moreinformation, not just the course description, for the purpose ofclassification plus increasing the vocabulary of the commonwords and also considering their various forms.

References

[1] Biletskiy Y., Vorochek O., Medovoy A. BuildingOntologies for Interoperability Among Learning Objectsand Learners, Lecture Notes in Computer Science,Springer-Verlag, vol. 3029/2004, pp. 977-986.

[2] Yevgen Biletskiy, Harold Boley, Luqian Zhu. A RuleML-Based Ontology for Interoperation between LearningObjects and Learners. UCFV Research Review, Issue 1,2006. Available: http://journals.ucfv.ca/ojs/rr/article-PDFs/biletskiy-boley-zhu.pdf.

[3] Bing Lui, Kevin Chen-Chuan Chang. (2004) Editorial.Special Issue on Web Content Mining. SIGKDDExplorations, 612, pp. 1-4.

[4] CanCore: Canadian Core Learning Resource MetadataApplication Profile. Available: http://www.cancore.ca.

[5] IEEE 1484.12.1: Standard for Learning Object MetadataPISCATAWAY, NJ, 2002.

[6] Yevgen Biletskiy, Tim Scribner. Conversion of LearningObjects to Meaningful XML. The 8-th IASTEDInternational Conference on Communication, Internet andInformation Technology, Cambridge, MA, USA, 2005.

APPENDIX A: XML TEMPLATE AND ITSGRAPHICAL REPRESENTATION

The XML template, which is used to store the extractedcourse outlines is given is follows (its graphical representationis gin in fig. 1).<document><properties>

</properties><structure><course><courseName>

</courseName><courseCredit>

</courseCredit><courseTeacher>

</teacherContacts></courseTeacher><courseResources>

</courseTextbook><courseUrl></courseUrl><coursePrerequisite></coursePrerequisite><courseDescription></courseDescription>

</courseResources><courseLecture>

</courseLecture>

</courseLab><courseSchedule>

</scheduleTopic></courseSchedule></course></structure></document>

Figure 1. Template for information extraction from course outlines

[IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

Documents