+ All Categories
Home > Documents > [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

[IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

Date post: 27-Jan-2017
Category:
Upload: dawn
View: 214 times
Download: 1 times
Share this document with a friend
4
MACHINE LEARNING FOR CLASSIFYING LEARNING OBJECTS Girish R Ranganathan University ofNew Brunswick Canada email: [email protected] Yevgen Biletskiy University ofNew Brunswick Canada email: [email protected] Dawn MacIsaac University ofNew Brunswick Canada email: [email protected] Abstract Building an ontology for learning objects can be useful for translating such objects between learning contexts. Such translations are important because they afford learners and educators with the opportunity to a survey a wide selection of learning and teaching material. For instance, university instructors are sometimes required to assess curriculum from courses delivered from other programs or universities, even internationally. Often, the only learning object available to do so is the course outline made available in HTML format on a web page. Generally there is an abundance of metadata available from such learning objects and this information can be used to generate useful components of the ontology. Other useful information can be derived from first establishing the domain of the object, Electricity and Computing for instance, or possibly History. Once extracted, the information representing learning objects can be stored as elements in an XML template. The purpose of this work was to develop and implement a machine learning strategy for classifying course outlines into pre-defined domains and sub-domains in order to provide this information to an ontology repository designed to aid in the translation of such objects. First some typical domains were identified. Then, 20-30 course outlines were chosen to represent each sub-domain. Next, frequency tables of words common to the course outlines for a given sub-domain were generated in order to compile an ordered list of synonyms used to represent the sub-domains. Finally, a new set of course outlines were randomly selected for classification based on an analysis of the synonym content of each. Establishing the frequency tables and completing the synonym analysis was automated completely thereby constituting the machine learning strategy. Keywords. e-Learning; ontology, learning object, LOM; clustering, learner's profile. 1. Introduction The wide proliferation of Internet and e-Learning technologies makes users capable of accessing a large amount electronic learning documents (learning objects) generated world-wide. Due to the heterogeneity of learning objects built within different cultural contexts, the use of these learning objects by learners is often ineffective. The present research is related to development of a delivery technology, which improves the effective delivery of semantically heterogeneous learning objects through adaptation of their contents to specific cultural contexts of learners [1, 2]. The work presented in this paper is devoted to the initial phase of the proposed delivery technology - information extraction from learning objects. Once the content of a learning object is extracted and stored in 1-4244-0038-4 2006 2 IEEE CCECE/CCGEI, Ottawa, May 2006 a meaningful format (i.e. XML-based), then it can be converted to any learner's context using automatic and semiautomatic conversion procedures. The particular focus of the present work is on information extraction and classification of learning objects from the domain of Electrical and Computer Engineering (ECE). The proposed classification of learning objects into sub-domains is important for enhancing automatic and semiautomatic conversion procedures. 2. Information Extraction from Learning Objects There are a variety of learning objects such as university curricula, course outlines, transcripts, and calendars which are rich in metadata and common attributes. The contents of these objects can be extracted and stored by attribute in predefined XML templates for use in context conversion procedures. There are various commercial tools for information extraction and conversion to XML, such as Altova's XML Spy, CambrigeDocs' xDoc XML, Any2XML, Republica's X-Fetch Wrapper, Exegenix Conversion Solutions (ECS), Texterity's Text Cafe, GATE (General Architecture for Text Engineering), IBM's UIMA (Unstructured Information Management Architecture) SDK, etc. However, most of these products require significant manual effort in script generation to tailor the resulting XML document to conform to potential uses. Our work proposes to reduce the amount of manual work done while still creating meaningful XML documents. For now, our focus is restricted to semi-structured university course outlines, but it may be extended to a broader focus and to structured and even unstructured documents as well. The present work is focused on the extraction of the content and metadata from course outlines, which are HTML documents stored in the respective ECE course websites. The targeted generic XML template will meaningfully represent both structural and contextual information from the original HTML document. There are three general methods by which information extraction can be accomplished: manually (writing custom code for each new type of document), which is labour intensive, wrapper learning (manually setting anchors within example documents to generate a set of rules for extraction), and fully automatic (pattern recognition) [3]. In this work we used pattern recognition and machine learning techniques. Z80
Transcript
Page 1: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

MACHINE LEARNING FOR CLASSIFYING LEARNING OBJECTS

Girish R RanganathanUniversity ofNew Brunswick

Canadaemail: [email protected]

Yevgen BiletskiyUniversity ofNew Brunswick

Canadaemail: [email protected]

Dawn MacIsaacUniversity ofNew Brunswick

Canadaemail: [email protected]

Abstract

Building an ontology for learning objects can be useful fortranslating such objects between learning contexts. Such translationsare important because they afford learners and educators with theopportunity to a survey a wide selection of learning and teachingmaterial. For instance, university instructors are sometimes requiredto assess curriculum from courses deliveredfrom other programs oruniversities, even internationally. Often, the only learning objectavailable to do so is the course outline made available in HTMLformat on a web page. Generally there is an abundance of metadataavailable from such learning objects and this information can be usedto generate useful components of the ontology. Other usefulinformation can be derivedfrom first establishing the domain of theobject, Electricity and Computing for instance, or possibly History.Once extracted, the information representing learning objects can bestored as elements in an XML template. The purpose of this work wasto develop and implement a machine learning strategy for classifyingcourse outlines into pre-defined domains and sub-domains in order toprovide this information to an ontology repository designed to aid inthe translation of such objects. First some typical domains wereidentified. Then, 20-30 course outlines were chosen to represent eachsub-domain. Next, frequency tables of words common to the courseoutlines for a given sub-domain were generated in order to compilean ordered list of synonyms used to represent the sub-domains.Finally, a new set of course outlines were randomly selected forclassification based on an analysis of the synonym content of each.Establishing the frequency tables and completing the synonymanalysis was automated completely thereby constituting the machinelearning strategy.

Keywords. e-Learning; ontology, learning object, LOM;clustering, learner's profile.

1. Introduction

The wide proliferation of Internet and e-Learningtechnologies makes users capable of accessing a large amountelectronic learning documents (learning objects) generatedworld-wide. Due to the heterogeneity of learning objects builtwithin different cultural contexts, the use of these learningobjects by learners is often ineffective. The present research isrelated to development of a delivery technology, whichimproves the effective delivery of semantically heterogeneouslearning objects through adaptation of their contents to specificcultural contexts of learners [1, 2]. The work presented in thispaper is devoted to the initial phase of the proposed deliverytechnology - information extraction from learning objects.Once the content of a learning object is extracted and stored in

1-4244-0038-4 2006 2IEEE CCECE/CCGEI, Ottawa, May 2006

a meaningful format (i.e. XML-based), then it can be convertedto any learner's context using automatic and semiautomaticconversion procedures.

The particular focus of the present work is on informationextraction and classification of learning objects from thedomain of Electrical and Computer Engineering (ECE). Theproposed classification of learning objects into sub-domains isimportant for enhancing automatic and semiautomaticconversion procedures.

2. Information Extraction from LearningObjects

There are a variety of learning objects such as universitycurricula, course outlines, transcripts, and calendars which arerich in metadata and common attributes. The contents of theseobjects can be extracted and stored by attribute in predefinedXML templates for use in context conversion procedures.There are various commercial tools for information extractionand conversion to XML, such as Altova's XML Spy,CambrigeDocs' xDoc XML, Any2XML, Republica's X-FetchWrapper, Exegenix Conversion Solutions (ECS), Texterity'sText Cafe, GATE (General Architecture for Text Engineering),IBM's UIMA (Unstructured Information ManagementArchitecture) SDK, etc. However, most of these productsrequire significant manual effort in script generation to tailorthe resulting XML document to conform to potential uses. Ourwork proposes to reduce the amount of manual work donewhile still creating meaningful XML documents.

For now, our focus is restricted to semi-structured universitycourse outlines, but it may be extended to a broader focus andto structured and even unstructured documents as well. Thepresent work is focused on the extraction of the content andmetadata from course outlines, which are HTML documentsstored in the respective ECE course websites. The targetedgeneric XML template will meaningfully represent bothstructural and contextual information from the original HTMLdocument.

There are three general methods by which informationextraction can be accomplished: manually (writing customcode for each new type of document), which is labourintensive, wrapper learning (manually setting anchors withinexample documents to generate a set of rules for extraction),and fully automatic (pattern recognition) [3]. In this work weused pattern recognition and machine learning techniques.

Z80

Page 2: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

Our objective was to extract the document's content andmetadata, and store them in a meaningful XML document witha pre-defined template, which is presented in Appendix. Theproposed template adopted IEEE LOM [4] and CanCore [5]standards with major changes in order to reduce attributes tothe most useful and include the learning object sub-domainclassification.Most of the information extraction approach has been

previously reported in [6]. The present paper is devoted to thestep of forming LOM - classification of learning objects.Information regarding the sub-domain to which a learningobject belongs may be helpful in establishing an efficienttransformation between learning contexts.

3. Classification of Learning Objects

The proposed classification approach methodology adopts awidely known clustering technique based on minimizing themean squared error distance from a learning object to thecenter of a cluster. The proposed approach uses machinelearning, and consists of the following five steps.

(1) We have identified five sub-domains/sub-classes of theECE domain/cluster (Electrical Circuits, Electronics, DigitalSystems, Communications and Software Engineering)..Certainly, the number of domains can be bigger, but we limitthis number to probate the approach.

(2) Then we manually selected 15-20 course outlines foreach sub-domain to form corpus of domain-specific trainingexamples as well as five corpuses of sub-domain-specifictraining examples.

(3) These examples have been analyzed using machinelearning of training examples, filtered, and the list of the mostcommonly occurring key words {ki} with frequencies of theiroccurrence was obtained for each sub-domain {nj1} and thewhole domain {n1}. So, we have created a library of frequencytables for keywords for each sub-domain and the domain.

(4) Because the number of key words is different in thewhole domain and sub-domains, these frequencies have beennormalized as follows:

N

'Njm

N Zni Nji=l I

(1)

(2)m

:Y ji;i=l

where Ay- normalization coefficient for each sub-domainj;ni - frequency of occurrence of i-th keyword in the domain

corpus;nji - frequency of occurrence of i-th keyword in the j-th

sub-domain corpus;i - a key word index in the list of keywords;j- a sub-domain index (range [1,5] in our case);m - number of key word in the lists;Nj - total number of keywords in the j-th sub-domain

corpus;

N- total number of keywords in the domain corpus.Thus, we obtain a library of normalized frequency tables of

keywords for each sub-domain and the domain.(5) Then we arbitrarily selected course outlines, and

frequency tables for keywords from them were built throughmaking machine learning of the keywords by itself. Thefrequency table from an arbitrary page is also normalized:

N.Na (3)

where 2a - normalization coefficient for an arbitrarydocument a;Na - total number of keywords in the arbitrary document a.(6) Finally, The distance from the arbitrary selected course

outline and sub-domain is calculated using the mean squarederror distance:

m

dj =Z(A¾xn1i-A xnai)2;i=l

(4)

where nai - frequency of occurrence of i-th keyword in thethe arbitrary document a.

So, the shortest distance dj defines an j-th sub-class of thelearning object in the ECE domain.

4. Evaluation

The approach have been implemented as a Java program;and for evaluating the developed system, we had consideredabout 20 arbitrary course outlines from five sub-domains (forexample 21 arbitrary course outlines from the sub-domain"Communication"), and identified their belonging to their andother sub-domains. The results are given in table 1. The leastsuccessful is "Digital Systems" because of overlapping withother sub-domains.

Table 1. Effectiveness of classifying course outlinesSub-domain f1<~~~~~~~~~~~~~~~~~~-E r 2 19 1 2

Electroics 3 12 I 66.67Communcar 12 1 6 0 2 5001 1Exampal 4 3 4 7 3684

Communia-e 12 1 0 21 57.140

Elgietricang9 1 2 910

281

Page 3: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

5. Conclusions

This work presented the process of machine learningfor the classification of learning objects in to their respectivedomains and sub-domains, which plays a significant role indelivering the learning objects to learner's contexts. For this,we selected 15-20 course outlines in a number of sub-domainsin one domain and created a corpus for each of the sub-domains and one for the whole domain. We then generate thekeywords for each of the corpus created and normalize them.Then we selected some arbitrary course outlines and repeat theabove procedure for these course outlines. Then the distanceof the arbitrary course outline from each of the sub-domains iscalculated using mean squared error distance as explainedbefore. Then based on the distance we determine the sub-domain to which the arbitrary course outline belongs.

The future work will be devoted to the improvementof the efficiency of the developed system by using moreinformation, not just the course description, for the purpose ofclassification plus increasing the vocabulary of the commonwords and also considering their various forms.

References

[1] Biletskiy Y., Vorochek O., Medovoy A. BuildingOntologies for Interoperability Among Learning Objectsand Learners, Lecture Notes in Computer Science,Springer-Verlag, vol. 3029/2004, pp. 977-986.

[2] Yevgen Biletskiy, Harold Boley, Luqian Zhu. A RuleML-Based Ontology for Interoperation between LearningObjects and Learners. UCFV Research Review, Issue 1,2006. Available: http://journals.ucfv.ca/ojs/rr/article-PDFs/biletskiy-boley-zhu.pdf.

[3] Bing Lui, Kevin Chen-Chuan Chang. (2004) Editorial.Special Issue on Web Content Mining. SIGKDDExplorations, 612, pp. 1-4.

[4] CanCore: Canadian Core Learning Resource MetadataApplication Profile. Available: http://www.cancore.ca.

[5] IEEE 1484.12.1: Standard for Learning Object MetadataPISCATAWAY, NJ, 2002.

[6] Yevgen Biletskiy, Tim Scribner. Conversion of LearningObjects to Meaningful XML. The 8-th IASTEDInternational Conference on Communication, Internet andInformation Technology, Cambridge, MA, USA, 2005.

APPENDIX A: XML TEMPLATE AND ITSGRAPHICAL REPRESENTATION

The XML template, which is used to store the extractedcourse outlines is given is follows (its graphical representationis gin in fig. 1).<document><properties>

<title></title><author></author><lastAuthor></lastAuthor><created></created>

<pages></pages><words></words><domain></domain><sub-domain></sub-domain><characters></characters><company></company><version></version><language></language>

</properties><structure><course><courseName>

<courseCode></courseCode><courseTitle></courseTitle>

</courseName><courseCredit>

<ch></ch><wh></wh>

</courseCredit><courseTeacher>

<teacherName></teacherName><teacherLocation></teacherLocation><teacherContacts>

<teacherEmail></teacherEmail><teacherPhone></teacherPhone><teacherUrl></teacherUrl>

</teacherContacts></courseTeacher><courseResources>

<courseTextbook><textAuthor></textAuthor><textTitle></textTitle>

</courseTextbook><courseUrl></courseUrl><coursePrerequisite></coursePrerequisite><courseDescription></courseDescription>

</courseResources><courseLecture>

<lectureLocation></lectureLocation><lectureTime></lectureTime>

</courseLecture>

<courseLab><labLocation></labLocation><labTime></labTime>

</courseLab><courseSchedule>

<scheduleTopic><topicName></topicName><topicText></topicText><topicWeek></topicWeek>

</scheduleTopic></courseSchedule></course></structure></document>

282

Page 4: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

Figure 1. Template for information extraction from course outlines

283


Recommended