Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

transcript

Using Discretization and Bayesian InferencUsing Discretization and Bayesian Inference Network Learning for Automatic Filterine Network Learning for Automatic Filterin

g Profile Generationg Profile Generation

Authors: Wai Lam and Kon Fan Low

Announcer: Kyu-Baek Hwang

ContentsContents

Introduction Overview of the approach Automatic document pre-processing Feature selection Feature discretization Learning Bayesian networks Experiments and results Conclusions and future work

ContentsContents

Information FilteringInformation Filtering

The Filtering ProfileThe Filtering Profile

Information filtering system deals with users who have a relatively stable and long-term information need.

An information need is usually represented by a filtering profile.

Construction of the Filtering ProfileConstruction of the Filtering Profile

Collect training data through the interactions with users. Ex) gathering user feedback information about the relevance

judgments for a certain information need or topic.

Analyze this kind of training data and construct the filtering profile by machine learning techniques.

Use this filtering profile to determine the relevance of a new document.

The Uncertainty IssueThe Uncertainty Issue

It is difficult to specify absolutely whether a document is relevant to a topic as it may only partially match with the topic. Ex) “the economic policy of government”

The probabilistic approach is appropriate for this kind of task.

ContentsContents

An Overview of the ApproachAn Overview of the Approach

Transformation of each document into an internal form

Feature selection

Discretization of the feature value

Gathering training data by interactions with users

Bayesian network learning

- For each topic

ContentsContents

Document RepresentationDocument Representation

All stop words are eliminated. Ex) “the”, “are”, “and”, etc.

Stemming of the remaining words. Ex) “looks” “look”, “looking” “look”, etc.

A document is represented by a vector form. Each element in the vector is either the word frequency or the wor

d weight. The word weight is calculated as follows:

where N is the total number of documents and ni is the number of documents that contains the term i.

Nfw log

Word Frequency Representation of a Word Frequency Representation of a DocumentDocument

Term id Term Frequency

21 gover 3

17 annouc 1

98 take 3

34 student 4

… … …

Feature SelectionFeature Selection

Expected mutual information measure is given as

where Wi is a feature and Cj denotes the fact that the document is relevant to topic j.

Mutual information measures the information contained in the term Wi about topic j.

A document is represented as follows:

),(log),(

),(log),(),(

jijiji

CbWPCbWP

CbWPCbWPCWI

).,...,(1 pjjj TTT

ContentsContents

Discretization SchemeDiscretization Scheme

The goal of discretization is to find a mapping m such that the feature value is represented by a discrete value.

The mapping is characterized by a series of threshold levels (0, w1, …, wk) where 0 < w1 < w2 < … < wk.

The mapping m has the following property:

where q is the feature value.

. if ,

0 if ,0

Predefine Level DiscretizationPredefine Level Discretization

One determine the discretization level k and the threshold values. Ex) Integers between 0 and 15 are discretized into three levels by t

he threshold values 5.5 and 10.5.

Lloyd’s AlgorithmLloyd’s Algorithm

Consider the distribution of feature values. Step 1: determine the discretization level k. Step 2: select the initial threshold levels (y1, y2, …, yk - 1). Step 3: repeat the following steps for all i.

Calculate the mean feature value i of ith region. Generate all possible threshold levels between i and i+1. Select the threshold level which minimizes the following distortio

n measure.

Step 4: If the distortion measure of this new set of threshold levels is less than that of the old set, then go to Step 3.

iji qd 2)(

Relevance Dependence Discretization (1/3)Relevance Dependence Discretization (1/3)

Consider the dependency between the feature and the relevance of the topic.

The relevance information entropy is given as

where S is the group of feature values.

),(log),(

),(log),()(Ent~~

SCPSCP

SCPSCPS

The partition entropy of the region induced by w is defined as

where S1 is the subset of S with feature values smaller than w and S2 is S – S1.

The more homogeneous of the region, the smaller is the partition entropy.

The partition entropy controls the recursive partition algorithm.

)(Ent||

||)(Ent

||);( 2

A criterion for recursive partition algorithm is as follows:

where (m; S) is defined as

where k number of relevance classes in the partition S; k1 number of relevance classes in the partition S1;

k2 number of relevance classes in the partition S2.

)1|(|log);( 2

SSmGain

)](Ent)(Ent)(Ent[)23(log);( 22112 SkSkSkSm k

ContentsContents

Bayesian Inference for Document Bayesian Inference for Document ClassificationClassification

The probability of Cj given the document by Bayes’ Theorem is as follows:

.),...,(

)()|,...,()|(

jj TTP

CPCTTPTCP

Background of Bayesian NetworksBackground of Bayesian Networks

The process of inference is to use the evidence of some of the nodes that have observations to find the probability of some of the other nodes in the network.

T3).,,,,,( 54321 TTTTTCP

Learning Bayesian NetworksLearning Bayesian Networks

Parametric learning The conditional probability for each node is estimated from the

training data.

Structural learning Best-first search MDL score

A classification-based network simplifies the structural learning process.

MDL Score for Bayesian NetworksMDL Score for Bayesian Networks

The MDL (Minimum Description Length) score for a Bayesian network B is defined as

where X is a node in the network.

The score for each node is calculated as

Xtotal XLBL ),()(

).,(),(),(ijiijiiji TjdataTjnetworkTjtotal TLTLTL

Complexity of the Network StructureComplexity of the Network Structure

Lnetwork is the network description length and corresponds to the topological complexity of a network and computed as follows:

where N is the number of training documents, sj is the number of possible states the variable Tji

can take.

jiTjnetwork ssN

TL )1(2

log),( 2

Accuracy of the Network StructureAccuracy of the Network Structure

The data description length is given by the following formula:

where M() is the number of cases that match a particular instantiation in the training data.

The more accurate the network, the shorter is this length.

)(log),(),( 2

jTijdata TM

ContentsContents

The Process of Information Filtering based The Process of Information Filtering based on Bayesian Network Learningon Bayesian Network Learning

Gather the training documents. For all training documents, determine the relevance to eac

h topic. Feature selection for each topic.

5 and 10 features were used in the experiments.

Discretization of the feature values. Learn a Bayesian network for each topic.

Set the probability threshold value for the relevance decision.

Each Bayesian network corresponds to the filtering profile.

Document CollectionsDocument Collections

Reuters 21 578 29 topics. In chronological order, first 7 000 documents were chosen as the tr

aining set and the other 14 578 documents were used as test set.

FBIS (Foreign Broadcast Information Service) 38 topics used in TREC (Text REtrieval Conferences). In chronological order, 60 000 documents were chosen as the train

ing set and the other 70 471 documents were used as test set.

Evaluation Metrics for Information Evaluation Metrics for Information RetrievalRetrieval

Relevant Non-relevant

Algorithm Relevant n1 n2

Non-relevant n3 n4

Utility

)/( )(precision

)/( (recall)

DnCnBnAn

Filtering Performance of the Bayesian Filtering Performance of the Bayesian Network on the Reuters CollectionNetwork on the Reuters Collection

Comparison of the Bayesian Network Comparison of the Bayesian Network Approach and the Naïve Bayesian Approach and the Naïve Bayesian ApproachApproach

Filtering Performance of the Bayesian Filtering Performance of the Bayesian Network on the FBIS CollectionNetwork on the FBIS Collection

Comparison of the Bayesian Network Comparison of the Bayesian Network Approach and the Naïve Bayesian Approach and the Naïve Bayesian ApproachApproach

Conclusions and Future WorkConclusions and Future Work

Discretization methods. Structural learning.

Large data

Better performance over naïve Bayesian approach.

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Documents