Download - Autonomous News Clustering and Classification for an Intelligent Web Portal

2008/05/23Autonomous News Clustering and Classification

for an Intelligent Web Portal ISMIS2008

Autonomous News Clustering and Classification for

an Intelligent Web Portal

Traian Rebedea, Stefan Trausan-Matu

“Politehnica” University of Bucharest, Department of Computer Science and Engineering

{trebedea, trausan}@cs.pub.ro



Overview Introduction

• Motivation• Intelligent News Processing

Theoretical Background• Text Clustering• Text Classification

Intelligent News Classification in Romanian• Functionality & Architecture• News Clustering• News Classification

Conclusions



Introduction WWW – increased number of users and web

sites Great volume of online information Information redundancy on the Web

• Different sources – slight variations News available as web syndication

• XML-based formats – RSS, ATOM• Applications: aggregators – various flavours• Aggregators do not (usually) exploit the large volume

of information and redundancy



Motivation Obtain an autonomous news portal using:

• Web syndication• Advanced text processing methods

• Clustering • Classification

Large volumes of data• Find a method to determine the importance of the processed

data (pieces of news) News headlines

• Different sources• Many stories / headlines acquired from feeds – high volume

& redundancy Intelligent News Processing



Intelligent News Processing Objective: find the most important piece of news Alternatives:

• Manually assign an importance to each piece of news – difficult, time consuming

• Number of readers for each news headline • Does not (1) reduce the number of news headlines, (2) solve

news redundancy, (3) offer an automatic method for computing the importance of a particular piece of news

• Each piece of news is attached to the source – no alternative sources on the same subject

• Intelligent processing of news• Automatically determines the main headlines • Offers a classification of news subjects - number of different

pieces of news that compose it – objective measure provided by (news) specialists (news agencies, newspapers, TVs, etc.)



Intelligent News Processing (2) NLP techniques – machine learning News fetched from various sources using web

syndication News clustering – used to determine the most important

subjects News classification – assign each piece of news to a

category Different approaches:

• Google News, Topix, NewsJunkie (Microsoft), European Media Monitor (EMM)

• Some of them also consider: assigning labels to news (persons, companies, events), information novelty







Conclusions



Theoretical Background Clustering and classification – widely used in

NLP Vector space model – Boolean, frequency, TF-

IDF vectors High dimensionality – number of distinct terms

in all the analyzed pieces of news• Curse of dimensionality• Similarity measures: based on cosine • Inverse of distance metrics do not offer good results



Text Clustering Partition the data into subsets – clusters

• Data in the same group has common characteristics• Grouping process is applied based on the proximity of

the elements that need to be clustered – similarity measure

• Large volumes – exploits the redundancy Different techniques:

• Bottom-up (agglomerative) / top-down (divisive)• Hierarchical / flat – relationships between groups• Assignment: hard / soft



Hierarchical Clustering Usually uses:

• Hard assignment• Greedy technique

Computing similarity between clusters:• Most similar elements – single link (fast)• Least similar elements – complete link (good

results)• Average similarity of all the elements in a

group – average link (fast & good results)



Text Classification Assign predefined labels (categories) to textual

items Supervised learning – 2 stages

• Training the classifier – training set of items• Using it for assigning labels to new items

Training => data model => classify items Text classification:

• News and e-mail categorization • Automatic classification of large text documents• Text is unstructured and the number of features is

very high ( > 1000) – unlike database (usually < 100)



Text Classification (2) Different methods:

• Separation of the space: NN• Probability distribution: decision trees, Bayes, SVMs

Nearest neighbour (NN) – easy to train and use Training phase – simple and fast – indexing the training data for each

category Classifying a new item - the most similar k indexed documents are

determined. The item is assigned to the class that has the most documents

Improvements: • score for each class• offsets for each class added to the score

Classifier can be trained to find the best values for k and the offsets Disadvantages: increased time and memory for classification (than

probabilistic based classifiers) Use greedy features’ selection







Conclusions



Intelligent News Classification Purpose: develop an online news portal able to

function with a minimum of human intervention Makes use of:

• Web syndication• NLP techniques

Advantages:• autonomy towards an administrator • the methodology used to present the news based on

the importance of the headlines over a period of time



Functionality & Architecture1. Automated collecting of web syndications (periodically);2. Save fresh pieces of news in the database;3. Process the textual information of each fresh piece of

news in order to determining the features’ vector associated with the news;

4. Group the news using a text clustering algorithm;5. Classify each group of news within a predefined

category, using a regularly retrained classifier;6. Generate web pages corresponding to the most

important subjects / headlines, grouped in various ways, including in each category of news.



Functionality & Architecture (2) These actions may be run in a single stage / sequentially,

as well as individually, at different moments in time• Operation is determined by the quantity of processed data• Functionality can be parallelized

Functionality of the portal may be broken into two different modules that are relatively independent:• Agent module, that processes the news items and

generates the web pages• Web module that displays the information and implements

search and personalization capabilities• The two modules communicate using a database



Functionality & Architecture (3)



News Clustering Preprocessing phase:

• Remove diacritical marks and other special characters that are not used by all the news sources

• Remove HTML tags and entities• Eliminate stop words• Tokenization• Stemming – special Romanian stemmer

• Inflexion rules are numerous and very complicated• They affect the inner structure of the words, not only the

trailing part• Use a small set of solid rules (reduce the number of terms with

20-25%) Vector space model – Boolean, frequency



News Clustering – Algorithm Hierarchical algorithm

• Agglomerative• Hard assignment• Average link

Used two thresholds:• Higher value – to merge very similar items

• Used in order to create very cohesive clusters

• Lower value – continue the process



News Clustering - Similarity Used different measures:

• Frequency:• Inverse of a distance• Cosine similarity

• Boolean:• Jaccard similarity• Dice’s coefficient• Other:

2),(1 ),( badebasim

),(11),(2 bad

basim

m

ib

m

ia

m

iba

ixix

ixixbasim

1

2

1

2

13

][][

][][),(

ba

ba

ORxxANDxxbasim ),(4

ba

ba

xxANDxx

basim

2),(5

)),min(),max(

1(log),min(),(

2

6

ba

baba

ba

xxxx

xx

ANDxxbasim



News Clustering - Results

• Used > 30 news sources• Frequency-based implementation worked better• News presented to the users:

• The most important headlines – number of items in a group• The importance of a piece of news – similarity with features’ vector of

the headline



News Classification Categories: Romania, Politics, Economy,

Culture, International, Sports, High-Tech and High Life

Classify the news clusters, not each piece of news individually• Advantage: a cluster holds more features’ information

– more probable to be correctly classified Training data – specialized RSS channels



Training Data

Training data: 3279 news items• Cross validation• 2/3 – used for training• 1/3 – used for evaluation• unequal distribution



Classifiers k-NN classifiers:

• Simple k-NN (most similar item)• k-NN with scores• Center-based NN

• Training phase – slower• Classification phase – faster

Various values for k = 1, 3, 5, …



Classification Results

Nearest Center (center-based NN) had the best accuracy Frequency-based vector space with cosine similarity

produced slightly better results than the Boolean-based vector space

k-NN with scores (notes with sum in the table above) produced slightly better results than simple k-NN



Classification Results (2)

Confusion matrix for the NC classifier Average recall = 0.59, Average precision =

0.62, Accuracy = 0.64, F1 = 0.61



Conclusions Alternative to classical news portals

• Solve the problems of large amounts of news and of information redundancy, by using the latter as an advantage

Web syndication and natural language processing techniques are used in order to achieve a human independent functionality

Clustering is used to exploit similar news and group them into a single topic – presented to the user

Automatic classification of the news topics – advantage over a single piece of news

Further development: • Improve the clustering and classification techniques• Language independent or multilingual



Thank You!