2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Autonomous News Clustering and Classification for
an Intelligent Web Portal
Traian Rebedea, Stefan Trausan-Matu
“Politehnica” University of Bucharest, Department of Computer Science and Engineering
{trebedea, trausan}@cs.pub.ro
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Overview Introduction
• Motivation• Intelligent News Processing
Theoretical Background• Text Clustering• Text Classification
Intelligent News Classification in Romanian• Functionality & Architecture• News Clustering• News Classification
Conclusions
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Introduction WWW – increased number of users and web
sites Great volume of online information Information redundancy on the Web
• Different sources – slight variations News available as web syndication
• XML-based formats – RSS, ATOM• Applications: aggregators – various flavours• Aggregators do not (usually) exploit the large volume
of information and redundancy
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Motivation Obtain an autonomous news portal using:
• Web syndication• Advanced text processing methods
• Clustering • Classification
Large volumes of data• Find a method to determine the importance of the processed
data (pieces of news) News headlines
• Different sources• Many stories / headlines acquired from feeds – high volume
& redundancy Intelligent News Processing
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Intelligent News Processing Objective: find the most important piece of news Alternatives:
• Manually assign an importance to each piece of news – difficult, time consuming
• Number of readers for each news headline • Does not (1) reduce the number of news headlines, (2) solve
news redundancy, (3) offer an automatic method for computing the importance of a particular piece of news
• Each piece of news is attached to the source – no alternative sources on the same subject
• Intelligent processing of news• Automatically determines the main headlines • Offers a classification of news subjects - number of different
pieces of news that compose it – objective measure provided by (news) specialists (news agencies, newspapers, TVs, etc.)
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Intelligent News Processing (2) NLP techniques – machine learning News fetched from various sources using web
syndication News clustering – used to determine the most important
subjects News classification – assign each piece of news to a
category Different approaches:
• Google News, Topix, NewsJunkie (Microsoft), European Media Monitor (EMM)
• Some of them also consider: assigning labels to news (persons, companies, events), information novelty
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Overview Introduction
• Motivation• Intelligent News Processing
Theoretical Background• Text Clustering• Text Classification
Intelligent News Classification in Romanian• Functionality & Architecture• News Clustering• News Classification
Conclusions
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Theoretical Background Clustering and classification – widely used in
NLP Vector space model – Boolean, frequency, TF-
IDF vectors High dimensionality – number of distinct terms
in all the analyzed pieces of news• Curse of dimensionality• Similarity measures: based on cosine • Inverse of distance metrics do not offer good results
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Text Clustering Partition the data into subsets – clusters
• Data in the same group has common characteristics• Grouping process is applied based on the proximity of
the elements that need to be clustered – similarity measure
• Large volumes – exploits the redundancy Different techniques:
• Bottom-up (agglomerative) / top-down (divisive)• Hierarchical / flat – relationships between groups• Assignment: hard / soft
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Hierarchical Clustering Usually uses:
• Hard assignment• Greedy technique
Computing similarity between clusters:• Most similar elements – single link (fast)• Least similar elements – complete link (good
results)• Average similarity of all the elements in a
group – average link (fast & good results)
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Text Classification Assign predefined labels (categories) to textual
items Supervised learning – 2 stages
• Training the classifier – training set of items• Using it for assigning labels to new items
Training => data model => classify items Text classification:
• News and e-mail categorization • Automatic classification of large text documents• Text is unstructured and the number of features is
very high ( > 1000) – unlike database (usually < 100)
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Text Classification (2) Different methods:
• Separation of the space: NN• Probability distribution: decision trees, Bayes, SVMs
Nearest neighbour (NN) – easy to train and use Training phase – simple and fast – indexing the training data for each
category Classifying a new item - the most similar k indexed documents are
determined. The item is assigned to the class that has the most documents
Improvements: • score for each class• offsets for each class added to the score
Classifier can be trained to find the best values for k and the offsets Disadvantages: increased time and memory for classification (than
probabilistic based classifiers) Use greedy features’ selection
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Overview Introduction
• Motivation• Intelligent News Processing
Theoretical Background• Text Clustering• Text Classification
Intelligent News Classification in Romanian• Functionality & Architecture• News Clustering• News Classification
Conclusions
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Intelligent News Classification Purpose: develop an online news portal able to
function with a minimum of human intervention Makes use of:
• Web syndication• NLP techniques
Advantages:• autonomy towards an administrator • the methodology used to present the news based on
the importance of the headlines over a period of time
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Functionality & Architecture1. Automated collecting of web syndications (periodically);2. Save fresh pieces of news in the database;3. Process the textual information of each fresh piece of
news in order to determining the features’ vector associated with the news;
4. Group the news using a text clustering algorithm;5. Classify each group of news within a predefined
category, using a regularly retrained classifier;6. Generate web pages corresponding to the most
important subjects / headlines, grouped in various ways, including in each category of news.
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Functionality & Architecture (2) These actions may be run in a single stage / sequentially,
as well as individually, at different moments in time• Operation is determined by the quantity of processed data• Functionality can be parallelized
Functionality of the portal may be broken into two different modules that are relatively independent:• Agent module, that processes the news items and
generates the web pages• Web module that displays the information and implements
search and personalization capabilities• The two modules communicate using a database
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Functionality & Architecture (3)
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
News Clustering Preprocessing phase:
• Remove diacritical marks and other special characters that are not used by all the news sources
• Remove HTML tags and entities• Eliminate stop words• Tokenization• Stemming – special Romanian stemmer
• Inflexion rules are numerous and very complicated• They affect the inner structure of the words, not only the
trailing part• Use a small set of solid rules (reduce the number of terms with
20-25%) Vector space model – Boolean, frequency
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
News Clustering – Algorithm Hierarchical algorithm
• Agglomerative• Hard assignment• Average link
Used two thresholds:• Higher value – to merge very similar items
• Used in order to create very cohesive clusters
• Lower value – continue the process
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
News Clustering - Similarity Used different measures:
• Frequency:• Inverse of a distance• Cosine similarity
• Boolean:• Jaccard similarity• Dice’s coefficient• Other:
2),(1 ),( badebasim
),(11),(2 bad
basim
m
ib
m
ia
m
iba
ixix
ixixbasim
1
2
1
2
13
][][
][][),(
ba
ba
ORxxANDxxbasim ),(4
ba
ba
xxANDxx
basim
2),(5
)),min(),max(
1(log),min(),(
2
6
ba
baba
ba
xxxx
xx
ANDxxbasim
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
News Clustering - Results
• Used > 30 news sources• Frequency-based implementation worked better• News presented to the users:
• The most important headlines – number of items in a group• The importance of a piece of news – similarity with features’ vector of
the headline
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
News Classification Categories: Romania, Politics, Economy,
Culture, International, Sports, High-Tech and High Life
Classify the news clusters, not each piece of news individually• Advantage: a cluster holds more features’ information
– more probable to be correctly classified Training data – specialized RSS channels
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Training Data
Training data: 3279 news items• Cross validation• 2/3 – used for training• 1/3 – used for evaluation• unequal distribution
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Classifiers k-NN classifiers:
• Simple k-NN (most similar item)• k-NN with scores• Center-based NN
• Training phase – slower• Classification phase – faster
Various values for k = 1, 3, 5, …
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Classification Results
Nearest Center (center-based NN) had the best accuracy Frequency-based vector space with cosine similarity
produced slightly better results than the Boolean-based vector space
k-NN with scores (notes with sum in the table above) produced slightly better results than simple k-NN
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Classification Results (2)
Confusion matrix for the NC classifier Average recall = 0.59, Average precision =
0.62, Accuracy = 0.64, F1 = 0.61
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Conclusions Alternative to classical news portals
• Solve the problems of large amounts of news and of information redundancy, by using the latter as an advantage
Web syndication and natural language processing techniques are used in order to achieve a human independent functionality
Clustering is used to exploit similar news and group them into a single topic – presented to the user
Automatic classification of the news topics – advantage over a single piece of news
Further development: • Improve the clustering and classification techniques• Language independent or multilingual
2008/05/23Autonomous News Clustering and Classification
for an Intelligent Web Portal ISMIS2008
Thank You!