Time.Mk

Post on 09-Jun-2015

6,325 views 2 download

Tags:

transcript

1

A New Approach to News

Prof. Igor Trajkovski, Ph.D.

2TIME.mk Proprietary

Motivation

– Traditionally, news readers first pick a publication and then look for headlines that interest them.

TIME.mk Proprietary

News Pipeline

Crawling,Extraction

Clustering

StoryClassification

Scoring

TIME.mk Proprietary 4

Crawling and Extraction

TIME.mk Proprietary

Crawl

• Most of the Macedonian news sites don’t have RSS feeds.

• One level crawl from a set of hubs:� (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1

� (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2

� (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100

� (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26

� (None) http://www.kirilica.com.mk/

• Many hubs per source.

� Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.)

• Hubs annotated with section name (topic):

� Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None

• At the moment, hubs of the sources are provided manually.

TIME.mk Proprietary

Article Extraction

Heuristics:

• Title matches link text and/or HTML title, and is above the body

• Body is a big run of unformatted Cyrillic text, below title

• Image is extracted from the hub pageand has attached link with exactly the same address as the article

Segment into title / body / image

The same procedure is used for extracting all articles from all sources !!!

TIME.mk Proprietary 7

Clustering

TIME.mk Proprietary

Clustering

Partition news articles into disjoint subsets of clusters, such that:

� News within a cluster are very similar

� News in different clusters are very different

.

...

. .. ..

..

....

TIME.mk Proprietary

Word weights

Weight is function of word frequency within a document and across all documents

TF(w) = frequency of word w in a news article

• Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning”

IDF(w) = log [N/nw] + 1

• where N = #news articles, nw is #news articles containing w

• Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.)

TFIDF: weight of a word in a news article is product of these quantities:

• TFIDF(w) = TF(w) x IDF(w)

A1, 17:15h, MKКривична пријава против Андреј Петров

петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)…

TIME.mk Proprietary 10

Story Classification

TIME.mk Proprietary

Story Classification

Based on hub classification tags

cluster news cluster news

Macedonia Culture

Macedonia Fun

Region � Macedonia Macedonia � Culture

Macedonia Culture

NONE Macedonia

TIME.mk Proprietary 12

Scoring

TIME.mk Proprietary

Cluster Scoring Logic

Cluster Score = quality-of-sources * freshness-of-news

Quality of a source: How useful is this source?

- Non-dup fraction- Participation in large stories

- First publisher of a top story

TIME.mk Proprietary

Article Scoring Logic

Article Score

– Used for ranking within a cluster

– Function of:

• Age

• Quality of source

• Title overlap with cluster centroid

• Article size

• …

TIME.mk Proprietary 15

Stats and Future Work

16TIME.mk Proprietary

News sources activity

0

20

40

60

80

100

120

140

160

180

200

ДНЕВ

НИК

УТРИН

СКИ

ВЕ

СН

ИК

НО

ВА М

АКЕ

ДО

НИЈА

ВРЕ

МЕ

ВЕЧ

ЕР А1

КАН

АЛ5

МА

КФАКС

НЕТП

РЕС

ВЕС

Т

КИРИЛИЦ

АСИ

ТЕЛ

АЛС

АТ-М

ФО

РУМ

АЛФ

А Т

ВКУ

РИР

ИДИ

ВИДИМ

ТВКА

ЈГАНА

БРО

КЕР

МАКД

ЕНЕС

BBCSETIM

ESON.N

ETЗА

ЗАБАВ

АТЕ

ЛМА

* in period of 2 working days

17TIME.mk Proprietary

4500

6500

1500

700

100 180

0

1000

2000

3000

4000

5000

6000

7000

Jul Aug Sept Oct Nov Dec

#visitors

Visitors

Source: 8pt, medium gray

Article about TIME.mk in Нова Македонија

ON.net started to present TIME.mk news

365.com.mk started to present TIME.mk news

discussions on MK forums

Lunch of TIME.mk

1.July.2008

TIME.mk Proprietary

Regional expansion - Slovenia

Next stop: Serbia

19TIME.mk Proprietary

Next to come …

• Search of the archive

• RSS feeds

• Click metrics & personalization

� adjustable cluster ranking to the user preferences

• News alerts

� emails with link to news that contain provided keywords

• Weekly and Monthly news threads

• New topics: Technology, Health, etc.

• Inclusion of other news sources (currently only 26)

• Automatic Hub discovery

• Improvements in the clustering algorithms (more sophisticated NLP)

� СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за

млади и спорт, струја = електрична енергија, etc.

� Go beyond duplicate detection by measuring new fact introduction

TIME.mk Proprietary

Acknowledgments

- Pajo & Biba for registering TIME.mk in MARNET

- Karolina for offering DNS services and HTML/CSS tricks

- Igor (Zuljo) for implementing the new design

- Nikola and Daniel for implementing text extraction for TIMES.si

- many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages

TIME.mk Proprietary 21

Thank You!

Q&A