Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | estella-henderson |
View: | 222 times |
Download: | 0 times |
1
T-Scroll: Visualizing Trends in a Time-Series of Documents for Interactive User Exploration
Yoshiharu Ishikawa and Mikine Hasegawa
Nagoya University, [email protected]
2
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
3
Background Time-series of documents
Example: news articles delivered on the Internet, online academic journals
Continually delivered everyday
Problems A large number of documents: appropriate
summarization is required Topics will change: topic detection/tracking and trend
extraction are useful
4
Objectives Development and evaluation of T-Scroll
(Trend/Topic-Scroll) User interface for visualizing the transition of topics
extracted from a time-series documents
System Features Constructed over a document clustering system that
outputs new clustering results periodically Clusters are displayed along the time axis like a scroll Links are shown between related clusters to represent
topic transition Some useful features for interactive exploratory analysis
5
6
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
7
Visualization of a time-series of documents A few systems for visualization of trends in a time-
series of documents ThemeRiver (Havre et al, IEEE Trans. VCG,
2002) [4] Visualizes topic streams like a river Focuses on providing visual impacts No features for analysis and browsing
TimeMine (Swan and Allan, SIGIR’00) [5] Extracts topics from a time-series of documents Displays timelines to represent topics on the screen
8
ThemeRiver
Analysis of the articles related to Cuba (1960 – 1961)
9
TimeMine Swan & Allan (U. of Massachusetts)
10
Analysis of time-dependent clusters Mei & Zhai (KDD’05) [6]
Statistical approach for discovering major topics from a time-series of documents
Probabilistic modeling
MONIC (Spiliopoulou et al., KDD’06) [7] Detects various types of patterns from cluster
transitions Examples: splitting/merging of clusters, cluster size changes
Based on the analysis of historical snapshots of clusters
11
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
12
Novelty-based document clustering (1) Developed by our group (ECDL’01 [8], WWW Journal
2007 [10] etc.) Clusters documents incrementally based on their
similarity and novelty Features
Similarity considers novelty Assign high weights to recent documents, low weights to old
ones Document weights decay as time passes: Based on the
concept of obsolescence (aging) Delete old documents whose weights are smaller than the
threshold Incremental processing: low update cost
13
Novelty-based document clustering (2)
ττ time
New President SarkozyYeltsin’s Death
Other articles
Blair to Resign
“Yeltsin’s Death” and other
documents are obsolete!
“Yeltsin’s Death” and other
documents are obsolete!
Periodical clustering processes are performed on a time-series of documents
14
Document similarity (1)
iTττi λ|dw
acquisition timeacquisition time of document of document di
1
dwi
Ti t
iTτλ
Current timeCurrent time
(0 < < 1): forgetting factor determines the forgetting speed
The weight of a document exponentially decreases as time passes.
Assumption: each delivered document gradually loses its value as time passes
dwi: the weight of a documentdi at time
15
Document similarity (2) Similarity score of documents di and dj
Based on novelty of documents and word occurrence patterns in the documents.
Extension of the tf-idf method
New documents have high impact on the clustering result
Document clustering: k-means method
ji
jiji
jiji
lenlendd
ddddsim
dd)Pr()(Pr
),(Pr),(
16
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
17
T-Scroll: Idea Periodical clustering results are displayed like a
scroll Links represents related cluster pairs
18
19
System functionalities (1) Cluster labels: selected based on the formula
Pr(di): document weight, tfij: term frequency count
Cluster sizes: ellipse size roughly corresponds to the number of documents
Links: If the score is greater than the threshold, links are shown
pi Cd
ijij tfdtscore )Pr()(
||
||)|Pr()(
i
jiijji C
CCCdCdCCscore
20
System functionalities (2) Cluster quality: visualized using different colors for
the cluster border lines red (good) purple (bad)
High score can be achieved if (1) the cluster size is large, and (2) documents contained in the cluster are similar
jiji ddCddji ddsim
CCCsimavg
CsimavgCCquality
,,
),()1|(|||
1)(_
)(_||)(
21
System functionalities (3) Drill-down/roll-up: user can specify the interval of
between two consecutive clustering interactively (e.g., one day, one week)
Displaying keyword list: user can browse the keyword list for a specified cluster
Access to original documents Keyword-based emphasis: clusters that contain a
user-specified keyword are emphasized
22
Demo
23
System implementation T-Scroll module
Written by Perl: generates an SVG file Browser displays the generated SVG file SVG file includes scripts (JavaScript)
Used for interactive manipulation
Clustering module Written by Ruby Novelty-based incremental document clustering
24
System architecture
SVG ControlModule
T-ScrollMain Module
SVG OutputModule
(JavaScript)
SVG file(includes JavaScript)
(Perl)
( Perl )
Plug-in
Outputs
T-Scroll
---------------------Browser
---------------------
---------------------
---------------------
News articles
Input Output
---------------------
---------------------
---------------------
Clustering result
Input
Commandinputs
Clusterdisplay
Interactivemanipulation
User
ClusteringModule
RSS FeedModule
25
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
26
Evaluation 10 Users Data set
Japanese news articles collected from news web sites from Sept. 2006 to Feb. 2007
100 articles per day Clustering was performed at six-hour intervals
Evaluation criteria Overall impressions Evaluation of each function Obervability of topics Comparison with ThemeRiver
27
Overall impression User specifies scores between 0 to 5
0
1
2
3
4
5
Usability
Understandability
Usefulness
Design
28
Evaluation on each function
012345
Scroll
DocN
um
Label
Quality
Keyw
ord
TitleList
Emphasis
Interval
29
Observability of topics (1) Can users observe major topics in Nov. 2006?
Five major topics are specified by ours: user gives scores how clearly he or she can observe the topic
0
1
2
3
4
5
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
30
Observability of topics (2) 10 users (different from
former experiments) Users should reply
observed topics and their scores with no information
Topics 1 to 5 are major topics used in the previous experiments
Topic 2 (big hurricane) was regarded as a normal weather topic
0
2
4
6
8
Topic 1
Topic 4
Topic 3
Topic 6
Topic 7
Topic 5
Topic 8
Topic 9
Topic 10
Topic 11
No. of answersScore
31
Comparison with ThemeRiver (1) ThemeRiver-like display figure was manually
created for news articles in Dec. 2006 11 users (different from previous experiments) Questions to users
Overall impressions Obserbability of topics
32
33
Comparison with ThemeRiver (2) Overall impression
Category No. of replies
T-Scroll is better 2
T-Scroll is slightly bettrer 3
Almost same 3
ThemeRiver is slightly better 3
ThemeRiver is better 0
34
Comparison with ThemeRiver (2) Can users observe five major topics that we
selected?
Category No. of replies
Good 0
Possible 3
No good 4
Impossible 4
35
Summary of experiments Overall impressions
Good, but improvements required for usability Some users made comments on the response speed
System functionalities Several features (quality info, article lists, etc.) are
useful in practice Appropriate labels are necessary: should be improved
Comparison with ThemeRiver ThemeRiver has visual impacts, but its display tends to
be complicated for many topics
36
Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work
37
Conclusions and future work Development and evaluation of T-Scroll system
Based on novelty-based incremental clustering method Scroll-like display for showing changing trends Several features for interactive analysis
Evaluation Overall impression Observability of topics Comparison with ThemeRiver
Future work Sophisticated keyword (label) selection Improvement of interactive speed