+ All Categories
Home > Documents > Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based...

Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based...

Date post: 14-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
24
Analysis of Social Media Streams Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24 Florian Weidner Dresden, 21.01.2014
Transcript
Page 1: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

Analysis of Social Media Streams

Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24

Florian Weidner Dresden, 21.01.2014

Page 2: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

Outline

1.Introduction

2.Social Media Streams

• Clustering

• Summarization

3.Topics

• Detection

• Tracking

4.Conclusion

TU Dresden, 21.01.2014 Analyse von Social Media und sozialen Netzen;Florian Weidner

Folie 2 von 24

Page 3: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

1. Introduction

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 3 von 24

• A lot of data hidden and obvious information

• Important for users, organization, …

• Algorithms for static data well researched

• However: Processing of streams is still „in it‘s early stages“[1]

State of the art overview

Page 4: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2. Social Media Streams

• High frequency

• Continious

• Different kind of data

• Text, links, pictures, meta-data…

• Human language is a problem!

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 4 von 24

Page 5: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.1 Social Media Streams - Clustering

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 5 von 24

#bigdata

A

C

F

#catfact

D

#clustering

B

E

• Find groups of similar instances without prior knowledge!

• Curse of dimensionality

• outliers

Page 6: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.1.1 Social Media Streams – ClusteringCluster Droplets, Similarity & Fading Functions

• Cluster Droplet (CD): statistical information (recency, #tweets, weights,…)

• Similarity function: cosine similarity, dice coefficient,…

• Fading Function: decay of cluster

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 6 von 24

Page 7: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.1.2 Social Media Streams – ClusteringVariable Feature Sets

• Feature Set

• Validity Index (VI)

• Clustering Threshold (CT)

• Reselection Threshold (RT)

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 7 von 24

Page 8: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.1.2 Social Media Streams – ClusteringVariable Feature Sets

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 8 von 24

1. Get Text

2. Insert into cluster

3. Calculate VI

4. Compare withCT & RT

Page 9: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.2 Social Media Streams - Summarization

• Input stream is huge Summarize based on intervals

• Cluster can still contain a huge amount of data Summarize clusters

• Single sentence vs. Multiple sentence

• New text vs. Text from stream

• Noise

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 9 von 24

Page 10: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.2.1 Social Media Streams – SummarizationWord-Variance Based Approach

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 10 von 24

Phrase Reinforcement Algorithm builds a tree

Output:

Set of sentences which summarize stream!

Page 11: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.2.1 Social Media Streams – SummarizationWord-Variance Based Approach

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 11 von 24

1. A tragedy: Ted Kennedy died today of cancer

2. Ted Kennedy died today

3. Ted Kennedy was a leader

4. Ted Kennedy died at Age 77

Page 12: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

2.2.2 Social Media Streams – SummarizationDistance Metrics

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 12 von 24

• Tweet-Cluster-Vector (timestamp, meta)

• Goal: extract k Tweets which cover as much contentas possible

Distance of Tweet

to cluster centroid

Size of cluster

Centrality Scores

Page 13: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3. Topics

• Abstract topic vs. real-life topic (event)

• Small-scale vs. large-scaled short duration and less info vs.

long lasting and a lot of data

• Semantic features important!

• For events, the location is important!

• Semantic features and weblinks

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 13 von 24

Page 14: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1 Topics - Detection

• Topic augmentation external topic as input

• Topic detection w/o prior knowledge

• Clustering is important/simplifies the topic detection

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 14 von 24

Page 15: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1.1 Topics – DetectionWord-Variance

• Topics are time-dependent!

• Simple solution: increase of certain words(i.e. „earthquake“)

Count words in intervals and compare!

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 15 von 24

Page 16: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1.1 Topics – DetectionWord-Variance

1. Preprocessing

2. Calculate word frequencies of incoming data for each time window

3. If there is a significant increase (threshold),keep word

4. Calculate correlations for all remaining words and cluster them

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 16 von 24

Page 17: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1.2 Topics – DetectionLocation

• Filter and cluster incoming data according to theirlocation (just longitude/latitude)

• Weight Tweets and clusters with help of features(textual, other)

If weight > threshold Topic

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 17 von 24

Page 18: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1.3 Topics – DetectionAuthority Score & Tweet Influence

• Key users + selected users

• Key words + selected words Repository

Authority Score: Importance of the authors of the tweets in the cluster

Topical Tweet Influence How many important keywords are in the cluster?

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 18 von 24

Page 19: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.1.3 Topics – DetectionAuthority Score & Tweet Influence

1. Cluster incoming data frequently“ with similarity function

2. Calculate Topical User Authority Score & Topical Tweet Influence of each cluster

3. Weight words and rank them emerging topic

4. Machine Learner (6 features) hot emerging topic

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 19 von 24

Page 20: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.3 Topics and Events - Tracking

• Track topic during a period of time display (only) related content

• Track spatial development evaluate geotags and keywords

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 20 von 24

Page 21: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

3.3.1 Topics and Events – TrackingTracking of an interesting topic

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 21 von 24

TweetTweet

TweetTweet

Tweet

Content ModelQuality Features

Semantic Features

Feedback ModelCompare Tweet with

x previous and best

descriptive Tweets

TweetTweet

TweetTweet

Tweet

Query for topic

Background

Corpus

Foreground

Corpus

TweetTweet

TweetTweet

Display

???

Page 22: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

4. Conclusion

• No holistic solution • Filtered stream

• Utilization of data sources

just single purpose solutions

• Many restrictions!

• Few open source framework(lot of conceptual work)

Many different solutions:• Cluster Droplets, Fading &

Similarity Functions

• Variable Feature Sets

• Word-Variance

• Distance

• Scores (Authority, Tweet Influence)

• Content & Feedback Model

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 22 von 24

Page 23: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

Vielen Dank für die Aufmerksamkeit!

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 23 von 24

Page 24: Analysis of Social Media Streams · 2.2.1 Social Media Streams –Summarization Word-Variance Based Approach TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner

5. References

[1] Gong L. - Text Clustering algorithm based on adaptive feature selection, Expert Systems

with Applications, 2011

[2] Aggarwal C. - On clustering massive text and categorical data streams, Knowledge and

Information Systems, 2009

[3] Sharifi B. - Summarizing Microblogs Automatically, HLT '10, 2010

[4] Chakrabati D. – Event Summarization Using Tweets, AAAI '11, 2011

[5] Shou L. - Sumblr: continuous summarization of evolving tweet streams, ACM SIGIR '13,

2013

[6] Olariu A. - Hierarchical clustering in improving microblog stream summarization,

Proceedings of the 14th international conference on Computational Linguistics and

Intelligent Text Processing, 2013

[7] Chen Y. - Emerging topic detection for organizations from microblogs, ACM SIGIR '13, 2013

[8] Hong Y. - Exploiting topic tracking in real-time tweet streams, UnstructuredNLP '13, 2013

[9] Hong L. - Discovering geographical topics in the twitter stream, WWW’12, 2012

TU Dresden, 21.01.2014 Analysis of Social Media Streams; Florian Weidner Folie 24 von 24


Recommended