+ All Categories
Home > Technology > Notes on mining social media updated

Notes on mining social media updated

Date post: 11-May-2015
Category:
Upload: gary-myers-kmb-unit-york-university
View: 1,225 times
Download: 0 times
Share this document with a friend
Popular Tags:
62
Notes On: MINING SOCIAL MEDIA COMMUNITIES AND CONTENT by Akshay Java Dissertation submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2008
Transcript
Page 1: Notes on mining social media updated

Notes On: MINING SOCIAL MEDIA COMMUNITIES AND CONTENT

byAkshay Java

Dissertation submitted to the Faculty of the Graduate Schoolof the University of Maryland in partial fulfillment

of the requirements for the degree ofDoctor of Philosophy

2008

Page 2: Notes on mining social media updated

• Open Access Link• http://bit.ly/9PCfuQ

• Key Words: social media, folksonomies (tags), social graph, structural vs. semantic information/knowledge

• Disclaimer: Most notes are taken directly from the paper/article and should be appropriately referenced/cited directly from the author(s)

Page 3: Notes on mining social media updated

Introduction• Social media is described as…

an umbrella term that defines the various activities that integrate technology, social interaction, and the construction of words, pictures, videos and audio. This interaction, and the manner in which information is presented, depends on the varied perspectives and “building” of shared meaning, as people share their stories, and understandings.

Institute for language and information technologies. http://ilit.umbc.edu/

Page 4: Notes on mining social media updated

Social Media

• Social Media has radically changed the way we communicate and share information both within and outside our social networks

Page 5: Notes on mining social media updated

Folksonomies

• Free-form tags (also known as folksonomies)

Page 6: Notes on mining social media updated

“Social Graph”

• social graph can be described as the sum of all declared social relationships across the participants in a given network

Page 7: Notes on mining social media updated

“User-generated content”

• Content produced in social media is often referred to as “user-generated content”

Page 8: Notes on mining social media updated

(Criticism: Lack of Reference)

• User-generated content contributes to about five times more content present on the Web today

• (No reference for this source was quoted!)

Page 9: Notes on mining social media updated

Motivating Thesis Question

• The motivating question that has guided this thesis is the following: “How can we analyze the structure and content of social media data to understand the nature of online communication and collaboration in social applications?”

Page 10: Notes on mining social media updated

Thesis Statement

• It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties structure and content

• This thesis is based on two key observations…• Understanding communication in social media

requires identifying and modeling communities

• Communities are a result of collective, social interactions and usage

Page 11: Notes on mining social media updated

Semantic Web language OWL• Why OWL? An acronym of Web Ontology Language • The Semantic Web is a vision for the future of the Web in which

information is given explicit meaning, making it easier for machines to automatically process and integrate information available on the Web. The Semantic Web will build on XML's ability to define customized tagging schemes and RDF's flexible approach to representing data. The first level above RDF required for the Semantic Web is an ontology language what can formally describe the meaning of terminology used in Web documents. If machines are expected to perform useful reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema. The OWL Use Cases and Requirements Document provides more details on ontologies, motivates the need for a Web Ontology Language in terms of six use cases, and formulates design goals, requirements and objectives for OWL. http://www.w3.org/TR/2004/REC-owl-features-20040210/#s1.2

Page 12: Notes on mining social media updated

Thesis Focus on Blogs & Wikis

• We soon realized that processing blogs and social media data required new techniques to be developed

• The problem of spam blogs in social media• Blogs empower users with a channel to freely

express themselves• The open, unrestricted format of blogs means

that the user is now able to express themselves and freely air opinions

Page 13: Notes on mining social media updated

Open Retrieval/Access

• Opinion retrieval is thus an important application of social media analysis

Page 14: Notes on mining social media updated

TREC Conference Blog Trackhttp://trec.nist.gov/

Page 15: Notes on mining social media updated

TREC Tracks

• A TREC workshop consists of a set tracks, areas of focus in which particular retrieval tasks are defined. The tracks serve several purposes. First, tracks act as incubators for new research areas: the first running of a track often defines what the problem really is, and a track creates the necessary infrastructure (test collections, evaluation methodology, etc.) to support research on its task. The tracks also demonstrate the robustness of core retrieval technology in that the same techniques are frequently appropriate for a variety of tasks. Finally, the tracks make TREC attractive to a broader community by providing tasks that match the research interests of more groups.

Page 16: Notes on mining social media updated

TREC Tracks

• Each track has a mailing list. The primary purpose of the mailing list is to discuss the details of the track's tasks in the current TREC. However, a track mailing list also serves as a place to discuss general methodological issues related to the track's retrieval tasks. Further, some tracks have track-specific web pages that provide history and background material regarding the track's focus. Thus, this page lists contact information for all the TREC tracks, whether or not the track is scheduled to be run in the current TREC. TREC track mailing lists are open to all; you need not participate in TREC to join a list. Most lists do require that you become a member of the list before you can send a message to it.

• The set of tracks that will be run in a given year of TREC is determined by the TREC program committee. The committee has established a procedure for proposing new tracks.

Page 17: Notes on mining social media updated

Goal of Thesis TREC Track

• The goal of this track was to build and evaluate a retrieval system that would find blog posts that express some opinion (either positive or negative) about a given topic or query word

Page 18: Notes on mining social media updated

The Blog Vox System• The BlogVox system retrieves opinionated blog posts specified by

ad hoc queries. BlogVox was developed for the 2006 TREC blog track by the University of Maryland, Baltimore County and the Johns Hopkins University Applied Physics Laboratory using a novel system to recognize legitimate posts and discriminate against spam blogs. It also processes posts to eliminate extraneous non-content, including blog-rolls, link-rolls, advertisements and sidebars. After retrieving posts relevant to a topic query, the system processes them to produce a set of independent features estimating the likelihood that a post expresses an opinion about the topic. These are combined using an SVM-based system and integrated with the relevancy score to rank the results.

• http://trec.nist.gov/pubs/trec15/papers/umbc-jhu.blog.final.pdf

Page 19: Notes on mining social media updated

The Blog Vox System

• BlogVox has resulted in the development of novel techniques for identifying trust and influence in online social media systems

Page 20: Notes on mining social media updated

General Content

• This dissertation is dedicated to social media content analysis and outlines both the semantic analysis system and the opinion retrieval system

Page 21: Notes on mining social media updated

Microblogging

• The activity of posting regular updates on a microblog

• A variety of microblogging sites have sprung uphttp://en.wiktionary.org/wiki/microblogging

Page 22: Notes on mining social media updated

(Is this claim accurate?)

• This is the first study in the literature that has analyzed the microblogging phenomenon.

• (Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56–65, New York, NY, USA, 2007. ACM.)

Page 23: Notes on mining social media updated

Social Graphs & Algorithms

• Thesis present how to utilize the special structure of social media and the nature of social graphs to develop efficient algorithms for community detection.

• http://en.wikipedia.org/wiki/Social_graph• http://en.wikipedia.org/wiki/Algorithm

Page 24: Notes on mining social media updated

SVD or Matrix Factorization Methods

• http://en.wikipedia.org/wiki/Singular_value_decomposition

• http://en.wikipedia.org/wiki/Non-negative_matrix_factorization

Page 25: Notes on mining social media updated

Social Media Tags & Graphs• One important property of social media datasets is the availability

of tags. Tags or folksonomies, as they are typically called, are free-form descriptive terms that are associated with any resource. Lately, folksonomies have become an extremely popular means to organize and share information. Tags can be used for videos, photos or URLs. While structural analysis is the most widely used method for community detection, the rich meta-data available via tags can provide additional information that helps group related nodes together. However, techniques that combine tag information (or more generally content) with the structural analysis typically tend to be complicated. We present a simple, yet effective method that combines the metadata provided by tags with structural information from the graphs to identify communities in social media. The main contribution of this technique is a simplified and intuitive approach to combining tags and graphs.

Page 26: Notes on mining social media updated

General Content

• This thesis outlines the structural analysis of social graphs

• Focuses on the (social media) user perspective by analyzing feed subscriptions across a large population of users

• Analyzes the subscription patterns of over eighty three thousand publicly listed Bloglines users

• http://bloglines.com/

Page 27: Notes on mining social media updated

(Criticism: Lack of Reference)

• According to some estimates, “the size of the Blogosphere continues to double every six months” and there are over seventy million blogs (with many that are actively posting)

Page 28: Notes on mining social media updated

Few Feeds• However, our studies indicate that of all these

blogs and feeds, the ones that really matter are relatively few. What blogs and feeds these users subscribe to and how they organize their subscriptions revealed interesting properties and characteristics of the way we consume information. For instance, most users have relatively few feeds in their subscriptions, indicating an inherent limit to the amount of attention that can be devoted to different channels.

Page 29: Notes on mining social media updated

User-Defined Folder Names

• Many users organize their feeds under user-defined folder names. Aggregated across a large number of users, these folder names are good indicators of the topics (or categories) associated with each blog. The study uses this collective intelligence to measure a readership-based influence of each feed for a given topic.

Page 30: Notes on mining social media updated

Feed Distillation Task

• The task of identifying the most relevant feed for a given topic or query term is now known as the “‘feed distillation task” in the literature

Page 31: Notes on mining social media updated

Thesis Contributions• Following are the main contributions of this thesis:• We provide a systematic study of the social media landscape by analyzing

the content, structure and special properties.• Developed and evaluated innovative approaches for community detection.• – We present a new algorithm for finding communities in social datasets.• – SimCut, a novel algorithm for combining structural and semantic

information.• First to comprehensively analyze two important social media forms• – We analyze the subscription patterns of a large collection of blog

subscribers. The insights gained in this study were critical in developing a blog categorization system, a recommendation system as well as provide a basis for further, recent studies on feed subscription patters.

• – We analyze the microblogging phenomena and develop a taxonomy of user intentions and types of communities present in this setting.

• Finally we have built systems, infrastructure and datasets for the social media research community.

Page 32: Notes on mining social media updated

The Social Web (Web 2.0)

• The World Wide Web today has become increasingly social

Page 33: Notes on mining social media updated

(Reference Cited)Here Comes Everybody: The Power of Organizing

Without Organizations by Clay Shirky

http://tiny.cc/pfvc9 (Book Overview) http://tiny.cc/ykqxq (YouTube)

Page 34: Notes on mining social media updated

(Criticism: Lack of Reference)

• Content on the Web today• According to recent estimates, while editing

content like CNN or Reuters news reports are about 2G per day, user generated content produced today is four to five times as much

Page 35: Notes on mining social media updated

So, what makes the Web “social”?

• Web 1.0: most websites and homepages that exist are a one-way communication medium

• Web 2.0: blogs and social media sites changed this by adding functionality to comment and interact with the content – be it blogs, music, videos or photos

Page 36: Notes on mining social media updated

The Blogosphere

• There are a number of studies that have specifically analyzed its structure and content

• Blogging provides a channel to express opinions, facts and thoughts

• Through these pieces of information, also known as memes, bloggers influence each other and engage in conversations that ultimately lead to exchange of ideas and spread of information

Page 37: Notes on mining social media updated

The Blogosphere

• By analyzing the graphs generated through such interactions, we can answer several questions about the structure of the blogosphere:

• Community structure• Spread of influence• Opinion detection• Formation, friendship networks• Information cascades

Page 38: Notes on mining social media updated

(Criticism: Lack of Reference)

• As of 2006 there were over 52 million blogs and presently there are in excess of 70 million blogs

• The number of blogs are rapidly doubling every six months and a large fraction of these blogs are active

Page 39: Notes on mining social media updated

Blogs

• It is estimated that blogs enjoy a significant readership and according to the recent report by Forrester Research, one in four Americans read blogs and a large fraction of users also participate by commenting

• Blogs are typically published through blog hosting sites or tools like Wordpress that can be self-hosted

• Blogs can be subscribed to by RSS (Really Simple Syndication) feeds

Page 40: Notes on mining social media updated

(Reference Cited)Click: What Millions of People are Doing Online

and Why It Mattersby Bill Tancer

http://www.billtancer.com/

Page 41: Notes on mining social media updated

Social Networking Sites• In a recent study of Facebook users, Dr. Zeynep Tufecki

concluded that Facebook users are very open about their personal information

• A surprisingly large fraction openly disclose their real names, phone numbers and other personal information

• (Zeynep Tufekci. Can you see me now? audience and disclosure regulation in online social network sites, 2008)

• (Zeynep Tufekci. Grooming, gossip, facebook and myspace: What can we learn about social networking sites from non-users. In Information, Communication and Society, volume 11, pages 544–564, 2008)

Page 42: Notes on mining social media updated

(Reference Cited)Snoop: What your stuff says about you?

by Dr. Sam Gosling

http://snoopology.com/

Page 43: Notes on mining social media updated

Social Networking Sites

• Dr. Sam Gosling talks about how personal spaces like bedrooms, office desks and even Facebook profiles reveal a whole lot about the real self

• The research indicates how using just the information from a Facebook profile page, users can accurately score openness, conscientiousness, extraversion, agreeableness, and neuroticism (also known as the five factor model in Psychology)

Page 44: Notes on mining social media updated

Tagging & Folksonomies

• The term folksonomy is derived from folk and taxonomy and is attributed to Thomas VanderWal

• http://www.vanderwal.net/

Page 45: Notes on mining social media updated

(Reference Cited)

• Heymann et al. inquire about the effectiveness of tagging and applications of social bookmarking in Web search

• (Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can social bookmarking improve web search? In WSDM ’08: Proceedings of the international conference on Web search and web data mining, pages 195–206, New York, NY, USA, 2008. ACM)

Page 46: Notes on mining social media updated

(Reference Cited)

• Brooks and Montanez have also studied the phenomenon of user-generated tags and evaluate effectiveness of tagging

• (Christopher H. Brooks and Nancy Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW, 2006)

Page 47: Notes on mining social media updated

Tagging & Folksonomies

• Studies have also shown that tagging can explain user behavior

Page 48: Notes on mining social media updated

Tagging & Folksonomies

• Cattuto et al. model users as simple agents that tag documents with a frequency-bias and have the notion of memory, such that they are less likely to use older tags

• (Ciro Cattuto, Vittorio Loreto, and Luciano Pietronero. Collaborative tagging and semiotic dynamics. CoRR, abs/cs/0605015, 2006)

Page 49: Notes on mining social media updated

Tagging & Folksonomies

• AutoTagging is a collaborative filtering-based recommendation system for suggesting appropriate tags

Page 50: Notes on mining social media updated

Tagging & Folksonomies

• TagAssist is a system that recommends tags related to a given blog post

Page 51: Notes on mining social media updated

Growth of the Blogosphere

• Ravi Kumar et. al. have studied the evolution of the blog graph and find that the size of the blogosphere grew drastically in 2001

• But only a small percentage of blogs have the most in-links

• (Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On the bursty evolution of blogspace. In WWW, pages 568–576, 2003)

Page 52: Notes on mining social media updated

“Forest Fire” Model• Leskovec et al. present the “Forest Fire” model to explain

the growth and evolution of dynamic social network graphs• (Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.

Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, New York, NY, USA, 2005. ACM)

• There are 2 theories that support this model:1. Out degree increases over time as the networks evolve2. “Shrinking diameter” of network decreases over time• The “Forest Fire” model tries to mimic the way information

spreads in networks

Page 53: Notes on mining social media updated

Information Cascades

• The forest fire model was also shown to describe information cascades in blog graphs

• Information cascades are a chain of links from one blog to another that describe a conversation

Page 54: Notes on mining social media updated

Behavioral Model

• Blogger is treated as both a reader and a writer

Page 55: Notes on mining social media updated

80/20 Distribution• Herring et al. performed an empirical study the interconnectivity of a

sample of blogs and found conversations on the blogosphere are sporadic and highlight the importance of the ‘A-list’ bloggers and their roles in conversations

• (Susan C. Herring, Inna Kouper, John C. Paolillo, Lois Ann Scheidt, Michael Tyworth, Peter Welsch, Elijah Wright, and Ning Yu. Conversations in the blogosphere: An analysis “from the bottom up” In HICSS ’05: Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05) - Track 4, page 107.2, Washington, DC, USA, 2005. IEEE Computer Society)

• A-list bloggers are those that enjoy a high degree of influence in the blogosphere

• These are the blogs that correspond to the head of the long tail (or power-law) distribution of the blogosphere

• These constitute a small fraction of all the blogs that receive the most attention or links

Page 56: Notes on mining social media updated

Feed Distillation• Technorati lists the top 100 blogs on the blogosphere• These lists, while serving as a generic ranking purpose, do not indicate the

most popular blogs in different categories• This task was explored by Java et al. to identify the “Feeds that Matter”• (Akshay Java, PranamKolari, Tim Finin, AnupamJoshi, and Tim Oates.

Feeds ThatMatter: A Study of Bloglines Subscriptions. In Proceedings of the International Conference on Weblogs and SocialMedia (ICWSM 2007). Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2007)

• The TREC 2007 blog track defines a new task called the feed distillation task

• Feed distillation, as defined in TREC 2007 is the task of identifying blogs with recurrent interest in a given topic

Page 57: Notes on mining social media updated

Thesis Context

• Presents two techniques for community analysis• Most of the existing approaches to community

detection are based on link analysis and ignore the folksonomy meta-data that is easily available on in social media

• Presents a novel method to combine the link analysis for community detection with information available in tags and folksonomies, yielding more accurate communities

Page 58: Notes on mining social media updated

Influence & Trust

• Influence on the Web is often a function of topic

• Meausres of the blog’s authority are mostly based on the number of inlinks

• This can sometimes be slightly misleading since a single post from a popular blogger on any topic may make it the top-most blog for that topic, even if the blog has little to do with the given subject

Page 59: Notes on mining social media updated

Discussion• The broader impact of this work is to understand online,

human communications and study how various elements of social media tools and platforms facilitate this goal

• The study spans a period of three years and is a snapshot into the World Wide Web’s changing landscape

• Sees the emergence of social media and it’s mainstream adoption as a key factor that has brought about a substantial change in how we interact with each other

• The study found that blogs are an important component of social media

• The goal has been to understand social behavior through the Web

Page 60: Notes on mining social media updated

Discussion

• The approach of the study takes a simplistic view of a community

• Defines a community as a set of nodes that have more links to each other than the rest of the network

Page 61: Notes on mining social media updated

Further Research

• Discovering partial membership and multi-dimensional communities is a challenging problem and something worth investigating further

Page 62: Notes on mining social media updated

Blog Search Implications

• As social media content becomes even more pervasive, more Web search engine queries also return a number of blog posts within their results

• It is an open question as to how this effects Web search ranking


Recommended