Social Tagging
Kristina Lerman USC Information Sciences Institute
Thanks to Anon Plangprasopchok for providing material for this lecture.
essembly
delicious
Bugzilla
Social Web
essembly
delicious
Bugzilla
Social Web is a platform for people to create, organize, and share information
Create Information
• People create content (resources) • Text posts: blogs, Twitter, … • Images: Flickr, Picasa, … • Videos: YouTube, Vimeo, … • News stories: Digg, Reddit, Slashdot, … • Bookmarks: Delicious, CiteULike, Bibsonomy, … • Personal profiles: Facebook, MySpace, … • Maps: OpenStreetMaps, … • Locations: FourSquare, …
Organize Information
• People organize resources • Annotate with metadata
• tags: descriptive labels • geotags: geographic coordinates
• Add to folders: organize content within personal hierarchies • E.g., sets and collections on Flickr
• Other types of metadata may include • Discussions, comments, reviews • Ratings, votes, …
• Social Tagging most popular form of annotation
Social Tagging: Delicious
Content (webpage)
User Tags
Rainbow bee-eater Merops ornatus Australia Queensland Mackay Gardens
Mackay May 2008 (Set) Birds (Set) Birds (Pool) Canberra (Pool) Field Guide: Birds of the World (Pool) Birds, Birds, Birds (Pool) BIRDPIX (3/day) (Pool) Australian Birds (Pool) Birds – Kingfishers, Pittas, and Bee-eaters (Pool) Birds of Queensland (Pool)
+ + + + +
+ + + +
+
tags
submitter
public groups
discussion
private albums
Social Tagging: Flickr
Share Information
• People share resources • Social networks: broadcast to social connections
• Friends on Facebook, … • Fans/Followers on Twitter, Digg, …
• Groups affiliations • Hotlists: emerge from collective activity
• E.g., Digg front page, Flickr Explore, Flickr Trends…
Social Networks: Facebook
Social Networks: Flickr
Harvesting Knowledge from Social Tagging
Users Tags
Resources
Resource (web page) User Tags
User Resource (photo)
Tags
RR graph: PageRank
UU graph: Social network analysis
RUT hypergraph: Harvesting knowledge from social tagging
Overview
Harvesting knowledge from social tagging • “Structure of Collaborative Tagging Systems”
• Statistics of tagging activity • Consensus about meaning of document quickly emerges from the
opinions of many users
• “Exploiting Social Annotation for Automatic Resource Discovery” • Learn hidden topics in a collection of tagged documents • Use hidden topics to find relevant documents
Social Tagging
• Tags are labels attached to content • Chosen from an uncontrolled personal vocabulary • Help users to more efficiently
• Browse • Filter • Search information
• Collaborative/social tagging • Anyone can attach labels to resources (not only experts or producers
of content) • Collectively, tags represent a semantic annotation of a resource
(alternative to Semantic Web)
Tagging and Taxonomies
• Taxonomy – hierarchical, exclusive organization of objects • Linnaean classification
felidaepantheratiger felidaefeliscat
• File system: articles about cats in Africa
c:\articles\cats c:\articles\africa c:\articles\africa\cats c:\articles\cats\africa
Search multiple folders to find all relevant content
• Tagging – non-hierarchical, inclusive organization of objects • Articles tagged ‘cat’, ‘africa’
But, will not find articles tagged with ‘cheetah’
‘africa’ ‘cats’
‘cats’ AND ‘africa’
Kinds of Tags
• What content is about (topic) identify who or what document is about: ‘cat’, ‘africa’
• What it is what kind of thing it is: ‘article’, ‘blog’, ‘book’
• Who owns it who owns/created content: ‘nikographer’
• Refining categories refine or qualify categories, especially numbers
• Identify qualities or characteristics express opinion: ‘funny’, ‘interesting’
• Self-reference ‘mystuff’
• Task organizing ‘toread’, ‘jobsearch’
Social Tagging Dimensions
• Tagging rights: who can tag? • Self-tagging – only resource owner (blog posts, Flickr by convention) • Free-for-all – anyone can tag a resource (Delicious)
• Consolidation: assisted tag generation? • Blind tagging – user enters tags independently of other users • Suggestive tagging – system suggests tags based on annotations of other
users • Resource type
• Text – Web pages, blog posts, bibliographic material, … • Multimedia – images, videos, …
• Source of content • User-owned – e.g., images on Flickr • Scavenged from the Web – e.g., Delicious
• Connectivity: links between users • Reciprocity – undirected links (Facebook) vs directed (Flickr, Delicious) • Link type – friend relationship vs contact (on Flickr) shows degree of trust
User Motivations
What are users’ motivations to tag? • Organizational
• Mark items for future personal retrieval
• Social • Mark items for others to find, e.g., concert photos on Flickr
• Can result in spamming • Express opinion, e.g., “funny” tag on video
Collective value emerges from tagging decisions of individual users
• How can users be incentivized to contribute high quality annotations?
Social Tagging on del.icio.us
• Social bookmarking site del.icio.us • Users can tag any Web page (URL)
• Delicious suggests tags based on existing tags for the URL • Delicious aggregates popular tags
• Anyone can see bookmarks of others • Users can create social links
• Value of social tagging • Users bookmark for their own benefit
• Organization • Retrieval
• Useful public good emerges • Tag suggestions • List of popular URLs and tags (hotlists)
Tagging on del.icio.us
Content (webpage)
User Tags
Dynamics of del.icio.us
• Delicious dynamics [Golder & Huberman] • User activity • Tag vocabulary growth • Datasets
• Bookmarks collected over 4 days in June 2005 • Sample of users who posted bookmarks in this period
Dynamics of User Interests
• Tags reflect how user’s interests and knowledge change in time • Tag1 and Tag2 are
consistent interests of the user
• Tag3 is new interest • Or a new way to
differentiate between concepts/interests
tag1
tag2 tag3
bookmark
Tim
es t
ag h
as b
een
used
Stable Patterns in Tagging
• Consider a single URL • As it is tagged by more users • Each tag’s proportion represents the combined description of the URL by
many users • After ~100 bookmarks, relative frequency of each tag is fixed
Tag
prop
ortio
n (w
rt a
ll ta
gs)
Number of bookmarks for URL
Findings
• Consensus about a URL’s topics • Emerges quickly- after ~100 users bookmark it
• URLs do not have to become popular for tags to be useful • Minority opinions can stably coexist with popular ones • Can be used to categorize/organize URLs
• Reasons for consensus • Imitation – users imitate tag selection of others
• But, stable patterns also exist for less common tags (not shown to users)
• Shared knowledge • Can we learn it?
Learning from Social Tagging/Annotation
Goal: Learn concepts from social annotations created by many users
• Annotations by an individual user may be inaccurate and incomplete…
• Annotations from many different users may complement each other, making them meaningful in aggregate
“Jaguar”
Animal Car
= ?
By A lion Rohrs By sparky2000
Learning Concepts from Tags
Goal of Learning Algorithm
Tags
“Animal” “Car”
“Flower”
?
Group semantically related tags and resources
Resources
A group ~ A concept
Challenges in Learning from Annotations
• Sparse data 4-7 tags per bookmark; 3.74 tags per photo [Rattenbury07+]
• Ambiguity jaguar: car vs. animal
• Polysemy window: hole in a wall vs. glass pane that resides in it
• Synonymy kid vs. child
• Disagreement cats\africa vs. africa\cats
• Different Levels of Specificity Dog vs. Beagle
• Multiple facets Bird tagged by appearance, location, scientific/colloquial name”
Document Modeling Approaches
• ‘Bag-of-words’ – tf-idf • Document as a vector of word frequencies
• Small reduction in document description length • Does not handle synonymy and polysemy
• Latent semantic indexing - LSI • Identifies subspace of tf-idf that captures most of the variance in a corpus
• Reduction in document description length (# principal components) • Handles polysemy and synonymy
• Topic modeling – pLSI, LDA • Documents as random mixtures over (hidden) topics, where each topic is a
distribution over words • Large reduction in description length (# topics)
• Inference • Given a document corpus, estimate parameters of the model
– Compute distribution of hidden topics given the document
Document (r)
Topics (z)
Possible Words
Possible Topics
Generated words (t)
pLSI (Hofmann99); LDA (Blei03+)
A Stochastic Process of Word Generation
Possible Words
Possible Topics
travel, flights, airline, flight, airlines, guide, aviation, …
map, maps, world, earth, latitude, longitude, directions, address, geography, distance, zip, usa, gmaps, atlas, …
Learned Topics
video, download, bittorrent, p2p, youtube, media, torrent, torrents, movies, …
High probability words in each topic:
Apply LDA to Tagging
Tags (words)
“Animal” “Car”
“Flower”
LDA
Resource (document)
Application to Resource Discovery
• Resource discovery • Given a seed source, find other data sources that provide the same
functionality • e.g., find geocoders like http://geocoder.us, which returns
geographic coordinates of a specified US address
• Benefits • Increase robustness of II applications
• If http://geocoder.us fails, substitute with another source • Increase coverage of II applications
• http://geocoder.ca geocodes US AND Canadian addresses
discovery Invocation
& extraction
semantic typing
source modeling
Background knowledge • Seed URL
anotherWS unisys
unisys
• sample input values
http://wunderground.com
“90254”
• patterns • domain types
unisys(Zip,Temp,Humidity,…)
• definition of known sources • sample values
unisys(Zip,Temp,…) :-weather(Zip,…,Temp,Hi,Lo)
Source Discovery and Modeling [Ambite et al, 2009]
Exploiting Social Annotation for Resource Discovery
Approach: Use topic modeling of social annotation obtained from Delicious to find sources similar to a given seed URL
Seed URL
Candidates Users Tags
URLs Probabilistic Learning Model
Compute URL Similarity
URL’s distribution over concepts
Rank by Similarity To seed
e.g., LDA, to learn concepts
Obtain Annotation corpus from Delicious
• Crawling strategy • For each seed, retrieve the 20 popular tags • For each tag, retrieve sources annotated with same tag • For each source, retrieve all tags
Corpus of Annotated Resources
• Use LDA to learn 80 topics in each corpus • Distributions over topics is used to compute similarity of target URL to
seed
Topic Modeling of Social Annotations
• Manually label top 100 ranked URLs by similarity to seed URL • Compare to Google’s “find similar URLs” functionality
Source Discovery Results
Source Discovery Results
Discussion
• Users express their knowledge through the tags they create while annotating content
• Apply document modeling techniques to social annotations data
• Infer hidden topics in annotated data • Use topics for source discovery task
• Outperforms standard Web search
• Next – Extract more complex types of knowledge from social annotations • Sentiment • Folksonomies