CIKM 2011 | Invited Talk
Model-Driven Research in Social Computing
Ed H. Chi
Google Research Work done while at Palo Alto Research Center (PARC)
2011-10-27 CIKM 2011 Invited Talk 1
Some Google Social Stats n 250,000 words are written each minute on Blogger -‐
that’s 360 million words a day n Every 16 seconds people view enough photos from
Picasa Web Albums to cover an entire football field n Every 8 minutes, more photos are viewed on Picasa
Web Albums than exist in the entire Time-‐LIFE photo collection
2011-10-27 CIKM 2011 Invited Talk 2
YouTube Stats n 150 years of YouTube video are watched everyday on
Facebook (up 2.5x y/y) n every minute 400+ tweets contain YouTube links (up 3x
y/y) [Q1 20111] n 100M+ people take a social action with YouTube (likes,
shares, comments, etc) every week (10/15/10)
2011-10-27 CIKM 2011 Invited Talk 3
Google+ Stats n 40 million people joined Google since launch. n People are 2x-‐3x times more likely to share content with
one of their circles than to make a public post.
2011-10-27 CIKM 2011 Invited Talk 4
Social Stream Research n Analytics
– Factors impacting retweetability [Suh et al, IEEE Social Computing 2010]
– Location field of user profiles [Hecht et al, CHI 2011] – Organic Q&A behaviors [Paul et al, ICWSM’11]
– Languages used in Twitter [Hong et al, ICWSM’11]
n Improving Stream Experience
– Topic-‐based summarization & browsing of tweets [Bernstein et al, UIST2010]
– Tweet recommendation [Chen et al, CHI2010 & CHI2011]
2011-10-27 CIKM 2011 Invited Talk 5
Invisible Brokerage Signals across Language Barriers
Joint work w/ Lichan Hong, Gregorio Convertino [Hong et al., ICWSM July 2011]
2011-10-27 CIKM 2011 Invited Talk 6
Motivation for Studying Languages
n Twitter is an international phenomenon – Most research focused on English users
– Question about generalization to non-‐English
– Understand cross-‐language usage differences – Design implications for international users
n Research Questions: – What is the language distribution in Twitter?
– How do users of different languages use Twitter?
– How do bilingual users spread information across languages?
2011-10-27 CIKM 2011 Invited Talk 7
Data Collection & Processing
104 languages
04/18/10-‐05/16/10 (4 weeks)
62M tweets
Google Language API & LingPipe
Twitter stream
Top 10 languages
2011-10-27 CIKM 2011 Invited Talk 8
Top 10 Languages in Twitter
Language Tweets % Users
English 31,952,964 51.1 5,282,657
Japanese 11,975,429 19.1 1,335,074
Portuguese 5,993,584 9.6 993,083
Indonesian 3,483,842 5.6 338,116
Spanish 2,931,025 4.7 706,522
Dutch 883,942 1.4 247,529
Korean 754,189 1.2 116,506
French 603,706 1.0 261,481
German 588,409 1.0 192,477
Malay 559,381 0.9 180,147 2011-10-27 CIKM 2011 Invited Talk 9
Human-‐Coding Study n 2,000 random tweets from 62M tweets
n 2 human judges for each of top 1o languages – native speakers or proficient – discuss to resolve disagreement
n Hard to find Indonesian & Malay judges
n Presented 2,000 tweets to each judge
n Judge selected tweets in his/her language
2011-10-27 CIKM 2011 Invited Talk 10
Machine vs. Human
Language T-‐P T-‐N F-‐N F-‐P Cohen’s Kappa
English 974 971 20 35 0.95 Japanese 370 1,595 0 35 0.94 Portuguese 170 1,803 19 8 0.92 Indonesian 106 1,875 15 4 0.91 Spanish 96 1,889 11 4 0.92 Dutch 18 1,978 2 2 0.90 Korean 24 1,976 0 0 1.00 French 13 1,980 0 7 0.79
German 12 1,979 2 7 0.72
Malay 8 1,979 4 9 0.55
T-‐P: true positive, T-‐N: true negative, F-‐N: false-‐negative, F-‐P: false positive
2011-10-27 CIKM 2011 Invited Talk 11
Accuracy of Language Detection
n Two Types of Errors
– Got ur dirct msg.i’m lukng 4wrd 2 twt wit u too.so,wat doing ha…(detected as Afrikaans)
– High error rate for tweets of 1~2 words
2011-10-27 CIKM 2011 Invited Talk 12
Machine vs. Human
Language T-‐P T-‐N F-‐N F-‐P Cohen’s Kappa
French 13 1,980 0 7 0.79
German 12 1,979 2 7 0.72
Malay 8 1,979 4 9 0.55
• French: 5/7 F-‐P have 2 words
• German: 1/2 F-‐N has 1 word; 6/7 F-‐Ps are in English
• Malay: 3/4 F-‐Ns & 7/9 F-‐Ps are in Indonesian
2011-10-27 CIKM 2011 Invited Talk 13
Common Twitter Conventions hashtag
URL mention
reply (per-‐tweet metadata)
retweet 2011-10-27 CIKM 2011 Invited Talk 14
Use of URLs in 62M Tweets
Language URLs
All 21%
English 25%
Japanese 13%
Portuguese 13%
Indonesian 13%
Spanish 15%
Dutch 17%
Korean 17%
French 37%
German 39%
Malay 17%
n Chi Square tests confirmed that differences by language are significant.
2011-10-27 CIKM 2011 Invited Talk 15
Significant Cross-‐Language Differences Language URLs Hashtags Mentions Replies Retweets
All 21% 11% 49% 31% 13%
English 25% 14% 47% 29% 13%
Japanese 13% 5% 43% 33% 7%
Portuguese 13% 12% 50% 32% 12%
Indonesian 13% 5% 72% 20% 39%
Spanish 15% 11% 58% 39% 14%
Dutch 17% 13% 50% 35% 11%
Korean 17% 11% 73% 59% 11%
French 37% 12% 48% 36% 9%
German 39% 18% 36% 25% 8%
Malay 17% 5% 62% 23% 29%
Chi Square tests confirmed that differences by language are significant
2011-10-27 CIKM 2011 Invited Talk 16
Implications Language URLs Hashtags Mentions Replies Retweets
All 21% 11% 49% 31% 13%
Korean 17% 11% 73% 59% 11%
German 39% 18% 36% 25% 8%
n Use of Twitter for social networking vs. information sharing different in different languages
n Design of recommendation engines – Korean users: promote conversational tweets – German users: promote tweets with URLs
2011-10-27 CIKM 2011 Invited Talk 17
Studying Bilingual Brokers n Importance of brokers
– Structural holes (Burt’92), LiveJournal (Herring et al’07)
n Define bilingual brokers as Users who tweeted in a pair of languages
n Caveat
– Under-‐estimated due to 4-‐week time limit
– Over-‐estimated due to language detection errors
2011-10-27 CIKM 2011 Invited Talk 18
Number of Bilingual Brokers E J P I S D K F G
J 140,730
P 488,545 13,228
I 230,023 4,825 29,405
S 359,117 10,139 112,524 36,068
D 150,041 6,383 30,855 34,906 30,916
K 19,722 6,384 906 2,014 1,109 972
F 194,931 10,463 53,607 34,586 49,445 33,568 1,244
G 110,748 6,053 22,106 21,471 21,989 22,162 786 24,763
M 148,365 4,208 31,184 135,427 31,967 29,331 1,518 30,257 18,301
2011-10-27 CIKM 2011 Invited Talk 19
Sharing URLs Across Languages E J P I S D K F G M
E 3,013 18,399 985 4,986 1,144 212 1,791 1,647 540
J 3,013 77 37 58 29 43 59 46 18
P 18,399 77 74 1,644 198 2 453 168 123
I 985 37 74 67 64 1 53 38 279
S 4,986 58 1,644 67 139 0 286 139 53
D 1,144 29 198 64 139 2 112 126 48
K 212 43 2 1 0 2 3 3 1
F 1,791 59 453 53 286 112 3 157 53
G 1,647 46 168 38 139 126 3 157 40
M 540 18 123 279 53 48 1 53 40
2011-10-27 CIKM 2011 Invited Talk 20
Sharing Hashtags Across Languages
E J P I S D K F G M
E 8,178 33,197 14,969
27,284 6,685 798 9,410 7,208 5,517
J 8,178 331 135 351 218 149 352 260 100
P 33,197 331 535 4,682 604 13 1,231 580 400
I 14,969 135 535 762 684 25 713 415 6,046
S 27,284 351 4,682 762 819 28 1,468 708 463
D 6,685 218 604 684 819 26 851 769 424
K 798 149 13 25 28 26 25 18 20
F 9,410 352 1,231 713 1,468 851 25 879 411
G 7,208 260 580 415 708 769 18 879 265
M 5,517 100 400 6,046 463 424 20 411 265
2011-10-27 CIKM 2011 Invited Talk 21
Implications n Indicators of connection strength between
languages – Number of bilingual brokers – Acts of brokerage: sharing URLs & hashtags
n English well connected to others, and may function as a hub
n Need to improve cross-‐language communications
? �2011-10-27 CIKM 2011 Invited Talk 22
Visible Social Signals from Shared Items
Kudos to Jilin Chen, Rowan Nairn
[Chen et al, CHI2010] [Chen et al., CHI2011]
2011-10-27 CIKM 2011 Invited Talk 23
Eddi: Summarizing Social Streams
2011-10-27 CIKM 2011 Invited Talk 24
Information Gathering/Seeking n The Filtering Problem:
– “I get 1,000+ items in my stream daily but only have time to read 10 of them. Which ones should I read?”
n The Discovery Problem: – “There are millions of URLs posted daily on Twitter. Am I
missing something important there outside my own Twitter stream?”
2011-10-27 CIKM 2011 Invited Talk 25
n Zerozero88.com – Twitter as the platform – URLs as the medium – Produces your
personal headlines
Stream Recommender
2011-10-27 CIKM 2011 Invited Talk 26
URL Sources
Topic Relevance Scores
Recommendation Engine Ø Multiply scores Ø Rank URLs using multiplied scores Ø Recommend highest ranked URLs
Social Network Scores
User Topic Profiles
Local Social Network
2011-10-27 CIKM 2011 Invited Talk 27
URL Sources
n Considering all URLs was impossible n FoF: URLs from followee-‐of-‐followees
– Social Local News is Better
n Popular: URLs that are popular across whole Twitter – Popular News is Better
Component Possible Design Choices
URL Sources FoF (followee-of-followees) Popular
2011-10-27 CIKM 2011 Invited Talk 28
URL Sources
Topic Relevance Scores
Social Network Scores
User Topic Profiles
Local Social Network
Recommendation Engine Ø Multiply scores Ø Rank URLs using multiplied scores Ø Recommend highest ranked URLs
2011-10-27 CIKM 2011 Invited Talk 29
Topic Relevance Scores
Funny YouTube Video
1.3 5.5 0.5
Funny Game …
4.0 2.1 …
2011-10-27 CIKM 2011 Invited Talk 30
n Built from tweets that contain the URL n However, tweets are short
– term vectors for URLs are often too sparse
n Adopt a term expansion technique using a search engine
“Best of Show CES 2011: The Motorola Atrix http://tcrn.ch/e0g3Oh”
Topic Profile of URLs
smartphone, mobility, …
Add to Profile
2011-10-27 CIKM 2011 Invited Talk 31
Topic Profile of Users
n Self-‐Topic: content profile based on my posts – My Interest as Information Producer
n Followee-‐Topic: content profile based on my followees’ posts – My Interest as Information Gatherer
n None, for comparison purpose
Component Possible Design Choices
Topic Relevance Scores
Self-Topic Followee-Topic None
2011-10-27 CIKM 2011 Invited Talk 32
My Followees Profile
Profile
Profile
Profile
Profile
Profile
Profile
Profile
Profile Collect & Profile
Find Top Key Terms
Aggregate Profile
Profile
A term is weighted higher in your profile if more of your followees have the term as their top key terms
Terms
Terms
Terms
Terms
Terms
Terms
Terms
Terms
Terms
Terms
2011-10-27 CIKM 2011 Invited Talk 33
URL Sources
Topic Relevance Scores
Social Network Scores
User Topic Profiles
Local Social Network
Recommendation Engine Ø Multiply scores Ø Rank URLs using multiplied scores Ø Recommend highest ranked URLs
2011-10-27 CIKM 2011 Invited Talk 34
Social Network Scores
n “Popular Vote” in among my followees-‐of-‐followees – People “vote” a URL by tweeting it – URLs with more votes in total are assigned higher score – Votes are weighted using social network structure
n None, for comparison purpose
Component Possible Design Choices
Social Network Scores
Social Voting None
2011-10-27 CIKM 2011 Invited Talk 35
The Intuition: Local Influence
Me
Whose URLs should be weighted higher?
15 People
5 People
follows
follows
follow
follow
2011-10-27 CIKM 2011 Invited Talk 36
Possible Recommender Designs
Component Possible Design Choices
URL Sources FoF (followee-of-followees) Popular
Topic Relevance Scores
Self-Topic Followee-Topic None
Social Network Scores
Social Voting None
• 2 (URL source) x 3 (topic score) x 2 (social score) = 12 possible algorithm designs in total"
• Random selection if for both scores we chose None"
Recommendation Engine Ø Multiply scores Ø Rank URLs using multiplied scores Ø Recommend highest ranked URLs
2011-10-27 CIKM 2011 Invited Talk 37
Study Design n Within-‐subject design n Each subject evaluated 5 URL recommendations
from each of the 12 algorithms – Show 60 URLs in random order, and ask for binary rating
– 60 ratings x 44 subjects = 2640 ratings in total
Best Performing
Social Vote Only
FoF URLs
39
Summary of Results
Popular URLs
2011-10-27 CIKM 2011 Invited Talk 39
Algorithms Differ Not Only in Accuracy!
n Relevance vs. Serendipity in recommendations
n From a subject in the pilot interview of zerozero88:
– “There is a tension between the discovery and the affirming aspect of things. I am getting tweets about things that I am already interested in. Something I crave …, is an element of surprise or whimsy. ... I am getting a lot of things I am interested in, but that is not necessarily a good thing for me personally”
2011-10-27 CIKM 2011 Invited Talk 40
Design Rule
n Interaction costs determine number of people who participate – Surplus of attention &
motivation at small transaction costs
n Therefore: n Important to keep
interaction costs low – Recommendation – Summarization
n Or bring new benefits Cost of participation
# Pe
ople
will
ing
to p
artic
ipat
e
2008-05-13 CSCL 2011 Keynote