Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | anatoliy-gruzd |
View: | 6,006 times |
Download: | 0 times |
Content-Based Social Network Analysis of Online Communities
Anatoliy Gruzd Caroline Haythornthwaite
Graduate School of Library and Information ScienceUniversity of Illinois at Urbana-Champaign
Social Network/ing SymposiumToronto, 2007
The Problem Online communities are
creating a growing volume of texts contributed by a growing number of participants
100 million posters in Usenet (Marc Smith, quoted in CNET, 2003)
3.12 terrabytes of data *daily* on Usenet (2007)
2010, >70% of digital content will be user-generated, with the majority of it will still be text-based (Technology Consultancy IDC)
Growth of Usenet(wikipedia, Oct. 2007)
Making Sense of Community Action How can we help make
sense of community action and interaction based solely on textual interchanges?
How can we make the social structures evident for participants, and managers or teachers?
How can we make advances from linear streams of text to visualized patterns of interaction?
Growth of blog activityMarch 2003-2006
• 175,000 new blogs a day (2006)
Mapping Online Communities Mappings and internal
examinations tend to be based on one aspect of ties
Links between sites Reports of friendship or
work relations FOAF declarations
With a concentration of quantity over content Flickrverse, Gustavog, 2006
http://www.flickr.com/photo_zoom.gne?id=9708628&context=set-222111&size=lBased on 50 connections between people.
Mapping Online Communities (2) Emerging mappings
include attention to Poster activity Actor profiles as posters Content of sites
e.g., words in common on different sites(Gloor & Zhao, 2006)
Welse, Gleave, Fisher & Smith, 2007 in JOSS
Extracting Network Information Determine who is talking to whom
Applying social network analysis techniques
Determine what they are talking about Applying natural language processing techniques
Merge these to produce network detection that better represents ongoing processes
Our Goal Use natural language processing (NLP)
enhance the current techniques of building social networks
gain more information and insight about Nodes, Relations, and Ties
Current focus is on bulletin boards Current example is online learning environment Procedures are being derived to use for groups
with unknown membership
Adding more with NLP Revealing network information
1. Node discovery 2. Tie discovery 3. Relation discovery 4. Role & Group discovery
Network visibility rather than aggregate behavior
Important for revealing structures to Participants to understand the ‘lay of the
(cyber)land’ and for instructors (or managers) to oversee participation and intervene as necessary
Adding relational information Few (yet) derive relations from content which can
reveal Networks based on multiple relations Change in discourse over time Changes in associations among network members by relation
and time Few deal with the vagaries of CMC texts
Bulletin boards, chat Incorrect spelling, partial sentences, inventive punctuation Deriving who is talking to whom from content analysis
Or local language conventions Acronyms, group naming conventions, group word use
conventions, nicknames for people and processes
Node and Ties Focus today on nodes and tie discovery Identifying who are the actors in the
network Identify nodes, i.e., people Make the tie(s) between nodes
Two approaches Chain Network, based on chain of posting Name Network, based on names used in
the text
Chain Network: definition optionsA B C D
Connect a sender to the last person in the post chain only (undirected)
0 0 1
Connect a sender to the last and first (=thread starter) person in the chain, and assign equal weight values (e.g. 1) to both ties.
1 0 1
Same as option 2, but a tie between a sender and the first person is half weight (e.g. 0.5)
.5 0 1
Connect a sender to all people in the reference chain with decreasing weights.
.25 .5 1
Chain Networks: missed info.
Previous post is by Gabriel, Sam replies: ‘Nick, Ann, Gina, Gabriel:
I apologize for not backing this up with a good source, but I know from reading about this topic that libraries…’
Previous posts by Gabriel, Sam, Gina, and Eva, then: ‘Gina, I owe you a cookie. This is exactly what I wanted to know.
I was already planning on taking 302 next semester, and now I have something to look forward to!’
Post by Fred: ‘I wonder if that could be why other libraries around the world have resisted changing –
it's too much work, and as Dan pointed out, too expensive.’
Ex.1
Ex.2
Ex.3
Name networks Making use of node and tie information that is
in the text of the postings Issues
Disambiguating names/nicknames from text Disambiguate names of people from names of
people being discussed (e.g., subject) Detection of aliases for a given person and
disambiguation of two or more users with the same name
Hand coding: categories Network Participants
<from> = person indicated in ‘from’ line of post heading (NB only info that is system generated) <addressee> = direct reference to other ('I agree with you Todd') <reference> = indirect reference to other ('Todd has a good point') <self-reference> = poster references themselves in some way (braindead library student, high
school teacher, etc.) <signature> = name as given by the message author on their post
Named non-participants <subject>, <subject 2>, or <subject 3> = name is a subject of the discussion, either as one name
(Dewey), 2 (Brewste Kahle) or 3 (Charles R. Darwin) <non-group reference> = reference to a person who is not in the group, nor the subject – e.g., a
former professor Error
<error> = new name appears because of error (e.g., Lackie as a subject instead of Leckie; or part of a prevpost line does not conform to the usual format)
Previous Posts (if not removed from dataset) <previous-poster> = when the previous message is included, this indicates the poster (‘Janice
wrote: ’) (system generated) <copy> = name appears because it is included with the previous message
Examples of hand-coding Just a note to clarify something in yesterday's lecture/chat session. I mentioned
that Monday's NY Times had an article on <#1><subject> Brewster. I want to clarify that the article concerns the copyright extension law and the current Supreme Court case <#1><subject> Eldred v. <#1><subject> Ashcroft, set to begin today, I believe. <#1><subject 2> Brewster Kahle is currently touring the country in a bookmobile … For more info on this … you can refer to the Web site that <#1><reference> Jodie mentioned yesterday… <#1><signature>LA
NB. Jodie may not even appear in the contributors to this thread Several of our programs at UC <#7><subject> Davis have well-intentioned
lower division research methods classes that introduce then never reinforce basic skills.
Need to disambiguate “UC Davis” from someone called “Davis” Research (to paraphrase my hero, <#8><subject> Shrek) is like onions. Not
because it stinks, but because it is made up of layers. “Shrek” as a name will not appear in conventional name lists.
Automated Node & Tie Discovery Method
1. Determine names in the dataset, and assign a probability value
2. Determine email address to name relationship
3. Assign tie weight to each discovered tie
Automated Node Discovery Named Entities Recognition
Discovery of personal names The 1990 US Census http://www.census.gov/genealogy/names Capitalization
Distinguishing between names of people in and outside the class
Having a list of names doesn’t always work e.g., if someone uses their middle name which is not on the name list,
or they use a short or nickname; Method: associate names with email addresses in the class
relying on content-based (e.g. context words) and structure-based (e.g. word position) features of names
Issues Many names - same person Same name - many people
Automated Node Discovery (2)EXAMPLEFrom: [email protected] (=Wilma)Reference Chain: [email protected] (=Dustin) => [email protected] (=Sam)
Hi Dustin, Sam, Nick and all, I appreciate your posts from this and last week […]. I keep thinking of poor Charlie who only wanted information on “dogs“ Sam has been talking about. […] Wilma.
Words to the Left
Name Words to the Right
Position %
Score “TO”
Score “FROM”
* Hi Dustin Sam, Nick, 0 0.322 -0.004
* Dustin, Sam Nick and 1 0.321 -0.002
Dustin, Sam, Nick and all, 2 0.320 -0.001
of poor Charlie who only 50 0.05 0.04
on “dogs“ Sam has been 65 0.285 0.07
* Wilma * 88 0.0012 0.116
* - end of the line
Automated Tie Discovery Associate each sender in the class with all names mentioned in
his/her emails. For example, Wilma ---> Dustin = [email protected] Wilma ---> Charlie
no email for Charlie, so not a person in the conversation group (e.g., when Steve and I took Professor Sid’s course last year)
Wilma ---> no mention of a name; info on tie is only in the Chain network; could start of a thread or change of topic within a thread, or a general posting
Assign tie weight Pair counts Mutual information
Chain vs. Name Networks Get added information from the name
network Ex. BBoards #06,07,08 Nodes: 37 Messages: 346 Chain network ties: 223 Name network ties: 215 / 429 Shared ties: 140 QAP Pearson Correlation: 0.453 (p = .000)
An ego network for Brent
Visualization powered by http://www.netvis.org
Name Network Chain Network
An ego network for TylerName Network Chain Network
kurt -> Kurt Cobain, a lead singer for the rock band Nirvana dewey -> John Dewey, philosopher & educatorsanta_monica -> Santa Monica Public Library mark –> mark up language Visualization powered by http://www.netvis.org
Conclusion Uses and benefits of content-based networks
Discovery of social network behavior rather than posting behavior
Discovery of social interactions between group members that happened outside the group (e.g. fishing trip)
Discovery of relations between group members and people outside the group (e.g. a shared friend from another department)
Expert/Co-discussant discovery Study of perceived social networks without directly collecting
survey-data from participants (?)
References and Further Reading Related papers
Haythornthwaite, C. & Gruzd, A. (2007). A noun phrase analysis tool for mining online community. In C. Steinfield, B.T. Pentland, M. Ackerman & N. Contractor (eds.). Communities and Technologies 2007 (pp. 67-86). London: Springer.
Howard T. Welser, Eric Gleave, Danyel Fisher, and Marc Smith (2007) Visualizing the signatures of social roles in online discussion groups. Journal of Social Structure, 8(2). http://www.cmu.edu/joss/content/articles/volume8/Welser/