+ All Categories
Home > Documents > The Social Science Approach to Web Mining

The Social Science Approach to Web Mining

Date post: 03-Feb-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
The Social Science Approach to Web Mining Sun-Ki Chai Dept. of Sociology David Chin, Scott Robertson, Kar-Hai Chu, Aaron Herres Dept. of Information & Computer Sciences University of Hawaii at Manoa IEEE SocialCom, Vancouver, August 28, 2009
Transcript

The Social Science Approach to Web Mining

Sun-Ki ChaiDept. of Sociology

David Chin, Scott Robertson,Kar-Hai Chu, Aaron Herres

Dept. of Information & Computer Sciences

University of Hawaii at Manoa

IEEE SocialCom, Vancouver, August 28, 2009

Social Science Web Crawler Project Team

Sun-Ki Chai (Dept. of Sociology)David Chin (Dept. of Information & Computer Sciences)Scott Robertson (Dept. of Information & Computer Sciences)Mooweon Rhee (Dept. of Management and Industrial Relations)Min-Sun Kim (Dept. of Speech Communications)Jang Hyun Kim (Dept. of Speech Communications)

Research Assistants: Kar-Hai Chu, Aaron Herres & Dong-Wan Kang

United States Patent # 7499965 by Sun-Ki Chai

Current Research Supported by the Air Force Office of Scientific Research and the Office of Naval Research

Organization

• Social Science methods for web mining (70 min.)• (Sun-Ki Chai)

• Differences between social science and computer science approaches (10 min.)

• (David Chin)• A crawler example from our project (20 min).

• (Aaron Herres, Kar-Hai Chu, David Chin)• Discussion (20 min.)

• (Everyone!)

What We Are and Are Not Teaching

• This is not primarily a tutorial on social network constructs and how to calculate them – computer scientists and engineers already know how to do that or can master this quickly.

• It is about how social network and other formal social science frameworks can be used to mine the web in a way that is consistent with mainstream social science theory and methodology

• It provides a short history of some of these frameworks, and illustrates the issues they were originally designed to address, and the issues that arise when they are transferred to web-mining.

What is Different about Mainstream Social Science Methodology, and How is it Applied to Web-Mining?

• Mainstream Social Science methods rely on strict application of a deductive/nomothetic approach to human phenomena– General Theory from existing literature generates Hypotheses

about specific class of empirical phenomena – Operationalization of Hypotheses translates to falsifiable

statements about measurable Indicators– Data collection Methods ensure Representative Sample of

Indicators– Hypotheses shown to be true or false by relationships in Data– Theories are confirmed or disconfirmed, subject to modification

• Much of social science work on web is an attempt to adapt proven “terrestrial” theories to online phenomenon

• Social science web research is judged by its ability to generate testable, confirmed hypotheses about behavioral, social structural (esp. stratification), and cultural consequences for individuals and groups represented by online data.

Social Networks Models:The Early Days

• Notion of formally analyzing social networks originates in “classical” sociology, particularly the work of Georg Simmel (1908) – Group size; dyads vs. triads – Tertius Gaudens (third who enjoys)

• Also central in gestalt and attitudinal psychology– Jacob Moreno’s sociograms (1931)– Fritz Heider’s balance theory (1958)

The Development and Use of Centrality Measures

• Centrality as a developing concept– Early concepts of centrality

• Alex Bavelas (1948; 1950) - communication • Marvin Shaw (1954) - small group behavior

– Degree, betweenness, closeness codified • Linton Freeman (1979)

– Eigenvector → PageRank • Phil Bonacich 1987

– Information • Stephenson and Zelen 1989

– Flow (Betweeness)• White and Smith 1989 (contested)

– Individual and group, local and global centrality

• Centrality as predictor of prestige/status in organizations (primarily business organizations).

Subcommunities: What are Their Implications for Status and Action?

• N-Clique – each actor within subcommunity connected with shortest path ≤ N

• N-Clan – clique with paths restricted to member nodes.

• N-Club – cannot be a subgraph of an N-Clique

• K-plex –for each node, no more than K nodes not directly tied

• K-core – each actor directly tied with K others in the group

Borgatti's Science camp network

Likewise for Equivalence

• Structural Equivalence: two or more nodes a share the same set of neighbors

• Automorphic Equivalence: exists partition of nodes into sets that, if simultaenously exchanged, would recreate the graph

• Regular Equivalence: exists partition into sets such that, for any two sets, no ties exist between any of their members or ties are exclusive and exhaustive

Innovation in Organizations (Burt 1977)

Idea Contagion in Organizations (Galaskiewicz 1991)

Linking Citation Networks (Doreian 1989)

Structural Holes vs. Closure: Most Prominent Empirical Debate in Sociology

• Structural Holes (Burt 1995) as a challenge to Centrality– Connecting two otherwise unconnected

communities (tertius jungens)– Consistent with the idea of exit

opportunity– But strong emphasis placed on

information• Longtime debate on relative causal

importance of structural holes vs. closure (Coleman 1990) for prestige, SES

J.S. Coleman, Frontiers of Social Theory, p. 318-320

Structural Holes vs. Closure:Who’s Better Off?

M

A

B

C

D

E

F

G

H

I

J

K

M occupies structural hole; K,J, I have closure (for that matter, A, C, H have highest degree centrality).

Thanks to Mooweon Rhee for graph example.

What is the resolution?

• Studies in business organizations show structural hole better predictor of status in business organizations (Burt, 1992; Podolny, 2005).

• In educational organizations, closure strong predictor of student achievement, but don’t consider structural holes (Coleman, 1990; Parke et al., 2002 )

• In East Asian business organizations, structural hole is not as good of a predictor of status as in West (Xiao and Tsui, 2007)

• Closure increases group status overall, structural holes increases individual status relative to rest of group, paradoxically more so in a closed group. Hence linear models inappropriate. (Chai and Rhee, 2009)

Software Tools for Social Network Analysis

• Desktop Applications– UCINet (Freeman and Borgatti) <http://www.analytictech.com/>

– NetMiner (Kim KH/Cyram) <http://www.netminer.com/>

– Pajek (Batagelj, Mrvar) <http://vlado.fmf.uni-lj.si/pub/networks/pajek/>

• Code Libraries– Jung <http://jung.sourceforge.net/>

– Prefuse <http://prefuse.org/>

– Network Workbench <http://nwb.slis.indiana.edu/>

• Visualizaiton Tools– Social Action <http://www.cs.umd.edu/hcil/socialaction/>

– NetViz (Cyram) <http://www.cyram.com>

Social Network Theory in Sociology As Predictor of Stratification, Ideas

• Development of technical constructs should immediately lead to testable hypotheses about important social outcomes

• In structure: greatest emphasis on predicting stratification outcomes – popularity, prestige/status, influence clearly distinguished.

• In culture: greatest emphasis is on dynamic analysis of spread of ideas (values, beliefs)

What’s Needed: Integration with Behavioral and Preference/Belief Change Models

• Network structure and content cannot be taken as static and exogenous: they are the result of human choices

• Consequences of virtual network for stratification and spread of ideas depend on the actions of the nodes embedded in the network, who in turn are the creations of the real-world actors

• Thus it is unavoidable that we incorporate theories about preferences, beliefs, and decision-making of actors to generate behavioral predictions.

Exchange Theory

• Attempt to in sociology to synthesize rational choice and behaviorist ideas– Homans (1958, 1960) views social

interactions as exchanges– Blau (1964) examines how

hierarchical relations can transform exchange into coercion

– Emerson (1968) developed pioneering notion of exchanges embedded in social relationships

Experiments on Networks, Exchange and Power (Sociological Social Psychology)

Emerson’s “Children”:• Karen Cook• Linda Molm• David Willer • Toshio Yamagishi

Main concepts: dependence, power, trust, fairness

Formalisms apply game theory framework

Formal Rational Choice

• Expected utility theory (solitary action)– Preferences: strict order, completeness, asymmetry (=

irreflexivity and acyclicity), and transitivity• Cardinal/Interval Preferences represented by Utility

function• Conventionally: Egoistic, Materialistic, Isomorphic

– Beliefs: based on observations and legal (logical, probablistic) inferences

• Conventionally: No other source of beliefs than the above

– Decision-Making: optimization - maximize expected utility in light of beliefs

• Game theory (collective action)– Strategic uncertainty– Common knowledge of rationality

Game Theory: Problems of Theoretical Indeterminacy and Intractibility

• Common Solution concepts– Static: Nash equilibrium, core, coalition-proof– Dynamic: Subgame perfect Nash equilibrium

• Methods for deriving solutions– Static: iterated dominance– Dynamic: backwards induction– Numerous games have no solution to any common equilibrium

concept– Numerous games have too many solutions.

• For infinitely iterated games just about any set of strategies can be portrayed as subgame perfect (Maskin and Fudenberg 1992)

• Games with continuous choice sets will often have an infinite number of solutions

Formal Models of Culture and Cultural Change

• How to guarantee a unique game solution? Culture– Culture as both tiebreaker (focal point, template, toolkit) and

game-changer (altruism, norm convergence)• General Cultural Typologies

– Individualism/Collectivism ∈ Modern/Traditional– Time Preferences and Risk Aversion – Grid/Group Cultural Theory (Douglas, Wildavsky)– Multidimensional Typologies (Parsons, Hofstede)

• General Models of Cultural Change– Coherence Model (Chai)

• Dissonance• Social Construction• Narrative Theories

What is the role of Rational/Intentional Choice in Online Analysis?

• Networks are dynamically changing entities, and these changes are the result of purposive human actions

• The effect of particular network configurations on nodes depends on the types of content that is being transferred via each link. Depending on this, different network measures are more likely to provide bridges to power, prestige, influence

• This means that we must model nodes as agents • It also means that we must have some understanding of the ideas

(and possibly goods) that are being espoused and transmitted through a network tie

• But understanding circumstances require a richer set of data than network location – we must analyze content– the main purpose of doing so is the extract beliefs and values in a

systematic form– in order to do this we must use methods that have been shown to be

both valid and reliable

Social Science Content Analysis

• Formal, quantitative analysis of content dates back to the 1930s work of Harold D. Lasswell– Focus on the nature of political communication

(1935), and later, specifically on propaganda (1939)

– Specifically motivated by the rise of political extremism in the West and the skillful use of propaganda to spread ideas

– First to devise a systematic formal analysis that went beyond word frequency counts through the use of conceptual dictionaries

– Funded by LoC for “Wartime Communications Project” that content analyzed a large portion of all political speeches during period leading up to and during WWII

– “Lasswell” dictionary still in use as a measure of ideology

Other Milestones in Content Analysis History

• Payne Fund Studies (1928) – examined content of movies and effects on children’s attitudes and knowledge

• Victor Raimy (1948) – first automated affective (sentiment) analysis – conversation between counselor and client

• Robert Bales (1950) – interaction process analysis – ties with symbolic interaction

• Harold Garfinkel (1967) – conversation analysis in ethnomethodology

• Philip Stone (1966) – first general concept computerized text analysis – Harvard Third Psychosociological Dictionary

• Rick Holmes and Joe Woelfel (1982) – demonstration of content analysis without large mainframe – focus on communication theory

How Social Science Content Analysis Process Differs from Text Mining/Sentiment Analysis

• Content analysis is a general method for extracting social meaning from artifacts (texts, pictures, videos, physical goods)

• Theories of meaning are the backdrop for deriving latent (meaning) content from manifest (observable at the surface) symbols

• Even when automated content analysis is goal, subjective coding is almost always one step of the process, but it is should be guided by theory– Whenever possible, subject matter experts are chosen for

coding– Accuracy of subjective encoding is checked by intercoder

reliability• Scott's π, Cohen's κ, Krippendorff's α

• After coding has is compiled into a codebook or “dictionary”, it is applied across to wide sample of artifacts, measuring individual term and concept frequencies, then repeatedly testing for accuracy.

Absolutely Essential to Ensure Social Representativeness of Your Data

• Defining the artifact: choose unit of analysis mapping most directly onto social phenomenon being modeled– For the web, if we are looking to measure individual or group

sentiments, the page is an inappropriate unit• Determining appropriate study population

– Be selective in identifying only web sites that represent your target real-world population, but identifying all that do so.

• all of those who may mention the issue in passing?• members of virtual communities centering around the issue?

• Census or sample?– If your population is very large, you may have to look at only a

subset.– If sample, what is your sampling frame?– What is your sampling method – simple random, stratified,

cluster, etc.?• How do you correct for bias?

– Deposit and Survival bias: stratifying on bias characteristics

Steps to Testing the Reliability and Validity of an Automated Content Analysis Dictionary

• After a dictionary is applied across a large sample within your study population, you can statistically examine the validity of word patterns chosen a for particular higher-level social construct.– Criterion validity and construct validity measured on frequencies:

e.g. Pearson’s r, Spearman’s ρ, Kronbach’s α• Attitudes measured through content analysis can generate

predictions of actions through the general cultural frameworks and intentional action theories mentioned before.– These predictions may be about virtual individual behavior, i.e.

building of networks and changing content– They may be about the terretrial collective behavior of the

groups represented by the sites• The action predictions can be tested in at least two major ways:

– Retrospective testing against existing datasets corresponding to the study population or terrestrial counterpart.

– Experimental testing in simulated environments, with incentive-outcome link designed to correspond to that in modeled environments.

Bottom Line

• Social Science Content Analysis Methodology is distinctive because:– Selection of concepts is guided by established general attitudinal

theories and frameworks.– Coding schemes are designed to match subjective opinions of

human SMEs.– Results of multi-SME coding are tested for intercoder reliability

and thus refined.– Data collection rules ensure analysis is applied to appropriate

units within an appropriate population.– Results of analysis are tested using internal measures of criteria

and construct validity.– Models of intentional action are applied to extracted attitudes,

leading to behavioral predictions that are tested against retrospective or experimental data.

Qualitative Content Analysis

• Is usually not designed to count frequency of concepts or ideas in texts.

• Labeling of texts is initially done subjectively, typically by one individual, then computer aided “decision-support”

• Since there is no desire to compile a objective codebook, long passages can be “tagged” with conceptual labels

• Often, the purpose is to organize the texts for the author rather than to generate data that will directly be subject to analysis.

• Qualitative research in general is often not aimed at hypothesis testing or prediction, but with developing valid taxonomies and tagging rules

Software for Content Analysis

• Quantitative (many offerings, usually with built-in dictionaries)– General Inquirer (Stone, loading on 100 basic concepts)

<http://www.wjh.harvard.edu/~inquirer/ >– CATPAC (Wuerfel, main ideas) <http://www.terraresearch.com/> – Lingustic Inquiry and Word Count (psychological state)

<http://www.liwc.net>– KEDS/Tabari (events) <http://web.ku.edu/keds/>

• Intercoder Reliability – PRAM, AGREE, ReCal

• Qualitative (two dominant products)– NVivo <http://www.qsr-software.com/>– ATLAS/ti <http://www.atlasti.de/>

Organization

• Social Science methods for web mining (70 min.)• (Sun-Ki Chai)

• Differences between social science and computer science approaches (10 min.)

• (David Chin)• A crawler example from our project (20 min).

• (Aaron Herres, Kar-Hai Chu, David Chin)• Discussion (20 min.)

• (Everyone!)

Data or Principles?

• Social scientists prefer principles– Data in social sciences often hard to obtain– Data usually affected by multiple factors– Curve fitting with no clear theory is disparaged

• Computer scientists prefer data– Measurement culture– Theory should emerge from data– Unverifiable theories are disparaged

Examples

• Economics: rational agent theory– People act to maximize utility– Contradicting data handled by tweaking theory

• Content Analysis– LWIC (psychology)

• Starts with dictionary from experts• Then data is used to validate/adjust dictionary

– Sentiment analysis (computer science)• Starts with human-labeled data for machine learning

Examples (cont.)

• Boundary of a virtual community– Social scientists look for theoretical basis

• Data is used to validate theory– Computer scientists look for data

• Theory explains data

Finished or Just Starting?

• Theory development is an end product for social science

• Complete theory is a starting point for computer scientists

• Example– Requirements document must contain very complete

specifications (from a theory)

Organization

• Social Science methods for web mining (70 min.)• (Sun-Ki Chai)

• Differences between social science and computer science approaches (10 min.)

• (David Chin & Scott Robertson)• A crawler example from our project (20 min).

• (Aaron Herres, Kar-Hai Chu, David Chin)• Discussion (20 min.)

• (Everyone!)

Example: Applying Social Science Methods

• Social Science Web Crawler• Crawler integrates a wide range of validated social

science theories on social networks, language, attitudes, culture, and behavior.

Starting Out: Locating the Right Virtual Community

• User enters a few “seed” sites to start the exploration.

• Control exactly how many sites to look for, how deep to go into each site, or select one of our pre-made profiles

The System at Work: Discovering a Virtual Community

• Our interface provides real-time feedback as it explores the web, including visual map and listing of the virtual community as it grows.

• System allows users to halt processing at any time, save the stage, and resume at a later point.

Community Crawler

• Site level unit of analysis– all crawls and calculations are performed at the site level (user

defined)– corresponds closer to human usage and sense of virtual community

• Combining link and content analyses– compare against content analysis for queries

• Using existing, validated content analysis “dictionaries” for:– Sentiment– Ideology– Event analysis– Emotional health– Behavioral tendencies– Etc.

Content analysis

• User defined combination of site level counts for:– term frequencies (how many times a term appears), this can also

include more abstract concept frequencies using dictionaries– document frequencies (how many documents a term appears in) to

generate term vectors which represent each site

• Cosine similarity to measure the closeness between existing community sites and a potential candidate site.

• Similarity (community, candidate) = cosine (ϴ) = A ● B / ∣∣A∣∣∣∣B∣∣

• Points A and B are based on user defined term weights (e.g. TF x IDF model) for each term

Community

Candidate sites

Single page

Community sites

SEED SITE SEED SITE LINKS=14CONTENT=0.89

LINKS=10CONTENT=0.65

LINKS=8CONTENT=0.78

Specialized Exploration: The Forum Analyzer

• Forum– A message board or online discussion site

• Forum Analyzer – Measures activity level– Estimates the strength of community– Detects opinion and sentiment

0.00

0.50

1.00

1.50

2.00

2.50

SF1 SF2 Buick Civic Mac BrainTalk Autism

Mean Reply Depth Participation Rate

Community Metrics in Forums

• Web metrics– Standard web traffic statistics and data

• Community metrics– Member interaction and communication structure– Response time, mean reply depth, active participation rate

Content Analysis: Contrast Between Different Kinds Forums

• Pronoun usage– I, we, they

• Emotions– neg emotion,

anxiety, sadness, anger

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

SF1 SF2 DSM1 Civic Flu Lung

iwethey

0.00

0.50

1.00

1.50

2.00

2.50

3.00

SF1 SF2 DSM1 Civic Flu Lung

negemoanxangersad

Analysis of Member Communication Networks in Forums

Centralized vs. Distributed

Application: Forum Analyzer

What The Technology Provides

Human behavior databased on established social science theories and metrics.

These can be freely composed, weighted, and added together in an intuitive way.

Content and network information combined seamlessly to provide the most accurate answers.

Bibliography

Bavelas, A. 1948. A Mathematical Model For Group Structure. Applied Anthropology 7, 16-30.

Bavelas, A. 1950. Communication Patterns In Task-oriented Groups. Journal Of The Acoustical Society Of America 57, 271-282.

Blau, Peter. 1964. Exchange and Power in Social Life. New York: Wiley.Bonacich, P. 1987. Power And Centrality: A Family Of Measures. American Journal Of

Sociology 92, 1170-1182.Brin, S., Page, L., 1998. The Anatomy of a Large-scale hypertextual Web Search Engine.

Computer Networks and ISDN Systems 30, 107-117.Burt, R.S. 1995. Structural Holes: The Social Structure of Competition Ronald S. Burt;

Cambridge, Mass.: Harvard University Press, 1995. Burt, R.S. 1977. Power in a Social Topology. Social Science Research 6(1), 1-83. Chai, S. and Liu, M. Confucian Capitalism and the Paradox of Closure and Structural

Holes in East Asian Firms. Coleman, J. S. 1996. Foundations of Social Theory. Cambridge Univ. Press. Belknap

Press of Harvard UniversityEmerson, Richard M. 1987. Towards a Theory of Value in Social Exchange. In Cook and

Levi (eds.) Social Exchange Theory. Harvard: Harvard University Press. 11-47Freeman, L.C., 1979. Centrality in Networks: Conceptual Clarification. Social Networks 1,

215-39.Freeman, L.C., Borgatti, S.P., White, D.R., 1991. Centrality In Valued Graphs: A Measure

of Betweenness Based on Network Flow. Social Networks 13, 141-15.

Bibliography pt. 2

Galaskiewicz, J. and R.S. Burt. 1991. Interorganization Contagion in Corporate Philanthropy. Administrative Science Quarterly 36.

Garfinkel, H. 1967 Studies in ethnomethodology. Englewood Cliffs, NJ: Prentice-Hall.Heider, F.1958. The Psychology Of Interpersonal Relations. New York: John Wiley &

Sons.Homans, G. 1958. Social Behavior as Exchange. American Sociological Review 83, 1-11Homans, G. 1961. Social Behavior: Its Elementary Forms. New York: Harcourt Brace and

World.Hummon, N.P. And P, Doreian. 1989. Connectivity In A Citation Network:The

Development Of Dna Theory, 39-63.Podolny. J. 2005. Status Signals:A Sociological Study of Market Competition. Princeton

University Press. Lasswell, H.D. 1935. Politics: Who Gets What, When, How. Princeton, NJ. Princeton

University Press.Lasswell, H.D. 1939. Propaganda: A Chicago Study. New York, Knopf.Moreno, J. L. 1951. Sociometry, Experimental Method and the Science of Society. An

Approach to a New Political Orientation. Beacon House, Beacon, New York.Parke, R.D., Simpkins, S.D., McDowell, D.J., Kim, M., Killian, C., Dennis, J., Flyr, M.L.,

Wild, M., & Rah, Y. 2002. Relative contributions of families and peers to children's social development. In P.K. Smith & C.H. Hart (Eds.), Blackwell Handbook of childhood social development: 156-177. New York: Blackwell Publishers.

Bibliography pt. 3

Parke, R.D., Simpkins, S.D., McDowell, D.J., Kim, M., Killian, C., Dennis, J., Flyr, M.L., Wild, M., & Rah, Y. 2002. Relative contributions of families and peers to children's social development. In P.K. Smith & C.H. Hart (Eds.), Blackwell handbook of childhood social development: 156-177. New York: Blackwell Publishers.

Pennebaker, J. W., Francis ME, Booth RJ. 2001. Linguistic Inquiry and Word Count (LIWC): LIWC2001. Mahwah: Lawrence Erlbaum Associates.

Simmel, G.1964 [1908]. The Sociology Of Georg Simmel, Free Press. Stephenson, K.A. And Zelen, M., 1989. Rethinking Centrality: Methods And Examples.

Social Networks 11, 1-37. Stone P.J. 1963. A computer approach to content analysis: studies using the General

Inquirer system. Proceedings of May 21-23, 1963 spring joint AEA concerence. Raimy, V J. 1987. Self-Reference in Counseling Interviews in 1967 Studies in

Ethnomethodology . Englewood Cliffs, NJ: Prentice-Hall.Woelfel, J., Cody, M. J., Gillham, J., & Holmes, R. A. (1980). Basic premises of

multidimensional attitude change theory: An experimental analysis. Human Communication Research 6(2), 153-167.

Xiao, Z.,and Tsui, A.S. 2007. When brokers may not work: The cultural contingency of social capital in Chinese high-tech firms. Administrative Science Quarterly, 52(1): 1-31.


Recommended