+ All Categories
Home > Documents > Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated...

Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated...

Date post: 30-Dec-2015
Category:
Upload: sharon-bishop
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
57
Digging Deep for Hidden Information in the Web Part 1: Automated blog analysis Part 2: Automated hyperlink analysis
Transcript

Digging Deep for Hidden Information in the Web

Part 1: Automated blog analysisPart 2: Automated hyperlink analysis

Part 1 Automated Blog Analysis

Analysing Public Science Debates through Blogs and Online News Sources

Part 1 Contents

Background Blogs Online news sources RSS

Tracking public science debatesDetecting public science debates

Background

Blogs, public opinion, online news, RSS

Background

There are millions of bloggersBloggers are almost normal human beingsAutomatically tracking bloggers’ postings may give insights into public opinion

Blog tracking companies

IBM WebFountain

Intelliseek BlogPulse “Monitor, measure and leverage

consumer-generated media”

Others growing…

RSS Format

Rich Site Syndication/Really Simple Syndication XML technology Used for frequently updated information

sources (blogs, news, academic journals)

RSS Readers Users subscribe to the RSS feeds of

favourite blogs/sites/journals/searches Notified when updates available User-controlled ‘push’ technology

Tracking Public Science Debates

Blog keyword searches

Technorati “Searches weblogs by keyword and for links” Stem cell research

Blogdigger stem cell research

IceRocket Allows Advanced searches Allows genuine date range search (Google only

allows “last updated” date range searches)

Track evolution over time

What is changing about interest in Stem cell research/GM food?Are experts good at identifying changes in public interest?How can experts be sure/can they be supported with quantitative information?Can blogs be used to generate time series reflecting changes in “public interest”?

Free science debate graphs

Solves the trend identification problem?Blogpulse Offers free automatic blog searches and keyword-generated click-search graphs Stem cell research GM food Mobile phone radiation

Research graphs

Time-consuming to collect dataGive control over the data source

Detecting Public Science Debates

How to detect a new debate?

Heuristic methods E.g. Read papers, scan relevant blogs

Automatic methods E.g. look for sudden increase in usage

of science-related words in blogs?

Free hot topic searches

Blog keyword search (sort by date) Technorati “Searches weblogs by keyword and for

links” Stem cell research

Blogdigger blog search

Hot topic searches Blogdex – top contagious information Bloglines – today’s hot topics (most popular links)

Searches find the really big science debates?

Specialist research tools

Commercial software Intelliseek/IBM

Mozdeh RSS monitor Generates sub-collections Generates word time series Allows keyword searches Identifies hot topics

Mozdeh Science Concern Corpus

A collection of blog postings containing a fear word AND a science wordTrend detection used to identify hot “science fear” topicsData cleaning to remove spamNeed manual scanning of list of words experiencing biggest usage increase

Classification of top 5 words

Word Max. daily increase (feeds)

Classification

stem 19% Science fear (stem cell research)

orlean 16% Information (about hurricane)

hurricane

16% Duplicate of ‘orlean’

katrina 15% Duplicate of ‘orlean’

june 14% Temporal descriptor

Classification of top 200 words

0 20 40 60 80

Fear of Science

Information

Progress

Threat Prediction

Other

Duplicate

Temporal Descriptor

Random

Hot science fear words

7.5% oftop 200WordsRepresentnew publicfears ofSciencestories

E.g. new medical cure

The wordscome from multiplestories

Unexpected results?

Social science research Sudden burst of discussion over fears of the

economic theories of Karl Rove, an influential advisor to George Bush

Computer security Concern over spyware features in a

software vendor’s products Research showing that consumers’ pin

numbers could be revealed by poor printing

Conclusions

Many free tools support exploration of Consumer Generated MediaAlso room for specialist research tools

References

http://www.blogpulse.com/http://www.blogpulse.com/www2006-workshop/http://www.creen.org/

Thelwall, M., Prabowo, R. & Fairclough, R. (2006, to appear). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. Journal of the American Society for Information Science and Technology.

Acknowledgement

The work was supported by a European Union grant for activity code NEST-2003-Path-1. It is part of the CREEN project (Critical Events in Evolving Networks, contract 012684, http://www.creen.org/)

Part 2: Automated hyperlink analysis

Link analysis as a social science technique

Link Analysis Manifesto

Links are: A wonderful new source of information

about relationships between people, organisations and information

An easy to collect data source

But: Results should be interpreted with care

Part 2 Contents

Academic link analysis –mainly from an information science perspectiveA general social science link analysis methodologyCommercial applications

Why Count Links?Individual hyperlinks may reflect connections between web page contents or creatorsCounts of large numbers of hyperlinks may reflect wider underlying social processes Links may reflect phenomena that have previously been difficult to study E.g. informal scholarly communication

Why Count University Links?To map patterns of communication between researchers in a countryWhich universities collaborate a lot?Which universities collaborate with government or industry?Which universities are using the web effectively?

Counting links

Search engines will count them for you!Yahoo! advanced queries, e.g. Links from Wolves Uni. to Oxford Uni. Or back

domain:ox.ac.uk AND linkdomain:wlv.ac.uk domain:wlv.ac.uk AND linkdomain:ox.ac.uk

Google link queries Find links to specific URLs, e.g. links to the

University home pagelink:www.wlv.ac.uk

Counting links

Can use a special purpose web crawler or robot Visits all the pages in a web site Counts the links in the site Can use “advanced” counting

methods

Some Inter-University Hyperlink Patterns

Mainly for the UK and Europe

Links to UK universities against their research

productivity

The reason for the strong correlation is the quantity of Web publication, not its quality

This is different to citation analysis

Most links are only loosely related to research90% of links between UK university sites have some connection with scholarly activity, including teaching and research But less than 1% are equivalent to citations

So link counts do not measure research dissemination but are more a natural by-product of scholarly activity Cannot use link counts to assess research Can use link counts to track an aspect of

communication

UK universities tend to link to their neighbours

Universitiesclustergeographically

Language is a factor in international interlinking

English the dominant language for Web sites in the Western EUIn a typical country, 50% of pages are in the national language(s) and 50% in EnglishNon-English speaking extensively interlink in English

12,379,256

2,888,072

1,094,442

1,008,353

962,092

941,420

885,432

488,172

458,961

444,974

172,804

86,107

328,644

- 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000

English

German

Spanish

Swedish

Dutch

Greek

French

Italian

Norwegian

Finnish

Portugese

Danish

Others

Lan

gu

age

Total university Web pages

0%

20%

40%

60%

80%

100%

fr it de es gr no nl pt ch be dk at se uk ie fi

Country

Un

ive

rsit

y W

eb

pa

ge

la

ng

ua

ge

s

OthersFrenchDutchSwedishGermanEnglish

Patterns of international communicationCounts of links between EU universities in Swedish are represented by arrow thickness.

Counts of links between EU universities in French are represented by arrow thickness.

Which language???

Which language???

Whichlanguage?

Who is isolated?

International link patterns

The next slide is a (Kamada-Kawai) network of the interlinking of the “top” 5 universities in AEAN countries (Asia and Europe) with arrows representing at least 100 links and universities not connected removed.

The rich get richer on the web

Link creation obeys the ‘rich get richer’ law Sites which already have a lot of links

attract the most new links Some sites have a huge number of links:

most have one or none

Rich get richer example: Links from Australian university pages

The anomaliesare also interesting

Part 3: A General Social Science Link Analysis MethodologyA general framework for using link counts in social sciences research For research into link creation or Together with other sources, for research into

other online or offline phenomena

Applicable when there are enough links relevant to the research question to count For collections of large web sites or For large collections of small web sites

Nine stages for a research project

1. Formulate an appropriate research question, taking into account existing knowledge of web structure

2. Conduct a pilot study

3. Identify web pages or sites that are appropriate to address the research question

Nine stages for a research project

4. Collect link data from a commercial search engine or a personal crawler, taking appropriate accuracy safeguards

5. Apply data cleansing techniques to the links, if possible, and select an appropriate counting method

6. Partially validate the link count results through correlation tests, if possible

Nine stages for a research project

7. Partially validate the interpretation of the results through a link classification exercise

8. Report results with an interpretation consistent with link classification exercise, including either a detailed description of the classification or exemplars to illustrate the categories

9. Report the limitations of the study and parameters used in data collection and processing

The theoretical perspective for link counting In order to be able to reliably interpret link counts, all links should be created individually and independently, by humans, through equivalent gravity judgments (e.g.,

about the quality of the information in the target page).

Additionally, links to a site should target pages created by the site owner or somebody else closely associated with the site.

Commercial applications

Of link analysis

Commercial applications

Find out who links to your web site More links mean more visitors Check if your web site is being recognised

Find out who isn’t linking to your site But is linking to a competitor’s web site! Gives ideas about where to get new

customers or links from

Takes an hour of advanced searches Simple but very valuable!

Conclusion

There is a lot of hidden information in the web: in blogs and hyperlinks

Co-authors

Ray Binns, Viv Cothey, Ruth Fairclough, Gareth Harries , Xuemei Li, Peter Musgrove, Teresa Page-Kennedy, Nigel Payne, Rudy Prabowo, Liz Price, David Stuart, David Wilkinson, Alesia Zuccala University of Wolverhampton.Rong Tang, Catholic University of America.Han-Woo Park, YeungNam University, South Korea.Paul Wouters, Andrea Scharnhorst. The Virtual Knowledge Studio for the Humanities and Social Sciences, Amsterdam, The Netherlands.


Recommended