+ All Categories
Home > Documents > Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Date post: 21-Dec-2015
Category:
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
54
Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005
Transcript
Page 1: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Relevance and Quality of Health Information on the Web

Tim TangDCS Seminar

October, 2005

Page 2: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

2

Outlines

• Motivation - Aims• Experiments & Results

– Domain specific vs. general search– A Quality Focused Crawler

• Conclusion & Future work

Page 3: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

3

Why health information on the Web?

• Internet is a free medium• High user demand for health information• Health information of various quality• Incorrect health advice is dangerous

Page 4: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

4

Problems

• Normal definition of relevance: Topical relevance• Normal way to search: Word-matching

Q: Are these applicable to health information?

A: Not complete, we also need quality = usefulness of the information

Page 5: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

5

Problem: Quality of health info

The various quality of health information in search results

Page 6: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

6

Wrong advice

Page 7: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

7

Dangerous information

Page 8: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

8

Dangerous information

Page 9: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

9

Dangerous Information

Page 10: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

10

Problem: Commercial sites

Health information for commercial purposes

Page 11: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

11

Commercial promotion

Page 12: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

12

Problem: Types of search engine

The difference between domain-specific search and general-purpose search.

Page 13: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

13

Querying BPS

Page 14: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

14

Querying Google: Irrelevant information

Page 15: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

15

Problem of domain-specific portals

Domain-specific portals may be good, but …

It often requires intensive effort to build and maintain (will be discussed more in experiment 2)

Page 16: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

16

Aims

• To analyse the relative performance of domain specific and general purpose search engines

• To discover how to provide effective domain specific search, particularly in the health domain

• To automate the quality assessment of medical web sites

Page 17: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

17

Two experiments

• First: Compare search results for health info between general and domain specific engines

• Second: Build and evaluate a Quality focused crawler for a health topic

Page 18: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

18

The First Experiment

A Comparison of the relative performance of general purpose search engines and domain-

specific search engines

In Journal of Information Retrieval ‘05 – Special Issue

with Nick Craswell, Dave Hawking, Kathy Griffiths and Helen Christensen

Page 19: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

19

Domain specific vs. General engines

• General search engines: Google, Yahoo, MSN search, …

• Domain specific: Search service for scientific papers, search service for health, or a topic in the health domain.

• A depression portal: BluePages (http://bluepages.anu.edu.au)

Page 20: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

20

BluePages Search (BPS)

Page 21: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

21

BPS result list

Page 22: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

22

Engines

– Google– GoogleD (Google with “depression”)– BPS– 4sites (4 high quality depression sites)– HealthFinder (HF): A health portal search named

Health Finder– HealthFinderD (HFD): HF with depression

Page 23: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

23

Queries

• 101 queries about depression:– 50 treatment queries suggested by domain experts– 51 non-treatment queries collected from 2 query logs:

domain-specific query log and general query log.• Examples:

– Treatment queries: acupuncture, antidepressant, chocolate

– Non-treatment queries: depression symptoms, clinical depression

Page 24: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

24

Experiment details

• Run the 101 queries on the 6 engines.• For each query, top 10 results from each engine are

collected.• All results were judged by research assistants: degrees

of relevance, recommendation of advice• Relevance and quality for all engines were then

compared

Page 25: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

25

Results

Engine Relevance Quality

GoogleD0.407 78

BPS0.319 127

4sites0.225 143

Google0.195 28

HFS0.0756 0

Page 26: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

26

Findings

• Google is not good in either relevance or quality • GoogleD can retrieve more relevant pages, but less high

quality pages. • 4sites and BPS provide good quality but have poor

coverage.

It’s important to have a domain-specific portal which provides both high quality and high coverage. How to improve coverage?

Page 27: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

27

Experiment 2

Building a high quality domain-specific portal using focused crawling techniques

In CIKM ’05

With Dave Hawking, Nick Craswell, Kathy Griffiths

Page 28: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

28

A Quality Focused Crawler

• Why?– The first experiment shows: Quality can be achieved

using domain specific portals– The current method for building such a portal is

expensive.– Focused crawling may be a good way to build a

health portal with high coverage, while reducing human effort.

Page 29: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

29

The problems of BPS

• Manual judgments of health sites by domain experts for two weeks to decide what to include.

• 207 Web sites are included, i.e., a lot of useful web pages are left out.

• Tedious maintenance process: Web pages change, cease to exist, new pages, etc.

• Also, the first experiment shows: High quality but quite low coverage.

Page 30: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

30

Focused Crawling (FC)

• Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure.

• Examples of topics: sport, health, cancer, or scientific papers, etc.

Page 31: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

31

FC Process

URL Frontier

Link extractorDownload

Classifier

{URLs, link info}

dequeue

{URLs, scores}

enqueue

Link info = anchor text, URL, source page’s content, so on.

Page 32: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

32

FC: simple example• Crawling pages about psychotherapy

Page 33: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

33

Relevance prediction

• anchor text: text appearing in a hyperlink• text around the link: 50 bytes before and after the link• URL words: parse the URL address

Page 34: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

34

Relevance Indicators

• URL: http://www.depression.com/psychotherapy.html

=> URL words: depression, com, psychotherapy

• Anchor text: psychotherapy• Text around the link:

– 50 bytes before: section, learn

– 50 bytes after: talk, therapy, standard, treatment

Page 35: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

35

Methods

• Machine learning approach: Train and test relevant and irrelevant URLs using the discussed features.

• Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron.

• Result: The C4.5 decision tree was the best to predict relevance.

• The same method applied to predict quality but not successful!!!

Page 36: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

36

Quality prediction

• Using evidence-based medicine, and

• Using Relevance Feedback (RF) technique

Page 37: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

37

Evidence-based Medicine

• Interventions that are supported by a systematic review of the evidence as effective.

• Examples of effective treatments for depression:– Antidepressants– ECT (electroconvulsive therapy)– Exercise– Cognitive behavioral therapy

• These treatments were divided into single and 2-word terms.

Page 38: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

38

Relevance Feedback

• Well-known IR approach of query by examples.• Basic idea: Do an initial query, get feedback from users

about what documents are relevant, then add words from relevant document to the query.

• Goal: Add terms to the query in order to get more relevant results.

Page 39: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

39

RF Algorithm

• Identify the N top-ranked documents• Identify all terms from these documents• Select the terms with highest weights• Merge these terms with the original query• Identify the new top-ranked documents for the new query

(Usually, 20 terms are added in total)

Page 40: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

40

Our Modified RF approach

• Not for relevance, but Quality• No only single terms, but also phrases• Generate a list of single terms and 2-word phrases and

their associated weights • Select the top weighted terms and phrases• Cut-off points at the lowest-ranked term that appears in

the evidence-based treatment list• 20 phrases and 29 single words form a ‘quality query’

Page 41: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

41

Terms represent topic “depression”Term WeightDepression 13.3

Health 6.9

Treatment 5.7

Mental 5.4

patient 3.3

Medication 3

ECT 2.4

antidepressants 1.9

Mental health 1.2

Cognitive therapy 0.84

Page 42: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

42

Predicting Quality

• For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights.

• Quality of a page is then predicted based on the quality of all downloaded pages linking to it.

(Assumption: Good pages are usually inter-connected)• Predicted quality score of a page with n downloaded

source pages:

PScore = ΣQScore/n

Page 43: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

43

Combining relevance and quality

• Need to have a way of balancing relevance and quality• Quality and relevance score combination is new• Our method uses a product of the two scores• Other ways to combine these scores will be explored in

future work• A quality focused crawler rely on this combined score to

order the crawl queue

Page 44: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

44

The Three Crawlers

• A Web crawler (spider): – A program which browses the WWW in a methodical, automated

manner– Usually used by a search engine to index web pages to provide

fast searches.• We built three crawlers:

– The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison)

– The Relevance crawler: For relevance purpose, ordering the crawl queue using the C4.5 decision tree

– The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.

Page 45: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

45

Results

Page 46: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

46

Relevance

Page 47: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

47

Relevance Results

• The relevance and quality crawls each stabilised after 3000 pages, at 80% and 88% relevance respectively.

• The BF crawl continued to degrade over time, and down to 40% at 10,000 pages.

• The quality crawler outperformed the relevance crawler due to the incorporation of the RF quality scores.

Page 48: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

48

Quality

Page 49: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

49

High quality pages

AAQ = Above Average Quality: top 25%

Page 50: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

50

Low quality pages

BAQ = Below Average Quality: bottom 25%

Page 51: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

51

Quality Results

• The quality crawler performed significantly better than the relevance crawler. (50% better towards the end of the crawl)

• All the crawls did well in crawling high quality pages. The quality crawler performed very well, with more than 50% of its pages being high quality.

• The quality crawl only has about 5% pages from low quality sites while the BF crawl has about 3 times higher.

Page 52: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

52

Findings

• Topical-relevance could be well predicted using link anchor context.

• Link anchor context could not be used to predict quality.

• Relevance feedback technique proved its usefulness in quality prediction.

Page 53: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

53

Overall Conclusions

• Domain-specific search engines could offer better quality of results than general search engines.

• The current way to build a domain-specific portal is expensive. We have successfully used focused crawling techniques, relevance decision tree and relevance feedback technique to build high-quality portals cheaply.

Page 54: Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

54

Future works

• So far we only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain.

• Other ways of combining relevance and quality should be explored.

• Experiments to compare our quality crawl with other health portals is necessary.

• How to remove spam from the crawl is another important step.


Recommended