+ All Categories
Home > Engineering > Swapnil soni Thesis_Presentation

Swapnil soni Thesis_Presentation

Date post: 08-Aug-2015
Category:
Upload: swapnil-soni
View: 166 times
Download: 2 times
Share this document with a friend
53
Domain-Specific Document Retrieval Framework for Near Real-time Social Health Data 1 Master’s Thesis Swapnil Soni Committees Prof. Amit P. Sheth (Advisor) Prof. Krishnaprasad Thirunarayan Dr. Tanvi Banerjee Collaborator Ashutosh Jadhav Contact: LinkedIn: https :// www.linkedin.com/in/swapnilsoniknoesis Home page: http://knoesis.org/researchers/swapnil / http://knoesis.org /
Transcript
Page 1: Swapnil soni Thesis_Presentation

Domain-Specific Document Retrieval Framework for Near Real-time Social Health Data

1

Master’s ThesisSwapnil Soni

Committees

Prof. Amit P. Sheth (Advisor)

Prof. Krishnaprasad Thirunarayan

Dr. Tanvi Banerjee

CollaboratorAshutosh Jadhav

Contact:LinkedIn: https://www.linkedin.com/in/swapnilsoniknoesisHome page: http://knoesis.org/researchers/swapnil/http://knoesis.org/

Page 2: Swapnil soni Thesis_Presentation

2

Outline

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 3: Swapnil soni Thesis_Presentation

3

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Outline

Page 4: Swapnil soni Thesis_Presentation

4

Background

Sources: Pew research http://www.pewinternet.org/2010/03/24/health-information/http://www.pewinternet.org/2011/02/01/health-topics-3/

Online health resources are easily accessible

and provide information about most of health

topics.

These resources can help non-experts to make

more informed decisions and play a vital role

in improving health literacy.

Page 5: Swapnil soni Thesis_Presentation

5

According to the pew research*,

45% of U.S. adults are dealing with at least one chronic condition

Of those who are living with two or more conditions, 45% have diabetes

*http://www.pewinternet.org/files/old-media/Files/Reports/2013/PIP_TrackingforHealth%20with%20appendix.pdf

Background

Page 6: Swapnil soni Thesis_Presentation

6

Health Information Seeking

Web search engine Social media

Choudhury et. al., Seeking and Sharing Health Information Online: Comparing Search Engines and Social Media,ACM,2014Teevan et. Al., #TwitterSearch: A Comparison of Microblog Search and Web Search, 2011

Real-time content

Popular trends

Online health information-seeker

Learn about basic facts

Get deeper understanding

about a topic of interest

Page 7: Swapnil soni Thesis_Presentation

7

Online health information-seeker

Real-time content

Relevant

Reliable information

Health Information Seeking

Social media search-engine

Page 8: Swapnil soni Thesis_Presentation

8

Health Information Seeking Challenges

Keyword-based techniques are based on the interpretation of keywords

Search results may not be real-time

Page 9: Swapnil soni Thesis_Presentation

9

Example: How to control diabetes

Keyword-basedNot real-time

Twitter

Page 10: Swapnil soni Thesis_Presentation

10

To provide a platform to ask health-related questions

in near real-time, reliable, and relevant health

information shared on social media.

Objective

Page 11: Swapnil soni Thesis_Presentation

11

In the US 18% internet users use Twitter.

As we know, there are 500 million tweets per day and around 75K

verified healthcare professionals accounts from all over the world.

152K: number of health tweets every day by professionals in health-care.

Twitter as a Data Source

Twitter has become a new source of information overload in health-care

Page 12: Swapnil soni Thesis_Presentation

12

Problem

How to extract near real-time, reliable and relevant

documents from the health information shared on

Twitter for a given user query?

Page 13: Swapnil soni Thesis_Presentation

13

Outline

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 14: Swapnil soni Thesis_Presentation

14

Questions

Real-time Twitter data

DataCollection

Data Collection

Page 15: Swapnil soni Thesis_Presentation

15

Predefined questions:

Selected most frequently asked questions from Mayo clinic, WebMD, etc.

Dynamic questions:

User can ask any question

Categories of Questions

Page 16: Swapnil soni Thesis_Presentation

16

System Architecture

User interface

Database

patternsURL social media

rank

Similarity score Calc

Patterns Rank Calc

Twitter

URL content

extractor

Hadoop-based Pattern extractor

Pattern extractor

URL share &

like counts

extractor

23

4

5

1

Language Identifier

URL extractor

URL resolver

DBHandler

Apache StormProcessing pipeline

Page 17: Swapnil soni Thesis_Presentation

17

Apache Storm

It is a distributed, real-time computation system.

Spouts and Bolts are basic components in storm for real-time processing of data.

Networks of spouts and bolts are packaged

into a “topology”, which is submitted to

storm cluster.

Page 18: Swapnil soni Thesis_Presentation

18

Crawler

Spout

Topology Architecture

Language identifier Bolt

Object Modeling

Hashtag extractor

URL Extractor

URL resolver

A spout which crawls in real-time based on  keywords

It allows only English tweets

It is used for retrieving a hashtag from the tweets

It converts tweet object to Java object

Extract URLs from tweets

It expands the URL(s) from short to

its original form

Page 19: Swapnil soni Thesis_Presentation

19

Outline

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 20: Swapnil soni Thesis_Presentation

20

Component: Pattern Extractor

User interface

DatabaseLanguage Identifier

URL extractor

URL resolver

DBHandler

URL content

extractor

Apache StormProcessing pipeline

patternsURL social media

rank

Similarity score Calc

Patterns Rank Calc

Hadoop-based Pattern extractor

Pattern extractor

Twitter

URL share &

like counts

extractor

23

4

5

1

Page 21: Swapnil soni Thesis_Presentation

21

Content Extractor:

To extract content from the URLs (present in the tweets).

URL(s) Share & Like counts Extractor:

Popularity of a source: To measure the content popularity, we have used social

media shares and likes counts of the URLs.

Facebook shares, Facebook like count, Twitter share count.

Reliability of a source: Google domain page rank of the URLs.

Extractors

Page 22: Swapnil soni Thesis_Presentation

22

Pattern Extractor

Pattern-based Mining

Triple Subject, predicate, and objectQuestion

Construct an AQL query

A noun or noun phrase, or a verb or verb

Page 23: Swapnil soni Thesis_Presentation

23

AQL is a language used for building queries that pulls structured

information from unstructured or semi-structured text.

Syntax of AQL is similar to that of Structured Query Language (SQL).

AQL file

AOG

SystemTData folder

Contains all the patterns.

Result contains pattern.

Annotation Query Language (AQL)

Page 24: Swapnil soni Thesis_Presentation

24

5 easy natural remedies to control diabetes : If you are a diabetic or know someone who is a diabeti... http://bit.ly/13oypg4 

Pattern Extractor: Example

How to control diabetes?

X control diabetesX control blood sugarX handle blood sugarX handle diabetes

UMLSWordNet

Synsets

Page 25: Swapnil soni Thesis_Presentation

25

This module extracts triples (patterns) from unstructured (tweets and

URLs’ content) based on predefined questions (AQL queries).

The text analytic engine executes AQL queries--an interval of six hours.

Predefined Questions: Pattern Mining

Page 26: Swapnil soni Thesis_Presentation

26

How to control diabetes by exercise?

Part-of-speech tagger

Control (verb), diabetes(noun), exercise(noun)

Query builderWordNetSynsets

Query executer

diabetes control exerciseexercise control diabetes

Dynamic Query Processing Architecture

Page 27: Swapnil soni Thesis_Presentation

Question Extracted Pattern Paragraph

How to control diabetes?

control blood sugar Exercise is a healthy way to lower and control blood sugar levels within your body. Doing exercise and lifting weights will improve your condition significantly. http://t.co/88ulxDPFTo

How to control diabetes?

insulin into the blood stream to handle

When a meal is eaten, the pancreas will send larger amounts of insulin into the blood stream to handle the food http://t.co/WsCWiNqhb9

How to control diabetes?

remove sugar Since people with Type 2 diabetes tend to accumulate sugar in their blood due to their inability to efficiently remove sugar from the blood http://t.co/aHqKJjrTPY

27

Results

Question Extracted Pattern Paragraph

What are the Symptoms of diabetes?

Diabetics tend to get Diabetics should be very cautious when having a pedicure. Diabetics tend to get bad infections in the feet, so you must be very aware of any puncture or cut you notice on your feet. http://t.co/HqJBjBtrXC

What are the causes of diabetes?

can cause diabetes Smoking isn’t healthy for anyone but can be very dangerous if you’re a diabetic. This habit produces many poor health issues. Smoking makes a person’s insulin resistant, in can cause diabetes to develop http://t.co/Ca5SaXRL6w

Page 28: Swapnil soni Thesis_Presentation

Patterns URL share and

like counts

Pattern Rank Calculator

Pattern Rank Calculator Architecture

24

Similarity Score

Calculator

Database

1a 1b 2

3

Page 29: Swapnil soni Thesis_Presentation

29

Features Set

Popularity Relevancy

Facebook share counts

Facebook like counts

Twitter share count

Vector based

similarity score

Reliability

Google domain rank

Page 30: Swapnil soni Thesis_Presentation

30

Query Expansion

How to control diabetes?

How to control diabetes?How to control blood sugar?How to handle blood sugar?How to handle diabetes?

0.81 (TF-IDF score)

0.0 (TF-IDF score)

0.81(TF-IDF score)

0.77(TF-IDF score)

Exercise controls diabetes

Natural way to handle blood sugar

Page 31: Swapnil soni Thesis_Presentation

31

NaiveBayes supportVector RandomForest AdaBoostM1

0.638 0.630.678

0.376

0.639 0.667 0.694

0.556

Social Media share and like count + Jaccard Index on query expansion

Precision Recall

NaiveBayes supportVector RandomForest AdaBoostM1

0.7530.687

0.793

0.501

0.722 0.750.806

0.583

Social Media share and like count + TF-IDF on query expansion

Precision Recall

Experiments: ML Classifiers

Page 32: Swapnil soni Thesis_Presentation

32

Outline

Background, Objective

Problem Statement

Data Collection

Pattern Extraction Analysis

Results and Evaluation

Demonstration

Conclusion

Page 33: Swapnil soni Thesis_Presentation

33

Evaluation

Reliability, Relevancy, and Real-time

Pattern Generator

Query Expansion based on Relevance Feedback

Page 34: Swapnil soni Thesis_Presentation

34

Evaluation: Reliability, Relevancy & Real-timeReliability:

• Based on URL’s (extracted news article) Google domain pagerank

• Filtration criteria is URL’s Google domain pagerank should be greater than 4  

Relevancy:

• Based on qualitative approach

• For a given question, user survey participants judge the relevancy of the result set from 1) Twitter search 2) Social Health Signals 3) Google time bound search and assign relevancy score from 1 (low) to 3(high)

Real-time:

• Timeliness (trends) of a retrieved document. We have considered only 6 hours

data to find out information of a user’s given query

• Example: breaking news on diabetes

Page 35: Swapnil soni Thesis_Presentation

35

Collected the top 10 results from these sources: Twitter search, Social

Health Signal, and Google time-bound search

Evaluation: Relevancy

Queries (Frequently Asked Query)

1) How to control diabetes?

2) What are the causes of diabetes?

3) What are the symptoms of diabetes?

Page 36: Swapnil soni Thesis_Presentation

36

Presented the top 10 results from all the three sources for each of the query

to participants

Participants judge each document of a query on a scale of 1 to 3 (i.e. 1-Not

good, 2-good, and 3-very good)

To calculate average rank, we have used the following formula*:

Evaluation: Relevancy

*http://help.surveymonkey.com/articles/en_US/kb/What-is-the-Rating-Average-and-how-is-it-calculated

Page 37: Swapnil soni Thesis_Presentation

37

Evaluation: RelevancyHow to control diabetes?

Result 1

Result 2

Result 3

Score 1 Score 2 Score 3

Page 38: Swapnil soni Thesis_Presentation

38

Evaluation: Relevancy

Twitter search

Social Health Signal

Query 1 40% 50%

Query 2 10% 60%

Query 3 40% 50%

Twitter search

Social Health Signal

Query 1 10% 10%

Query 2 30% 10%

Query 3 30% 20%

Twitter search

Social Health Signal

Query 1 50% 40%

Query 2 60% 30%

Query 3 30% 30%

Bad

GoodVery Good

Google-time bound

10%

50%

10%

Google-time bound

40%

10%

70%

Google-time bound

50%

40%

20%

Page 39: Swapnil soni Thesis_Presentation

39

nDCG@K (Normal Discounted cumulative gain)

nDCG@K can handle multiple levels of relevance

It gives more weightage to a higher position document than a lower

ranking position document

Evaluation Matrices: Relevancy

Page 40: Swapnil soni Thesis_Presentation

40

Twitter-Search

Social Health Signal

DCG 9.68 12.72

IDCG 13.33 13.76

nDCG 0.726 0.924

Twitter-Search

Social Health Signal

DCG 9.67 13.15

IDCG 10.55 14.15

nDCG 0.91 0.92

Twitter-Search

Social Health Signal

DCG 10.75 11.47

IDCG 12.69 13.45

nDCG 0.84 0.85

Google Time-Bound

9.12

9.81

0.929

Google Time-Bound

10.03

12.76

0.78

Google Time-Bound

10.76

10.89

0.98

Query 2Query 1

Query 3

Evaluation Matrices: Relevancy

Page 41: Swapnil soni Thesis_Presentation

41

Evaluation: Popularity

Google time-bound Social Health Signal

Query: How to control diabetes?

Facebook (Share + Like ) Counts

Twitter Share Counts

4 0

0 0

0 0

0 4

0 0

0 0

1 2

52 1211

229 0

Facebook (Share + Like ) Counts

Twitter Share Counts

3910 1843

213 8

81 90

0 128

149 826

0 24

0 20

0 24

0 2

Page 42: Swapnil soni Thesis_Presentation

42

Google Time-Bound Search

Obesity cause diabetes Overweight cause diabetes

Page 43: Swapnil soni Thesis_Presentation

43

URL Title: Replacing sugary drinks with water may reduce diabetes riskExtracted Pattern: 'obese is a major risk'

URL Title: More Evidence Links Diabetes to Alzheimer's DiseaseExtracted Pattern: Overweight May Decrease Mortality Risk   URL Title :The facts about sugarExtracted Pattern: ‘overweight can increase your risk ‘

URL Title : Having Diabetes Can Increase Your Alzheimer's Risk Via Blood Glucose And Brain Plaque LinkExtracted Pattern : obesity can also increase our risk URL Title : Diabetes Study Suggests a Little Extra Weight Tied to Longer SurvivalExtracted Pattern: risk for dying than overweight   

Social Health SignalObesity less Mortality Risk

Page 45: Swapnil soni Thesis_Presentation

45

Future Work

Evaluation

Relevancy on more Queries

Pattern Generator

Query Expansion based on Relevance Feedback

Semantic Categorization

Performance improvement for dynamic queries

Page 46: Swapnil soni Thesis_Presentation

46

Conclusion

◦ Twitter has become a popular tool for seeking health information.

◦ It is very difficult task to extract relevant, and reliable health document

from Twitter in near real-time

◦ We address this problem, by using state-of-the-art approaches such as

◦ Semantics-based pattern mining

◦ TF-IDF relevancy score on query expansion

◦ Content popularity: Social media share and like counts

◦ Reliability : Google domain page rank

Page 47: Swapnil soni Thesis_Presentation

47

Acknowledgements

Dr. Amit Sheth Dr. T.K Prasad Dr. Tanvi Banerjee Ashutosh Jadhav

Page 48: Swapnil soni Thesis_Presentation

48

Thanks!

Questions?

Page 49: Swapnil soni Thesis_Presentation

49

Social Health Signal

Screenshots

Page 50: Swapnil soni Thesis_Presentation

50

Home Screen

Page 51: Swapnil soni Thesis_Presentation

51

Search & Explore Screen

Page 52: Swapnil soni Thesis_Presentation

52

Top 10 URLs Screen

Page 53: Swapnil soni Thesis_Presentation

53

Tweet Locations Screen


Recommended