Machine Learning Applications to Modeling Web Searcher Behavior
Eugene Agichtein
Intelligent Information Access Lab (IRLab)
Emory University
Talk Outline
�Overview of the Emory IR Lab
�Intent-centric Web Search
�Contextualized search intent detection
�One example medical application
2Eugene Agichtein, Emory University, IR Lab
Intelligent Information Access Lab (IRLab)
Eugene Agichtein, Emory University, IR Lab 3
Qi Guo Ablimit Aji
• Modeling information seeking behavior• Searching the Web and social media
• Text and data mining for medical informatics
In collaboration with:
- Beth Buffalo (Neurology)- Charlie Clarke (Waterloo)
- Ernie Garcia (Radiology)
- Phil Wolff (Psychology)- Hongyuan Zha (GaTech)
Julia Kiseleva
Dmitry Lagun Qiaoling Liu Yu Wang
Our Approach to Intelligent Information Access
Eugene Agichtein, Emory University, IR Lab
44
Information
sharing
Health
Informatics
Cognitive
Diagnostics
Intelligent
search
Data-Driven Model Discovery(machine learning/data mining)
Search logs:queries, clicksSearch logs:queries, clicks
Web-scale Text Mining
Extract entities, relationships, events from text
Estimate accuracy of web content
Some Applications:
– Incorporating extracted information into (web) search
– Finding implicit connections between events, entities
– Visualizing and exploring large text collections
[DL’00, ICDE 2003 “best student paper”, SIGMOD 2006 “best paper”, … ]
DiseaseOutbreaks,
The New York Times
Intelligent
search
18 November 2009 5Eugene Agichtein, Emory
University, IR Lab
Social Media Language Analysis
Social Media != WSJ Text
Text Mining/NLP Challenges:
�Content quality
�Authority/expertise
�User goals,
�Subjectivity
�Sentiment
�Temporal Sensitivity
�Effort and Incentives
Eugene Agichtein, Emory University, IR Lab 7
Information
sharing
Hybrid Web/Social Search
Eugene Agichtein, Emory University, IR Lab 9
Intelligent
search
Information
sharing+
Talk Outline
�Overview of the Emory IR Lab
�Intent-centric Web Search
�Contextualized search intent detection
�One example medical application
10Eugene Agichtein, Emory University, IR Lab
Some Key Challenges for Web Search
• Query interpretation (infer intent)
• Ranking (high dimensionality)
• Evaluation (system improvement)
• Result presentation (information visualization)
Eugene Agichtein, Emory University, IR Lab 11
Task-Goal-Search Model
Eugene Agichtein, Emory University, IR Lab 12
car safety ratings consumer reports
Problem Statement
• Given: Sequence of user actions, predict user goal,
task, and future actions
– Will define tasks and goal next
• Example applications:
– Predict document relevance (ranking, result
presentation, summarization)
– Predict next query (query suggestion, spelling
correction)
– Predict user satisfaction (market share)
Eugene Agichtein, Emory University, IR Lab 13
Intent (Goal) Classes, top level only
User intent taxonomy (Broder 2002)
– Informational – want to learn about something (~40% / 65%)
– Navigational – want to go to that page (~25% / 15%)
– Transactional – want to do something (web-mediated) (~35% / 20%)
• Access a serviceDownloads
• Shop
– Gray areas
• Find a good hub
• Exploratory search “see what’s there”
Eugene Agichtein, Emory University, IR Lab
[from SIGIR 2008 Tutorial,
Baeza-Yates and Jones]
History nonya food
Singapore Airlines
Jakarta weather
Kalimantan satellite images
Nikon Finepix
Car rental Kuala Lumpur
Information Retrieval Process: Implementation
Eugene Agichtein, Emory University, IR Lab 15
Source
Selection
Search
Query: car safety ratings
Selection
Ranked List
Examination
Documents
Delivery
Documents
Query
Formulation
Resource
query reformulation,
vocabulary learning,
relevance feedback
source reselection
Search Engine
Result Page (SERP)
Search Actions
• Keystrokes
– query, scroll, CTRL-C, …)
• GUI:
– scrolling, button press, clicks
• Mouse:
– moving, scrolling, down/up, scroll
• Browser:
– new tab, close, back/forward
Eugene Agichtein, Emory University, IR Lab 16
All of these can be
easily captured on
SERP (javascript)
How Do We Know “True” User Intent?
• Ask the user (surveys, field studies, pop-ups)
– Does not scale, users get annoyed
• Observe user actions and guess
– Intent usually obvious to humans but not always
• Detect signals from user’s brain(fMRI, EEEG) and attempt to interpret neuron activity
Eugene Agichtein, Emory University, IR Lab 17
Adapted from [Daniel M. Russell, 2007]
What Eye Movement Can Tell
• Eye tracking gives
information about searcher
interests:
– Eye position
– Pupil diameter
– Saccades and fixations
Eugene Agichtein, Emory University, IR Lab 18
Reading
Visual
Search
Camera
“An Eye Tracker on Every Table”
• Eye tracking equipment is bulky and expensive
• Can we infer gaze position from observable actions?
• Exploratory study from Google (Rodden et al.) says
maybe: mouse position is sometimes related to eye
position
Eugene Agichtein, Emory University, IR Lab 19
Relationship Between Mouse and Gaze Position
• Searchers might use
the mouse to focus
reading attention,
bookmark promising
results, or not at all.
• Behavior varies with
task difficulty and
user expertise
Eugene Agichtein, Emory University, IR Lab 20
[K. Rodden, X. Fu, A. Aula, and I. Spiro, Eye-mouse coordination patterns
on web search results pages, Extended Abstracts of ACM CHI 2008]
Assume “Transitivity” Holds
• Given:
– Gaze position ==> user intent
and Mouse movement ==> gaze position
�Mouse movement ==> user intent
� Restate problem:
� Given user actions, infer current user’s intent, focusing
on Individual User’s actions
Eugene Agichtein, Emory University, IR Lab 21
Collecting Search Data: EMU
Eugene Agichtein, Emory University, IR Lab 22
• Firefox + LibX plugin
• Track whitelisted sites e.g., Emory, Google, Yahoo search…
• All SERP events logged (asynchronous http requests)
•150 public use machines, 5,000+ opted-in users
HTTP Log
HTTP Server
Usage DataData Mining &
Management
Train Prediction
Models
Research vs. Purchase Intent
• 12 subjects (grad students and staff) asked to 1. Research a product they want to purchase eventually
(Research intent)
2. Search for a best deal on an item they want to purchase immediately (Purchase intent)
• Eye tracking and browser instrumentation performed in parallel for some of the subjects
• EyeTech systems TM3 (rental) ���� avoid!– At reasonable resolution, samples at only ~12-15 Hz
– Looses calibration after a few minutes
Eugene Agichtein, Emory University, IR Lab 25
Qi Guo
Mouse Features: Simple
Eugene Agichtein, Emory University, IR Lab 29
• First representation:
– Trajectory length
– Horizontal range
– Vertical range
Horizontal range
Vertical range
Trajectory length
Mouse Features: Full
Eugene Agichtein, Emory University, IR Lab 30
• Second representation:
– 5 segments:
initial, early, middle, late, and end
– Each segment:
speed, acceleration, rotation, slope, etc.
1
2
3
4
5
Use Support Vector Machine (SVM) Classifier
• SVMs maximize the margin around the
separating hyperplane.
• A.k.a. large margin classifiers
• The decision function is fully specified
by a subset of training samples, the
support vectors.
• Quadratic programming problem
• Seen by many as most successful
current text classification method
Eugene Agichtein, Emory University, IR Lab 32
Support vectors
Maximize
margin
From HMMs to MEMMs to CRFs
Eugene Agichtein, Emory University, IR Lab 36
nn oooossss ,...,,..., 2121 ==vv
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
∏=
−∝||
1
1)|()|(),(
o
t
ttttsoPssPosP
v
vv
∏∑
∑
∏
=
−
=
−
+∝
∝
−
||
1
1
,
||
1
1
),(
),(
exp1
),|()|(
1
o
t
k
ttkk
j
ttjj
os
o
t
ttt
osg
ssf
Z
ossPosP
tt
v
v
vv
µ
λ
∏∑
∑
=
−
+∝
||
1
1
),(
),(
exp1
)|(o
t
k
ttkk
j
ttjj
o osg
ssf
ZosP
v
v
vv
µ
λ
Conditional Random Fields (CRFs)
[from Lafferty, McCallum, Pereira 2001]
Application: Predict Ad Receptiveness
Hypothesis: the
right time to
show search ads:
when searcher is
receptive to
seeing ads
Eugene Agichtein, Emory University, IR Lab 37
Results: Ad Click Prediction
• 200%+ precision improvement (within task)
Eugene Agichtein, Emory University, IR Lab 38
Challenges
• Separate context from intent (e.g., smart phones)
• User variability: individual differences, tasks
• Scale of data: representation, compression
• Privacy: client-side data similar to other PII
• Obtaining realistic user data: see above
– EMU toolbar tracking since 2007 in Emory Libraries (biased)
Eugene Agichtein, Emory University, IR Lab 39
Current and Future Work
• Detect mouse “reading” behavior
• Unsupervised intent clustering
• User vs. task
• Personalized behavior models
• Long-term interests/effects
• User mental state (frustration, cognitive impairment, …)
Eugene Agichtein, Emory University, IR Lab 40
Towards Web-based Visual Paired Comparison Test
• VPC can be used to detect MCI (years before AD), but requires eye tracking equipment
• Goal: develop web-based version of VPC
– NIH ADRC Pilot Grant, jointly with Beth Buffalo
• Approach: exploit connection between mouse movement and gaze position.
– Force usage of mouse to reveal image
• Or parts of image
– Develop robust machine learning techniques to predict cognitive impairment based on (noisy) mouse data
Eugene Agichtein, Emory University, IR Lab 41
Dmitry Lagun
with Beth Buffalo, Neurology/Yerkes
Initial Results
• Preference for novel image
(59%) consistently observed
• Still exploring parameter
space and metrics to
optimize
• Results sensitive to Mturk
worker instructions,
incentives, other factors (?)
• Looking for advice on
remote behavioral testing…
Eugene Agichtein, Emory University, IR Lab 42
Eye tracking
Mouse (oculus) center
Summary: From Behavior to State of Mind
• Approach:
– Machine learning methods for detecting searcher intent
– Calibrated and augmented with lab studies
• Foundational contributions:
– Methods to mine and integrate wide range of interactions
– Data-driven discovery of user state-of-mind
• Impact:
– Intelligent, intuitive search and information sharing
– Potential for new research tools and techniques
Eugene Agichtein, Emory University, IR Lab 43
Main References
• Classifying and Characterizing Query Intent, Azin Ashkan, Charles L. A. Clarke, Eugene Agichtein, Qi Guo, In ECIR 2009.
• Qi Guo and Eugene Agichtein, Exploring Client-Side
Instrumentation for Personalized Search Intent Inference: Preliminary Experiments, Proc. of AAAI 2008 Workshop on Intelligent Techniques for Web Personalization (ITWP 2008)
• Qi Guo, Eugene Agichtein, Azin Ashkan and Charles L. A. Clarke: In the Mood to Click? Inferring Searcher Advertising Receptiveness, in Proc. of WI 2009
• Other papers here: http://www.mathcs.emory.edu/~eugene/
Eugene Agichtein, Emory University, IR Lab 44