+ All Categories
Home > Documents > Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National...

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National...

Date post: 28-Dec-2015
Category:
Upload: imogene-gilbert
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
24
Automatically Identifying Automatically Identifying Localizable Queries Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Michael J. Welch, Junghoo Cho University of California, Los Angeles SIGIR 2008
Transcript

Automatically Identifying Localizable Automatically Identifying Localizable QueriesQueries

Center for E-Business TechnologySeoul National University

Seoul, Korea

Nam, Kwang-hyun

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Michael J. Welch, Junghoo Cho

University of California, Los Angeles

SIGIR 2008

Copyright 2009 by CEBT

ContentsContents

Introduction

Motivation

Our Approach

Identify candidate localizable queries

Select a set of relevant features

Train and evaluate supervised classifier performance

Evaluation

Individual Classifiers

Ensemble Classifiers

Conclusion and future work

Discussion

IDS Lab Seminar - 2

Copyright 2009 by CEBT

IntroductionIntroduction

Typical queries

Insufficient to fully specify a user’s information need

Localizable queries

Some queries are location sensitive

– “italian restaurant” -> “[city] italian restaurant”

– “courthouse” -> “[county] courthouse”

– “drivers license” -> “[state] drivers license”

They are submitted by a user with the goal of finding information or services relevant to user’s current location.

Our task

Identify the queries which contain locations as contextual modifiers

IDS Lab Seminar - 3

Copyright 2009 by CEBT

MotivationMotivation

Why automatically localize?

Reduce burden on the user

– No special “local” or “mobile” site

Improve search result relevance

– Not all information is relevant to every user

Increase clickthrough rate

Improve local sponsored content matching

IDS Lab Seminar - 4

Copyright 2009 by CEBT

MotivationMotivation

Significant fraction of queries are localizable

Roughly 30%

But users only explicitly localize them about ½ of the time

– 16% of queries would benefit from automatic localization

Users agree on which queries are localizable

Queries for goods and services

– E.g. “food supplies”, “home health care providers”

– But “calories coffee”, “eye chart” are not.

IDS Lab Seminar - 5

Copyright 2009 by CEBT

Our ApproachOur Approach

Identify candidate localizable queries

Select a set of relevant features

Train and evaluate supervised classifier performance

IDS Lab Seminar - 6

Copyright 2009 by CEBT

Identifying Base QueriesIdentifying Base Queries

Queries are short and unformatted

Use string matching

Compare against locations of interest

– Using U.S. Census Bureau data

Extract base query

– Where the matched portion of text is tagged with the detected location type (state, county, or city)

To ensure accuracy, filter out false positives in the classifier

Simple, yet effective

IDS Lab Seminar - 7

Copyright 2009 by CEBT

Example: Identifying Base Example: Identifying Base QueriesQueries

IDS Lab Seminar - 8

city:malibu state:california

city:malibustate:california

Copyright 2009 by CEBT

Example: Identifying Base Example: Identifying Base QueriesQueries

Three distinct base queries

Remove stop words and group by base

Allows us to compute aggregate statistics

IDS Lab Seminar - 9

Base Tag

public libraries california city:malibu

public libraries malibu state:california

public libraries city:malibu, state:california

Copyright 2009 by CEBT

Our ApproachOur Approach

Identify candidate localizable queries

Select a set of relevant features

Train and evaluate supervised classifier performance

IDS Lab Seminar - 10

Copyright 2009 by CEBT

Distinguishing FeaturesDistinguishing Features

Hypothesis: localizable queries should

Be explicitly localized by some users

Occur several times

– From different users

Occur with several different locations

– Each with about equal probability

IDS Lab Seminar - 11

Copyright 2009 by CEBT

Localization RatioLocalization Ratio

Users vote for the localizability of query qi by

contextualizing it with a location l

Drawbacks

Capable to small sample sizes

Unable to identify false positives resulting from incorrectly tagged locations

IDS Lab Seminar - 12

ri : localization ratio for qi

Qi : the count of all instances of qi

Qi(L) : the count of all query instances tagged with some location l ∈ L

ri : localization ratio for qi

Qi : the count of all instances of qi

Qi(L) : the count of all query instances tagged with some location l ∈ L

, ri ∈ [0,1]

Copyright 2009 by CEBT

Location DistributionLocation Distribution

Informally: given an instance of any localized query ql

with base qb , the probability that ql contains location l is

approximately equal across all locations that occur with qb.

To estimate the distribution, we calculate several measures

mean, median, min, max, and standard deviation of occurrence counts

IDS Lab Seminar - 13

ql : localized queryqb : base queryL(qb ) : the set of location tags

ql : localized queryqb : base queryL(qb ) : the set of location tags

Copyright 2009 by CEBT

Location DistributionLocation Distribution

The “fried chicken” problem

IDS Lab Seminar - 14

Tag Count Tag Count

city:chester 6 city:rice 2

city:colorado springs

1 city:waxahachie 1

city:cook 1 state:kentucky 163

city:crown 1 state:louisiana 4

city:lousiana 4 state:maryland 2

city:louisville 2

Copyright 2009 by CEBT

Clickthrough RatesClickthrough Rates

Assumption

Greater clickthrough rate indicative of higher user satisfaction

– T. Joachims et. al., “Accurately interpreting clickthrough data as implicit feedback”, SIGIR ‘05.

Calculated clickthrough rates for both the base query and its localized forms

Binary clickthrough function

Clickthrough rate for localized instances 17% higher than nonlocalized instances

IDS Lab Seminar - 15

Copyright 2009 by CEBT

Our ApproachOur Approach

Identify candidate localizable queries

Select a set of relevant features

Train and evaluate supervised classifier performance

IDS Lab Seminar - 16

Copyright 2009 by CEBT

Classifier Training DataClassifier Training Data

Selected a random sample of 200 base queries generated by the tagging step

Filtered out base queries where

nL <= 1 (with only one distinct location modifier)

uq = 1 (only issued by a single user)

q = 0 (base form was never issued to the search engine)

From remaining 102 queries

48 positive (localizable) examples

54 negative (non-localizable) examples

IDS Lab Seminar - 17

Copyright 2009 by CEBT

Evaluation SetupEvaluation Setup

Evaluated supervised classifiers on precision and recall using 10-fold cross validation

Precision: accuracy of queries classified as localizable

Recall: percent of localizable queries identified

Focused attention on positive precision

False positives more harmful than false negatives

Recall scores account for manual filtering

IDS Lab Seminar - 18

Copyright 2009 by CEBT

Individual ClassifiersIndividual Classifiers

Naïve Bayes

Gaussian assumption doesn’t hold for all features

– Kernel-based naïve Bayes classifier is used.

Decision Trees

Emphasized localization ratio, location distribution measures, and clickthrough rates

IDS Lab Seminar - 19

ClassifierPrecision

Recall

Naïve Bayes 64% 43%

Decision Tree (Information Gain) 67% 57%

Decision Tree (Normalized Information Gain)

64% 56%

Decision Tree (Gini Coefficient) 68% 51%

Copyright 2009 by CEBT

Individual ClassifiersIndividual Classifiers

SVM (Support Vector Machine)

A set of related supervised learning methods used for classification and regression

Improvement over NB and DT, but opaque

Neural Network

Best individual classifier, but also opaque

IDS Lab Seminar - 20

ClassifierPrecision

Recall

SVM 75% 62%

Neural Network 85% 52%

Copyright 2009 by CEBT

Ensemble ClassifiersEnsemble Classifiers

Observation

False positive classifications didn’t fully overlap for individual classifiers

Combined DT, SVM, and NN using a majority voting scheme

IDS Lab Seminar - 21

Classifier Precision Recall

Combined 94% 46%

Copyright 2009 by CEBT

ConclusionConclusion

Method for classifying queries as localizable

Scalable, language independent tagging

Determined useful features for classification

Demonstrated simple components can make a highly accurate system

Exploited variation in classifiers by applying majority voting

IDS Lab Seminar - 22

Copyright 2009 by CEBT

Future WorkFuture Work

Optimize feature computation for real-time

Many features fit into MapReduce framework

Investigate using dynamic features

Updating classifier models

Explicit feedback loops

Generalize definition of “location”

Landmarks, relative locations, GPS

Integration with search system

IDS Lab Seminar - 23

Copyright 2009 by CEBT

DiscussionDiscussion

Pros

Interesting issue to be helpful for web search

Good performance

Cons

Lack contents to understand

– One of equations is omitted

– No explanation about terms

No explanation why ‘localizable’ is called as ‘positive’

False positives

IDS Lab Seminar - 24


Recommended