+ All Categories
Home > Science > Semantic Search keynote at CORIA 2015

Semantic Search keynote at CORIA 2015

Date post: 18-Jul-2015
Category:
Upload: peter-mika
View: 466 times
Download: 2 times
Share this document with a friend
Popular Tags:
33
Semantic Search: from document retrieval to Virtual Assistants PRESENTED BY Peter Mika, Director of Research, Yahoo Labs March 20, 2015
Transcript
Page 1: Semantic Search keynote at CORIA 2015

Semantic Search: from document

retr ieval to Virtual Assistants

P R E S E N T E D B Y P e t e r M i k a , D i r e c t o r o f R e s e a r c h , Y a h o o L a b s ⎪ M a r c h 2 0 , 2 0 1 5

Page 2: Semantic Search keynote at CORIA 2015

The Semantic Web (2001-)

3/21/20152

Part of Tim Berners-Lee’s original proposal for the Web

Beginning of a research community

› Formal ontology

› Logical reasoning

› Agents, web services

Rough start in deployment

› Misplaced expectations

› Lack of adoption

Page 3: Semantic Search keynote at CORIA 2015

The Semantic Web, May 2001

“At the doctor's office, Lucy instructed her Semantic Web agent through her handheld Web browser. The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-planfor Mom's insurance within a 20-mile radius of her home and with a rating of excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules.”

(The emphasized keywords indicate terms whose semantics, or meaning, were defined for the agent through the Semantic Web.)

3/21/20153

Misplaced expectations?

Page 4: Semantic Search keynote at CORIA 2015

Lack of adoption

Standardization ahead of adoption

› URI, RDF, RDF/XML, RDFa, JSON-LD,

OWL, RIF, SPARQL, OWL-S, POWDER …

Chicken and egg problem

› No users/use cases, hence no data

› No data, because no users/use cases

By 2007, some modest progress

› Metadata in HTML: microformats

› Linked Data: simplifying the stack

Page 5: Semantic Search keynote at CORIA 2015

Web search by 2007

5

Large classes of queries are solved to perfection

Improvements in web search are harder and harder to come by

› Relevance models, hyperlink structure and interaction data

› Combination of features using machine learning

› Heavy investment in computational power

• real-time indexing, instant search, datacenters and edge services

Page 6: Semantic Search keynote at CORIA 2015

Language issues

› Multiple interpretations

• jaguar

• paris hilton

› Secondary meaning

• george bush (and I mean the beer brewer

in Arizona)

› Subjectivity

• reliable digital camera

• paris hilton sexy

› Imprecise or overly precise searches

• jim hendler

Complex needs

› Missing information

• brad pitt zombie

• florida man with 115 guns

• 35 year old computer scientist living in barcelona

› Category queries

• countries in africa

• barcelona nightlife

› Transactional or computational queries

• 120 dollars in euros

• digital camera under 300 dollars

• world temperature in 2020

Poorly solved information needs remain

Many of these queries would

not be asked by users, who

learned over time what search

technology can and can not

do.

Page 7: Semantic Search keynote at CORIA 2015

Web search by 2007

7

Are there even any true keyword queries?

› Lyrics, quotes and bugs… anything else?

Remaining challenges are not computational, but in modeling user

cognition

› Need a deeper understanding of the query, the content and/or the world at large

Page 8: Semantic Search keynote at CORIA 2015

Microsearch internal prototype (2007)

Personal and

private

homepage

of the same

person

(clear from the

snippet but it

could be also

automatically

de-duplicated)

Conferences

he plans to attend

and his vacations

from homepage

plus bio events

from LinkedIn

Geolocation

Page 9: Semantic Search keynote at CORIA 2015

Enhanced Results

Computing abstracts is hard

› Summarization of HTML

• Template detection

• Selecting relevant snippets

• Composing readable text

› Efficiency constraints

Structured data to replace or complement text summary

› Key/value pairs

› Deep links

› Image or Video

Page 10: Semantic Search keynote at CORIA 2015

Yahoo SearchMonkey (2008)

1. Extract structured data

› Semantic Web markup

• Example:

<span property=“vcard:city”>Santa Clara</span>

<span property=“vcard:region”>CA</span>

› Information Extraction

2. Presentation

› Fixed presentation templates

• One template per object type

› Applications

• Third-party modules to display data (SearchMonkey)

Page 11: Semantic Search keynote at CORIA 2015

Effectiveness of enhanced results

Explicit user feedback

› Side-by-side editorial evaluation (A/B testing)

• Editors are shown a traditional search result and enhanced result for the same page

• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)

Implicit user feedback

› Click-through rate analysis

• Long dwell time limit of 100s (Ciemiewicz et al. 2010)

• 15% increase in ‘good’ clicks

› User interaction model

• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than textual (III)

• Enhanced results effectively reduce bad clicks!

See

› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR 2011: 725-734

Page 12: Semantic Search keynote at CORIA 2015

Adoption among consumers of web content

Google announces Rich Snippets - June, 2009

› Faceted search for recipes - Feb, 2011

Bing tiles – Feb, 2011

Facebook’s Like button and the Open Graph Protocol (2010)

› Shows up in profiles and news feed

› Site owners can later reach users who have liked an object

Page 13: Semantic Search keynote at CORIA 2015

schema.org

Agreement on a shared set of schemas for common types of web content

› Bing, Google, and Yahoo! as initial founders (June, 2011)

• Yandex joins schema.org in Nov, 2011

› Similar in intent to sitemaps.org

• Use a single format to communicate the same information to all three search engines

schema.org covers areas of interest to all search engines

› Business listings (local), creative works (video), recipes, reviews and more

› Microdata, RDFa, JSON-LD syntax

Collaborative effort

› Growing number of 3rd party contributions

› schema.org discussions at [email protected]

Page 14: Semantic Search keynote at CORIA 2015

Adoption among publishers of content

R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote)

› Over 15% of all pages now have schema.org markup

› Over 5 million sites, over 25 billion entity references

› In other words

• Same order of magnitude as the web

See also

› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012

• Based on Bing US corpus

• 31% of webpages, 5% of domains contain some metadata

› WebDataCommons

• Based on CommonCrawl Nov 2013

• 26% of webpages, 14% of domains contain some metadata

Page 15: Semantic Search keynote at CORIA 2015

Semantic Search at Yahoo

15

Page 16: Semantic Search keynote at CORIA 2015

Yahoo’s Knowledge Graph

Chicago Cubs

Chicago

Barack Obama

Carlos Zambrano

10% off ticketsfor

plays for

plays in

lives in

Brad Pitt

Angelina Jolie

Steven Soderbergh

George Clooney

Ocean’s Twelve

partner

directs

casts in

E/R

casts

in

takes place in

Fight Club

casts in

Dust Brotherscasts

in

music by

Nicolas Torzec: Making knowledge reusable at Yahoo!:

a Look at the Yahoo! Knowledge Base (SemTech 2013)

Page 17: Semantic Search keynote at CORIA 2015

Information extraction and reconciliation

Information extraction› Automated information extraction

• e.g. wrapper induction

› Metadata from HTML pages

• Focused crawler

› Public datasets (e.g. Dbpedia)

› Proprietary data

Data fusion› Manual mapping from the source schemas to the

ontology

› Supervised entity reconciliation

• Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013

• Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012

Ontology management › Editorially maintained OWL ontology with 300+

classes

› Covering the domains of interest of Yahoo

Curation and quality assessment› Editors and user feedback still play a large role

Page 18: Semantic Search keynote at CORIA 2015

Semantic Search

Active research field at the intersection of IR, NLP, DB and SemWeb› ESAIR at SIGIR, SemSearch at ESWC/WWW, EOS and JIWES at SIGIR, Semantic Search

at VLDB

Exploiting semantic understanding in the retrieval process› User intent and resources are represented using semantic models

• Not just symbolic representations

› Semantic models are exploited in the matching and ranking of resources

Tasks › information extraction

› information reconciliation/tracking

› query understanding

› retrieving/ranking entities/attributes/relations

› result presentation

Page 19: Semantic Search keynote at CORIA 2015

Semantic Search – a process view

Query Constructi

on

•Keywords

•Forms

•NL

•Formal language

Query Processin

g

• IR-style matching & ranking

•DB-style precise matching

•KB-style matching & inferences

ResultPresentation

•Query visualization

•Document and data presentation

•Summarization

Query Refinement

• Implicit feedback

•Explicit feedback

• Incentives

Document Representation

Knowledge Representation

Semantic ModelsResources

Documents

Page 20: Semantic Search keynote at CORIA 2015

Semantic understanding

23

Documents

› Text in general

• Exploiting natural language structure and semantic coherence

› Specific to the Web

• Exploiting structure of web pages, e.g. annotation of web tables

Queries

› Short text and no structure… nothing to do?

Page 21: Semantic Search keynote at CORIA 2015

Semantic understanding of queries

24

Entities play an important role › [Pound et al, WWW 2010], [Lin et al WWW 2012]

› ~70% of queries contain a named entity (entity mention queries)

• brad pitt height

› ~50% of queries have an entity focus (entity seeking queries)

• brad pitt attacked by fans

› ~10% of queries are looking for a class of entities

• brad pitt movies

Entity mention query = <entity> {+ <intent>}› Intent is typically an additional word or phrase to

• Disambiguate, most often by type e.g. brad pitt actor

• Specify action or aspect e.g. brad pitt net worth, toy story trailer

Page 22: Semantic Search keynote at CORIA 2015

Entities and Intents

3/21/201525

moneyball trailer

what the user wants to do with it (intent)

Movie

Object of the query (entity)

Page 23: Semantic Search keynote at CORIA 2015

oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org

Annotation over sessions

Sports team

Movie

Actor

Page 24: Semantic Search keynote at CORIA 2015

list search

related entity finding

entity searchSemSearch 2010/11

list completion

SemSearch 2011

TREC ELC taskTREC REF-LOD task

entity retrieval

Common tasks in Semantic Search

question-answeringQALD 2012/13/14

document retrievale.g. Dalton et al SIGIR 2014

Page 25: Semantic Search keynote at CORIA 2015

Entity-seeking queries make up 40-50% of the query volume› Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc

object retrieval in the web of data. WWW 2010: 771-780

› Thomas Lin, Patrick Pantel, Michael Gamon, AnithaKannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012: 589-598

Show a summary of the most likely information-needs

› Including related entities for navigation

› Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013

Application:

entity displays in web search

Page 26: Semantic Search keynote at CORIA 2015

Application: personalization in online news

Entity linking

Entity ranking according to relevance to the document

Page 27: Semantic Search keynote at CORIA 2015

New appl icat ions

Page 28: Semantic Search keynote at CORIA 2015

Mobile search on the rise

Information access on-the-go requires hands-free operation

› Driving, walking, gym, etc.

• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]

~50% of queries are coming from mobile devices (and growing)

› Changing habits, e.g. iPad usage peaks before bedtime

› Limitations in input/output

[1] http://answers.google.com/answers/threadview?id=392456

[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622

Page 29: Semantic Search keynote at CORIA 2015

Mobile search challenges and opportunities

35

Interaction

› Question-answering

› Support for interactive retrieval

› Spoken-language access

› Task completion

Contextualization

› Personalization

› Geo

› Context (work/home/travel)

• Try getaviate.com

Page 30: Semantic Search keynote at CORIA 2015

Interactive, conversational voice search

Parlance EU project

› Complex dialogs within a domain

• Requires complete semantic understanding

Complete system (mixed license)

› Automated Speech Recognition (ASR)

› Spoken Language Understanding (SLU)

› Interaction Management

› Knowledge Base

› Natural Language Generation (NLG)

› Text-to-Speech (TTS)

Video

Page 31: Semantic Search keynote at CORIA 2015

Task completion

37

We would like to help our users in task completion

› But we have trained our users to talk in nouns

• Retrieval performance decreases by adding verbs to queries

› We need to understand what the available actions are

Modeling actions

› Understand what actions can be taken on a page

› Help users in mapping their query to potential actions

› Applications in web search, email etc.

THING

THING

Schema.org v1.2

including Actions

published

April 16, 2014

Page 32: Semantic Search keynote at CORIA 2015

Applications

Email (Gmail) SERP (Yandex)

Page 33: Semantic Search keynote at CORIA 2015

Q&A

Many thanks to members of the Semantic Search team

at Yahoo Labs Barcelona and to Yahoos around the world

Contact me

[email protected]

› @pmika

› http://www.slideshare.net/pmika/


Recommended