+ All Categories
Home > Technology > Semantic Search overview at SSSW 2012

Semantic Search overview at SSSW 2012

Date post: 06-May-2015
Category:
Upload: peter-mika
View: 3,969 times
Download: 0 times
Share this document with a friend
Description:
Presentation at the Summer School on the Semantic Web 2012 editionhttp://sssw.org/2012/
78
Semantic Search Peter Mika Senior Research Scientist Yahoo! Research With contributions from Thanh Tran (KIT)
Transcript
Page 1: Semantic Search overview at SSSW 2012

Semantic SearchPeter Mika

Senior Research Scientist

Yahoo! Research

With contributions from Thanh Tran (KIT)

Page 2: Semantic Search overview at SSSW 2012

- 2 -

Yahoo! serves over 700 million users in 25 countries

Page 3: Semantic Search overview at SSSW 2012

- 3 -

Yahoo! Research: visit us at research.yahoo.com

Page 4: Semantic Search overview at SSSW 2012

- 4 -

Yahoo! Research Barcelona

• Established January, 2006

• Led by Ricardo Baeza-Yates

• Research areas

– Web Mining

• content, structure, usage

– Social Media

– Distributed Systems

– Semantic Search

Page 5: Semantic Search overview at SSSW 2012

- 5 -

Search is really fast, without necessarily being intelligent

Page 6: Semantic Search overview at SSSW 2012

- 6 -

Why Semantic Search? Part I

• Improvements in IR are harder and harder to come by

– Machine learning using hundreds of features

• Text-based features for matching

• Graph-based features provide authority

– Heavy investment in computational power, e.g. real-time indexing and instant search

• Remaining challenges are not computational, but in modeling user cognition

– Need a deeper understanding of the query, the content and/or the world at large

– Could Watson explain why the answer is Toronto?

Page 7: Semantic Search overview at SSSW 2012

- 7 -

Poorly solved information needs

• Multiple interpretations

– paris hilton

• Long tail queries

– george bush (and I mean the beer brewer in Arizona)

• Multimedia search

– paris hilton sexy

• Imprecise or overly precise searches

– jim hendler

– pictures of strong adventures people

• Searches for descriptions

– countries in africa

– 32 year old computer scientist living in barcelona

– reliable digital camera under 300 dollars

Many of these queries would not be asked by users, who learned over time what search technology can and can not do.

Many of these queries would not be asked by users, who learned over time what search technology can and can not do.

Page 8: Semantic Search overview at SSSW 2012

- 8 -

Example: multiple interpretations

Page 9: Semantic Search overview at SSSW 2012

- 9 -

Why Semantic Search? Part II

• The Semantic Web is now a reality

– Large amounts of RDF data

– Heterogeneous schemas, quality

– Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain

• Searching data instead or in addition to searching documents

– Direct answers

– Novel search tasks

Page 10: Semantic Search overview at SSSW 2012

- 10 -

Information box with content from and links to Yahoo! Travel

Information box with content from and links to Yahoo! Travel

Example: direct answers in search

Points of interest in Vienna, Austria

Points of interest in Vienna, Austria

Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’

Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’

Faceted search for Shopping results

Faceted search for Shopping results

Information from the Knowledge Graph

Information from the Knowledge Graph

Page 11: Semantic Search overview at SSSW 2012

- 11 -

Novel search tasks

• Aggregation of search results

– e.g. price comparison across websites

• Analysis and prediction

– e.g. world temperature by 2020

• Semantic profiling

– Ontology-based modeling of user interests

• Semantic log analysis

– Linking query and navigation logs to ontologies

• Support for complex tasks (search apps)

– e.g. booking a vacation using a combination of services

Page 12: Semantic Search overview at SSSW 2012

- 13 -

Interactive search and task completion

Page 13: Semantic Search overview at SSSW 2012

- 14 -

Why Semantic Search? Part III

• There is a use case

– Consumers want to understand content

– Publishers want consumers to understand their content

• Semantic Web standards seem to be a good fit

http://en.wikipedia.org/wiki/Underpants_Gnomes

Page 14: Semantic Search overview at SSSW 2012

- 16 -

Example: Facebook’s Like and the Open Graph Protocol

• The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities

– Shows up in profiles and news feed

– Site owners can later reach users who have liked an object

– Facebook Graph API allows 3rd party developers to access the data

• Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’

Page 15: Semantic Search overview at SSSW 2012

- 17 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa

– Simplify the work of developers by restricting the freedom in RDFa

• Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment

• Only HTML <head> accepted

• http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>

<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

Page 16: Semantic Search overview at SSSW 2012

- 18 -

Example: schema.org

• Agreement on a shared set of schemas for common types of web content

– Bing, Google, and Yahoo! as initial supporters

– Similar in intent to sitemaps.org (2006)

• Use a single format to communicate the same information to all three search engines

• Support for microdata

• schema.org covers areas of interest to all search engines

– Business listings (local), creative works (video), recipes, reviews

– User defined extensions

• Each search engine continues to develop its products

Page 17: Semantic Search overview at SSSW 2012

- 19 -

Documentation and OWL ontology

Page 18: Semantic Search overview at SSSW 2012

- 20 -

Current state of metadata on the Web

• 31% of webpages, 5% of domains contain some metadata

– Analysis of the Bing Crawl (US crawl, January, 2012)

– RDFa is most common format• By URL: 25% RDFa, 7% microdata, 9% microformat

• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat

– Adoption is stronger among large publishers• Especially for RDFa and microdata

• See also

– P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012

– H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012

Page 19: Semantic Search overview at SSSW 2012

- 21 -

Exponential growth in RDFa data

Percentage of URLs with embedded metadata in various formats

Five-fold increase between March, 2009 and October, 2010

Five-fold increase between March, 2009 and October, 2010

Another five-fold increase between October 2010 and January, 2012

Another five-fold increase between October 2010 and January, 2012

Page 20: Semantic Search overview at SSSW 2012

Semantic Search

Page 21: Semantic Search overview at SSSW 2012

- 23 -

Semantic Search: a definition

• Semantic search is a retrieval paradigm that

– Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content

– Exploits this understanding at some part of the search process

• Web search vs. vertical/enterprise/desktop search

• Related fields:

– XML retrieval

– Keyword search in databases

– Natural Language Retrieval

Page 22: Semantic Search overview at SSSW 2012

- 24 -

Semantics at every step of the IR process

bla bla bla?

bla

blabla

q=“bla” * 3

Crawling and indexing bla

blabla

blabla

bla

IndexingRanking

“bla”θ(q,d)

Query interpretation

Result presentation

The IR engine The Web

Page 23: Semantic Search overview at SSSW 2012

Crawling and Indexing

Page 24: Semantic Search overview at SSSW 2012

- 26 -

Data on the Web

• Most web pages on the Web are generated from structured data

– Data is stored in relational databases (typically)

– Queried through web forms

– Presented as tables or simply as unstructured text

• The structure and semantics (meaning) of the data is not directly accessible to search engines

• Two solutions

– Extraction using Information Extraction (IE) techniques (implicit metadata)

• Supervised vs. unsupervised methods

– Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)

• Particularly interesting for long tail content

Page 25: Semantic Search overview at SSSW 2012

- 27 -

Information Extraction methods

• Natural Language Processing

• Extraction of triples

– Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007.

– Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.

• Filling web forms automatically (form-filling)

– Madhavan et al. Google's Deep-Web Crawl. VLDB 2008

• Extraction from HTML tables

– Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008

• Wrapper induction

– Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007

Page 26: Semantic Search overview at SSSW 2012

- 28 -

Semantic Web

• Sharing data across the Web– Publish information in standard formats (RDF, RDFa)

– Share the meaning using powerful, logic-based languages (OWL, RIF)

– Query using standard languages and protocols (HTTP, SPARQL)

• Two main forms of publishing

– Linked Data

• Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points

• Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)

– RDFa

• Data embedded inside HTML pages

• Recommended for site owners by Yahoo, Google, Facebook

Page 27: Semantic Search overview at SSSW 2012

- 29 -

Crawling the Semantic Web

• Linked Data

– Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled

– Semantic Sitemap/VOID descriptions

• RDFa

– Same as HTML crawling, but data is extracted after crawling

– Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.

• SPARQL endpoints

– Endpoints are not linked, need to be discovered by other means

– Semantic Sitemap/VOID descriptions

Page 28: Semantic Search overview at SSSW 2012

- 30 -

Data fusion

• Ontology matching– Widely studied in Semantic Web research, see e.g. list of

publications at ontologymatching.org• Unfortunately, not much of it is applicable in a Web context due to the

quality of ontologies

• Entity resolution– Logic-based approaches in the Semantic Web– Studied as record linkage in the database literature

• Machine learning based approaches, focusing on attributes

– Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data

• Improvements over only attribute based matching

• Blending– Merging objects that represent the same real world entity and

reconciling information from multiple sources

Page 29: Semantic Search overview at SSSW 2012

- 31 -

Data quality assessment and curation

• Heterogeneity, quality of data is an even larger issue– Quality ranges from well-curated data sets (e.g. Freebase) to

microformats • In the worst of cases, the data becomes a graph of words

– Short amounts of text: prone to mistakes in data entry or extraction• Example: mistake in a phone number or state code

• Quality assessment and data curation– Quality varies from data created by experts to user-generated

content– Automated data validation

• Against known-good data or using triangulation• Validation against the ontology or using probabilistic models

– Data validation by trained professionals or crowdsourcing• Sampling data for evaluation

– Curation based on user feedback

Page 30: Semantic Search overview at SSSW 2012

- 32 -

Indexing

• Search requires matching and ranking

– Matching selects a subset of the elements to be scored

• The goal of indexing is to speed up matching

– Retrieval needs to be performed in milliseconds

– Without an index, retrieval would require streaming through the collection

• The type of index depends on the query model to support

– DB-style indexing

– IR-style indexing

Page 31: Semantic Search overview at SSSW 2012

- 33 -

IR-style indexing

• Index data as text

– Create virtual documents from data

– One virtual document per subgraph, resource or triple

• typically: resource

• Key differences to Text Retrieval

– RDF data is structured

– Minimally, queries on property values are required

Page 32: Semantic Search overview at SSSW 2012

- 34 -

Horizontal index structure

• Two fields (indices): one for terms, one for properties

• For each term, store the property on the same position in the property index– Positions are required even without phrase queries

• Query engine needs to support the alignment operator

• ✓ Dictionary is number of unique terms + number of properties

• Occurrences is number of tokens * 2

Page 33: Semantic Search overview at SSSW 2012

- 35 -

Vertical index structure

• One field (index) per property

• Positions are not required

– But useful for phrase queries

• Query engine needs to support fields

• Dictionary is number of unique terms

• Occurrences is number of tokens

• ✗ Number of fields is a problem for merging, query performance

Page 34: Semantic Search overview at SSSW 2012

- 36 -

Distributed indexing

• MapReduce is ideal for building inverted indices

– Map creates (term, {doc1}) pairs

– Reduce collects all docs for the same term: (term, {doc1, doc2…}

– Sub-indices are merged separately

• Term-partitioned indices

• Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.

Page 35: Semantic Search overview at SSSW 2012

Query Processing

Page 36: Semantic Search overview at SSSW 2012

- 38 -

What is search?

• The search problem

– A data collection consisting of a set of items (units of retrieval)

– Information needs expressed as queries

– Ambiguity in the interpretation of the data and/or the queries

• Search is the task of efficiently finding items that are relevant to the information need

– Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!

Page 37: Semantic Search overview at SSSW 2012

- 40 -

Types of data models (1)

• Textual

– Bag-of-words

– Represent documents, text in structured data,…, real-world objects (captured as structured data)

– Lacks “structure”

• Text structure, e.g. linguistic structure, outlines, hyperlinks etc.

• Structure in structured data representation

In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.

In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.

combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……

combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……

term (statistics)

Page 38: Semantic Search overview at SSSW 2012

- 41 -

Types of data models (2)

• Graph structure

– Relationships in the data

• Hyperlinks

• Typed relationships

– Ontology

Bob

Personcreator Picture

Page 39: Semantic Search overview at SSSW 2012

- 42 -

Types of data models (3)

• Hybrid

– RDF data embedded in text (RDFa)

Page 40: Semantic Search overview at SSSW 2012

- 43 -

Formalisms for querying semantic data (1)

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Page 41: Semantic Search overview at SSSW 2012

- 44 -

Formalisms for querying semantic data (2)

• Unstructured

– NL

– Keywords apartment Berlin Aliceshared

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Page 42: Semantic Search overview at SSSW 2012

- 45 -

Formalisms for querying semantic data (3)

• Fully-structured

– SPARQL: BGP, filter, optional, union, select, construct, ask, describe

• PREFIX ns: <http://example.org/ns#>

SELECT ?x

WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.

?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Page 43: Semantic Search overview at SSSW 2012

- 46 -

Formalisms for querying semantic data (4)

• Hybrid: both content and structure constraints

?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

“shared apartment Berlin Alice”

Page 44: Semantic Search overview at SSSW 2012

- 47 -

Summary: data and queries in Semantic Search

Query

Data

KeywordsKeywords NL Questions

NL Questions

Form- / facet-based InputsForm- / facet-based Inputs

Structured Queries (SPARQL)

Structured Queries (SPARQL)

OWL ontologies with rich, formal semantics

OWL ontologies with rich, formal semantics

Structured RDF dataStructured RDF data

Semi-Structured RDF data

Semi-Structured RDF data

RDF data embedded in text (RDFa)

RDF data embedded in text (RDFa)

Ambiquities

Ambiquities: confidence degree, truth/trust value…

Semantic Search target different group of users,

information needs, and types of data. Query processing for

semantic search is hybrid combination of techniques!

Semantic Search target different group of users,

information needs, and types of data. Query processing for

semantic search is hybrid combination of techniques!

Page 45: Semantic Search overview at SSSW 2012

- 48 -

Processing hybrid graph patterns (1)

Alice

Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:

trouble with bob

creator

Bob

knowssunset.jpg

creator

titleBeautiful Sunset

knows Thanhworks

author

KIT loca

ted

Germany

Semantic Search

year

2009

Germany

author

Peter

worksFluidOps 34age

?y ns:name “Alice”. ?x ns:knows ? y

apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”

Page 46: Semantic Search overview at SSSW 2012

- 49 -

Matching keyword query against text

• Retrieve documents• Inverted list (inverted index)– keyword {<doc1, pos, score>,…,<doc2, pos,

score, ...>, ...}

• AND-semantics: top-k join

sharedshared

sharedshared berlinberlin alicealice= =

shared Berlin Alice shared Berlin Alice

D1 D1 D1

Page 47: Semantic Search overview at SSSW 2012

- 50 -

Matching structured query against structured data

• Retrieve data for triple patterns

• Index on tables

• Multiple “redundant” indexes to cover different access patterns

• Join (conjunction of triples)

• Blocking, e.g. linear merge join (required sorted input)

• Non-blocking, e.g. symmetric hash-join

• Materialized join indexes

SP-index PO-index

==

=

?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

Per1 ns:works ?v ?v ns:name “KIT”

Per1 ns:works Ins1Ins1 ns:name KIT

Per1 ns:works Ins1 Ins1 ns:name KIT

Page 48: Semantic Search overview at SSSW 2012

- 51 -

Matching keyword query against structured data

• Retrieve keyword elements• Using inverted index– keyword {<el1, score, ...>, <el2, score, ...>,…}

• Exploration / “Join”• Data indexes for triple lookup• Materialized index (paths up to graphs)• Top-k Steiner tree search, top-k subgraph exploration

↔ ↔

==

Alice Bob KITAlice Bob KIT

Alice ns:knows BobBob ns:works Inst1

Inst1 ns:name KIT

Page 49: Semantic Search overview at SSSW 2012

- 52 -

Matching structured query against text

• Offilne IE• Online IE, i.e., “retrieve “ is as follows

• Derive keywords to retrieve relevant documents• On-the-fly information extraction, i.e., phrase pattern matching “X name Y”• Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg.

exp.

?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

name

knows

KIT

Page 50: Semantic Search overview at SSSW 2012

- 53 -

Matching structured query against text

• Index• Inverted index for document retrieval and pattern matching• Join index inverted index for storing materialized joins

between keywords• Neighborhood indexes for phrase patterns

?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

KIT

name

knows

KIT

name

Page 51: Semantic Search overview at SSSW 2012

- 54 -

Query processing – main tasks• Retrieval

– Documents , data elements, triples, paths, graphs

– Inverted index,…, but also other (B+ tree)

– Index documents, triples, materialized paths

• Join

– Different join implementations, efficiency depends on availability of indexes

– Non-blocking join good for early result reporting and for “unpredictable” Linked Data / data streams scenario

Query

Data

Page 52: Semantic Search overview at SSSW 2012

- 55 -

Query processing – more tasks• More complex queries: disjunction,

aggregation, grouping, analytics…

• Join order optimization

• Approximate – Approximate the search space

– Approximate the results (matching, join)

• Parallelization

• Top-k – Use only some entries in the input

streams to produce k results

• Multiple sources– Federation, routing

– On-the-fly mapping, similarity join

• Hybrid– Join text and data

Query

Data

Page 53: Semantic Search overview at SSSW 2012

Ranking

Page 54: Semantic Search overview at SSSW 2012

- 58 -

Ranking – problem definition

Query

Data

• Ambiguities arise when representation is incomplete / imprecise

• Ambiguities at the level of • elements (content ambiguity) • structure between elements

(structure ambiguity)

• Ambiguities arise when representation is incomplete / imprecise

• Ambiguities at the level of • elements (content ambiguity) • structure between elements

(structure ambiguity)

Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.

Page 55: Semantic Search overview at SSSW 2012

- 59 -

Content ambiguity

Alice

Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:

trouble with bob

creator

Bob

knows

sunset.jpg

creator

titleBeautiful Sunset

knows Thanhworks

author

KIT loca

ted

Germany

Semantic Search

year

2009

Germany

author

Peter

worksFluidOps 34age

?y ns:name “Alice”. ?x ns:knows ? y

apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

What is meant by “Berlin” in the query?What is meant by “Berlin” in the data?A city with the name Berlin? a person?

What is meant by “Berlin” in the query?What is meant by “Berlin” in the data?A city with the name Berlin? a person?

What is meant by “KIT” in the query?What is meant by “KIT” in the data?A research group? a university? a location?

What is meant by “KIT” in the query?What is meant by “KIT” in the data?A research group? a university? a location?

Page 56: Semantic Search overview at SSSW 2012

- 60 -

Structure ambiguity

Alice

Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:

trouble with bob

creator

Bob

knows

sunset.jpg

creator

titleBeautiful Sunset

knows Thanhworks

author

KIT loca

ted

Germany

Semantic Search

year

2009

Germany

author

Peter

worksFluidOps 34age

?y ns:name “Alice”. ?x ns:knows ? y

apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

What is the connection between “Berlin” and “Alice”? Friend? Co-worker?

What is the connection between “Berlin” and “Alice”? Friend? Co-worker?

What is meant by “works”? Works at? employed? What is meant by “works”? Works at? employed?

Page 57: Semantic Search overview at SSSW 2012

- 61 -

Ambiguity

• Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches

– Syntactic, e.g. works vs. works at

– Semantic, e.g. works vs. employ

• “Aboutness”, i.e., contain some elements which represent the correct interpretation

– Ambiguities arise when matching elements of different granularities

– Does i contains the interpretation for j, given some part(s) of i (syntactically/semantically) match j

– E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”

• Strictly speaking, ranking is performed after syntactic / semantic matching is done!

Page 58: Semantic Search overview at SSSW 2012

- 62 -

Features: What to use to deal with ambiguities?

What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”?

• Content features

– Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)

– Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis)

• Structure features

– Consider relevance at level of fields

– Linked-based popularity

Page 59: Semantic Search overview at SSSW 2012

- 63 -

Ranking paradigms

• Explicit relevance model

– Foundation: probability ranking principle

– Ranking results by the posterior probability (odds) of being observed in the relevant class:

– P(w|R) varies in different approaches

• binary independence model

• Two-Poisson model

• BM25

))|(1()|( )|(

DwDw

NwPRwPRDP

P(D|R)

P(D/N)

Page 60: Semantic Search overview at SSSW 2012

- 64 -

Ranking paradigms

• No explicit notion of relevance: similarity between the query and the document model

– Vector space model (cosine similarity)

– Language models (KL divergence)

)),...,(,),...,((),( ,,1,,1 qkqdtd wwwwCosdqSim

)|(

)|(log()|()||(),(

d

qq

Vtdq tP

tPtPKLdqSim

Page 61: Semantic Search overview at SSSW 2012

- 65 -

Model construction

• How to obtain

– Relevance models?

– Weights for query / document terms?

– Language models for document / queries?

Page 62: Semantic Search overview at SSSW 2012

- 66 -

Content-based features

• Document statistics, e.g.

– Term frequency

– Document length

• Collection statistics, e.g.

– Inverse document frequency

– Background language models

)|()1(||

)|( CtPd

tftP d

idfd

tfw dt

||,

• An object is more likely about “Berlin” when

• it contains a relatively high number of mentions of the term “Berlin”• the number of mentions of this term in the overall collection is relatively low

• An object is more likely about “Berlin” when

• it contains a relatively high number of mentions of the term “Berlin”• the number of mentions of this term in the overall collection is relatively low

Page 63: Semantic Search overview at SSSW 2012

- 67 -

Structure-based features

• Consider structure of objects

– Content-based features for structured objects, documents and for general tuples

)|()|( fFf

fd

d

tPtP

• An object is more likely about “Berlin when• one of its (important) fields contains a relatively high number of mentions of the term “Berlin”

• An object is more likely about “Berlin when• one of its (important) fields contains a relatively high number of mentions of the term “Berlin”

Page 64: Semantic Search overview at SSSW 2012

- 68 -

Structure-based features (2)

• PageRank

– Link analysis algorithm

– Measuring relative importance of nodes

– Link counts as a vote of support

– The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)

• ObjectRank

– Types and semantics of links vary in structured data setting

– Authority transfer schema graph specifies connection strengths

– Recursively compute authority transfer data graph

An object about “Berlin” is more important than another when• a relatively large number of objects are linked to it

An object about “Berlin” is more important than another when• a relatively large number of objects are linked to it

Page 65: Semantic Search overview at SSSW 2012

- 69 -

In practice

• Many more aspects of relevance

– User profiles

– History

– Context, e.g. geo-location

– etc.

• Combination of features using Machine Learning

– Several hundred features in modern search engines

• Pre-compute static features such as PageRank/ObjectRank

• Two-phase scoring for efficiency

– Round 1: easy to compute features

– Round 2: more expensive features

Page 66: Semantic Search overview at SSSW 2012

EvaluationHarry Halpin, Daniel Herzig, Peter Mika, Jeff Pound, Henry Thompson, Roi Blanco, Thanh Tran Duc

Page 67: Semantic Search overview at SSSW 2012

- 71 -

Semantic Search challenge (2010/2011)

• Two tasks

– Entity Search

• Queries where the user is looking for a single real world object

• Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.

– List search (new in 2011)

• Queries where the user is looking for a class of objects

• Billion Triples Challenge 2009 dataset

• Evaluated using Amazon’s Mechanical Turk

– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010

– Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011

Page 68: Semantic Search overview at SSSW 2012

- 72 -

Evaluation form

Page 69: Semantic Search overview at SSSW 2012

- 73 -

Other evaluations

• TREC Entity Track

– Related Entity Finding

• Entities related to a given entity through a particular relationship

• Retrieval over documents (ClueWeb 09 collection)

• Example: (Homepages of) airlines that fly Boeing 747

– Entity List Completion

• Given some elements of a list of entities, complete the list

• Question Answering over Linked Data

– Retrieval over specific datasets (Dbpedia and MusicBrainz)

– Full natural language questions of different forms

– Correct results defined by an equivalent SPARQL query

– Example: Give me all actors starring in Batman Begins.

Page 70: Semantic Search overview at SSSW 2012

Search interface

Page 71: Semantic Search overview at SSSW 2012

- 75 -

Search interface

• Input and output functionality– helping the user to formulate complex queries– presenting the results in an intelligent manner

• Semantic Search brings improvements in– Query formulation– Snippet generation– Suggesting related entities– Adaptive and interactive presentation

• Presentation adapts to the kind of query and results presented• Object results can be actionable, e.g. buy this product

– Aggregated search• Grouping similar items, summarizing results in various ways• Filtering (facets), possibly across different dimensions

– Task completion• Help the user to fulfill the task by placing the query in a task context

Page 72: Semantic Search overview at SSSW 2012

- 76 -

Query formulation

• “Snap-to-grid”: suggest the most likely interpretation of the query

– Given the ontology or a summary of the data

– While the user is typing or after issuing the query

– Example: Freebase suggest, TrueKnowledge

Page 73: Semantic Search overview at SSSW 2012

- 77 -

Enhanced results/Rich Snippets

– Use mark-up from the webpage to generate search snippets

• Originally invented at Yahoo! (SearchMonkey)

– Google, Yahoo!, Bing, Yandex now consume schema.org markup

• Validators available from Google and Bing

Page 74: Semantic Search overview at SSSW 2012

- 79 -

Aggregated search: facets

Page 75: Semantic Search overview at SSSW 2012

- 80 -

Aggregated search: Sig.ma

Page 76: Semantic Search overview at SSSW 2012

- 81 -

Related entities

Related actors and movies

Page 77: Semantic Search overview at SSSW 2012

- 83 -

Resources

• Books– Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern

Information Retrieval. ACM Press. 2011

• Survey papers– Thanh Tran, Peter Mika. Survey of Semantic Search Approaches.

Under submission, 2012.

• Conferences and workshops– ISWC, ESWC, WWW, SIGIR, CIKM, SemTech– Semantic Search workshop series– Exploiting Semantic Annotations in Information Retrieval (ESAIR)– Entity-oriented Search (EOS) workshop

• Upcoming– Joint Intl. Workshop on Entity-oriented and Semantic Search

(JIWES) at SIGIR 2012– ESAIR 2012 at CIKM 2012

Page 78: Semantic Search overview at SSSW 2012

- 84 -

The End

• Many thanks to Thanh Tran (KIT) and members of the SemSearch group at Yahoo! Research in Barcelona

• Contact

[email protected]

– Internships available for PhD students (deadline in January)


Recommended