Search Computing
Stefano Ceri
WI-IAT’09 Keynote
Prof. Stefano CeriDatabase Management
Talk Outline
Genesis of Search Computing
Background research (Next Generation Search – a PRIN
Project)– Join of two search services
– Multi-domain query optimization
– Mash-up based interaction
– (Top-K extraction in rank aggregation)
Search Computing (SeCo) Project– Architecture
– Technology watch and business plan
– (Preliminary) results after 6 months
– SeCo teams
2
Prof. Stefano CeriDatabase Management
GENESIS OF THE PROJECT
3
Prof. Stefano CeriDatabase Management
Search Computing, an EU-funded Project
European Reseach Council (ERC) runs EU
program IDEAS– Funding body set up to support investigator-driven
frontier research
2 calls each year:– Starting Grant: for most talented scientists and
scholars with scientific leadership potential
• in 2008, 9200 total proposals, 300 funded
– Advanced Grant: for “exceptional research leaders”
• in 2008, >1000 proposals “science&engineering”,
100 funded
4
Prof. Stefano CeriDatabase Management
Genesis of Search Computing
My “Gong Show” challenge at 2003 Lowell Workshop:
“Find an ethnical restaurant in a nice place close to Milano” .
Logically a composition of domains:
– Restaurants (ethnical)
– Geo-locations (nice place close to Milano)
Composing maps with “geo-located” information is now
solved by many services, i.e. on top of yahoo local, google-
local…
… but in general no system is capable of composing
arbitrary semantic domains
Prof. Stefano CeriDatabase Management
Motivating Examples
“Who are the strongest candidates in Europe for
competing on software ideas?”
6
“Who is the best doctor who can cure insomnia in a
close-by hospital?”
“Where can I attend an interesting scientific conference in
my field and at the same time relax on a beautiful beach
nearby?”
This information is available on Internet, but no software
system is capable of computing the answer.
Queries span over multiple semantic domains and require
composing ranking of results.
Prof. Stefano CeriDatabase Management
Their Common Aspect
Multi-domain queries
The answers are on the Web
7
A knowledgeable user would do the query step-by-step:– Search database conferences, get their city
– Check that the city average temperature is warm enough
– Search low-cost flights via a broker for that city
– Search luxury hotels via another broker
After hours of painful search the user might actually
succeed!
Can this be done better?
Prof. Stefano CeriDatabase Management
Results before Search Computing
FUNDING (2007-08): PRIN NGS (New Generation
Search)– Politecnico Milano (National Coordination)
– University of Roma 3
– Free University of Bolzano
The brick: join of two search services– Information Systems, March 2008
The framework: multi-domain query optimization– International Very Large Data Bases Conference,
Auckland (NZ), August 2008
The interface: mash-up based interaction– IEEE-Internet Computing, November 2008
Optimality: top-K extraction in rank aggregation– Currently submitted
8
Prof. Stefano CeriDatabase Management
JOIN OF TWO SEARCH
SERVICES
9
Prof. Stefano CeriDatabase Management
JOIN of Web Services
Input: items resulting from TWO web service calls, possibly ranked
Output: composed items resulting from the concatenation of matching items, presented in a “global ranking order”
Matching condition using:
– value equality,
– partial set matching
– term matching within a vocabulary
…..
Services are known, their matching function is predefined: this is not service discovery!
Prof. Stefano CeriDatabase Management
Join 11
bx5
Service X Service Y
bx4
bx3
bx2
bx1
by5
by4
by3
by2
by1
r1
r2
r3
Prof. Stefano CeriDatabase Management
Matching items 12
Prof. Stefano CeriDatabase Management
Popupular rock CDs
Score: measures ranking
Match: measures similarity
A BIGGER BAND1.000
Amazon
THE ROLLING STONES
Match
0.980
Score iTunes
THE ROLLING STONES
YOU COULD HAVE IT SO MUCH...1.000
FRANZ FERDINAND0.950
YOU COULD HAVE IT SO MUCH...
FRANZ FERDINAND6
3
A BIGGER BAND
0.306CONFESSIONS ON A DANCE FLOOR
0.556MADONNA
YOU CAN DANCE
MADONNA46
...
... ...
...
• Sources: Amazon and iTunes
Prof. Stefano CeriDatabase Management
Relevant news in two newspapers
0.656
Corriere della SeraMatch
0.502
Score La Repubblica
Firenze. Stupro nel garage di casa.
Notte da incubo per una 15enne. Dell’aggressore ha potuto fornire solo una
descrizione sommaria.
Firenze. 15enne violentata in garageEra entrata in casa dopo essere stata da
un amico, aveva parcheggiato il motorino
nel box. Aggredita da un uomo.
0.8220.394
Iraq, undici milioni al voto.
Risultati tra due settimane.Alle urne il settanta per cento
degli aventi diritto.
Iraq, undici milioni al voto.Altissima affluenza alle urne: il 70%.
Primi dati: in testa l’alleanza sciita
e la lista di Allawi.
• Sources: Corriere.it and Repubblica.it
Prof. Stefano CeriDatabase Management
Given: two services si and sj, a query q which is decomposed into two queries qi and qj
The join of the two services is obtained by composing the results xi and yj returned by qi and qj, producing a sequence r of elements:
– rk = c(xi, yj), k
K is the relevance index of each result item rk :
– K = i j mij
– i = ranking of the result produced by Si
– j = ranking of the result produced by Sj
– mij = match index between xi e yj
Assuming that xi and yj are produced in ranking order, the
objective is to produce rk in approximated ranking order;
often rankings are opaque
Join of two Search Services
Prof. Stefano CeriDatabase Management
The model 16
tij
Prof. Stefano CeriDatabase Management
Tile-extraction-optimal algorithm 17
1.Compute the first tile
(performing an initial request-
response)
2.Compute the estimate of
rankings for all candidate tiles C
3.Choose tile with the highest
estimate for the expected
relevance
4.If the tile has either i or j equal
to 1, then perform a request-
response to the relevant service
5.Output those results which are
above a relevance threshold
6.Goto step 2 unless (a) the
search space is exhausted or
(b) the user stops the search
Prof. Stefano CeriDatabase Management
Unavailable ranking
When ranking is unavailable on either one or both
services, we further characterize the service according to
its expected behavior in the following two classes:– Step ranking. We assume that, by performing a limited number
H of service requests, most of the relevant entries will be retrieved
– Linear ranking. We assume that the entry relevance decreases roughly linearly, with no step.
18
Prof. Stefano CeriDatabase Management
Nested Loop 19
Prof. Stefano CeriDatabase Management
Merge scan 20
Prof. Stefano CeriDatabase Management
SEARCH SERVICE
INTEGRATION FRAMEWORK
21
Prof. Stefano CeriDatabase Management
Search Service Integration Framework
Objective: a Web Service Management System
The system accepts queries, optimize them transparently
to the user, and produces the result
This is the follow-up of research done at Stanford
[VLDB06] but with significant changes– Focus on search services
– Ranking as first-class citizen
– Physical optimization
22
Prof. Stefano CeriDatabase Management
“Best DB conference" multi-domain query
Reference query: – “Find all database conferences in the next six months in
locations where the average temperature is at least 28°C degrees and for which a cheap travel solution (including a luxury accommodation) exists”
Answering the query requires:– finding interesting conferences in the desired timeframe via
online services by the scientific community;
– understanding whether the conference location is served by low-cost flights;
– finding luxury hotels close to the conference location with available rooms; and
– checking the expected average temperature of the location
23
Prof. Stefano CeriDatabase Management
A unified model for heterogeneous data sources
Ranking– Search services: return answers in ranking order
– Exact services: return indistinguishable tuples (no ranking)
Cardinality– Expected result size per invocation (ERSPI):
• Proliferative sevices (ERSPI > 1)
• Selective services (0 < ERSPI ≤ 1)
Accessibility– Services have access patterns
– Paging of result sets:
• Bulk vs. Chunked services, with given chunk size
Cost parameters– Response time
– Invocation cost
24
Prof. Stefano CeriDatabase Management
Cost Metrics
Cost metric: sums the costs of operators in the plan– Request-response cost metric
a special case of sum cost metric where each invocation costs 1 and joins have a negligible cost (e.g., trivial value equalities)
– Monetary cost metric
minimizes cost of accessing changing services
Execution time metric: measures the expected time from query input to k-th result output– Time-to-screen metric
minimizes time to produce the first line in output
Bottleneck metric: minimizes the dominant cost (for streaming/continuous scenarios)
25
Prof. Stefano CeriDatabase Management
Service registration and query formulation
The example query (in Datalog-like syntax):
Services with alternative access patterns
26
This formulation does not take access patterns into account
Prof. Stefano CeriDatabase Management
Query plans
Represented as DAGs (directed acyclic graphs)– Nodes: components of the query plan (service call, join)
– Arcs: precedence constraints + data flow
– Annotations: number of fetches per service, estimate of in-out tuples
Join Methods– Two nodes connected by an arc: pipeline execution– Explicit node with two services as input: parallel execution, tagged with
the join method (e.g., NL/MS)
27
Prof. Stefano CeriDatabase Management
Query plan example
Defines a strategy for accessing services
28
Prof. Stefano CeriDatabase Management
Annotation of query plans
Annotations indicate:– The number of tuples in output of each service
– The number of fetches for each chunked service
– The join strategy for each parallel join
Based on estimators, worked out from OUT node to IN
node
29
Prof. Stefano CeriDatabase Management
Genaration and evaluation of alternatives
30
Prof. Stefano CeriDatabase Management
Results of the optimal plan
Screenshot of the results found by our prototype
31
Prof. Stefano CeriDatabase Management
MASHUP INTERFACE
FOR SEARCH SERVICES
32
Prof. Stefano CeriDatabase Management
Developer-Oriented interface
Mashing up software services is becoming very popular
among developers
We propose a “declarative” mashup language for search
services as a simple interface of the Web Service
Management System, hiding all the optimization
Prof. Stefano CeriDatabase Management
Mashup interface for Search Service Integration
Prof. Stefano CeriDatabase Management
Wrapped Web sites in the prototype:
Booking.com (www.booking.com) for hotels
Expedia (www.expedia.it) for flights
AccuWeather (http://www.dapper.net/) for weather
conditions
TicketOne (www.ticketone.it) for events
GoogleMaps (maps.google.com) [Distance Calculator & Find
Businesses]
Bed-and-breakfasts(www.bedandbreakfast.it)
35mm.it (http://programmazione.35mm.it/) for movies locations
IMDB(www.imdb.com) for movies descriptions
Prof. Stefano CeriDatabase Management
Another domain: bioinformatics
Find the human amino acid sequences with at least two occurrences
of the same protein domain, broaden the set with similar protein
sequences, and then check that they are involved into at least one
pathway either of the man or of the mouse
36
Prof. Stefano CeriDatabase Management
SEARCH COMPUTING
PRELIMINARY ARCHITECTURE
37
Prof. Stefano CeriDatabase Management
Search Computing architecture: overall view 38
Main Query flow
Domain
Repository
Front End
Query Planner
Cache
Query To Domain
MapperCache
Query Analysis
Cache
Query Engine
OP 1 OP 2 OP N Cache...
WS-Framework
Cache
Service
Repository
Result
Transformation
Cache
WS
World
High-Level Query
Sub-queries
Concrete
Query Plan
Low-level queriesMerged Results
Domain
FrameworkCache
Final User
Results
<Uses> relation
High level query
“Where can I attend a DB
scientific conference close to
a beautiful beach reachable
with cheap flights?”Sub query 1
“Where can I attend a DB
scientific conference?”
Sub query 2
“place close to
a beautiful beach?”
Sub query 3
“place reachable with
cheap flight?”
Low level query 1
ConfSearch(“DB”,placeX,dateY)Low level query 2
TourSearch(“Beach”,PlaceX)Low level query 3
Flight(“cost<200”,PlaceX,DateY)
Query plan
Services invocations
and operators execution
Results
Presented resultsMSVVEIS’08 - Barcelona – Iberia
LID’08 – Rome - Alitalia
RCIS’08- Marrakech- AirFrance
Prof. Stefano CeriDatabase Management
Search Computing architecture:
configurability of the implementation 39
Main Query flow
<Uses> relation
Domain
Repository
Front End
Query Planner
Cache
Query To Domain
MapperCache
Query Analysis
Cache
Query Engine
OP 1 OP 2 OP N Cache...
WS-Framework
Cache
Service
Repository
Result
Transformation
Cache
WS
World
High-Level Query
Sub-queries
Concrete
Query Plan
Low-level queriesMerged Results
Domain
FrameworkCache
Final User
Results
Ad
min
In
terf
ace
Lo
w-le
ve
l q
ue
rie
s
Su
b-q
ue
rie
s
Co
ncre
te Q
ue
ry P
lan
Prof. Stefano CeriDatabase Management
Search Computing architecture: incremental prototyping40
Prototype 1:
Core behaviour of the
system.
• Engine-based execution
of queries
• Domain repository
• Service repository
• Coarse result
presentation
<Uses> relation
Domain
Repository
Front End
Query Planner
Cache
Query To Domain
MapperCache
Query Analysis
Cache
Query Engine
OP 1 OP 2 OP N Cache...
WS-Framework
Cache
Service
Repository
Result
Transformation
Cache
WS
World
High-Level Query
Sub-queries
Concrete
Query Plan
Low-level queriesMerged Results
Domain
FrameworkCache
Final User
Results
Ad
min
In
terf
ace
Lo
w-le
ve
l q
ue
rie
s
Su
b-q
ue
rie
s
Co
ncre
te Q
ue
ry P
lan
Prototype 2:
Planning
• Automatic optimized
query planning
Prototype 3:
Mapping and
presentation
• mapping to domains
• presentation of results
Prototype 4:
High level queries
Prof. Stefano CeriDatabase Management
SEARCH COMPUTING
BUSINESS MODEL &
TECHNOLOGY WATCH
41
Prof. Stefano CeriDatabase Management
Overall Approach 42
SEC
O
Scenarios:Usage Models
Technological competitionCompetition in services
THEORETICAL MODEL
BUSINESS MODEL
IMPLEMENTATION (partnership)
Literature
BUSINESS STRATEGYMarkets (tecnologie, servizi, e quali segmenti)
Business modelsFunctionalities (diversi modelli di search computing)
Quali IncentiviPricing Models
TECHNOLOGY STRATEGYDegree of flexibility
Outsiders LeveragingValue Capture
dat
abas
e
Inte
rnat
ion
al
colla
bo
rati
on
Cas
e st
ud
ies
Business & Technology Watch
Prof. Stefano CeriDatabase Management
Business & Technology Watch Approach
A screenshot of the technology blog 43
http://blog.search-computing.it/
Prof. Stefano CeriDatabase Management
THE FIRST SIX MONTHS
44
Prof. Stefano CeriDatabase Management
Dissemination
SeCo Project Portal 45
http://www.search-computing.it/
Prof. Stefano CeriDatabase Management
Search Computing at month 6
Search is a very competitive arena– Just to name a few newcomers: Bing, Wolfram-Alpha, Kosmics
Academic research in search is hard– Even simple ideas require lots of investments to be proven
46
After initial brainstorming and lots of thinking, our
current approach is to stay away from core research in:
– Global indexing and crawling
– Semantic web
– Communities
... and instead focus on our strength:
– Data management
– Query optimization and execution (on scalable architectures)
– Software/service technologies and tools
– Process modelling and mining
Prof. Stefano CeriDatabase Management
Reasoning about Search 09
UNIVERSAL APPROACHES– Indexing + global page ranking: Google
– Classification: Yahoo, Bing
– Semantics: Wolfram-Alpha
DOMAIN-EXPERT APPROACHES– Fixed horizontal composers: Kosmics – broadcasts the
same query to multiple engines and collate results.
– Domain-specific meta searchers: Tuifly – broadcasts the same query to multiple engines, collects and ranks results.
– Fixed vertical composers: Expedia – given known compositional patterns between flights, hotels, cars, travel-related events –broadcasts modified queries to data sources, collects, composes and ranks results.
– Search computing systems: extending the compositional pattern used by fixed vertical composers Expedia in a very specific context (travels) to arbitrary contexts, with multiple domains, many search engines, and greater query variance, but with known composition methods.
47
Prof. Stefano CeriDatabase Management
What are the assets of search computing?
A standard model for search services (service-mart), with almost-flat
representation and with suitable parameters for computing query cost/time.
A registration strategy consisting of providing, for each pair of services and
each composition semantics, a “composition set-ups” (service properties that
should be compared).
A standard model for composition, based on the notion of join between web
services, and several composition operations (join methods) for associating
a query with execution strategies.
A query optimization strategy, consisting of methods for determining a plan,
i.e. selecting the involved services, inferring the compositional semantics to be
used, and determining the best composition operations.
A service scheduling environment, consisting of methods for executing a
plan, i.e. iterating service invocation, computing compositions, evaluating global
rankings, determining stop conditions, caching results, enabling backtracking
and recomputing.
Liquid presentation of query results, enabling browsing of results and
sophisticated controls for, e.g. asking more results, rolling up, drilling down,
augmenting the query in given dimensions.
48
Prof. Stefano CeriDatabase Management
Service Marts 49
Prof. Stefano CeriDatabase Management
Service orchestration 50
Prof. Stefano CeriDatabase Management
Panta Rhei 51
B
(2,2)(1,m)(0,1)n-1
flight
hotel
flight
hotel
NL
Prof. Stefano CeriDatabase Management
Liquid queries 52
Prof. Stefano CeriDatabase Management
Current focus:
Simplified instantiations of search computing
Parametric query, fixed choice of services, fixed
composition. – E.g.: “best trips for a soccer supporter who wants to follow the
team on a road game to a given “city” and also find in the city of the game a good hotel, cheap and fast rountrip transports, and “rock music” event within “2” days from the game.”
Composable query, variable choice of services, fixed
composition once services are chosen.– E.g.: queries allowing users to focus their interests on offers in
june about monuments, sport events, concerts, museums, hiking trails, beaches, fairs,… thus finding a city or area within Europe where the top offers matching their interests are present, and at the same time there is an affordable option for roundtrip and hotel stay in the city or area, and good average climate in june.
53
Prof. Stefano CeriDatabase Management
From queries to processes
Start from a multi-domain query:– Search for the best movie-theatre combination where the movie
must be an action movie (ranked by stars) and the theatre must be close to home.
and then …– Once several candidate movies are located, look for their actors,
their directore, other films directed by that director, and so on…
– Once several candidate theatres are located, look for close-by pizzeria, for transportation, for parkings…
with search options…– enable a user to dynamically impose search ordering (first
choose the movie then the theatres)
– enable backtracking (if theatre location is not satisfactory after investigation go back and change theatre)
54
Prof. Stefano CeriDatabase Management
From queries to goals
If a query is being asked what does the user really want?
Composition set-ups can give the most likely directions of
extension of the current query– These can be observed/mined from multiple query instantations
and then suggested in “ranking order”
Disaggregation of global rankings and association with
results can suggest the most promising direction of
improvement for the user:– The one with fewer answers
– The one whose ranking in the answers did not drop much
Disaggregation of the query + results by services may
enable inspecting/changing one at a time– Exploring the search space by steps, with a sort of “pivotal”
exploration (at every new search, a new dimension goes on focus but the dimension previously on focus is fixed)
55
Prof. Stefano CeriDatabase Management
Future Focus (far away)
Search Computing in the Universe
An “exploratory” approach to search computing, starting
from NL queries and combining: – Lightweight semantics (inspired by Wordnet) for service
description
– Open-source NLP queries and processors for query analysis and decomposition (aiming at using a mix of syntactic methods and of light semantic annotations for mapping sentence chunks to domains of interests)
– Distance-based and clustering methods for mapping queries to services.
Will measure distance between “discovered mappings”
and “intended mappings” while the query broadens to
enlarge more and more domains.
Will enable us to understand the pros and cons of a
service-oriented approach to search in a global sense
56
Prof. Stefano CeriDatabase Management
The next six months
Foundations, foundations, foundations!– Service marts: motivation, theory, design, source wrapping,
materialization and incremental maintenance
– Join methods: theory, efficient implementation, bio-inspired methods
– Optimization: plan selection through decision trees, strategy analysis and comparison
– Execution (panta rhei): producer-consumer system with service scheduling, context, caching, run-time controls
– Interfaces: liquid queries and liquid results
Together with a fast prototyping attitute and the objective of delivering the core of the technology in the next six monts
Software engineering methods and tools for search
computingFor enabling application deployment with core technologies
Human-computer interaction for search computingFor enabling navigation on result combinations
57
Prof. Stefano CeriDatabase Management
Search Computing Workshop, June 17-19, 2009 58
Prof. Stefano CeriDatabase Management
Search Computing Challenges and Directions
(LNCS, Ceri-Brambilla eds)
Part 1: Vision– Ceri: Search computing
– Baeza-Yates: Next generation search
– Weikum: Search for knowledge
Part 2: Technology Watch for Search Computing– Dellavalle-Buganza-Gatti: The search engine industry
– Casati-Daniel-Soi: Mashup technologies
– Baumgartner-Campi-Gottlob-Herzog: Web data extraction
– Hedeler-Belhajjame-Campi-Embury-Fernandez-Paton:Dataspaces
– Bozzon-Fraternali: Multimedia and multimodal information retrieval
Part 3: Issues in Search Computing – Campi-Ceri-Gottlob-Ronchi: Service marts
– Braga-Campi-Grossniklaus: Join methods and query optimization
– Ilyas-Martinenghi-Tagliasacchi: Rank aggregation
– Braga-Grossinklaus-Ceri: Panta Rhei, a query execution environment
– Brambilla-Ceri-Fraternali-Manolescu: Liquid queries and liquid results
– Brambilla-Ceri: Software engineering of search computing applications
– Masseroli-Paton-Spasic: Search computing and the life sciences
59
Prof. Stefano CeriDatabase Management
SeCo Teams
Theory and Methods (Davide Martinenghi, Marco Tagliasacchi).
Design of solid methods (with known performance and guarantees)
returning top-k query results.
Service Registration and Management (Alex Campi, Stefania
Ronchi, Andrea Maesani). Registration of new search services, their
semantic description, and the production of relevant parameters. We
envision informal and quick deployment and registration of search
services.
Query Processing and Execution Engine (Daniele Braga, Michael
Grossnicklaus, Davide Barbieri, Adnan Abid, Mahmoud Abu Helou,
several Ms Students Francesco Corcoglioniti, ). Operation-based
model of SeCo enabling the mapping of user queries into execution
plans and the selection and execution of "optimal" execution plans.
60
Prof. Stefano CeriDatabase Management
SeCo Teams
Tools (Marco Brambilla, Alessandro Bozzon, several Ms students).
Developer-oriented and user-oriented tools, demonstrators of the
"current" technology throughout the project.
Business Models and Technology Watch (Emanuele Della Valle,
Roberto Verganti, Tommaso Buganza, Nicola Gatti, Sofia Ceppi).
Setting the strategic directions of the project, offering a
"technological watch" and then discussing "scenarios" that can lead
to better results from all perspectives, including business.
Interaction Design (Piero Fraternali, Sara Comai, Maristella
Matera, Davide Mazza). Paradigms for improving interaction, design
of effective feedback methods for involving users in producing the
answer to queries.
Concept Team (Stefano Ceri and all the team coordinators).
Coordinating the project, deciding planning and milestones, deciding
about technological standards, and integrating the various parts.
Manage resource allocation, including human resources.
61
Prof. Stefano CeriDatabase Management
62
问题?