Post on 08-May-2015
description
transcript
My Beliefs
The key challenge in web search is structured search
Part 1: What is structured search?
The key challenge in structured search is collecting data
Part 2: Data distribution & idea of Data Cloud
Part 3: Demo: numeric data distribution
The key challenge in collecting data is incentive design
Part 4: Economics of data distribution
StructuredSearch
Data
Structured data
Entity unit:
• Identifier
• Metadata:
– Explicit key-value pairs
– Relational properties
– Evaluation
Semi-structured data
Content unit:
• Body: text, video, audio, or image
• Metadata:
– Explicit key-value pairs
– Relational properties
– Evaluation
Data = data of entities + data of content
Structured Search
Factoid search“what's the value of property X of object Y“
Entity hubs– Domain hubs
Structured object search"all concerts this weekend in SF under 20$ sorted by popularity"– Time focus– Ranking focus – Relations focus
Structured content search "all videos with Tom Brady"“all comments and blog posts about Bing"
Yury’s Wishlist
Business-generated data• Products, services, news, wishlists, contact data
Reality stream, sensors• Where what have happened
Expert knowledge• Glossary, issues, typical solutions, object databases, related
objects graph
Events• Sport, concerts, education, corporate, community, private
Market graph & signals• Like, interested, use, following, want to buy; votes and ratings
Search as a Platform
App 4 Classic search App 1 App 2 App 3
Structured DataStructured DataWeb index
Post analysis Query analysis
Data CloudHow to collect all structured data in one place?
Data Producers
• People: forums, wiki, mail groups, blogs, social networks
• Enterprizes: product profiles, corporate news, professional content
• Sensors: GPS modules, web cameras, traffic sensors, RFID
• Transactional data
Data Distributors
Data distributor is any technical solution to accumulate, organize and provide access to structured and semi-structured data
Data publisher: the original distributor of some data
Data retailer: a consumer-facing distributor of some data
Data Consumers
• Humans– Email
– Aggregators: news, friend feeds, RSS readers
– Search
– Browsing / random walks
• Intelligence projects– Recommendation systems
– Trend mining
Data Cloud
Data Cloud is a centralized fully-functional data distribution service
Success metric for data cloud strategy = the total “value” of data on the cloud
To-Cloud Solutions
• Extraction– DBpedia.org, “web tables”
• Semantic markup, data APIs– Yahoo! SearchMonkey
• Feeds– Yahoo! Shopping
– Disqus.com, js-kit.com, Facebook Connect
• Direct publishing
On-Cloud Solutions
• Ontology maintenance– Freebase
• Normalization, de-duplication, antispam
• Named entity recognition, metadata inference, ranking
• Data recycling (cross-references)– Amazon Public Data Sets
– Viral license
• Hosted search – Yahoo! BOSS
From-Cloud Solutions
• Search, audience– Y! SearchMonkey, Google Base
• Data API, dump access, update stream
• Custom notifications– Gnip.com
• Data cloud as a primary backend
• Access control– Ad distribution. (AT&T and Yahoo! Local deal)
Demo:webNumbr.com
Joint work with Paul Tarjan
webNumbr.com: Import
• Crawl numbers from the webURL + XPath + regex
• Create “numbr pages”• Update their values every hour • Keep the history
Anyone can create a numbrhttp://webnumbr.com/create
webNumbr.com: Export
• Embed code
• Graphs
• Search & browse
• RSS
Economics of Data Distribution
Joint work with Ravi Kumar and Andrew Tomkins
Network Effect in Two-Sided Markets
Two sided market = every product serves consumers of two types A and B
Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa
Examples: operating systems, credit cards, e-commerce marketplaces
Two-sided network effects: A theory of information product designG. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne
Basic model
• Distributors D1, … Dk
• Producer/consumer joins only one distributor
• Initial shares (p1,c1) … (pk,ck)
• New consumer selects a distributor with a probability proportional to pi
• New producer selects a distributor with probability proportional to ci
Basic model
a1 a4a2 a3
a1 a4a3a2
Market Shares Dynamics
Theorem 1Market shares will stabilize
Theorem 2With super-liner preference rule
one of distributors will tip
Theorem 3With sub-liner preference rule
market shares will flatten
External Factor
Preference rule with external factor:
ei+ci/(c1+…+ck)
Theorem 4 Market shares will stabilize on e1 : e2 : … : ek
Coalition
Data Cloud
Coalitions
Theorem 5
If all market shares are below 1/sqrt(k)
coalition (sharing data) is profitable for
all distributors
Corollary
Coalitions are not monotone
Example: 5 : 4 : 1 : 1
Model Variations
• Same-side network effect
• Different p-to-c and c-to-p rules
• Multi-homing (overlapping audiences)
• n^2 vs. nlog n revenue models
• Mature market: newcomer rate = departing rate
• Diverse market (many types of producers and consumers)
• Newcoming and departing distributors
• Directed coalitions
Challenges
Marketing
• Data demand?
• Data offerings?
• Requirements for distribution technology?
Incentive design
• Incentives for data sharing?
• Centralized or distributed?– For profit or non-profit?
• Data licensing and ownership?
• Monetizing data cloud?
More Challenges
Prototyping:• Data marketplace: open data & data demand• Search plugins: related objects, glossaries, object timelines• Publishing tools for structured data• Data client: structured news, bookmarking, notifications
Tech design:• Access management• Namespace design
User interface:• Structured search UI• Discovery UI
Thanks!
Follow my research:http://twitter.com/yurylifshitshttp://yury.name/blog