+ All Categories
Home > Documents > 1 Content Integration for E-Business Joe Hellerstein.

1 Content Integration for E-Business Joe Hellerstein.

Date post: 21-Dec-2015
Category:
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
1 Content Integration for E-Business Joe Hellerstein
Transcript

1

Content Integration for E-Business

Joe Hellerstein

New Generation of e-Business on the Internet

• Companies moving beyond marketing, storefronts• Attempting to do operations on the Internet

– procurement

– supply chain

– customer relationships

– etc.

• In a cross-enterprise environment

• Requires cross-enterprise content integration– catalog integration is the procurement instance of this problem

Content Integration

• Content integration across enterprises– Not the “in-house” data warehousing problem– Not the Enterprise App Integration (EAI) problem

• “Operational” data must be integrated– As opposed to historical (trend) data– E.g. pricing, availability, supply chain

• Structured and unstructured data– Not just relational or XML queries– Not just text search– A combination of the two: logic meets statistics

The “Butterfly”

• Everybody’s favorite picture c. 1/2000:

• At question (6/2001) is how many butterflies, who owns them– Not a startup opportunity (Transora vs. Chemdex)– Perhaps one of the wings is smaller than the other (HomeDepot)

Suppliers Buyers

Marketplace

Road Map

• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism

Some Scenarios for Content Integration

• Catalog Management: Integration and Syndication– “MRO” (Maintenance, Repair and Operations) a la Grainger

– Thousands of suppliers, run by a “content manager”

• Availability and Pricing– Travel industry

– Necessitates live, cross-enterprise querying

• Supply Chain Management– E.g. auto industry

– Increase in production requires the entire supply chain (“the cows”)

– Contractual information along with catalog and availability

Marketing: The EcoSystem and its Terminology

• Enterprise Application Integration (EAI): App Glue– Imperative, message-oriented programming (scripting languages)

– Transactional networking (persistent queues)

– Gateways to popular packaged apps

– Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc.

• Data Integration: Warehousing and associated processes– Intra-enterprise, for “business intelligence” (historical trends)

– Vendors: Informatica, Ascential, DBMS vendors

• Content Management: Tools for content creation– Web page and graphic design

– Versioning and configuration management

– Vendors: Vignette, Interwoven, etc.

Road Map

• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration

– Content Access, Mapping and Transformation

– Query Processing

• Research Evangelism

Content Integration: Characteristic and Challenges

• New integration challenges for e-business– cross-enterprise

– operational

– data-centric (not app-centric)

– structured/unstructured

• Two main thrusts– Content Access, Mapping and Transformation

– Query Processing

Content Access: Relationships with Providers

• Varying relationships with content providers– Direct DBMS access (typically in-house)

– Direct access to federated apps (SAP, etc.)

– Gateway vendors a la Merant, NEON, Attunity, etc.

– Arm’s-length relationships

– HTML screen scraping

– XML messaging

Relationships evolve over time!

MySimon example

Content Mapping

• Syntactic and semantic integration– Formatting/normalization is one piece of the puzzle

– XML, HTML, Relational, etc.

– Semantics is much harder

– E.g. “price”. E.g. “delivery”.

• Semantics gate the process– A “content manager” must own the transformation task

– Ease of use critical

– Home Depot has 60,000 suppliers!

– Standards can help a bit (e.g. UDDI)

– But graphical tools are the name of the game

Cohera Workbench

Schemas and Taxonomies

• Cross-enterprise = multiple schemas– Even if standards prevail (very optimistic)– Early e-catalog systems were locked into one schema

– Great for service companies, e.g. Requisite– Tools are sounding the death knell

• Taxonomies are critical– Natural for browsing, especially with dirty data

– “Black Ink”, “India ink”, “fountain pen ink, black”– Taxonomy per vertical markets, plus standards like UNSPSC

– Office Supplies->Ink and lead refills->India ink– Taxonomy as data: query it, browse it, etc.

• Integration task includes taxonomy integration!

Themes in Content Access and Mapping

• Scalability in human terms

• “Content managers”, not geeks

• The name of the game: semi-automatic tools– Statistical (“fuzzy”) techniques to provide hints (not silver bullets)– Integrated into graphical programming-by-example interfaces– Problem domains:

– Wrapper generation– Data cleaning– Schema mapping– Taxonomy mapping– Syndication

• One of the key “systems” challenges today

Road Map

• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism

Query Processing Issues

• Content to be integrated is increasingly “uncacheable”– Arm’s-length accessibility

– Business rules, not data

– E.g. custom content throughout the dataflow

– Volatile information

– E.g. Availability

• Yet a great deal of content is cacheable and slowly changing• Upshot: need a combined technology

– Prefetch/Cache/Replicate when possible

– Query live when impossible

Federated Query Processing

• DBMS community must shed our materialization myopia!– ETL/Warehousing was inelegant and limited

– What do we do on a “cache miss”??

– Should be no distinction between materialized views and queries!

• Federated Query Processing– Query across multiple sources

– Choose among multiple replicas, materialized views

– Consider staleness

• This is the natural extension of the modern database vision– Cohera uses Mariposa’s economic model to do this

– Decouples optimization, cost estimation, storage and processing

Standard Queries Required

• Hand-coded queries are brittle: you want ad-hoc– Don’t buy a handful of beans

• Need support for standard query languages– SQL and XPath today

– SQL/XQuery tomorrow

• Everybody knows this!– Part of industrial religion

– Oracle on one side

– Dotcoms on the other side

– You might get by claiming to be “XML compliant”

– But most people have cottoned on by now

IR capabilities need to be in the engine

• The best-integrated data will still be noisy (product names, etc)• Text search on taxonomies, names, descriptions• Still no good integration of DBMS and IR engines

– Storage (compression huge in IR)– Index concurrency (many updates per doc in IR)– Query optimization challenges

• Note: this is not semi-structured querying!– Integration of logic + statistics is the real model/query challenge

– Plus HCI issues– Unify: “query”, “browse”, “mine”, “rank”

• Cohera integrates AltaVista into the engine & optimizer

Core Systems Issues Remain Important

• Availability, Scalability, Load Balancing– All critically important in the B2B space

– Availability: you don’t even control the components! Outage=news.

– Scalability: MRO wants to grow up to very big installations

– Load Balancing: need to respect SLAs, etc.

• Need adaptive, load balancing, federated QP– 100s to 1000’s of “sites”

– Replication is key to availability, but optimizer must understand it

– Cohera’s economic model adapts for each query

– Other models being studied (see DE Bulletin 6/2000)

– Compile-time, centralized optimizers (R*, et al) will break

Query Processing: Themes

• Standards• Logic + Statistics• Adaptivity to changing performance, load, failures• Optimizer Scalability

So What Really Matters Today?

• Cohera sells because…– Customers need the content integration workbench today

– They are in integration pain!– Comes in multiple guises (e-catalog, supplier enablement, etc.)– Smart tools start cutting the pain immediately

– Customers want an open, standard solution– Plain old SQL and relational schemas (vs. Requisite, e.g.)– XML “in the bottom”, “out the top” for messaging/integration

– Customers want federated querying…tomorrow– For today, they’ll settle for a centralized solution– Want the flexibility to grow in that direction

– Federated query engine works fine centralized– The converse clearly not true

Road Map

• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism

Research Evangelism

• Semi-Automatic Tools– Statistical + logical techniques, with a user in the loop

– E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]http://control.cs.berkeley.edu

– schema integration algebra

– interactive visualization

– programming-by-example

– statistical inferencing for discrepancies and domain detection

– A new class of “systems” work!

– “Tools”/“Apps” must be part of our agenda

– Many systems challenges here, especially on the stat/HCI side

– Architectural elegance, API design, extensibility, scalability, etc.

Research Evangelism, Cont.

• Adaptive Query Processing– Critical to the federated B2B space

– Unpredictable world, you don’t control the components

– Also critical to the ubiquitous computing space

– Sensors are the next challenge

– Who’s the DBA of your housepaint? The freeway lines?

– Economic optimization (Mariposa) is one model

– Finer-Grained adaptivity possible (Eddies, SIGMOD 2K)

– See http://telegraph.cs.berkeley.edu for examples, ideas, SW

Research Evangelism, Cont.

• Tired of research on relational? Choose wisely!– One big direction here is to integrate IR

– Another is to abandon languages in favor of interfaces

– query+browse+mine: semi-automatic GUIs again!

• XML is critical to business, but under control– We’re doing fine in this space, thank you

– XQuery will push (merge with?) SQL

– The end-result will resemble things you’ve seen before

• But text search is eating our lunch!– Intellectual impact in the last decade?

– Industrial impact in the last decade?

– Text search is mostly “just” an access method + a sort metric

– Integrate into our composable algebras and architectures!

– Teach it in our undergrad classes

Summary

• Content Integration is a new, challenging industrial space– Cohera provides the first complete solution

– Access with varying relationships, formats

– Support for multiple schemas and taxonomies

– Support for custom syndication

– Support for distributed data, both cacheable and uncacheable

– Ad hoc querying

– Fuzzy & structured search

– Availability, Scalability, Load Balancing

– Smart graphical tools for content managers

– A fertile area for research as well

– Join the fun!


Recommended