Post on 21-Dec-2015
transcript
New Generation of e-Business on the Internet
• Companies moving beyond marketing, storefronts• Attempting to do operations on the Internet
– procurement
– supply chain
– customer relationships
– etc.
• In a cross-enterprise environment
• Requires cross-enterprise content integration– catalog integration is the procurement instance of this problem
Content Integration
• Content integration across enterprises– Not the “in-house” data warehousing problem– Not the Enterprise App Integration (EAI) problem
• “Operational” data must be integrated– As opposed to historical (trend) data– E.g. pricing, availability, supply chain
• Structured and unstructured data– Not just relational or XML queries– Not just text search– A combination of the two: logic meets statistics
The “Butterfly”
• Everybody’s favorite picture c. 1/2000:
• At question (6/2001) is how many butterflies, who owns them– Not a startup opportunity (Transora vs. Chemdex)– Perhaps one of the wings is smaller than the other (HomeDepot)
Suppliers Buyers
Marketplace
Road Map
• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism
Some Scenarios for Content Integration
• Catalog Management: Integration and Syndication– “MRO” (Maintenance, Repair and Operations) a la Grainger
– Thousands of suppliers, run by a “content manager”
• Availability and Pricing– Travel industry
– Necessitates live, cross-enterprise querying
• Supply Chain Management– E.g. auto industry
– Increase in production requires the entire supply chain (“the cows”)
– Contractual information along with catalog and availability
Marketing: The EcoSystem and its Terminology
• Enterprise Application Integration (EAI): App Glue– Imperative, message-oriented programming (scripting languages)
– Transactional networking (persistent queues)
– Gateways to popular packaged apps
– Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc.
• Data Integration: Warehousing and associated processes– Intra-enterprise, for “business intelligence” (historical trends)
– Vendors: Informatica, Ascential, DBMS vendors
• Content Management: Tools for content creation– Web page and graphic design
– Versioning and configuration management
– Vendors: Vignette, Interwoven, etc.
Road Map
• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration
– Content Access, Mapping and Transformation
– Query Processing
• Research Evangelism
Content Integration: Characteristic and Challenges
• New integration challenges for e-business– cross-enterprise
– operational
– data-centric (not app-centric)
– structured/unstructured
• Two main thrusts– Content Access, Mapping and Transformation
– Query Processing
Content Access: Relationships with Providers
• Varying relationships with content providers– Direct DBMS access (typically in-house)
– Direct access to federated apps (SAP, etc.)
– Gateway vendors a la Merant, NEON, Attunity, etc.
– Arm’s-length relationships
– HTML screen scraping
– XML messaging
Relationships evolve over time!
MySimon example
Content Mapping
• Syntactic and semantic integration– Formatting/normalization is one piece of the puzzle
– XML, HTML, Relational, etc.
– Semantics is much harder
– E.g. “price”. E.g. “delivery”.
• Semantics gate the process– A “content manager” must own the transformation task
– Ease of use critical
– Home Depot has 60,000 suppliers!
– Standards can help a bit (e.g. UDDI)
– But graphical tools are the name of the game
Schemas and Taxonomies
• Cross-enterprise = multiple schemas– Even if standards prevail (very optimistic)– Early e-catalog systems were locked into one schema
– Great for service companies, e.g. Requisite– Tools are sounding the death knell
• Taxonomies are critical– Natural for browsing, especially with dirty data
– “Black Ink”, “India ink”, “fountain pen ink, black”– Taxonomy per vertical markets, plus standards like UNSPSC
– Office Supplies->Ink and lead refills->India ink– Taxonomy as data: query it, browse it, etc.
• Integration task includes taxonomy integration!
Themes in Content Access and Mapping
• Scalability in human terms
• “Content managers”, not geeks
• The name of the game: semi-automatic tools– Statistical (“fuzzy”) techniques to provide hints (not silver bullets)– Integrated into graphical programming-by-example interfaces– Problem domains:
– Wrapper generation– Data cleaning– Schema mapping– Taxonomy mapping– Syndication
• One of the key “systems” challenges today
Road Map
• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism
Query Processing Issues
• Content to be integrated is increasingly “uncacheable”– Arm’s-length accessibility
– Business rules, not data
– E.g. custom content throughout the dataflow
– Volatile information
– E.g. Availability
• Yet a great deal of content is cacheable and slowly changing• Upshot: need a combined technology
– Prefetch/Cache/Replicate when possible
– Query live when impossible
Federated Query Processing
• DBMS community must shed our materialization myopia!– ETL/Warehousing was inelegant and limited
– What do we do on a “cache miss”??
– Should be no distinction between materialized views and queries!
• Federated Query Processing– Query across multiple sources
– Choose among multiple replicas, materialized views
– Consider staleness
• This is the natural extension of the modern database vision– Cohera uses Mariposa’s economic model to do this
– Decouples optimization, cost estimation, storage and processing
Standard Queries Required
• Hand-coded queries are brittle: you want ad-hoc– Don’t buy a handful of beans
• Need support for standard query languages– SQL and XPath today
– SQL/XQuery tomorrow
• Everybody knows this!– Part of industrial religion
– Oracle on one side
– Dotcoms on the other side
– You might get by claiming to be “XML compliant”
– But most people have cottoned on by now
IR capabilities need to be in the engine
• The best-integrated data will still be noisy (product names, etc)• Text search on taxonomies, names, descriptions• Still no good integration of DBMS and IR engines
– Storage (compression huge in IR)– Index concurrency (many updates per doc in IR)– Query optimization challenges
• Note: this is not semi-structured querying!– Integration of logic + statistics is the real model/query challenge
– Plus HCI issues– Unify: “query”, “browse”, “mine”, “rank”
• Cohera integrates AltaVista into the engine & optimizer
Core Systems Issues Remain Important
• Availability, Scalability, Load Balancing– All critically important in the B2B space
– Availability: you don’t even control the components! Outage=news.
– Scalability: MRO wants to grow up to very big installations
– Load Balancing: need to respect SLAs, etc.
• Need adaptive, load balancing, federated QP– 100s to 1000’s of “sites”
– Replication is key to availability, but optimizer must understand it
– Cohera’s economic model adapts for each query
– Other models being studied (see DE Bulletin 6/2000)
– Compile-time, centralized optimizers (R*, et al) will break
Query Processing: Themes
• Standards• Logic + Statistics• Adaptivity to changing performance, load, failures• Optimizer Scalability
So What Really Matters Today?
• Cohera sells because…– Customers need the content integration workbench today
– They are in integration pain!– Comes in multiple guises (e-catalog, supplier enablement, etc.)– Smart tools start cutting the pain immediately
– Customers want an open, standard solution– Plain old SQL and relational schemas (vs. Requisite, e.g.)– XML “in the bottom”, “out the top” for messaging/integration
– Customers want federated querying…tomorrow– For today, they’ll settle for a centralized solution– Want the flexibility to grow in that direction
– Federated query engine works fine centralized– The converse clearly not true
Road Map
• Setting• Scenarios & Terminology• Characteristics and Challenges of Content Integration• Research Evangelism
Research Evangelism
• Semi-Automatic Tools– Statistical + logical techniques, with a user in the loop
– E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]http://control.cs.berkeley.edu
– schema integration algebra
– interactive visualization
– programming-by-example
– statistical inferencing for discrepancies and domain detection
– A new class of “systems” work!
– “Tools”/“Apps” must be part of our agenda
– Many systems challenges here, especially on the stat/HCI side
– Architectural elegance, API design, extensibility, scalability, etc.
Research Evangelism, Cont.
• Adaptive Query Processing– Critical to the federated B2B space
– Unpredictable world, you don’t control the components
– Also critical to the ubiquitous computing space
– Sensors are the next challenge
– Who’s the DBA of your housepaint? The freeway lines?
– Economic optimization (Mariposa) is one model
– Finer-Grained adaptivity possible (Eddies, SIGMOD 2K)
– See http://telegraph.cs.berkeley.edu for examples, ideas, SW
Research Evangelism, Cont.
• Tired of research on relational? Choose wisely!– One big direction here is to integrate IR
– Another is to abandon languages in favor of interfaces
– query+browse+mine: semi-automatic GUIs again!
• XML is critical to business, but under control– We’re doing fine in this space, thank you
– XQuery will push (merge with?) SQL
– The end-result will resemble things you’ve seen before
• But text search is eating our lunch!– Intellectual impact in the last decade?
– Industrial impact in the last decade?
– Text search is mostly “just” an access method + a sort metric
– Integrate into our composable algebras and architectures!
– Teach it in our undergrad classes
Summary
• Content Integration is a new, challenging industrial space– Cohera provides the first complete solution
– Access with varying relationships, formats
– Support for multiple schemas and taxonomies
– Support for custom syndication
– Support for distributed data, both cacheable and uncacheable
– Ad hoc querying
– Fuzzy & structured search
– Availability, Scalability, Load Balancing
– Smart graphical tools for content managers
– A fertile area for research as well
– Join the fun!